感謝這個博客司训,之前一直在想旱捧,torchtext能不能對這個數(shù)據(jù)進行操作独郎,嘗試了一下不行踩麦,昨天搜索之后發(fā)現(xiàn)了這個教程,真的很有用氓癌。
我們先看一下之前做的時候預(yù)處理的流程谓谦。
在前面已經(jīng)訓(xùn)練好了word2vec,這里不再處理贪婉。
import pandas as pd
import numpy as np
import spacy
# Read data from files
train_data = pd.read_csv( "./drive/My Drive/NLPdata/train.tsv", header=0, delimiter="\t", quoting=3,encoding='latin-1' )
test_data = pd.read_csv( "./drive/My Drive/NLPdata/test.tsv", header=0, delimiter="\t", quoting=3,encoding='latin-1')
# unlabeled_train = pd.read_csv( "./train01.tsv", header=0, delimiter="\t", quoting=3,encoding='latin-1' )
# Verify the number of reviews that were read (100,000 in total)
print("Read %d labeled train reviews, %d labeled test reviews, "% (train_data["Phrase"].size, test_data["Phrase"].size ))
導(dǎo)入之前生成的word2vec
import logging
import gensim
from gensim.models import word2vec
model=gensim.models.KeyedVectors.load_word2vec_format("./drive/My Drive/NLPdata/word2Vec03.bin",binary=True)
index2word=model.index2word
print(len(index2word))
index2word_set=set(model.index2word)
print(len(index2word_set))
print(model)
對語料庫數(shù)據(jù)進行處理
包括分句反粥、分詞、單詞小寫等
# text是輸入的已經(jīng)分好詞的語料庫文本
# model是之前生成的word2vec模型
# num_features是word2vec模型中每個詞維度大小疲迂,這里是200
def word2vec(text,model,num_features):
featureVec = np.zeros((200,),dtype="float32")
nwords=0
for word in text:
if word in index2word_set:
nwords+=1
featureVec=np.add(featureVec,model[word])
featureVec = np.divide(featureVec,nwords)
return featureVec
# print(word2vec(token))
def getAvgFeatureVecs(phrases,model,num_features):
counter=0
phraseFeatureVecs = np.zeros((len(phrases),num_features),dtype="float32")
for phrase in phrases:
if counter % 2000==0:
print("Phrase %d of %d" % (counter, len(phrases)))
phraseFeatureVecs[counter]=word2vec(phrase, model, num_features)
counter = counter+1
return phraseFeatureVecs
from nltk.corpus import stopwords
import re
def phrase_to_wordlist(phrase, remove_stopwords=False):
phrase_text = re.sub("[^a-zA-Z]"," ", phrase)
words = phrase_text.lower().split()
# if remove_stopwords:
# stops = set(stopwords.words("english"))
# words = [w for w in words if not w in stops]
return(words)
處理訓(xùn)練集和測試集數(shù)據(jù)
clean_train_phrases = []
for phrase in train_data["Phrase"]:
clean_train_phrases.append( phrase_to_wordlist( phrase, remove_stopwords=True ))
num_features=200
trainDataVecs = getAvgFeatureVecs( clean_train_phrases, model, num_features )
clean_test_phrases = []
for phrase in test_data["Phrase"]:
clean_test_phrases.append( phrase_to_wordlist( phrase, remove_stopwords=True ))
num_features=200
testDataVecs = getAvgFeatureVecs( clean_test_phrases, model, num_features )
# np.isnan(trainDataVecs).any()
nullFeatureVec = np.zeros((200,),dtype="float32")
# print(trainDataVecs[4])
trainDataVecs[np.isnan(trainDataVecs)] = 0
print(trainDataVecs[3])
對向量化的數(shù)據(jù)中空值進行賦值
# np.isnan(trainDataVecs).any()
nullFeatureVec = np.zeros((200,),dtype="float32")
# print(trainDataVecs[4])
trainDataVecs[np.isnan(trainDataVecs)] = 0
print(trainDataVecs[3])
接下來看一下使用torchtext怎么處理數(shù)據(jù)才顿,對比之后,我感覺鬼譬,確實優(yōu)雅了很多
讀取數(shù)據(jù)
import pandas as pd
data=pd.read_csv(r'C:\Users\jwc19\Desktop\sentiment-analysis-on-movie-reviews\train.tsv',sep='\t')
test=pd.read_csv(r'C:\Users\jwc19\Desktop\sentiment-analysis-on-movie-reviews\test.tsv',sep='\t')
data.head()
使用sklearn對數(shù)據(jù)集進行分割
將訓(xùn)練集數(shù)據(jù)按照8:2的比例分割為訓(xùn)練集和驗證集
from sklearn.model_selection import train_test_split
train,val=train_test_split(data,test_size=0.2)
train.to_csv("train.csv",index=False)
val.to_csv('val.csv',index=False)
構(gòu)建分詞器娜膘,定義Field
Torchtext采用了一種聲明式的方法來加載數(shù)據(jù):你來告訴Torchtext你希望的數(shù)據(jù)是什么樣子的,剩下的由torchtext來處理优质。
實現(xiàn)這種聲明的是Field竣贪,F(xiàn)ield確定了一種你想要怎么去處理數(shù)據(jù)。
field在默認(rèn)的情況下都期望一個輸入是一組單詞的序列巩螃,并且將單詞映射成整數(shù)演怎。
這個映射被稱為vocab。如果一個field已經(jīng)被數(shù)字化了并且不需要被序列化避乏,
可以將參數(shù)設(shè)置為use_vocab=False以及sequential=False爷耀。
import spacy
import torch
from torchtext import data, datasets
from torchtext.vocab import Vectors
from torch.nn import init
device=torch.device("cuda")
spacy_en=spacy.load("en")
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
label=data.Field(sequential=False, use_vocab=False)
text=data.Field(sequential=True, tokenize=tokenize_en,lower=True)
定義Dataset
The fields知道當(dāng)給定原始數(shù)據(jù)的時候要做什么。現(xiàn)在拍皮,我們需要告訴fields它需要處理什么樣的數(shù)據(jù)歹叮。這個功能利用Datasets來實現(xiàn)。
Torchtext有大量內(nèi)置的Datasets去處理各種數(shù)據(jù)格式铆帽。
TabularDataset官網(wǎng)介紹: Defines a Dataset of columns stored in CSV, TSV, or JSON format.
對于csv/tsv類型的文件咆耿,TabularDataset很容易進行處理,故我們選它來生成Dataset
train, val=data.TabularDataset.splits(
path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code',
train='train.csv',
validation='val.csv',
format='csv',
skip_header=True,
fields=[
('PhraseId',None),
('SentenceId',None),
('Phrase',text),
('Sentiment',label)
]
)
test=data.TabularDataset.splits(
path=r'C:\Users\jwc19\Desktop\sentiment-analysis-on-movie-reviews',
test='test.tsv',
format='tsv',
skip_header=True,
fields=[
('PhraseId',None),
('SentenceId',None),
('Phrase',text),
]
)
建立vocab
Torchtext可以將詞轉(zhuǎn)化為數(shù)字爹橱,但是它需要被告知需要被處理的全部范圍的詞萨螺,在這里使用的是glove,庫會幫你下載好
text.build_vocab(train,vectors='glove.6B.100d')
text.vocab.vectors.unk_init = init.xavier_uniform
print(text.vocab.itos[1510])
print(text.vocab.stoi['bore'])
# 詞向量矩陣: TEXT.vocab.vectors
print(text.vocab.vectors.shape)
word_vec = text.vocab.vectors[text.vocab.stoi['bore']]
print(word_vec.shape)
print(word_vec)