說(shuō)明
對(duì)于詞向量的訓(xùn)練,常用的有如gensim庫(kù)下提供的word2vec模型秘蛇,后面會(huì)簡(jiǎn)單的示例gensim庫(kù)下該模型的使用。而并嘗試基于pytorch實(shí)現(xiàn)詞向量的訓(xùn)練
數(shù)據(jù)-word.txt文件
文件內(nèi)容如下:
pen pencil pencilcase ruler book bag comic-book post-card newspaper schoolbag eraser crayon sharpener storybook notebook Chinese-book English-book math book magazine dictionary foot head face hair nose mouth eye ear arm hand finger leg tail
red blue yellow green white black pink purple orange brown
cat dog pig duck rabbit horse elephant ant fish bird eagle beaver snake mouse squirrel kangaroo monkey panda bear lion tiger fox zebra deer giraffe goose hen turkey lamb sheep-goat cow donkey squid lobster shark seal sperm whale killer-whale
friend boy girl mother father sister brother uncle man woman Mr Miss lady mom dad parents grandparents grandma grandmother grandpa grandfather aunt cousin son daughter baby kid classmate queen visitor neighbour principal university-student pen-pal tourist people robot
teacher student doctor nurse driver farmer singer writer actor actress artist TV-reporter engineer accountant policeman salesperson cleaner baseball-player assistant police
rice bread beef milk water egg fish tofu cake hot-dog hamburger French-fries cookie biscuit jam noodles meat chicken pork mutton vegetable salad soup-ice icecream Coke juice tea coffee-breakfast lunch dinner supper meal
apple banana pear orange watermelon grape eggplant green-beans tomato potato peach strawberry cucumber onion carrot cabbage
jacket shirt Tshirt skirt dress jeans pants socks shoes sweater coat raincoat shorts sneakers slippers sandals boots hat cap sunglasses tie scarf gloves trousers cloth
bike bus train boat ship yacht car taxi jeep van plane airplane subway underground motor-cycle
window door desk chair bed computer board fan light teachers-desk picture wall floor curtain trash-bin closet mirror end-table football soccer present walkman lamp phone sofa shelf fridge table TV airconditioner key lock photo chart plate knife fork spoon chopsticks pot gift-toy doll ball balloon kite jigsaw-puzzle box umbrella zipper violin yoyo nest hole tube toothbrush menu ecard email traffic-light money medicine
home room bedroom bathroom living-room kitchen classroom school park library post-office police-office hospital cinema bookstore farm zoo garden-study playground canteen teachers-office library gym washroom art-room computer-room music-room TV-room flat company factory fruit-stand pet-shop nature-park theme-park science-museum Great-Wall supermarket bank country village city hometown bus-stop
sports science Moral-Education Social-Studies-Chinese math PE English
China PRC America USA UK England Canada CAN Australia New-York London Sydney Moscow-Cairo
cold warm cool snowy sunny hot rainy windy cloudy weather-report
river lake stream forest path road house-bridge building rain cloud sun mountain sky-rainbow wind air moon
flower grass tree seed sprout plant rose leaf
Monday Tuesday Wednesday Thursday Friday Saturday Sunday weekend
Jan Feb Mar April May June July Aug Sept Oct Nov Dec
spring summer fall autumn winter
south north east west left right
have-a-fever hurt have-a-cold have-a-toothache have-a-headache have-a-sore-throat
one two three four five six seven eight nine ten-eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty-sixty seventy eighty ninety fortytwo hundred one a-hundred-and-thirtysix first second third fourth fifth eighth ninth twelfth twentieth thirtieth fortieth fiftieth sixtieth seventieth eightieth ninetieth fiftysixth
big small long tall short young old strong thin active quiet nice kind strict smart funny tasty sweet salty sour fresh favourite clean tired excited angry happy bored sad taller shorter stronger older younger bigger heavier longer thinner smaller good fine great heavy new fat happy right hungry cute little lovely beautiful colourful pretty cheap expensive juicy tender healthy ill helpful high easy proud sick better higher
in on under near behind next-to over in-front-of
I we you he she it they my our your his her
play swim skate fly jump walk run climb fight swing eat sleep like have turn buy take live teach go study learn sing dance row do do-homework do-housework watch-TV read-books cook-meals water-flowers sweep-floor clean-bedroom make-bed set-table wash-clothes do-dishes use-a-computer do-morning-exercises eat-breakfast eat-dinner go-to-school have-English-class play-sports getup climb-mountains go-shopping play-piano visit-grandparents go-hiking fly-kites make-a-snowman plant-trees draw-pictures cook-dinner read-a-book answer-phone listen-to-music clean-room write-a-letter write-an-email drink-water take-pictures watch-insects pick-up-leaves do-an-experiment catch-butterflies count-insects collect-insects collect-leaves write-a-report play-chess have-a-picnic get-to ride-a-bike play-violin make-kites collect-stamps meet welcome thank love work drink taste smell feed shear milk look guess help pass show use clean open close put paint tell kick bounce ride stop wait find drive fold send wash shine become feel think meet fall leave wake-up put-on take-off hang-up wear go-home go-to-bed play-computer-games play-chess empty-trash put-away clothes get-off take-a-trip read-a-magazine go-to-cinema go-straight
這里的數(shù)據(jù)集可能較少匙瘪,不過(guò)作為簡(jiǎn)單示例應(yīng)該還是夠的索昂,其中每一類(lèi)單詞都放在同一行
基于gensim庫(kù)的word2vec訓(xùn)練
首先需要安裝gensim庫(kù):pip install gensim
,安裝完成后訓(xùn)練代碼如下:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
sentences = LineSentence('word.txt')
# 讀取前面示例的word.txt文件俺夕,這個(gè)會(huì)生成一個(gè)迭代器裳凸,里面存放著按行分類(lèi)的所有單詞
model = Word2Vec(sentences, size=16, window=5, min_count=0, workers=4, iter=5000)
# 使用該模型贱鄙,參數(shù)解釋如下:
# size:每個(gè)單詞用16維的詞向量表示
# window:窗口大小為5,即通過(guò)當(dāng)前單詞能夠預(yù)測(cè)其前兩個(gè)和后兩個(gè)單詞
# min_count:?jiǎn)卧~頻數(shù)最低要求姨谷,低于該數(shù)的都將被忽略逗宁,默認(rèn)是5,這里因?yàn)閿?shù)據(jù)集小就改成0
# workers:工作線程數(shù)梦湘,用4個(gè)線程
# iter:迭代5000輪
model.save('gensim_16.mdl')
# 保存模型
因?yàn)閿?shù)據(jù)量小瞎颗,大概1分鐘內(nèi)就能跑完,測(cè)試代碼如下:
model = Word2Vec.load('gensim_16')
# 載入模型
items = model.wv.most_similar('bear')
# wv下提供了很多工具方法捌议,這里詞向量按與傳入的單詞相似度從高到低排序
for i, item in enumerate(items):
print(i, item[0], item[1])
print(model.wv.similarity('bear', 'tiger'))
# 計(jì)算兩個(gè)詞的相似度
查看詞向量
通過(guò)model.wv.index2word
可以查看模型下所有詞向量對(duì)應(yīng)的標(biāo)簽名言缤,結(jié)果如下:
['book',
'math',
'orange',
'fish',
'milk',
...
]
所以要獲取所有詞向量可以通過(guò)model[model.wv.index2word]
獲取,結(jié)果如下:
array([[ 2.1858058 , -1.1265628 , 0.7986337 , ..., -3.3885555 ,
-5.0689073 , -2.3837712 ],
[ 2.5849087 , 0.6549566 , 1.0028977 , ..., -1.8795928 ,
-4.4294124 , -4.1221085 ],
[ 0.9784559 , -4.1107635 , 0.8471646 , ..., -3.7726424 ,
-0.33898747, -3.4206762 ],
...,
[ 2.0379307 , -1.7257718 , 0.98616403, ..., -2.5776517 ,
-0.8687243 , 1.4909588 ],
[ 1.8207592 , -1.4406224 , 0.66797787, ..., -2.2530203 ,
-0.6574308 , 1.4921187 ],
[ 1.2744113 , -1.1354392 , 0.6139609 , ..., -1.8367131 ,
-0.59694195, 1.073009 ]], dtype=float32)
降維可視化
我們可以將上面的詞向量降維成二維向量禁灼,并放在坐標(biāo)軸上進(jìn)行展示管挟,查看向量之間的分布,代碼如下:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2, init='pca', n_iter=5000)
# 將數(shù)據(jù)降成2維弄捕,基于主成分分析算法僻孝,迭代5000次
embed_two = tsne.fit_transform(model[model.wv.index2word])
# 將詞向量轉(zhuǎn)成降維后的
labels = model.wv.index2word
# 標(biāo)簽就是每個(gè)詞向量對(duì)應(yīng)的名字
plt.figure(figsize=(15, 12))
for i, label in enumerate(labels[:80]):
# 展示前80個(gè)詞向量的二維分布
x, y = embed_two[i, :]
plt.scatter(x, y)
plt.annotate(label, (x, y), ha='center', va='top')
# 對(duì)對(duì)應(yīng)坐標(biāo)添加標(biāo)注
# plt.savefig('word.png')
結(jié)果如下:
可以看到詞向量聚成了好幾塊,比如顏色的單詞基本聚到了中間的一塊守谓,而動(dòng)物則聚到了底下
參考:
https://blog.csdn.net/zhl493722771/article/details/82781675
https://blog.csdn.net/qq_27586341/article/details/90025288
基于pytorch定義模型訓(xùn)練
前面是使用了gensim庫(kù)直接調(diào)用word2vec模型進(jìn)行詞向量訓(xùn)練穿铆,接下來(lái)我們嘗試用pytorch來(lái)訓(xùn)練。首先我們要選擇一個(gè)訓(xùn)練的方式斋荞,一般來(lái)說(shuō)有兩種:
CBOW(Continuous Bag-of-Words):根據(jù)上下文詞語(yǔ)預(yù)測(cè)當(dāng)前詞
Skip-Gram:根據(jù)當(dāng)前詞預(yù)測(cè)上下文詞語(yǔ)
即假設(shè)有一類(lèi)數(shù)據(jù):[a, b, c, d, e]荞雏,如果使用CBOW,假設(shè)有一組模型的輸入輸出數(shù)據(jù)[(x1, y1), (x2, y2), ...]平酿,那么就可能是:[(a, c), (b, c), (d, c), (e, c)]凤优,此時(shí)輸入a、b蜈彼、d筑辨、e對(duì)應(yīng)的結(jié)果都可以是c,同理輸入a幸逆、c棍辕、d、e對(duì)應(yīng)的結(jié)果都可以是b...即多個(gè)預(yù)測(cè)一個(gè)还绘;而Skip-Gram則是反過(guò)來(lái)的關(guān)系來(lái)訓(xùn)練楚昭,即同樣的情況可能就是:[(c, a), (c, b), (c, d), (c, e)]
這里我們嘗試基于Skip-Gram來(lái)進(jìn)行訓(xùn)練,步驟如下:
模塊導(dǎo)入
import torch
from torch import nn
import matplotlib.pyplot as plt
初始化定義
首先我們需要使用gpu拍顷,然后定義上下文的窗口大小抚太,這里設(shè)置2,即在數(shù)據(jù)集當(dāng)中當(dāng)前單詞的前兩個(gè)和后兩個(gè)單詞都能夠預(yù)測(cè)他菇怀,還有一些其他的初始化定義凭舶,代碼如下:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()
context_size = 2
# 上下文大小晌块,即當(dāng)前詞與關(guān)聯(lián)詞的最大距離
lr = 1e-3
batch_size = 64
li_loss = []
數(shù)據(jù)預(yù)處理
這里讀取前面的單詞數(shù)據(jù)集,并基于前面的窗口大小來(lái)創(chuàng)建輸入和輸出的數(shù)據(jù)帅霜,代碼如下:
with open("word.txt", "r", encoding="utf-8") as f:
lines = f.read().strip().split("\n")
set_words = set()
# 存放所有單詞集合
for words in lines:
for word in words.split():
set_words.add(word)
word_to_id = {word: i for i, word in enumerate(set_words)}
# 單詞索引id
id_to_word = {word_to_id[word]: word for word in word_to_id}
# id索引單詞
word_size = len(set_words)
# 單詞數(shù)
train_x = []
train_y = []
for words in lines:
li_words = words.split()
# 存放每一行的所有單詞
for i, word in enumerate(li_words):
for j in range(-context_size, context_size + 1):
# 對(duì)于每個(gè)單詞匆背,將上下文大小內(nèi)的詞與其進(jìn)行關(guān)聯(lián)
if i + j < 0 or i + j > len(li_words) - 1 or li_words[i + j] == word:
# 對(duì)于上下文越界以及當(dāng)前單詞本身不添加關(guān)聯(lián)關(guān)系
continue
train_x.append(word_to_id[word])
train_y.append(word_to_id[li_words[i + j]])
# 訓(xùn)練數(shù)據(jù)基于Skip-Gram,輸入當(dāng)前詞身冀,輸出當(dāng)前詞上下文大小內(nèi)的所有詞
這里可以查看一部分輸入和輸出數(shù)據(jù):
print("init:", lines[0].split()[:10])
print("x:", [ id_to_word[each] for each in train_x[:10]])
print("y:", [ id_to_word[each] for each in train_y[:10]])
# 結(jié)果:
# init: ['pen', 'pencil', 'pencilcase', 'ruler', 'book', 'bag', 'comic-book', 'post-card', 'newspaper', 'schoolbag']
# x: ['pen', 'pen', 'pencil', 'pencil', 'pencil', 'pencilcase', 'pencilcase', 'pencilcase', 'pencilcase', 'ruler']
# y: ['pencil', 'pencilcase', 'pen', 'pencilcase', 'ruler', 'pen', 'pencil', 'ruler', 'book', 'pencil']
可以看出對(duì)于第一個(gè)單詞pen钝尸,能夠預(yù)測(cè)pencil和pencilcase,而pencil可以預(yù)測(cè)pen搂根、pencilcase和ruler...珍促,即每個(gè)單詞都只能預(yù)測(cè)自己的前兩個(gè)和后兩個(gè),越界則忽略
定義模型
這里就是用embedding層來(lái)訓(xùn)練詞向量表剩愧,后面再加上個(gè)全連接和logsoftmax計(jì)算每個(gè)詞向量的可能概率猪叙,代碼如下:
class EmbedWord(nn.Module):
def __init__(self, word_size, context_size):
super(EmbedWord, self).__init__()
self.embedding = nn.Embedding(word_size, 16)
# 用128維向量表示每個(gè)單詞
self.linear = nn.Linear(16, word_size)
self.log_softmax = nn.LogSoftmax()
def forward(self, x):
x = self.embedding(x)
x = self.linear(x)
x = self.log_softmax(x)
return x
model = EmbedWord(word_size, context_size).to(device)
定義優(yōu)化器和loss
loss_fun = nn.NLLLoss()
# 這里使用NLL作為損失函數(shù)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
開(kāi)始訓(xùn)練
由于詞向量表十分龐大,因此這里訓(xùn)練50000輪(為了效果更好的話(huà)仁卷,建議訓(xùn)練100000輪以上)穴翩,并且每隔50輪輸出查看訓(xùn)練結(jié)果,代碼如下:
model.train()
for epoch in range(len(li_loss), 50000):
if epoch % 2000 == 0 and epoch > 0:
optimizer.param_groups[0]['lr'] /= 1.05
# 每2000輪下降一點(diǎn)學(xué)習(xí)率
for batch in range(0, len(train_x) - batch_size, batch_size):
word = torch.tensor(train_x[batch: batch + batch_size]).long().to(device)
label = torch.tensor(train_y[batch: batch + batch_size]).to(device)
out = model(word)
loss = loss_fun(out, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
li_loss.append(loss)
if epoch % 50 == 0 and epoch > 0:
print('epoch: {}, Loss: {}, lr: {}'.format(epoch, loss, optimizer.param_groups[0]['lr']))
plt.plot(li_loss[-500:])
plt.show()
for w in range(5):
# 每50輪測(cè)試一下前5個(gè)單詞的預(yù)測(cè)結(jié)果
pred = model(torch.tensor(w).long().to(device))
print("{} -> ".format(id_to_word[w]), end="\t")
for i, each in enumerate((-pred).argsort()[:10]):
print("{}:{}".format(i, id_to_word[int(each)]), end=" ")
print()
降維可視化
這部分代碼基本和前面的一樣:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
result = model(torch.tensor([i for i in range(word_size)]).long().to(device))
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
embed_two = tsne.fit_transform(model.embedding.weight.cpu().detach().numpy())
# 將詞向量降到二維查看空間分布
# embed_two = tsne.fit_transform(result.cpu().detach().numpy())
labels = [id_to_word[i] for i in range(200)]
# 這里就查看前200個(gè)單詞的分布
plt.figure(figsize=(15, 12))
for i, label in enumerate(labels):
x, y = embed_two[i, :]
plt.scatter(x, y)
plt.annotate(label, (x, y), ha='center', va='top')
# plt.savefig('詞向量降維可視化.png')
可視化結(jié)果:
仔細(xì)看可以看出一部分?jǐn)?shù)據(jù)的確被很好的聚類(lèi)了锦积,但是對(duì)比前面使用gensim庫(kù)的結(jié)果可以發(fā)現(xiàn)還是差一些
完整代碼
import torch
from torch import nn
import matplotlib.pyplot as plt
# ----------------------------
# 初始化定義
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()
context_size = 2
# 上下文大小芒帕,即當(dāng)前詞與關(guān)聯(lián)詞的最大距離
word_size = len(set_words)
# 單詞數(shù)
lr = 1e-3
batch_size = 64
li_loss = []
# ----------------------------
# 數(shù)據(jù)預(yù)處理
with open("word.txt", "r", encoding="utf-8") as f:
lines = f.read().strip().split("\n")
set_words = set()
# 存放所有單詞集合
for words in lines:
for word in words.split():
set_words.add(word)
word_to_id = {word: i for i, word in enumerate(set_words)}
# 單詞索引id
id_to_word = {word_to_id[word]: word for word in word_to_id}
# id索引單詞
train_x = []
train_y = []
for words in lines:
li_words = words.split()
# 存放每一行的所有單詞
for i, word in enumerate(li_words):
for j in range(-context_size, context_size + 1):
# 對(duì)于每個(gè)單詞,將上下文大小內(nèi)的詞與其進(jìn)行關(guān)聯(lián)
if i + j < 0 or i + j > len(li_words) - 1 or li_words[i + j] == word:
# 對(duì)于上下文越界以及當(dāng)前單詞本身不添加關(guān)聯(lián)關(guān)系
continue
train_x.append(word_to_id[word])
train_y.append(word_to_id[li_words[i + j]])
# 訓(xùn)練數(shù)據(jù)基于Skip-Gram丰介,輸入當(dāng)前詞背蟆,輸出當(dāng)前詞上下文大小內(nèi)的所有詞
# ----------------------------
# 定義模型
class EmbedWord(nn.Module):
def __init__(self, word_size, context_size):
super(EmbedWord, self).__init__()
self.embedding = nn.Embedding(word_size, 16)
# 用128維向量表示每個(gè)單詞
self.linear = nn.Linear(16, word_size)
self.log_softmax = nn.LogSoftmax()
def forward(self, x):
x = self.embedding(x)
x = self.linear(x)
x = self.log_softmax(x)
return x
model = EmbedWord(word_size, context_size).to(device)
# ----------------------------
# 定義優(yōu)化器和loss
loss_fun = nn.NLLLoss()
# 這里使用NLL作為損失函數(shù)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# ----------------------------
# 開(kāi)始訓(xùn)練
model.train()
for epoch in range(len(li_loss), 50000):
if epoch % 2000 == 0 and epoch > 0:
optimizer.param_groups[0]['lr'] /= 1.05
# 每2000輪下降一點(diǎn)學(xué)習(xí)率
for batch in range(0, len(train_x) - batch_size, batch_size):
word = torch.tensor(train_x[batch: batch + batch_size]).long().to(device)
label = torch.tensor(train_y[batch: batch + batch_size]).to(device)
out = model(word)
loss = loss_fun(out, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
li_loss.append(loss)
if epoch % 50 == 0 and epoch > 0:
print('epoch: {}, Loss: {}, lr: {}'.format(epoch, loss, optimizer.param_groups[0]['lr']))
plt.plot(li_loss[-500:])
plt.show()
for w in range(5):
# 每50輪測(cè)試一下前5個(gè)單詞的預(yù)測(cè)結(jié)果
pred = model(torch.tensor(w).long().to(device))
print("{} -> ".format(id_to_word[w]), end="\t")
for i, each in enumerate((-pred).argsort()[:10]):
print("{}:{}".format(i, id_to_word[int(each)]), end=" ")
print()
# ----------------------------
# 降維可視化
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
result = model(torch.tensor([i for i in range(word_size)]).long().to(device))
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
embed_two = tsne.fit_transform(model.embedding.weight.cpu().detach().numpy())
# 將詞向量降到二維查看空間分布
# embed_two = tsne.fit_transform(result.cpu().detach().numpy())
labels = [id_to_word[i] for i in range(200)]
# 這里就查看前200個(gè)單詞的分布
plt.figure(figsize=(15, 12))
for i, label in enumerate(labels):
x, y = embed_two[i, :]
plt.scatter(x, y)
plt.annotate(label, (x, y), ha='center', va='top')
# plt.savefig('詞向量降維可視化.png')
參考:
https://blog.csdn.net/weixin_40759186/article/details/87857361
https://my.oschina.net/earnp/blog/1113897