Word2vec簡介
Word2Vec是由Google的Mikolov等人提出的一個詞向量計算模型较曼。
- 輸入:大量已分詞的文本
- 輸出:用一個稠密向量來表示每個詞
詞向量的重要意義在于將自然語言轉(zhuǎn)換成了計算機(jī)能夠理解的向量。相對于詞袋模型郑气、TF-IDF等模型,詞向量能抓住詞的上下文乌昔、語義隙疚,衡量詞與詞的相似性,在文本分類磕道、情感分析等許多自然語言處理領(lǐng)域有重要作用供屉。
詞向量經(jīng)典例子:![][01]
[01]:http://latex.codecogs.com/png.latex?\vec{man}-\vec{woman}\approx\vec{king}-\vec{queen}
gensim已經(jīng)用python封裝好了word2vec的實(shí)現(xiàn),有語料的話可以直接訓(xùn)練了溺蕉,參考中英文維基百科語料上的Word2Vec實(shí)驗
會使用gensim訓(xùn)練詞向量伶丐,并不表示真的掌握了word2vec,只表示會讀文檔會調(diào)接口而已焙贷。
Word2vec詳細(xì)實(shí)現(xiàn)
word2vec的詳細(xì)實(shí)現(xiàn)撵割,簡而言之贿堰,就是一個三層的神經(jīng)網(wǎng)絡(luò)辙芍。要理解word2vec的實(shí)現(xiàn),需要的預(yù)備知識是神經(jīng)網(wǎng)絡(luò)和Logistic Regression羹与。
神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)
上圖是Word2vec的簡要流程圖故硅。首先假設(shè),詞庫里的詞數(shù)為10000; 詞向量的長度為300(根據(jù)斯坦福CS224d的講解纵搁,詞向量一般為25-1000維吃衅,300維是一個好的選擇)。下面以單個訓(xùn)練樣本為例腾誉,依次介紹每個部分的含義徘层。
- 輸入層:輸入為一個詞的one-hot向量表示。這個向量長度為10000利职。假設(shè)這個詞為ants趣效,ants在詞庫中的ID為i,則輸入向量的第i個分量為1猪贪,其余為0跷敬。
[0, 0, ..., 0, 0, 1, 0, 0, ..., 0, 0]
- 隱藏層:隱藏層的神經(jīng)元個數(shù)就是詞向量的長度。隱藏層的參數(shù)是一個[10000 热押,300]的矩陣西傀。 實(shí)際上,這個參數(shù)矩陣就是詞向量桶癣。回憶一下矩陣相乘拥褂,一個one-hot行向量和矩陣相乘,結(jié)果就是矩陣的第i行牙寞。經(jīng)過隱藏層饺鹃,實(shí)際上就是把10000維的one-hot向量映射成了最終想要得到的300維的詞向量。
矩陣乘法 -
輸出層: 輸出層的神經(jīng)元個數(shù)為總詞數(shù)10000,參數(shù)矩陣尺寸為[300尤慰,10000]馏锡。詞向量經(jīng)過矩陣計算后再加上softmax歸一化,重新變?yōu)?0000維的向量伟端,每一維對應(yīng)詞庫中的一個詞與輸入的詞(在這里是ants)共同出現(xiàn)在上下文中的概率杯道。
輸出層
上圖中計算了car與ants共現(xiàn)的概率,car所對應(yīng)的300維列向量就是輸出層參數(shù)矩陣中的一列责蝠。輸出層的參數(shù)矩陣是[300党巾,10000],也就是計算了詞庫中所有詞與ants共現(xiàn)的概率霜医。輸出層的參數(shù)矩陣在訓(xùn)練完畢后沒有作用齿拂。
-
訓(xùn)練:訓(xùn)練樣本(x, y)有輸入也有輸出,我們知道哪個詞實(shí)際上跟ants共現(xiàn)肴敛,因此y也是一個10000維的向量署海。損失函數(shù)跟Logistic Regression相似,是神經(jīng)網(wǎng)絡(luò)的最終輸出向量和y的交叉熵(cross-entropy)医男。最后用隨機(jī)梯度下降來求解砸狞。
交叉熵(cross-entropy)
上述步驟是一個詞作為輸入和一個上下文中的詞作為輸出的情況,但實(shí)際情況顯然更復(fù)雜镀梭,什么是上下文呢刀森?用一個詞去預(yù)測周圍的其他詞,還是用周圍的好多詞來預(yù)測一個詞报账?這里就要引入實(shí)際訓(xùn)練時的兩個模型skip-gram和CBOW研底。
skip-gram和CBOW
-
skip-gram: 核心思想是根據(jù)中心詞來預(yù)測周圍的詞。假設(shè)中心詞是cat透罢,窗口長度為2榜晦,則根據(jù)cat預(yù)測左邊兩個詞和右邊兩個詞。這時琐凭,cat作為神經(jīng)網(wǎng)絡(luò)的input芽隆,預(yù)測的詞作為label。下圖為一個例子:
skip-gram在這里窗口長度為2统屈,中心詞一個一個移動胚吁,遍歷所有文本。每一次中心詞的移動愁憔,最多會產(chǎn)生4對訓(xùn)練樣本(input腕扶,label)。
-
CBOW(continuous-bag-of-words):如果理解了skip-gram吨掌,那CBOW模型其實(shí)就是倒過來半抱,用周圍的所有詞來預(yù)測中心詞脓恕。這時候,每一次中心詞的移動窿侈,只能產(chǎn)生一個訓(xùn)練樣本炼幔。如果還是用上面的例子,則CBOW模型會產(chǎn)生下列4個訓(xùn)練樣本:
- ([quick, brown], the)
- ([the, brown, fox], quick)
- ([the, quick, fox, jumps], brown)
- ([quick, brown, jumps, over], fox)
這時候史简,input很可能是4個詞乃秀,label只是一個詞,怎么辦呢圆兵?其實(shí)很簡單跺讯,只要求平均就行了。經(jīng)過隱藏層后殉农,輸入的4個詞被映射成了4個300維的向量刀脏,對這4個向量求平均,然后就可以作為下一層的輸入了超凳。
兩個模型相比愈污,skip-gram模型能產(chǎn)生更多訓(xùn)練樣本,抓住更多詞與詞之間語義上的細(xì)節(jié)聪建,在語料足夠多足夠好的理想條件下钙畔,skip-gram模型是優(yōu)于CBOW模型的。在語料較少的情況下金麸,難以抓住足夠多詞與詞之間的細(xì)節(jié),CBOW模型求平均的特性簿盅,反而效果可能更好挥下。
負(fù)采樣(Negative Sampling)
實(shí)際訓(xùn)練時,還是假設(shè)詞庫有10000個詞桨醋,詞向量300維棚瘟,那么每一層神經(jīng)網(wǎng)絡(luò)的參數(shù)是300萬個,輸出層相當(dāng)于有一萬個可能類的多分類問題喜最≠苏海可以想象,這樣的計算量非常非常非常大瞬内。
作者M(jìn)ikolov等人提出了許多優(yōu)化的方法迷雪,在這里著重講一下負(fù)采樣。
負(fù)采樣的思想非常簡單虫蝶,簡單地令人發(fā)指:我們知道最終神經(jīng)網(wǎng)絡(luò)經(jīng)過softmax輸出一個向量章咧,只有一個概率最大的對應(yīng)正確的單詞,其余的稱為negative sample能真。現(xiàn)在只選擇5個negative sample赁严,所以輸出向量就只是一個6維的向量扰柠。要考慮的參數(shù)不是300萬個,而減少到了1800個疼约! 這樣做看上去很偷懶卤档,實(shí)際效果卻很好,大大提升了運(yùn)算效率程剥。
我們知道裆装,訓(xùn)練神經(jīng)網(wǎng)絡(luò)時,每一次訓(xùn)練會對神經(jīng)網(wǎng)絡(luò)的參數(shù)進(jìn)行微小的修改倡缠。在word2vec中哨免,每一個訓(xùn)練樣本并不會對所有參數(shù)進(jìn)行修改。假設(shè)輸入的詞是cat昙沦,我們的隱藏層參數(shù)有300萬個琢唾,但這一步訓(xùn)練只會修改cat相對應(yīng)的300個參數(shù),因為此時隱藏層的輸出只跟這300個參數(shù)有關(guān)盾饮!
負(fù)采樣是有效的采桃,我們不需要那么多negative sample。Mikolov等人在論文中說:對于小數(shù)據(jù)集丘损,負(fù)采樣的個數(shù)在5-20個普办;對于大數(shù)據(jù)集,負(fù)采樣的個數(shù)在2-5個徘钥。
那具體如何選擇負(fù)采樣的詞呢衔蹲?論文給出了如下公式:
其中f(w)是詞頻〕蚀。可以看到舆驶,負(fù)采樣的選擇只跟詞頻有關(guān),詞頻越大而钞,越有可能選中沙廉。
Tensorflow實(shí)現(xiàn)
最后用tensorflow動手實(shí)踐一下。參考Udacity Deep Learning的一次作業(yè)
這里只是訓(xùn)練了128維的詞向量臼节,并通過TSNE的方法可視化撬陵。作為練手和深入理解word2vec不錯,實(shí)戰(zhàn)還是推薦gensim网缝。
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE
Download the data from the source website if necessary.
url = 'http://mattmahoney.net/dc/'
def maybe_download(filename, expected_bytes):
"""Download a file if not present, and make sure it's the right size."""
if not os.path.exists(filename):
filename, _ = urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified %s' % filename)
else:
print(statinfo.st_size)
raise Exception(
'Failed to verify ' + filename + '. Can you get to it with a browser?')
return filename
filename = maybe_download('text8.zip', 31344016)
Found and verified text8.zip
Read the data into a string.
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data
words = read_data(filename)
print('Data size %d' % len(words))
Data size 17005207
Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count = unk_count + 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words # Hint to reduce memory.
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]
Function to generate a training batch for the skip-gram model.
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
print('data:', [reverse_dictionary[di] for di in data[:8]])
for num_skips, skip_window in [(2, 1), (4, 2)]:
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']
with num_skips = 2 and skip_window = 1:
batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
labels: ['anarchism', 'as', 'originated', 'a', 'as', 'term', 'a', 'of']
with num_skips = 4 and skip_window = 2:
batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a']
labels: ['originated', 'term', 'anarchism', 'a', 'of', 'as', 'originated', 'term']
Skip-Gram
Train a skip-gram model.
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
#######important#########
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Variables.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
# Optimizer.
# Note: The optimizer will optimize the softmax_weights AND the embeddings.
# This is because the embeddings are defined as a variable quantity and the
# optimizer's `minimize` method will by default modify all variable quantities
# that contribute to the tensor it is passed.
# See docs on `tf.train.Optimizer.minimize()` for more details.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
num_steps = 100001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_data, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
num_points = 400
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
def plot(embeddings, labels):
assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
pylab.figure(figsize=(15,15)) # in inches
for i, label in enumerate(labels):
x, y = embeddings[i,:]
pylab.scatter(x, y)
pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
ha='right', va='bottom')
pylab.show()
words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)
CBOW
data_index_cbow = 0
def get_cbow_batch(batch_size, num_skips, skip_window):
global data_index_cbow
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index_cbow])
data_index_cbow = (data_index_cbow + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index_cbow])
data_index_cbow = (data_index_cbow + 1) % len(data)
cbow_batch = np.ndarray(shape=(batch_size), dtype=np.int32)
cbow_labels = np.ndarray(shape=(batch_size // (skip_window * 2), 1), dtype=np.int32)
for i in range(batch_size):
cbow_batch[i] = labels[i]
cbow_batch = np.reshape(cbow_batch, [batch_size // (skip_window * 2), skip_window * 2])
for i in range(batch_size // (skip_window * 2)):
# center word
cbow_labels[i] = batch[2 * skip_window * i]
return cbow_batch, cbow_labels
# actual batch_size = batch_size // (2 * skip_window)
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
#######important#########
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_dataset = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), skip_window * 2])
train_labels = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Variables.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# reshape embed
embed = tf.reshape(embed, (skip_window * 2, batch_size // (skip_window * 2), embedding_size))
# average embed
embed = tf.reduce_mean(embed, 0)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
# Optimizer.
# Note: The optimizer will optimize the softmax_weights AND the embeddings.
# This is because the embeddings are defined as a variable quantity and the
# optimizer's `minimize` method will by default modify all variable quantities
# that contribute to the tensor it is passed.
# See docs on `tf.train.Optimizer.minimize()` for more details.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
num_steps = 100001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_data, batch_labels = get_cbow_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
num_points = 400
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
words = [reverse_dictionary[i] for i in range(200, num_points+1)]
plot(two_d_embeddings, words)
參考資料
- Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[J]. 2014, 4:II-1188.
- Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119.
- Word2Vec Tutorial - The Skip-Gram Model
- Udacity Deep Learning
- Stanford CS224d Lecture2,3