word2vec詳解（未完待續(xù)）

word2vec總覽

一娩缰、重要的兩篇論文

（1）Efficient Estimation of Word Representations in Vector Space（向量空間中詞表示的有效估計）

該篇論文主要描述原理谋国，所以本篇主要講解該篇論文

（2）Distributed Representations of Words and Phrases

and their Compositionality（單詞和短語的分布式表示及其組成）

該片論文主要是數(shù)學(xué)原理

二、論文結(jié)構(gòu)

0.摘要：提出2種新的高效計算詞向量結(jié)構(gòu)烧栋，并使用詞相似度任務(wù)驗證效果

1.Introduction：介紹詞向量背景；本文目標(biāo)拳球；前人工作

2.Model Architectures（模型架構(gòu)）：LSA/LDA;前向神經(jīng)網(wǎng)絡(luò)审姓；循環(huán)神經(jīng)網(wǎng)絡(luò)；并行網(wǎng)絡(luò)計算

3.New Log-linear Models：介紹2中新模型結(jié)構(gòu)：CBOW, Skipgrams

4.Results：評價任務(wù)描述祝峻；最大化正確率魔吐；模型結(jié)構(gòu)比較；模型上大量數(shù)據(jù)的并行計算莱找；微軟研究院句子完成比賽

5.Examples of the Learned Relationships：例子：學(xué)習(xí)到的詞與詞之間的關(guān)系

6.Conclusion：結(jié)論：高質(zhì)量詞向量酬姆；高效率訓(xùn)練方式；作為預(yù)訓(xùn)練詞向量作用于其他nlp任務(wù)能提升效果

7.Follow-Up Work：后續(xù)工作：C++單機(jī)代碼

三奥溺、論文詳解

0.摘要

? ? We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

? ? ? ?我們提出了兩種新的模型體系結(jié)構(gòu)辞色，用于從非常大的數(shù)據(jù)集計算單詞的連續(xù)向量表示。這些詞向量的質(zhì)量在一個單詞相似性任務(wù)中進(jìn)行測量浮定，并將結(jié)果與之前基于不同類型的神經(jīng)網(wǎng)絡(luò)的性能最好的技術(shù)進(jìn)行比較淫僻。我們觀察到在更低的計算成本下精度的巨大提高诱篷。從16億個單詞中學(xué)習(xí)高質(zhì)量的單詞向量只需要不到一天的時間。此外雳灵，我們還表明棕所，這些向量在用于測量語法和語義詞的相似性的測試集上，提供了最先進(jìn)的性能悯辙。

總結(jié)：

1.提出兩種新穎的模型結(jié)構(gòu)用來計算詞向量

2.采用一種詞相似度（語法琳省、語義）的任務(wù)來評估對比詞向量之類

3.大量降低計算量的同時提升詞向量的質(zhì)量

4.在語義和句法的任務(wù)，同樣得到當(dāng)前最好的效果躲撰。

1.Introduction

Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling - today, it is possible to train N-grams on virtually all available data (trillions of words [3]).

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.

With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].

? ? ? ?許多當(dāng)前的NLP系統(tǒng)和技術(shù)將單詞視為原子單位——單詞之間沒有相似性的概念针贬，因為它們在詞匯表中表示為索引。這個選擇有好幾個原因——簡單性拢蛋、健壯性和簡單模型在大量數(shù)據(jù)上的觀察數(shù)據(jù)比基于較少數(shù)據(jù)的復(fù)雜系統(tǒng)表現(xiàn)更好桦他。一個例子就是如今流行的N-gram模型用于統(tǒng)計語言建模，N-gram模型幾乎可以在所有可用數(shù)據(jù)(數(shù)萬億個單詞[3])上訓(xùn)練谆棱。

? ? ? ? 然而快压，簡單的技術(shù)在許多任務(wù)中都有其局限性。例如垃瞧，用于自動語音識別的相關(guān)域內(nèi)數(shù)據(jù)的數(shù)量是有限的蔫劣，性能的好壞通常由高質(zhì)量記錄的語音數(shù)據(jù)(通常只有數(shù)百萬字)的大小決定。在機(jī)器翻譯个从，許多語言的現(xiàn)有語料庫只包含幾十億個單詞或更少脉幢。因此，在某些情況下嗦锐，基本技術(shù)的簡單擴(kuò)展將不會產(chǎn)生效果任何重大進(jìn)展嫌松，我們都必須關(guān)注更先進(jìn)的技術(shù)。

? ? ? ? 隨著近年來機(jī)器學(xué)習(xí)技術(shù)的進(jìn)步奕污，訓(xùn)練更復(fù)雜的模型在更大的數(shù)據(jù)集上成為可能萎羔，它們通常比簡單的模型更好【罩担可能最成功的概念是使用單詞[10]的分布式表示外驱。例如,神經(jīng)基于網(wǎng)絡(luò)的語言模型顯著優(yōu)于N-gram模型[1,27,17]育灸。

1.1Goals of the Paper

The main goal of this paper is to introduce techniques that can be used for learning high-quality wordvectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more han a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.

We use recently proposed techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity [20]. This has been observed earlier in the context of inflectional languages - for example, nouns can have multiple word endings, and if we search for similar words in a subspace of the original vector space, it is possible to find words that have similar endings [13, 14].

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].

In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words. We design a new comprehensive test set for measuring both syntactic and semantic regularities, and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.

? ? ? ? 本文的主要目標(biāo)是介紹一些技術(shù)腻窒，這些技術(shù)可用于從具有數(shù)十億個單詞和數(shù)百萬個單詞的龐大數(shù)據(jù)集中學(xué)習(xí)高質(zhì)量的單詞向量。據(jù)我們所知磅崭，以前提出的體系結(jié)構(gòu)都沒有成功地訓(xùn)練出超過幾億的單詞儿子，而且字向量的維數(shù)在50 - 100之間。

? ? ? ? 我們使用最近提出的技術(shù)來測量結(jié)果向量表示的質(zhì)量砸喻，期望不僅相似的單詞會趨向于彼此接近柔逼，而且單詞可以有多個程度的相似[20]蒋譬。這已經(jīng)在早期的屈折語言中觀察到——例如，名詞可以有多個詞尾愉适，如果我們在原始向量空間的子空間中搜索相似的詞犯助，就有可能找到詞尾相似的詞[13,14]。

? ? ? ? 有些令人驚訝的是维咸，人們發(fā)現(xiàn)詞語表征的相似性超出了簡單的句法規(guī)律剂买。使用單詞偏移技術(shù)對單詞向量進(jìn)行簡單的代數(shù)運(yùn)算，例如癌蓖，vector(“King”)- vector(“Man”)+ vector(“Woman”)得到的向量最接近單詞Queen[20]的向量表示瞬哼。

? ? ? ? 在本文中，我們試圖通過開發(fā)新的模型架構(gòu)來保持單詞之間的線性規(guī)律租副，以最大限度地提高這些向量運(yùn)算的準(zhǔn)確性坐慰。我們設(shè)計了一個新的綜合測試集來衡量語法和語義的規(guī)律，并表明許多這樣的規(guī)律是可以學(xué)習(xí)的高精度用僧。此外结胀，我們討論了訓(xùn)練時間和準(zhǔn)確性如何依賴于字向量的維數(shù)和訓(xùn)練數(shù)據(jù)的數(shù)量。

1.2 Previous Work

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.

It was later shown that the word vectors can be used to significantly improve and simplify many NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison. However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices are used [23].

? ? ? ? 將單詞表示為連續(xù)向量已經(jīng)有很長的歷史了[10,26,8]永毅。[1]中提出了一種非常流行的神經(jīng)網(wǎng)絡(luò)語言模型估計模型結(jié)構(gòu)把跨，該結(jié)構(gòu)采用具有線性投影層和非線性隱層的前饋神經(jīng)網(wǎng)絡(luò)共同學(xué)習(xí)單詞向量表示和統(tǒng)計語言模型。這項工作之后又有許多其他的工作沼死。

? ? ? ? 另一個有趣的NNLM結(jié)構(gòu)在[13,14]中提出着逐，其中單詞向量首先使用具有單個隱含層的神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)。然后使用單詞向量來訓(xùn)練NNLM意蛀。因此耸别，即使不構(gòu)造完整的NNLM，也可以學(xué)習(xí)單詞向量县钥。在這項工作中秀姐，我們直接擴(kuò)展了這個架構(gòu)，并且只關(guān)注使用簡單模型學(xué)習(xí)單詞向量的第一步若贮。

? ? ? ? 后來的研究表明省有，單詞向量可以顯著改善和簡化許多NLP應(yīng)用[4,5,29]。使用不同的模型架構(gòu)對詞向量本身進(jìn)行估計谴麦，并在不同的語料庫上進(jìn)行訓(xùn)練[4,29,23,19,9]蠢沿，得到的部分詞向量可用于將來的研究和比較。然而匾效，據(jù)我們所知舷蟀，這些架構(gòu)比[13]中提出的架構(gòu)在訓(xùn)練方面的計算成本要高得多，除了某些版本的log-bilinear model，其中對角權(quán)矩陣使用[23]野宜。

2 Model Architectures

Many different types of models were proposed for estimating continuous representations of words,including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

Similar to [18], to compare different model architectures we define first the computational complexity of a model as the number of parameters that need to be accessed to fully train the model. Next, we will try to maximize the accuracy, while minimizing the computational complexity.

For all the following models, the training complexity is proportional to

O = E × T × Q, (1)

where E is number of the training epochs, T is the number of the words in the training set and Q is defined further for each model architecture. Common choice is E = 3 ? 50 and T up to one billion.All models are trained using stochastic gradient descent and backpropagation [26].

許多不同類型的模型被提出用于估計詞的連續(xù)表示扫步，包括眾所周知的潛在語義分析(LSA)和潛在狄利克雷分配(LDA)。在本文中匈子，我們關(guān)注的是神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)到的單詞的分布表示河胎，因為之前已經(jīng)表明，在保留單詞之間的線性規(guī)律方面虎敦，神經(jīng)網(wǎng)絡(luò)的表現(xiàn)明顯優(yōu)于LSA [20,31];此外仿粹，LDA在大型數(shù)據(jù)集上的計算開銷非常大。

與[18]類似原茅，為了比較不同的模型架構(gòu)吭历，我們首先將模型的計算復(fù)雜度定義為需要訪問的參數(shù)的數(shù)量，以充分訓(xùn)練模型擂橘。接下來晌区，我們將嘗試最大化精確度，同時最小化計算復(fù)雜度通贞。

對于以下所有模型朗若，訓(xùn)練復(fù)雜度與

O = E×T×Q， (1)

其中E為訓(xùn)練epoch的個數(shù)昌罩，T為訓(xùn)練集中單詞的個數(shù)哭懈，Q為每個模型架構(gòu)進(jìn)一步定義。常見的選擇是E = 3 - 50, T高達(dá)10億茎用。所有模型都使用隨機(jī)梯度下降和反向傳播[26]進(jìn)行訓(xùn)練遣总。

2.1 Feedforward Neural Net Language Model (NNLM)

The probabilistic feedforward neural network language model has been proposed in [1]. It consists of input, projection, hidden and output layers. At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary. The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix. As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.

The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of N = 10, the size of the projection layer (P) might be 500 to 2000, while the hidden layer size H is typically 500 to 1000 units. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality V . Thus, the computational complexity per each training example is

Q = N × D + N × D × H + H × V, (2)

where the dominating term is H × V . However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax [25, 23, 18], or avoiding normalized models completely by using models that are not normalized uring training [4, 9]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go own to around log2(V ). Thus, most of the complexity is caused by the term N × D × H.

In our models, we use hierarchical softmax where the vocabulary? is represented as a Huffman binary tree. This follows previous observations that the frequency of words works well for obtaining classes in neural net language models [16]. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated: while balanced binary tree would require log2(V ) outputs to be evaluated, the Huffman tree based hierarchical softmax requires only about log2(Unigram perplexity(V )). For example when the vocabulary size is one million words, this results in about two times speedup in evaluation. While this is not crucial speedup for neural network LMs as the computational bottleneck is in the N ×D×H term, we will later propose architectures that do not have hidden layers and thus depend heavily on the efficiency of the softmax?normalization.

? ? ? ? 在[1]中提出了概率前饋神經(jīng)網(wǎng)絡(luò)語言模型。它由輸入層轨功、投影層旭斥、隱藏層和輸出層組成。在輸入層古涧，前面的N個單詞使用1-of-V編碼垂券，其中V為詞匯表的大小。然后使用共享的投影矩陣將輸入層投影到維數(shù)N×D的投影層P羡滑。由于在任何給定時間只有N個輸入是活動的菇爪，合成投影層是一個相對便宜的操作。

? ? ? ? 由于投影層中的值比較密集柒昏，NNLM體系結(jié)構(gòu)使得投影層和隱層之間的計算變得復(fù)雜凳宙。對于通常選擇的N = 10，投影層(P)的大小可能是500到2000昙楚，而隱藏層的大小H通常是500到1000個單位近速。并利用隱含層計算詞匯中所有單詞的概率分布诈嘿，得到維數(shù)為V的輸出層堪旧。因此削葱，每個訓(xùn)練示例的計算復(fù)雜度為

Q = N * D + N * D * H + H * V， (2)

????????主要項是H乘以V淳梦。然而析砸，提出了一些實際的解決方案來避免它;要么使用層次softmax[25,23,18]，要么完全避免使用未標(biāo)準(zhǔn)化模型進(jìn)行訓(xùn)練[4,9]爆袍。詞匯表用二叉樹表示首繁，需要計算的輸出單元的數(shù)量可以降低到log2(V)。如此陨囊，此時大部分復(fù)雜性是由N×D×H這一項引起的弦疮。

????????在我們的模型中，我們使用分層的softmax蜘醋，其中詞匯表表示為霍夫曼二叉樹胁塞。這與之前的觀察結(jié)果一致，即在神經(jīng)網(wǎng)絡(luò)語言模型[16]中压语，單詞的頻率對于獲取分類非常有效啸罢。Huffman樹將簡短的二進(jìn)制編碼分配給頻繁的單詞，這進(jìn)一步減少了需要計算的輸出單元數(shù)胎食。此時平衡二樹需要計算log2(V)輸出扰才，而基于Huffman樹的hierarchical softmax只需要大約log2(Unigram perplexity(V))。例如厕怜，當(dāng)詞匯表大小為100萬個單詞時衩匣，這會導(dǎo)致評估速度提高兩倍。雖然這對于神經(jīng)網(wǎng)絡(luò)LMs來說并不是關(guān)鍵的加速粥航，因為計算瓶頸在N×D×H項舵揭。我們稍后將提出沒有隱藏層的架構(gòu)，此時嚴(yán)重依賴softmax歸一化的效率躁锡。

Feedforward Neural Net Language Model

2.2 Recurrent Neural Net Language Model (RNNLM)

Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length (the order of the model N),and because theoretically RNNs can efficiently represent more complex patterns than the shallow neural networks [15, 2]. The RNN model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections. This allows the ecurrent model to form some kind of short term memory, as information from the past can be represented by the hidden layer state that gets updated based on the current input and the state of the hidden layer in the previous time step.

The complexity per training example of the RNN model is

Q = H × H + H × V, (3)

where the word representations D have the same dimensionality as the hidden layer H. Again, the term H × V can be efficiently reduced to H × log2(V ) by using hierarchical softmax. Most of the complexity then comes from H × H.

Recurrent Neural Net Language Model

????????基于遞歸神經(jīng)網(wǎng)絡(luò)的語言模型被提出午绳，以克服前饋NNLM的某些限制，如需要指定上下文長度(模型的階數(shù)N)映之，以及RNNs在理論上可以比淺神經(jīng)網(wǎng)絡(luò)有效地表示更復(fù)雜的模式[15,2]拦焚。RNN模型沒有投影層;只有輸入、隱藏和輸出層杠输。這類模型的特殊之處在于使用延時連接將隱含層連接到自身的遞歸矩陣赎败。這使得ecurrent模型能夠形成某種短期記憶，因為來自過去的信息可以通過基于當(dāng)前輸入和前一個時間步長的隱含層狀態(tài)進(jìn)行更新的隱含層狀態(tài)來表示蠢甲。

????????RNN模型每訓(xùn)練例的復(fù)雜度為

Q = H×H + H×V僵刮， (3)

其中表示詞D與隱含層H具有相同的維數(shù)，通過使用softmax分層，術(shù)語H×V可以有效地降為H×log2(V)搞糕。大部分的復(fù)雜性來自于H乘以H勇吊。