論文:Bag of Tricks for Efficient Text Classification
1.Introduce
We evaluate the quality of our approach fastText1 on two different tasks, namely?tag prediction?and?sentiment analysis.
兩種評(píng)價(jià)方法:標(biāo)簽預(yù)測(cè)芭梯、情感分析
2.Model architecture
A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier, e.g., a logistic regression or an SVM
句子分類:使用詞袋模型 BoW表示句子示辈,然后訓(xùn)練線性分類器
However, linear classifiers do not share parameters among features and classes.This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices or to use multilayer neural networks
線性分類器的缺點(diǎn):不共享參數(shù)疫鹊,在輸出空間很大的情況下泛化能力較差
解決辦法:將線性分類器分解為低秩矩陣或者使用多層神經(jīng)網(wǎng)絡(luò)
The first weight matrix A is a look-up table over the words.
The word representations are then averaged into a text representation, which is in turn fed to a linear classifier.
The text representation is an hidden variable which can be potentially be reused.
FastText模型和CBOW模型類似绍豁,CBOW是上下文單詞的詞向量平均去預(yù)測(cè)中心詞悬秉,fasttext是整個(gè)文檔的單詞的詞向量平均去預(yù)測(cè)標(biāo)簽
使用softmax模型計(jì)算分類的概率,使用negiative log-likelihood作為代價(jià)函數(shù)
This model is trained asynchronously on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate.
2.1 Hierarchical softmax
When the number of classes is large
基于哈夫曼編碼:每個(gè)節(jié)點(diǎn)與從根節(jié)點(diǎn)到該節(jié)點(diǎn)的概率有關(guān)
復(fù)雜度:O(kh) ---> O(h log2(k)) 程帕,where k is the number of classes and h the dimension of the text representation
優(yōu)勢(shì):when searching for the most likely class
Each node is associated with a probability that is the probability of the path from the root to that node. If the node is at depth l+1 with parents n1, . . . , nl, its probability is
一個(gè)節(jié)點(diǎn)的概率總是小于其父節(jié)點(diǎn)如失。DFS遍歷一棵樹,并且總是遍歷葉子節(jié)點(diǎn)中概率較大的那個(gè)。
This approach is further extended to compute the T-top targets at the cost of O(log(T)), using a binary heap.
將輸入層中的詞和詞組構(gòu)成特征向量储玫,再將特征向量通過(guò)線性變換映射到隱藏層侍筛,隱藏層通過(guò)求解最大似然函數(shù),然后根據(jù)每個(gè)類別的權(quán)重和模型參數(shù)構(gòu)建Huffman樹撒穷,將Huffman樹作為輸出匣椰。?
FastText 也利用了類別(class)不均衡這個(gè)事實(shí)(一些類別出現(xiàn)次數(shù)比其他的更多),通過(guò)使用 Huffman 算法建立用于表征類別的樹形結(jié)構(gòu)端礼。因此禽笑,頻繁出現(xiàn)類別的樹形結(jié)構(gòu)的深度要比不頻繁出現(xiàn)類別的樹形結(jié)構(gòu)的深度要小,這也使得進(jìn)一步的計(jì)算效率更高蛤奥。?
2.2 N-gram features
詞袋模型(BoW)對(duì)于一個(gè)文本佳镜,忽略其詞序和語(yǔ)法,句法凡桥,將其僅僅看做是一個(gè)詞集合蟀伸,或者說(shuō)是詞的一個(gè)組合,文本中每個(gè)詞的出現(xiàn)都是獨(dú)立的缅刽,不依賴于其他詞是否出現(xiàn)啊掏。
we use a bag of n-grams as additional features to capture some partial information about the local word order.?
加入N-gram特征,以捕捉局部詞序衰猛;使用Hash-Trick方法降維
3 Experiments
First, we compare it to existing text classifers on the problem of sentiment analysis.
Then, we evaluate its capacity to scale to large output space on a tag prediction dataset.
①情感分析
②輸出空間很大標(biāo)簽預(yù)測(cè)
3.1 Sentiment analysis
We present the results in Figure 1. We use 10 hidden units and run fastText for 5 epochs with a learning rate selected on a validation set from {0.05, 0.1, 0.25, 0.5}.
On this task,adding bigram information improves the performanceby 1-4%. Overall our accuracy is slightly better than char-CNN and char-CRNN and, a bit worse than VDCNN.
Note that we can increase the accuracy slightly by using more n-grams, for example with trigrams.
We tune the hyperparameters on the validation set and observe that using n-grams up to 5 leads to the best performance.
3.2 Tag prediction
To test scalability of our approach, further evaluation is carried on the YFCC100M dataset?which consists of almost 100M images with captions,titles and tags. We focus on predicting the tags according to the title and caption (we do not use the images).
We remove the words and tags occurring less than 100 times and split the data into a train, validation and test set.?
We consider a frequency-based baseline whichpredicts the most frequent tag.? we consider the linear version.
We run fastText for 5 epochs and compare it to Tagspace for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, but adding bigrams gives us a significant boost in accuracy.
At test time, Tagspace needs to compute the scores for all the classes which makes it relatively slow,while our fast inference gives a significant speed-up when the number of classes is large (more than 300K here).
Overall, we are more than an order of magnitude faster to obtain model with a better quality.?
4 Discussion and conclusion
Unlike unsupervisedly trained word vectors from word2vec, our word features can?be averaged together to form good sentence representations.
In several tasks, fastText obtains performance on par with recently proposed methods inspired by deep learning, while being much faster.
Although deep neural networks have in theory much higher representational power than shallow models, it is not clear if simple text classification problems such as sentiment analysis are the right ones to evaluate them.
輸入是一句話迟蜜,x1到xN是這句話的單詞或是ngram。每一個(gè)都對(duì)應(yīng)一個(gè)向量啡省,對(duì)這些向量取平均就得到了文本向量娜睛。然后用文本向量去預(yù)測(cè)標(biāo)簽。當(dāng)類別不多的時(shí)候卦睹,就是最最簡(jiǎn)單的softmax畦戒。當(dāng)標(biāo)簽數(shù)量巨大的時(shí)候,就是要用到hierarchical softmax了分预。由于這個(gè)文章除了詞向量還引入了ngram向量兢交,ngram的數(shù)量非常大,會(huì)導(dǎo)致參數(shù)很多笼痹。所以這里使用了哈希桶配喳,會(huì)可能把幾個(gè)ngram映射到同一個(gè)向量。這樣會(huì)大大的節(jié)省內(nèi)存
word2vec和fasttext的對(duì)比:
word2vec對(duì)局部上下文中的單詞的詞向量取平均凳干,預(yù)測(cè)中心詞晴裹;fasttext對(duì)整個(gè)句子(或是文檔)的單詞的詞向量取平均,預(yù)測(cè)標(biāo)簽救赐。
word2vec中不使用正常的softmax涧团,因?yàn)橐A(yù)測(cè)的單詞實(shí)在是太多了只磷。word2vec中可以使用hierarchical softmax或是negative sampling。fasttext中當(dāng)標(biāo)簽數(shù)量不多的時(shí)候使用正常的softmax泌绣,在標(biāo)簽數(shù)量很多的時(shí)候用hierarchical softmax钮追。fasttext中不會(huì)使用negative sampling是因?yàn)閚egative sampling得到的不是嚴(yán)格的概率。
補(bǔ)充知識(shí):
Negative log-likelihood function
代碼:
After embed each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a?linear classifier.
It use?softmax function?to compute the probability distribution over the predefined classes.
Then?cross entropy?is used to compute loss.
Bag of word representation does not consider word order.?
In order to take account of word order,?n-gram features?is used to capture some partial information about the local word order
When the number of classes is large, computing the linear classifier is computational expensive. So it use?hierarchical softmax?to speed training process.
????use bi-gram and/or tri-gram
????use NCE loss to speed us softmax computation(not use hierarchy softmax as original paper)
訓(xùn)練模型
1.load data(X:list of lint,y:int).?
2.create session.?
3.feed data.?
4.training
?(5.validation)?
(6.prediction)