TextCNN詳解 - 簡書

Convolutional Neural Networks for Sentence Classification（基于卷積神經(jīng)網(wǎng)絡(luò)的句子分類）

三大頂會 ACL EMNLP NAACL

一拙寡、論文總覽：

Abstract：使用卷積神經(jīng)網(wǎng)絡(luò)處理句子級別的文本分類授滓，并在多個(gè)數(shù)據(jù)集上取得很好效果

Introduction:通過使用預(yù)訓(xùn)練的詞向量和卷積神經(jīng)網(wǎng)絡(luò)，文本提出一種簡單且有效的文本分類模型倒庵。

Model:TextCNN模型結(jié)構(gòu)和正則化

Datasets and Experimental Setp：數(shù)據(jù)集介紹褒墨，實(shí)驗(yàn)超參設(shè)置以及實(shí)驗(yàn)結(jié)果炫刷。

Results and Discussion:實(shí)驗(yàn)研究擎宝，通道個(gè)數(shù)討論和詞向量使用方法討論

Conclusion:全文總結(jié)

二、目標(biāo)

（一）TextCnn

卷積層

池化層

（二）減少過擬合

正則化

Dropout

（三）超參數(shù)選擇

詞向量設(shè)置方式

卷積核大小

卷積核個(gè)數(shù)

激活函數(shù)

正則化

（四）代碼實(shí)現(xiàn)

三浑玛、論文總覽

深度學(xué)習(xí)的發(fā)展

詞向量的發(fā)展

CNN的發(fā)展

（一）Introduction

詞向量的發(fā)展：Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al., 2012) and speech recognition (Graves et al., 2013) in recent years. Within natural language processing, much of the work with deep learning methods has involved learning word vector representations through neural language models (Bengio et al., 2003; Yih et al., 2011; Mikolov et al., 2013) and performing composition over the learned word vectors for classification (Collobert et al., 2011). Word vectors, wherein words are projected from a sparse, 1-of-V encoding (here V is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions. In such dense representations, semantically close words are likewise close—in euclidean or cosine distance—in the lower dimensional vector space.

CNN的發(fā)展：Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features (LeCun et al., 1998). Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al., 2014), search query retrieval (Shen et al., 2014), sentence modeling (Kalchbrenner et al., 2014), and other traditional NLP tasks (Collobert et al., 2011).

In the present work, we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on100 billion words of Google News（詞向量來源）, and are publicly available.1 We initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are ‘universal’ feature extractors that can be utilized for various classification tasks（預(yù)訓(xùn)練的詞向量可以一些任務(wù)上通用）. Learning task-specific vectors through fine-tuning results in further improvements. We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels（混合使用詞向量）

Our work is philosophically similar to Razavian et al. (2014) which showed that for image classification, feature extractors obtained from a pretrained deep learning model perform well on a variety of tasks—including tasks that are very different from the original task for which the feature extractors were trained.

使用簡單的CNN模型在預(yù)訓(xùn)練詞向量基本上進(jìn)行微調(diào)就可以在文本分類任務(wù)上得到很好的結(jié)果

通過對詞向量進(jìn)行微調(diào)而獲得的任務(wù)指向的詞向量能夠得到更好的結(jié)果绍申。

我們也提出了一種即使使用靜態(tài)預(yù)訓(xùn)練詞向量又使用任務(wù)指向詞向量的文本模型

最終我們在7個(gè)文本分類任務(wù)中的四個(gè)上都取得了最好的分類準(zhǔn)確率

（二）Model

A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification卷積神經(jīng)網(wǎng)絡(luò)用于句子分類的敏感性分析(和從業(yè)者指南)

2.1 Regularization TextCnn正則化

1.Dropout：在神經(jīng)網(wǎng)絡(luò)的傳播過程中，讓某個(gè)神經(jīng)元以一定的概率停止工作顾彰，從而增加模型的泛化能力

（三）Datasets and Experimental Setup

MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005).

SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).

SST-2: Same as SST-1 but with neutral reviews removed and binary labels. ? Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).

TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002).5

CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)

MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).

3.1 Hyperparameters and Training

參數(shù)設(shè)置：

windows(h):3,4,5 with 100 feature maps each

dropout rate(p):0.5

l2 constraint(s):3

mini-batch:50

來源：SST-2 驗(yàn)證集上進(jìn)行網(wǎng)格搜索

3.2 Pre-trained Word Vectors

We use the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors have dimensionality of 300 and were trained using the continuous bag-of-words architecture

詞向量：word2vec

vocable size: 100billion

datas: Google News

dimension:300

architecture: CBOW

（四） Results and Discussion

Results of our models against other methods are listed in table 2. Our baseline modelwith all randomly initialized words (CNN-rand) does not perform well on its own（CNN+隨機(jī)初始化表現(xiàn)不好）. While we had expected performance gains through the use of pre-trained vectors, we were surprised at the magnitude of the gains. Even a simple model with static vectors (CNN-static) performs remarkably well（如果使用預(yù)訓(xùn)練的詞向量极阅，會提升非常大）, giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al., 2014) or require parse trees to be computed beforehand (Socher et al., 2013). These results suggest that the pretrained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static).

4.1 Multichannel vs. Single Channel Models （多通道和單通道的對比）

We had initially hoped that the multichannel architecture would prevent overfitting（希望通過多通道來避免過擬合）(by ensuring that the learned vectors do not deviate too far from the original values) and thus work better than the single channel model, especially on smaller datasets. The results,however（然而，實(shí)現(xiàn)結(jié)果差不多）, are mixed, and further work on regularizing the fine-tuning process is warranted.For instance, instead of using an additional channel for the non-static portion（可以額外的增加非靜態(tài)的channel）,one could maintain a single channel but employ extra dimensions that are allowed to be modified during training.

4.2 Static vs. Non-static Representations

As is the case with the single channel non-static model, the multichannel model is able to fine-tune the non-static channel to make it more specific to the task-at-hand.For example, good is most similar to bad in word2vec, presumably because they are (almost) syntactically equivalent. （舉例涨享，在word2vec中筋搏，good和bad很接近，因?yàn)樗麄兊恼Z法是很接近的厕隧。）But for vectors in the non-static channel that were finetuned on the SST-2 dataset, this is no longer the case (table 3). Similarly, good is arguably closer to nice than it is to great for expressing sentiment, and this is indeed reflected in the learned vectors.

For (randomly initialized) tokens not in the set of pre-trained vectors, fine-tuning allows them to learn more meaningful representations: the network learns that exclamation marks are associated with effusive expressions and that commas are conjunctive (table 3).

4.3 Further Observations

We report on some further experiments and observations:

效果提升很多奔脐，因?yàn)槭褂昧烁嗟膄eature maps。Kalchbrenner et al. (2014) report much worse results with a CNN that has essentially the same architecture as our single channel model. For example, their Max-TDNN (Time Delay Neural Network) with randomly initialized words obtains 37.4% on the SST-1 dataset, compared to 45.0% for our model. We attribute such discrepancy to our CNN having much more capacity (multiple filter widths and feature maps).

Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Dropout consistently added 2%–4% relative performance（dropout 可以提高2%-4%的表現(xiàn)）. ? When randomly initializing words not in word2vec, we obtained slight improvements by sampling each dimension from U[?a, a] where a was chosen such that the randomly initialized vectors have the same variance as the pre-trained ones. It would be interesting to see if employing more sophisticated methods to mirror the distribution of pre-trained vectors in the initialization process gives further improvements.

We briefly experimented with another set of publicly available word vectors trained by Collobert et al. (2011) on Wikipedia,8 and found that word2vec gave far superior performance. It is not clear whether this is due to Mikolov et al. (2013)’s architecture or the 100 billion word Google News dataset.通過word2vec的訓(xùn)練集效果好了很多吁讨。但是不清楚到底是模型好髓迎，還是因?yàn)閿?shù)據(jù)集好

Adadelta (Zeiler, 2012) gave similar results to Adagrad (Duchi et al., 2011) but required fewer epochs.

5 Conclusion

In the present work we have described a series of experiments with convolutional neural networks built on top ofword2vec. Despite little tuning of hyperparameters, a simpleCNNwith one layer of convolution performs remarkably well. Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.

關(guān)鍵點(diǎn)：

預(yù)訓(xùn)練的詞向量——Word2Vec、Glove

卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)——一維卷積建丧、池化層

超參選擇——卷積核選擇排龄、詞向量方式選擇

創(chuàng)新點(diǎn)：

提出基于CNN的文本分類模型TextCNN

提出了多種詞向量設(shè)置方式

在四個(gè)文本分類任務(wù)上取得最優(yōu)的結(jié)果

對超參進(jìn)行大量實(shí)驗(yàn)和分析

啟發(fā)點(diǎn)：

在預(yù)訓(xùn)練模型的基礎(chǔ)上微調(diào)能夠得到非常好的結(jié)果，這說明預(yù)訓(xùn)練詞向量學(xué)習(xí)到了一些通用的特征

在預(yù)訓(xùn)練詞向量的基礎(chǔ)上使用簡單模型比復(fù)雜模型表現(xiàn)的還要好

對于不在預(yù)訓(xùn)練詞向量中的詞翎朱，微調(diào)能夠使得它們能夠?qū)W習(xí)更多的意義橄维。

四、超參選擇（另一篇論文：A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification）

Embedding方式

卷積核大小

卷積核個(gè)數(shù)

激活函數(shù)

Dropout

L2正則

4.1 Baseline Configuration

We first consider the performance of a baseline CNN configuration. Specifically, we start with the architectural decisions and hyperparameters used in previous work (Kim, 2014) and described in Table 2. To contextualize the variance in performance attributable to various architecture decisions and hyperparameter settings, it is critical to assess the variance due strictly to the parameter estimation procedure.Most prior work, unfortunately, has not reported such variance, despite a highly stochastic learning procedure（之前的工作拴曲，忽略了一些參數(shù)的偏差）. This variance is attributable to estimation via SGD, random dropout, and random weight parameter initialization.Holding all variables (including the folds) constant, we show that the mean performance calculated via 10-fold cross validation (CV) exhibits relatively high variance over repeated runs. （盡管保持參數(shù)均不變争舞，但是10-fold cross的波動仍然很大）We replicated CV experiments 100 times for each dataset（復(fù)制100份數(shù)據(jù)）, so that each replication was a 10-fold CV, wherein the folds were fixed. We recorded the average performance foreach replication and report the mean, minimum and maximum average accuracy (or AUC) values observed over 100 replications of CV (that is, we report means and ranges of averages calculated over 10-fold CV).（報(bào)告平均值等，就可以看出數(shù)值的波動）This provides a sense of the variance we might observe without any changes to the model. We did this for both static and non-static methods. For all experiments, we used the same preprocessing steps for the data as in (Kim, 2014). For SGD, we used the ADADELTA update rule (Zeiler, 2012), and set the minibatch size to 50. We randomly selected 10% of the training data as the validation set for early stopping.

4.2 Effect of input word vectors（embedding 設(shè)置）

A nice property of sentence classification models that start with distributed representations of words as inputs is the flexibility such architectures afford to swap in different pre-trained word vectors during model initialization. Therefore, we first explore the sensitivity of CNNs for sentence classification with respect to the input representations used. Specifically, we replaced word2vec with GloVe representations（兩種詞向量：word2vec和glove）. Google word2vec uses a local context window model trained on 100 billion words from Google News (Mikolov et al., 2013), while GloVe is a model based on global wordword co-occurrence statistics (Pennington et al., 2014). We used a GloVe model trained on a corpus of 840 billion tokens of web data. For both word2vec and GloVe we induce 300-dimensional word vectors. We report results achieved using GloVe representations in Table 3. Here we only report non-static GloVe results (which again uniformely outperformed the static variant).

We also experimented with concatenating word2vec and GloVe representations, thus creating 600-dimensional word vectors to be used as input to the CNN. Pre-trained vectors may not always be available for specific words (either in word2vec or GloVe, or both); in such cases, we randomly initialized the corresponding subvectors. Results are reported in the final column of Table 3.

word2vec：300維 100 billion words

glove：300維 840billion words

word2vec + glove：600維

4.3 Effect of filter region size

5 Conclusions

5.1 Summary of Main Empirical Findings

參數(shù)固定下效果仍然有波動疗韵。Prior work has tended to report only the mean performance on datasets achieved by models. But this overlooks variance due solely to the stochastic inference procedure used. This can be substantial: holding everything constant (including the folds), so that variance is due exclusively to the stochastic inference procedure, we find that mean accuracy (calculated via 10 fold cross-validation) has a range of up to 1.5 points. And the range over the AUC achieved on the irony dataset is even greater – up to 3.4 points (see Table 3). More replication should be performed in future work, and ranges/variances should be reported, to prevent potentially spurious conclusions regarding relative model performance.

We find that, even when tuning them to the task at hand, the choice of input word vector representation (e.g., betweenword2vec and GloVe) has an impact on performance, however different representations perform better for different tasks.（詞向量表示會表現(xiàn)的更好兑障，但是不同的詞向量在不同的任務(wù)中表現(xiàn)不一樣）At least for sentence classification, both seem to perform better than using one-hot vectors directly. We note, however, that: (1) this may not be the case if one has a sufficiently large amount of training data（如果數(shù)據(jù)量足夠大，one-hot可能效果更好）,and, (2) the recent semi-supervised CNN model proposed by Johnson and Zhang (Johnson and Zhang, 2015) may improve performance, as compared to the simpler version of the model considered here (i.e., proposed in (Johnson and Zhang, 2014)).（使用更復(fù)雜的model可能更好）

The filter region size can have a large effect on performance, and should be tuned.

The number offeature maps（卷積核）can also play an important role in the performance, and increasing the number of feature maps will increase the training time of the model.

1-max pooling uniformly outperforms other pooling strategies.

Regularization has relatively little effect on the performance of the model.

5.2 Specific advice to practitioners

Drawing upon our empirical results, we provide the following guidance regarding CNN architecture and hyperparameters for practitioners looking to deploy CNNs for sentence classification tasks.

Consider starting with the basic configuration described in Table 2 and using non-static word2vec or GloVe rather than one-hot vectors. However, if the training dataset size is very large, it may be worthwhile to explore using one-hot vectors. Alternatively, if one has access to a large set of unlabeled in-domain data, (Johnson and Zhang, 2015) might also be an option.

卷積核的選擇：Line-search over the single filter region size to find the ‘best’ single region size（通過單卷積核的搜索，選取最優(yōu)的單卷積核的大小及步數(shù)）.A reasonable range might be1～10.However, for datasets with very long sentences like CR,it may be worth exploring larger filter region sizes（在長的句子里面可以適當(dāng)調(diào)整卷積核）.Once this ‘best’ region size is identified, it may be worth exploring combining multiple filters using regions sizes near this single best size（選出最好的之后流译，也可以考慮使用臨近的組合）, given that empirically multiple ‘good’ region sizes always outperformed using only the single best region size.

Alter the number of feature maps for each filter region size from 100 to 600, and when this is being explored, use a small dropout rate (0.0-0.5) and a large max norm constraint. Note that increasing the number of feature maps will increase the running time, so there is a trade-off to consider. Also pay attention whether the best value found is near the border of the range (Bengio, 2012). If the best value is near 600, it may be worth trying larger values.

考慮不同的激活函數(shù)：Consider different activation functionsif possible: ReLU and tanh are the best overall candidates. And it might also be worth tryingno activation function at all（也可以不使用激活函數(shù)）for our one-layer CNN.

沒有必要去常識其他選項(xiàng)：Use 1-max pooling; it does not seem necessary to expend resources evaluating alternative strategies.

正則的選擇：Regarding regularization: When increasing the number of feature maps begins to reduce performance, try imposing stronger regularization, e.g., a dropout out rate larger than 0.5.

When assessing the performance of a model (or a particular configuration thereof), it is imperative to consider variance. Therefore, replications of the cross-fold validation procedure should be performed and variances and ranges should be considered.（當(dāng)評估一個(gè)模型的性能(或其特定的配置)時(shí)逞怨，必須考慮方差。因此福澡，應(yīng)進(jìn)行交叉驗(yàn)證程序的重復(fù)叠赦，并應(yīng)考慮方差和范圍。）

五革砸、研究成果及意義

（一）研究成果

在七個(gè)文本分類任務(wù)中的四個(gè)取得了最好的分類效果

CNN-rand:使用隨機(jī)初始化向量

CNN-static:使用靜態(tài)預(yù)訓(xùn)練的詞向量

CNN-non-static：使用微調(diào)的預(yù)訓(xùn)練的詞向量

CNN-multichannel:同時(shí)使用靜態(tài)預(yù)訓(xùn)練的詞向量和微調(diào)的預(yù)訓(xùn)練的詞向量

（二）歷史意義

開啟了基于深度學(xué)習(xí)的文本分類的序幕

推動了卷積神經(jīng)網(wǎng)絡(luò)在自然語言處理的發(fā)展