Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks：使用神經(jīng)網(wǎng)絡(luò)來做序列到序列的學(xué)習(xí)

Abstract 本文提出了一種通用的端對端的方法進行序列到序列的學(xué)習(xí)，其中的 Encoder和Deocder都是多層的LSTM抹凳。我們的模型在機器翻譯上取得了非常好的效果祈争。

The Model Introduction 為了處理變長的輸入和變長的輸出席舍，我們使用了LSTM來作為Encoder和Deocder，并且得到了很好的結(jié)果。

Experiment 我們使用兩個不同的LSTM來作為 Encoder和Deocder球凰，其中Encoder將源語言編碼成定長的向量,Deocder生成一個個目標(biāo)語言的詞。

Related Work 論文的相關(guān)工作腿宰。

Conclusion 對全文進行總結(jié)并對未來進行展望

一呕诉、評價指標(biāo)

1.人工評價：通過人主觀對翻譯進行打分

優(yōu)點：準(zhǔn)確

缺點：速度慢，價格昂貴

2.機器自動評價：通過設(shè)置指標(biāo)對翻譯結(jié)果自動評價

優(yōu)點：較為準(zhǔn)確吃度，速度快甩挫，免費

缺點：可能和人工評價有一些出入

BLEU評價指標(biāo)

只使用1-gram的問題：對每個詞進行翻譯就能得到很高的分，完全沒考慮到句子的流利性椿每。

解決方法：使用多-gram融合伊者，BLEU使用1-4gram英遭。

任務(wù)場景：輸入一個序列，輸出一個序列亦渗。

基本思想：使用一個Encoder將輸入序列編碼成定長的向量挖诸，Decoder使用這個向量產(chǎn)生輸出。

二法精、論文詳解

Abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on dif?cult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences.介紹做法：In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a ?xed dimensionality, and then another deep LSTM to decode the target sequence from the vector.結(jié)果：Our main result is that on an English to French translation task from the WMT’14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have dif?culty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly,because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.（倒序輸入縮小依賴多律，使得優(yōu)化更簡單）

1. DNN在很多任務(wù)上取得了非常好的結(jié)果，但是它并不能解決Seq2Seq模型搂蜓。

2. 我們使用多層LSTM作為Encoder和Decoder菱涤，并且在WMT14英語到法語上取得了34.8的BLEU

的結(jié)果。

3. 此外洛勉，LSTM在長度上表現(xiàn)也很好粘秆，我們使用深度NMT模型來對統(tǒng)計機器翻譯的結(jié)果進行重排序，

能夠使結(jié)果BLEU從33.3提高到36.5收毫。

4. LSTM能夠很好地學(xué)習(xí)到局部和全局的特征攻走，最后我們發(fā)現(xiàn)對源句子倒序輸入能夠大大提高翻譯

的效果，因為這樣可以縮短一些詞從源語言到目標(biāo)語言的依賴長度此再。

1 Introduction

Deep Neural Networks (DNNs) are extremely powerful machine learning models（DNN是非常強大的機器學(xué)習(xí)模型）that achieve excellent performanceon dif?cult problems such as speech recognition [13, 7] and visual object recognition（在不同任務(wù)中都有很好的表現(xiàn)） [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps. A surprising example of the power of DNNs is their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are related to conventional statistical models, they learn an intricate computation. Furthermore, large DNNs can be trained with supervised backpropagationwhenever the labeled training set has enough information to specify the network’s parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised backpropagation will ?nd these parameters and solve the problem.

Despite their ?exibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of ?xed dimensionality（盡管很強大昔搂，但是僅能處理定長的問題）.It is a signi?cant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori.（這個限制會很大，因為序列問題都是不定長的問題） For example, speech recognition and machine translation are sequential problems. Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful.

Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and ?xed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large ?xed-dimensional vector representation,（通過使用LSTM输拇，每一個時間點作為輸入摘符，去填充一個很大的向量） and then to use another LSTM to extract the output sequence from that vector (?g. 1). The second LSTM is essentially a recurrentneural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (?g. 1).

relate work：There have been a number of related attempts to address the general sequence to sequence learning problem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18] who were the ?rst to map the entire input sentence to vector, and is related to Cho et al. [5] although the latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10] introduced a novel differentiable attention mechanism that allows neural networks to focus on different parts of their input, and an elegant variant of this idea was successfully applied to machine translation by Bahdanau et al. [2]. The Connectionist Sequence Classi?cation is another popular technique for mapping sequences to sequences with neural networks, but it assumes a monotonic alignment between the inputs and the outputs [11].

The main result of this work is the following. On the WMT’14 English to French translation task, we obtained aBLEU score of 34.81by directly extracting translations from an ensemble of 5 deep LSTMs (with 384M parameters and 8,000 dimensionalstate each)using a simple left-to-rightbeam-search decoder. This is by far the best result achieved by direct translation with large neural networks. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81 BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalized whenever the reference translation contained a word not covered by these 80k. This result shows that a relatively unoptimized small-vocabulary neural network architecture which has much room for improvement outperforms a phrase-based SMT system.

Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline on the same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by 3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).

Surprisingly, the LSTM did not suffer on very long sentences（LSTM并沒有因為句子變長而效果變差）, despite the recent experience of other researchers with related architectures [26]. We were able to do well on long sentences becausewe reversed the order of words in the source sentence but not the target sentences in the training and test set（我們反轉(zhuǎn)了句子輸入時的順序）. By doing so, we introduced many short term dependencies that made the optimization problem much simpler（我們引入短期依賴使得我們優(yōu)化更加簡單） (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with long sentences. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.

A useful property of the LSTM is that it learns to map an input sentence of variable length into a ?xed-dimensional vector representation. Given that translations tend to be paraphrases of the source sentences, the translation objective encourages the LSTM to ?nd sentence representations that capture their meaning, as sentences with similar meaningsare close to each other while different sentences meanings will be far. A qualitative? evaluation supports this claim, showing that our model is aware of word order and is fairly invariant to the active and passive voice.

LSTM的一個有用特性是，它學(xué)會將可變長度的輸入語句映射為固定維向量表示策吠。由于翻譯往往是對源句的意譯逛裤，因此翻譯目標(biāo)通過LSTM去尋找能抓住其意思的句子的詞向量表示，因為意義相似的句子彼此接近猴抹，而不同的句子意義卻相差甚遠带族。一項定性評估支持了這一論斷，表明我們的模型能夠獲取詞序以及主動和被動語態(tài)蟀给。

深度神經(jīng)網(wǎng)絡(luò)非常成功蝙砌，但是卻很難處理序列到序列的問題。

本文使用一種新的Seq2Seq模型結(jié)果來解決序列到序列的問題跋理，其中Seq2Seq模型的Encoder

和Decoder都使用的是LSTM择克。

前人研究者針對這個問題已經(jīng)有了很多工作，包括Seq2Seq模型和注意力機制前普。

本文的深度Seq2Seq模型在機器翻譯上取得了非常好的效果肚邢。

model

Tricks：

1. 對于Encoder和Deocder，使用不同的LSTM汁政。

2. 深層的LSTM比淺層的LSTM效果好道偷。

3. 對源語言倒序輸入會大幅度提高翻譯效果。

實驗結(jié)果及分析：

借助統(tǒng)計機器學(xué)習(xí)的方法记劈。在其排序上的進行打分

We initialized all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08：使用-0.08到0.08的均勻分布作為參數(shù)初始化

We used stochastic gradient descent without momentum, with a ?xed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs. 初始學(xué)習(xí)率0.7勺鸦，5輪后每訓(xùn)練半輪，學(xué)習(xí)率減半目木。一共是7.5輪

We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128).

Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10, 25] by scaling it when its norm exceeded a threshold. For each training batch, we compute s = ||g||2, where g is the gradient divided by 128. If s > 5, we set g = 5g s . 盡管LSTM很少出現(xiàn)梯度消失的問題换途，但是會出現(xiàn)梯度爆炸的問題。約束梯度的大小

Different sentences have different lengths. Most sentences are short (e.g., length 20-30) but some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences, and as a result, much of the computationin the minibatch is wasted. To address this problem, we made sure that all sentences in a minibatch are roughly of the same length, yielding a 2x speedup.每個batch內(nèi)需要統(tǒng)一長度刽射。為了提高計算效率军拟，對句子長度排序，這樣保證每個batch內(nèi)長度差不多誓禁。

關(guān)鍵點

? 驗證了Seq2Seq模型對于序列到序列任務(wù)的有效性懈息。

? 從實驗的角度發(fā)現(xiàn)了很多提高翻譯效果的tricks

? Deep NMT模型

創(chuàng)新點

? 提出了一種新的神經(jīng)機器翻譯模型---Deep NMT模型

? 提出了一些提高神經(jīng)機器翻譯效果的tricks——多層LSTM和倒序輸入等。

? 在WMT14英語到法語翻譯上得到了非常好的結(jié)果

啟發(fā)點

? Seq2Seq模型就是使用一個LSTM提取輸入序列的特征摹恰，每個時間步輸入一個詞辫继，從而生成固定維度的句子向量表示，然后Deocder使用另外一個LSTM來從這個向量中生成輸入序列俗慈。

? 我們的實驗也支持這個結(jié)論姑宽，我們的模型生成的句子表示能夠明確詞序信息，并且能夠識別出來同一種含義的主動和被動語態(tài)闺阱。