Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks:使用神經(jīng)網(wǎng)絡(luò)來做序列到序列的學(xué)習(xí)

Abstract 本文提出了一種通用的端 對端的方法進行序列到序 列的學(xué)習(xí),其中的 Encoder和Deocder都是 多層的LSTM抹凳。我們的模 型在機器翻譯上取得了非 常好的效果祈争。

The Model Introduction 為了處理變長的輸入和變長的 輸出席舍,我們使用了LSTM來作 為Encoder和Deocder,并且 得到了很好的結(jié)果。

Experiment 我們使用兩個不同的LSTM來作為 Encoder和Deocder球凰,其中Encoder將源 語言編碼成定長的向量,Deocder生成一個 個目標(biāo)語言的詞。

Related Work 論文的相關(guān)工作腿宰。

Conclusion 對全文進行總結(jié)并對未來進 行展望

一呕诉、評價指標(biāo)

1.人工評價:通過人主觀對翻譯進行打分

優(yōu)點:準(zhǔn)確

缺點:速度慢,價格昂貴

2.機器自動評價:通過設(shè)置指標(biāo)對翻譯結(jié)果自動評價

優(yōu)點:較為準(zhǔn)確吃度,速度快甩挫,免費

缺點:可能和人工評價有一些出入

BLEU評價指標(biāo)

只使用1-gram的問題:對每個詞進行翻譯就能得到很高的分,完全沒考慮到句子的流利性椿每。

解決方法:使用多-gram融合伊者,BLEU使用1-4gram英遭。

任務(wù)場景:輸入一個序列,輸出一個序列亦渗。

基本思想:使用一個Encoder將輸入序列編碼成定長的向量挖诸,Decoder使用這個向量產(chǎn)生輸出。

二法精、論文詳解

Abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on dif?cult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences.介紹做法:In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a ?xed dimensionality, and then another deep LSTM to decode the target sequence from the vector.結(jié)果:Our main result is that on an English to French translation task from the WMT’14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have dif?culty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly,because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.(倒序輸入縮小依賴多律,使得優(yōu)化更簡單)

1. DNN在很多任務(wù)上取得了非常好的結(jié)果,但是它并不能解決Seq2Seq模型搂蜓。

2. 我們使用多層LSTM作為Encoder和Decoder菱涤,并且在WMT14英語到法語上取得了34.8的BLEU

的結(jié)果。

3. 此外洛勉,LSTM在長度上表現(xiàn)也很好粘秆,我們使用深度NMT模型來對統(tǒng)計機器翻譯的結(jié)果進行重排序,

能夠使結(jié)果BLEU從33.3提高到36.5收毫。

4. LSTM能夠很好地學(xué)習(xí)到局部和全局的特征攻走,最后我們發(fā)現(xiàn)對源句子倒序輸入能夠大大提高翻譯

的效果,因為這樣可以縮短一些詞從源語言到目標(biāo)語言的依賴長度此再。

1 Introduction

Deep Neural Networks (DNNs) are extremely powerful machine learning models(DNN是非常強大的機器學(xué)習(xí)模型)that achieve excellent performanceon dif?cult problems such as speech recognition [13, 7] and visual object recognition(在不同任務(wù)中都有很好的表現(xiàn)) [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps. A surprising example of the power of DNNs is their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are related to conventional statistical models, they learn an intricate computation. Furthermore, large DNNs can be trained with supervised backpropagationwhenever the labeled training set has enough information to specify the network’s parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised backpropagation will ?nd these parameters and solve the problem.

Despite their ?exibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of ?xed dimensionality(盡管很強大昔搂,但是僅能處理定長的問題).It is a signi?cant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori.(這個限制會很大,因為序列問題都是不定長的問題) For example, speech recognition and machine translation are sequential problems. Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful.

Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and ?xed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large ?xed-dimensional vector representation,(通過使用LSTM输拇,每一個時間點作為輸入摘符,去填充一個很大的向量) and then to use another LSTM to extract the output sequence from that vector (?g. 1). The second LSTM is essentially a recurrentneural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (?g. 1).

relate work:There have been a number of related attempts to address the general sequence to sequence learning problem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18] who were the ?rst to map the entire input sentence to vector, and is related to Cho et al. [5] although the latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10] introduced a novel differentiable attention mechanism that allows neural networks to focus on different parts of their input, and an elegant variant of this idea was successfully applied to machine translation by Bahdanau et al. [2]. The Connectionist Sequence Classi?cation is another popular technique for mapping sequences to sequences with neural networks, but it assumes a monotonic alignment between the inputs and the outputs [11].

The main result of this work is the following. On the WMT’14 English to French translation task, we obtained aBLEU score of 34.81by directly extracting translations from an ensemble of 5 deep LSTMs (with 384M parameters and 8,000 dimensionalstate each)using a simple left-to-rightbeam-search decoder. This is by far the best result achieved by direct translation with large neural networks. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81 BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalized whenever the reference translation contained a word not covered by these 80k. This result shows that a relatively unoptimized small-vocabulary neural network architecture which has much room for improvement outperforms a phrase-based SMT system.

Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline on the same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by 3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).

Surprisingly, the LSTM did not suffer on very long sentences(LSTM并沒有因為句子變長而效果變差), despite the recent experience of other researchers with related architectures [26]. We were able to do well on long sentences becausewe reversed the order of words in the source sentence but not the target sentences in the training and test set(我們反轉(zhuǎn)了句子輸入時的順序). By doing so, we introduced many short term dependencies that made the optimization problem much simpler(我們引入短期依賴使得我們優(yōu)化更加簡單) (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with long sentences. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.

A useful property of the LSTM is that it learns to map an input sentence of variable length into a ?xed-dimensional vector representation. Given that translations tend to be paraphrases of the source sentences, the translation objective encourages the LSTM to ?nd sentence representations that capture their meaning, as sentences with similar meaningsare close to each other while different sentences meanings will be far. A qualitative? evaluation supports this claim, showing that our model is aware of word order and is fairly invariant to the active and passive voice.

LSTM的一個有用特性是,它學(xué)會將可變長度的輸入語句映射為固定維向量表示策吠。由于翻譯往往是對源句的意譯逛裤,因此翻譯目標(biāo)通過LSTM去尋找能抓住其意思的句子的詞向量表示,因為意義相似的句子彼此接近猴抹,而不同的句子意義卻相差甚遠带族。一項定性評估支持了這一論斷,表明我們的模型能夠獲取詞序以及主動和被動語態(tài)蟀给。

深度神經(jīng)網(wǎng)絡(luò)非常成功蝙砌,但是卻很難處理序列到序列的問題。

本文使用一種新的Seq2Seq模型結(jié)果來解決序列到序列的問題跋理,其中Seq2Seq模型的Encoder

和Decoder都使用的是LSTM择克。

前人研究者針對這個問題已經(jīng)有了很多工作,包括Seq2Seq模型和注意力機制前普。

本文的深度Seq2Seq模型在機器翻譯上取得了非常好的效果肚邢。

model

Tricks:

1. 對于Encoder和Deocder,使用不同的LSTM汁政。

2. 深層的LSTM比淺層的LSTM效果好道偷。

3. 對源語言倒序輸入會大幅度提高翻譯效果。

實驗結(jié)果及分析:

借助統(tǒng)計機器學(xué)習(xí)的方法记劈。在其排序上的進行打分

We initialized all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08:使用-0.08到0.08的均勻分布作為參數(shù)初始化

We used stochastic gradient descent without momentum, with a ?xed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs. 初始學(xué)習(xí)率0.7勺鸦,5輪后每訓(xùn)練半輪,學(xué)習(xí)率減半目木。一共是7.5輪

We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128).

Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10, 25] by scaling it when its norm exceeded a threshold. For each training batch, we compute s = ||g||2, where g is the gradient divided by 128. If s > 5, we set g = 5g s . 盡管LSTM很少出現(xiàn)梯度消失的問題换途,但是會出現(xiàn)梯度爆炸的問題。約束梯度的大小

Different sentences have different lengths. Most sentences are short (e.g., length 20-30) but some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences, and as a result, much of the computationin the minibatch is wasted. To address this problem, we made sure that all sentences in a minibatch are roughly of the same length, yielding a 2x speedup.每個batch內(nèi)需要統(tǒng)一長度刽射。為了提高計算效率军拟,對句子長度排序,這樣保證每個batch內(nèi)長度差不多誓禁。

關(guān)鍵點

? 驗證了Seq2Seq模型對于序列到序列任務(wù)的有效性懈息。

? 從實驗的角度發(fā)現(xiàn)了很多提高翻譯效果的tricks

? Deep NMT模型

創(chuàng)新點

? 提出了一種新的神經(jīng)機器翻譯模型---Deep NMT模型

? 提出了一些提高神經(jīng)機器翻譯效果的tricks——多層LSTM和倒序輸入等。

? 在WMT14英語到法語翻譯上得到了非常好的結(jié)果

啟發(fā)點

? Seq2Seq模型就是使用一個LSTM提取輸入序列的特征摹恰,每個時間步輸入一個詞辫继,從而生成固定維度 的句子向量表示,然后Deocder使用另外一個LSTM來從這個向量中生成輸入序列俗慈。

? 我們的實驗也支持這個結(jié)論姑宽,我們的模型生成的句子表示能夠明確詞序信息,并且能夠識別出來同一 種含義的主動和被動語態(tài)闺阱。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末炮车,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子酣溃,更是在濱河造成了極大的恐慌瘦穆,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,743評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件赊豌,死亡現(xiàn)場離奇詭異难审,居然都是意外死亡,警方通過查閱死者的電腦和手機亿絮,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,296評論 3 385
  • 文/潘曉璐 我一進店門告喊,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人派昧,你說我怎么就攤上這事黔姜。” “怎么了蒂萎?”我有些...
    開封第一講書人閱讀 157,285評論 0 348
  • 文/不壞的土叔 我叫張陵秆吵,是天一觀的道長。 經(jīng)常有香客問我五慈,道長纳寂,這世上最難降的妖魔是什么主穗? 我笑而不...
    開封第一講書人閱讀 56,485評論 1 283
  • 正文 為了忘掉前任,我火速辦了婚禮毙芜,結(jié)果婚禮上忽媒,老公的妹妹穿的比我還像新娘。我一直安慰自己腋粥,他們只是感情好晦雨,可當(dāng)我...
    茶點故事閱讀 65,581評論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著隘冲,像睡著了一般闹瞧。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上展辞,一...
    開封第一講書人閱讀 49,821評論 1 290
  • 那天奥邮,我揣著相機與錄音,去河邊找鬼罗珍。 笑死漠烧,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的靡砌。 我是一名探鬼主播已脓,決...
    沈念sama閱讀 38,960評論 3 408
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼通殃!你這毒婦竟也來了度液?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,719評論 0 266
  • 序言:老撾萬榮一對情侶失蹤画舌,失蹤者是張志新(化名)和其女友劉穎堕担,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體曲聂,經(jīng)...
    沈念sama閱讀 44,186評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡霹购,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,516評論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了朋腋。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片齐疙。...
    茶點故事閱讀 38,650評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖旭咽,靈堂內(nèi)的尸體忽然破棺而出贞奋,到底是詐尸還是另有隱情,我是刑警寧澤穷绵,帶...
    沈念sama閱讀 34,329評論 4 330
  • 正文 年R本政府宣布轿塔,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏勾缭。R本人自食惡果不足惜揍障,卻給世界環(huán)境...
    茶點故事閱讀 39,936評論 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望俩由。 院中可真熱鬧毒嫡,春花似錦、人聲如沸采驻。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,757評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽礼旅。三九已至,卻和暖如春洽洁,著一層夾襖步出監(jiān)牢的瞬間痘系,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,991評論 1 266
  • 我被黑心中介騙來泰國打工饿自, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留汰翠,地道東北人。 一個月前我還...
    沈念sama閱讀 46,370評論 2 360
  • 正文 我出身青樓昭雌,卻偏偏與公主長得像复唤,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子烛卧,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 43,527評論 2 349