Transformer 解讀-1

一. Transformer 模型火爆原因:?

1.? 模型簡(jiǎn)單易懂. encoder 和decoder 模塊高度相似且相通.

2. encoder 容易并行,模型訓(xùn)練速度快.

3. 效果拔群,在NMT等領(lǐng)域都取得了state-of-the-art的效果.

二. Transformer 模型:

Transformer 模型拿一個(gè)序列,生成另一個(gè)序列. 打開這個(gè)模型,我們會(huì)看到其中包含2個(gè)部分. encoders 和decoders.

Transformer

其中 encoders 和decoders 都是兩個(gè)堆疊架構(gòu),一層層同質(zhì)的結(jié)構(gòu)堆疊到一起, 組成了編碼器和解碼器.

Transformer

首先,打開每個(gè)encoder:?

encoder

每一個(gè)encoder都包含了一個(gè)自注意力(self-attention) 層和一個(gè)Feed Forward Neural Network.

encoder的輸入首先會(huì)經(jīng)過一個(gè)self-attention層. self-attention層的作用是讓每個(gè)單詞可以看到自己和其他單詞的關(guān)系.并且將自己轉(zhuǎn)換成一個(gè)與所有單詞相關(guān)的, focus在自己身上的詞向量.self-attention之后的輸出會(huì)再經(jīng)過一層feed-forward神經(jīng)網(wǎng)絡(luò).每個(gè)位置的輸出被同樣的feed forwardnetwork 處理.

decoder也有同樣的self-attention 和feed-forward結(jié)構(gòu),但是在這兩層之間還有一層encoder-decoder attention 層, 幫助decoder? 關(guān)注到某一些特別需要關(guān)注的encoder位置.

decoder

Tensor 變化

1. embedding. (此處是 512維度)
2. 每一個(gè)embedding 好之后的單詞,進(jìn)入一個(gè)2層的encoder.

Tensor 變化

這里,我們可以看到Transformer的一個(gè)重要的特性, 每個(gè)位置的單詞有自己的encoder 路徑. 他們?cè)趕elf-attention層時(shí) 建立連接互相產(chǎn)依賴性. 在前向神經(jīng)網(wǎng)絡(luò)層沒有依賴性.因此在前向神經(jīng)網(wǎng)絡(luò)層時(shí) 多個(gè)路徑可以并行運(yùn)行.

編碼器

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

Self-Attention 機(jī)制

我們考慮用Transformer模型翻譯下面這句話: '' The animal didn't cross the street because it was too tired". 當(dāng)翻譯到it時(shí), 我們知道 it 指代的是 animal 而不是street. 所以, 如果有辦法可以讓 it 對(duì)應(yīng)位置的embedding 適當(dāng)包含 animal 的信息,就會(huì)非常有用. self-attention的出現(xiàn)就是為了完成這一任務(wù).

如下圖所示. self attention 會(huì)讓單詞it 和某些單詞發(fā)生比較強(qiáng)的聯(lián)系, 得到比較高的attention 分?jǐn)?shù).

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

Self-attention 細(xì)節(jié)

第一步: 為了實(shí)現(xiàn)self-attention, 每個(gè)輸入的位置需要產(chǎn)生三個(gè)向量, 分別是 Query, Key, Value 向量. 這些向量都是由輸入embedding 通過3個(gè)matrices(也就是線性變化)產(chǎn)生的.?

注意到在Transformer架構(gòu)中, 這些新的向量比原來的輸入向量要小, 原來的向量是512維, 轉(zhuǎn)變后的三個(gè)向量都是64維.

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

第二步是計(jì)算分?jǐn)?shù). 當(dāng)我們?cè)谟胹elf-attention encode某個(gè)位置上的某個(gè)單詞的時(shí)候, 我們希望知道這個(gè)單詞對(duì)應(yīng)的句子上其他單詞的分?jǐn)?shù). 其他單詞所得到的分?jǐn)?shù)表示了當(dāng)我們encode 當(dāng)前單詞的時(shí)候, 應(yīng)該放多少的關(guān)注度在其余的每個(gè)單詞上. 或者說,其他單詞和我當(dāng)前單詞有多大的相關(guān)性或者相似性.

在transformer模型中,這個(gè)分?jǐn)?shù)由 query vector 和 key vector做點(diǎn)積(dot product)所得的結(jié)果. 所以說, 當(dāng)我們?cè)趯?duì)第一個(gè)單詞做self-attention處理的時(shí)候,第一個(gè)單詞的分?jǐn)?shù)是 q_1和k_1的點(diǎn)積, 第二個(gè)分?jǐn)?shù)是 q_1和k_2的分?jǐn)?shù).

score

第三步和第四步 是將這些分?jǐn)?shù)除以8.? 8 這個(gè)數(shù)是64開方, 也就是key vector的維度的開方. 據(jù)說這么做可以穩(wěn)定模型的gradient. 然后我們將這些分?jǐn)?shù)傳入softmax層產(chǎn)生一些符合概率分布的probability scores.

scores

這個(gè)score 就表示了在處理當(dāng)前單詞的時(shí)候我們應(yīng)該分配多少關(guān)注度給其他單詞.

第五步時(shí)將每個(gè)value vector 乘以他們各自的attention score.

第六步是把這些weighted value vectors 相加,成為當(dāng)前單詞的vector表示.

得到了self-attention 生成的詞向量之后,我們就可以將他們傳入feed-forward network了.

Self attention 中的矩陣運(yùn)算

首先, 我們要對(duì)每一個(gè)詞向量計(jì)算Query, Key, 和value 矩陣. 我們把句子中的每個(gè)詞向量拼接到一起變成一個(gè)矩陣X, 然后乘以不同的矩陣做線性變換(WQ, WK, WV)

word-> Q, K, V

然后我們就用矩陣乘法實(shí)現(xiàn)上面介紹過得 Self-Attention機(jī)制.

self-Attention

Multi-Headed attention

在論文中, 每個(gè)embedding vector 并不止產(chǎn)生一個(gè)key, value, query vectors, 而是產(chǎn)生若干組這樣的vectors, 稱之為'multi-headed' attention. 這么做有幾個(gè)好處:
1. 模型有更強(qiáng)的能力產(chǎn)生不同的attention 機(jī)制, focus在不同的單詞上.
2. attention layer 有多個(gè)不同的 'representation space'.

ith multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

每個(gè)attention head 最終都產(chǎn)生了一個(gè)matrix表示這個(gè)句子中的所有詞向量. 在transformer模型中,我們產(chǎn)生了八個(gè)matrices. 我們知道self attention 之后就是一個(gè)feed-forward network. 那么我們是否需要做8次feed-forward network 運(yùn)算呢? 事實(shí)上是不用的, 我們只需要將這8個(gè)matrices拼接到一起,然后乘以另外一個(gè)權(quán)重矩陣WO 輸出Z,? 然后做前向神經(jīng)網(wǎng)絡(luò)運(yùn)算就可以了.

cat all 8 matrices and feed-forward network

綜合起來, 我們可以用下面一張圖表示self-attention模塊所做的事情.

數(shù)字表示的句子->詞嵌入->線性變換得到8組QKV-> 計(jì)算attention->拼接z矩陣,線性變換得到輸出矩陣Z(用于前向神經(jīng)網(wǎng)絡(luò)運(yùn)算)

self-attention

現(xiàn)在我們有了attention, 那么我們來看下當(dāng)編碼 ' it '時(shí), 各個(gè)位置的attention.?

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

如果我們把所有的attention加起來,得到下圖

Positional Encoding

到目前為止,我們的模型完全沒有考慮單詞的順序. 即使我們將句子中單詞的順序完全打亂,對(duì)于transformer 這個(gè)模型來說,并沒有什么區(qū)別. 為了加入句子中單詞的順序信息,我們引入了一個(gè)概念叫做positional encoding.

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

如果我們假設(shè)輸入的embedding是4個(gè)維度的,那么他們的positional encodings 大概長下面這樣.

A real example of positional encoding with a toy embedding size of 4

下面這張圖的每一行表示一個(gè)positional encoding vector. 第一行表示第一個(gè)單詞的positional encoding, 以此類推. 每行都有512個(gè)-1 到1 之間的數(shù)字, 我們用顏色標(biāo)記了這些vectors.

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings inget_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

Residuals

另外一個(gè)細(xì)節(jié)是, encoder中的每一層都包含了一個(gè)residual connection 和 layer-normalization 如下圖所示.

下面這張圖是更詳細(xì)的vector表示米辐。

Decode 也是同樣的構(gòu)架, 如果我們把encoder 和decoder 放到一起就是下圖.

解碼器

encoder 從處理輸入序列開始,? encoder的最后一個(gè)層的輸出轉(zhuǎn)成一系列的attention vectors? K和V.? K 和 V會(huì)被decoder 作為解碼的原料.

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

在解碼的過程中逐哈，解碼器每一步會(huì)輸出一個(gè)token扰藕。一直循環(huán)往復(fù)槽奕，直到它輸出了一個(gè)特殊的end of sequence token，表示解碼結(jié)束了绞愚。

The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

decoder的self attention機(jī)制與encoder稍有不同叙甸。在decoder當(dāng)中，self attention層只能看到之前已經(jīng)解碼的文字位衩。我們只需要把當(dāng)前輸出位置之后的單詞全都mask掉（softmax層之前全都設(shè)置成-inf）即可裆蒸。

Encoder-Decoder Attention層和普通的multiheaded self-attention一樣，除了它的Queries完全來自下面的decoder層糖驴，然后Key和Value來自encoder的輸出向量僚祷。

最后的線性層和 softmax 層

decoder 最后輸出一個(gè)浮點(diǎn)向量.? 我們?nèi)绾螌⑺麄冝D(zhuǎn)成詞呢? 這就是最后的線性層和softmax層的主要工作.

線性層是一個(gè)簡(jiǎn)單的全連接層, 將解碼器的最后輸出映射到一個(gè)非常大的logits向量上.? 假設(shè)模型已知有1萬個(gè)單詞(輸出詞表) 從訓(xùn)練集中學(xué)到的. 那么, logits 向量就有10000維, 每一個(gè)值表示是某個(gè)單詞的可能傾向值.

softmax層將這些分?jǐn)?shù)轉(zhuǎn)成概率值(都是正值,且加和為1), 最高概率處的index就是輸出單詞.

his figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

模型訓(xùn)練

現(xiàn)在我們已經(jīng)了解了一個(gè)訓(xùn)練完畢的Transformer的前向過程，順道看下訓(xùn)練的概念也是非常有用的贮缕。在訓(xùn)練時(shí)辙谜，模型將經(jīng)歷上述的前向過程，當(dāng)我們?cè)跇?biāo)記訓(xùn)練集上訓(xùn)練時(shí)感昼，可以對(duì)比預(yù)測(cè)輸出與實(shí)際輸出装哆。為了可視化，假設(shè)輸出一共只有6個(gè)單詞（“a”,

“am”, “i”, “thanks”, “student”, “eos”）

he output vocabulary of our model is created in the preprocessing phase before we even begin training.

模型的詞表是在訓(xùn)練之前的預(yù)處理中生成的.

一旦定義了詞表定嗓，我們就能夠構(gòu)造一個(gè)同維度的向量來表示每個(gè)單詞蜕琴，比如one-hot編碼，下面舉例編碼“am”宵溅。

Example: one-hot encoding of our output vocabulary

舉例采用one-hot編碼輸出詞表

下面讓我們討論下模型的loss損失凌简，在訓(xùn)練過程中用來優(yōu)化的指標(biāo)，指導(dǎo)學(xué)習(xí)得到一個(gè)非常準(zhǔn)確的模型恃逻。

損失函數(shù)

我們用一個(gè)簡(jiǎn)單的例子來示范訓(xùn)練雏搂，比如翻譯“merci”為“thanks”。那意味著輸出的概率分布指向單詞“thanks”寇损，但是由于模型未訓(xùn)練是隨機(jī)初始化的凸郑，不太可能就是期望的輸出。

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

由于模型參數(shù)是隨機(jī)初始化的矛市，未訓(xùn)練的模型輸出隨機(jī)值线椰。我們可以對(duì)比真實(shí)輸出，然后利用誤差后傳調(diào)整模型權(quán)重尘盼，使得輸出更接近與真實(shí)輸出憨愉。如何對(duì)比兩個(gè)概率分布呢？簡(jiǎn)單采用cross-entropy或者Kullback-Leibler divergence中的一種卿捎。鑒于這是個(gè)極其簡(jiǎn)單的例子配紫，更真實(shí)的情況是，使用一個(gè)句子作為輸入午阵。比如躺孝，輸入是“je suis étudiant”，期望輸出是“i am a student”底桂。在這個(gè)例子下植袍，我們期望模型輸出連續(xù)的概率分布滿足如下條件：

每個(gè)概率分布都與詞表同維度

第一個(gè)概率分布對(duì)“i”具有最高的預(yù)測(cè)概率值。

第二個(gè)概率分布對(duì)“am”具有最高的預(yù)測(cè)概率值籽懦。

一直到第五個(gè)輸出指向""標(biāo)記于个。

The targeted probability distributions we'll train our model against in the training example for one sample sentence.

對(duì)一個(gè)句子而言，訓(xùn)練模型的目標(biāo)概率分布.

在足夠大的訓(xùn)練集上訓(xùn)練足夠時(shí)間之后暮顺，我們期望產(chǎn)生的概率分布如下所示：

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

訓(xùn)練好之后厅篓，模型的輸出是我們期望的翻譯。當(dāng)然捶码，這并不意味著這一過程是來自訓(xùn)練集羽氮。注意，每個(gè)位置都能有值惫恼，即便與輸出近乎無關(guān)档押，這也是softmax對(duì)訓(xùn)練有幫助的地方。現(xiàn)在祈纯，因?yàn)槟Ｐ兔坎街划a(chǎn)生一組輸出令宿，假設(shè)模型選擇最高概率，扔掉其他的部分盆繁，這是種產(chǎn)生預(yù)測(cè)結(jié)果的方法掀淘，叫做greedy解碼。另外一種方法是beam search油昂，每一步僅保留最頭部高概率的兩個(gè)輸出革娄，根據(jù)這倆輸出再預(yù)測(cè)下一步，再保留頭部高概率的兩個(gè)輸出冕碟，重復(fù)直到預(yù)測(cè)結(jié)束.

更多資料

Attention Is All You Need?

Transformer博客文章 Transformer: A Novel Neural Network Architecture for Language Understanding

Tensor2Tensor announcement.

Jupyter Notebook provided as part of the Tensor2Tensor repo

Tensor2Tensor repo.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末拦惋，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子安寺，更是在濱河造成了極大的恐慌厕妖，老刑警劉巖，帶你破解...
沈念sama閱讀 218,204評(píng)論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件挑庶，死亡現(xiàn)場(chǎng)離奇詭異言秸，居然都是意外死亡软能，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,091評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門举畸，熙熙樓的掌柜王于貴愁眉苦臉地迎上來查排，“玉大人，你說我怎么就攤上這事抄沮“虾耍” “怎么了？”我有些...
開封第一講書人閱讀 164,548評(píng)論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵叛买，是天一觀的道長砂代。經(jīng)常有香客問我，道長率挣，這世上最難降的妖魔是什么刻伊？我笑而不...
開封第一講書人閱讀 58,657評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮难礼，結(jié)果婚禮上娃圆，老公的妹妹穿的比我還像新娘。我一直安慰自己蛾茉，他們只是感情好讼呢，可當(dāng)我...
茶點(diǎn)故事閱讀 67,689評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著谦炬，像睡著了一般悦屏。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上键思，一...
開封第一講書人閱讀 51,554評(píng)論 1贊 305
城市分裂傳說
那天础爬，我揣著相機(jī)與錄音，去河邊找鬼吼鳞。笑死看蚜，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的赔桌。我是一名探鬼主播供炎，決...
沈念sama閱讀 40,302評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼疾党！你這毒婦竟也來了音诫？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,216評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤雪位，失蹤者是張志新（化名）和其女友劉穎竭钝，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,661評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡香罐，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,851評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年卧波，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片穴吹。...
茶點(diǎn)故事閱讀 39,977評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡幽勒，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出港令，到底是詐尸還是另有隱情，我是刑警寧澤锈颗，帶...
沈念sama閱讀 35,697評(píng)論 5贊 347
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布顷霹，位于F島的核電站，受9級(jí)特大地震影響击吱，放射性物質(zhì)發(fā)生泄漏淋淀。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,306評(píng)論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一覆醇、第九天我趴在偏房一處隱蔽的房頂上張望朵纷。院中可真熱鬧，春花似錦永脓、人聲如沸袍辞。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,898評(píng)論 0贊 22
一樁弒父案常摧，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽搅吁。三九已至，卻和暖如春落午，著一層夾襖步出監(jiān)牢的瞬間谎懦，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,019評(píng)論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工溃斋，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留界拦，地道東北人。一個(gè)月前我還...
沈念sama閱讀 48,138評(píng)論 3贊 370
代替公主和親
正文我出身青樓梗劫，卻偏偏與公主長得像享甸，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子在跳，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,927評(píng)論 2贊 355

Transformer 解讀-1

一. Transformer 模型火爆原因:?

二. Transformer 模型:

Tensor 變化

編碼器

Self-Attention 機(jī)制

Self-attention 細(xì)節(jié)

Self attention 中的矩陣運(yùn)算

Multi-Headed attention

Positional Encoding

Residuals

解碼器

最后的線性層 和 softmax 層

模型訓(xùn)練

損失函數(shù)

推薦閱讀更多精彩內(nèi)容

最后的線性層和 softmax 層