Note 1: Transformer

Attention Is All You Need [1]

[1]

1. Encoder-Decoder

  • The encoder maps an input sequence of symbol representations (x_1,\ldots, x_n) to a sequence of continuous representations z = (z_1, \ldots, z_n).
  • Given z, the decoder then generates an output sequence (y_1, \ldots, y_m) of symbols one element at a time.
  • At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Overview of Transformer
  • Encoder
    • It has 6 identical layers.
    • Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
    • Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
    • The residual connection mechanism uses x+F(x) as its final result.
  • Decoder
    • It has 6 identical layers.
    • Each layer has three sub-layers:
      • Masked multi-head attention layer ensues the prediction at time t can only depends on the known output at positions less than t.
      • Multi-head attention layer further adds the output of the encoder stack into this decoder.
      • Position-wise fully connect feed-forward layer.

2. Attention

  • Mapping a query and a set of key-value pairs to a weighted output.
  • Scaled Dot-product Attention
    Scaled Dot-product Attention
    • Given three vectors, a query Q \in R^{1 \times d_k}, keys K \in R^{1 \times d_k} and values V \in R^{1 \times d_v}:
      Attention(Q, K, V)=softmax(\frac{QK^T}{\sqrt{d_k}})V
  • Multi-head Attention
    Multi-head Attention
    • Jointly collect information from different representation subspace focused on different positions.
    • Given n queries Q \in R^{n \times d_{model}}, keys K and values V:
      MultiHead(Q, K, V)=Concat({head}_1, \cdots, {head}_n)W^o
      where \ {head}_i=Attention(QW_i^Q, KW_i^K, VW_i^V)
      • W_i^Q \in R^{d_{model} \times d_k}, W_i^K \in R^{d_{model} \times d_k}, W_i^V \in R^{d_{model} \times d_v}, {head}_i \in R^{n \times d_v}, Concat(\cdot) \in R^{n \times hd_v}, W^o \in R^{hd_v \times d_{model}}.
      • The Q and MultiHead(Q,K,V) have the same dimension R^{n \times d_{model}}.
      • First, it linearly projects the queries, keys and values h times to learn h different Q, K and V.
      • Next, it concatenates all yield output values together.
      • At last, it projects the concatenated vector into a d_v -dimension vector.
        Example [2]
  • Attention in Transformer
    • Encoder's multi-head:
      • Q=K=V=the output of previous layer.
      • Each position in the encoder can attend to all positions in the previous layer of the encoder.
    • Decoder's masked multi-head:
      • Q=K=V=the masked output of previous layer.
      • For example, if we predict the t-th output token, all tokens after timestamp t have to be marked.
      • This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
      • It masks out (setting to -\infty) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
    • Decoder's multi-head:
      • Q=the output of the previous decoder layer, K=V=the encoder stack's output.
      • This allows every position in the decoder to attend over all positions in the input sequence.

3. Position-wise Feed-forward Networks

FFN(x)=max(0, xw_i+b_i)w_2+b_2

  • ReLu activation function: max(0, x).
  • It's applied to each position separately and identically.

4. Positional Encoding

  • To make use of the order of the sequence.
  • Add at the bottoms of the encoder and decoder stacks.
  • Have the same dimension d_{model} as the embeddings.
  • PE(pos, 2i)=sin(pos/10000^{2i / d_{model}})
    PE(pos, 2i+1)=cos(pos/10000^{2i / d_{model}})
    • pos is the index of position, 1 \leq pos \leq n in encoder while 1 \leq pos \leq m in decoder.
    • i is the index of the dimension d_{model}, 1 \leq i \leq d_{model}.

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末躯畴,一起剝皮案震驚了整個(gè)濱河市民鼓,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌蓬抄,老刑警劉巖丰嘉,帶你破解...
    沈念sama閱讀 221,430評(píng)論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異嚷缭,居然都是意外死亡饮亏,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,406評(píng)論 3 398
  • 文/潘曉璐 我一進(jìn)店門(mén)阅爽,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)路幸,“玉大人,你說(shuō)我怎么就攤上這事付翁〖螂龋” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 167,834評(píng)論 0 360
  • 文/不壞的土叔 我叫張陵百侧,是天一觀的道長(zhǎng)砰识。 經(jīng)常有香客問(wèn)我,道長(zhǎng)佣渴,這世上最難降的妖魔是什么咙边? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 59,543評(píng)論 1 296
  • 正文 為了忘掉前任桦锄,我火速辦了婚禮梢薪,結(jié)果婚禮上轴或,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好灵迫,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,547評(píng)論 6 397
  • 文/花漫 我一把揭開(kāi)白布秦叛。 她就那樣靜靜地躺著,像睡著了一般瀑粥。 火紅的嫁衣襯著肌膚如雪挣跋。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 52,196評(píng)論 1 308
  • 那天狞换,我揣著相機(jī)與錄音避咆,去河邊找鬼。 笑死修噪,一個(gè)胖子當(dāng)著我的面吹牛查库,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播黄琼,決...
    沈念sama閱讀 40,776評(píng)論 3 421
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼樊销,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了脏款?” 一聲冷哼從身側(cè)響起围苫,我...
    開(kāi)封第一講書(shū)人閱讀 39,671評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎撤师,沒(méi)想到半個(gè)月后剂府,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,221評(píng)論 1 320
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡剃盾,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,303評(píng)論 3 340
  • 正文 我和宋清朗相戀三年腺占,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片痒谴。...
    茶點(diǎn)故事閱讀 40,444評(píng)論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡湾笛,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出闰歪,到底是詐尸還是另有隱情,我是刑警寧澤蓖墅,帶...
    沈念sama閱讀 36,134評(píng)論 5 350
  • 正文 年R本政府宣布库倘,位于F島的核電站,受9級(jí)特大地震影響论矾,放射性物質(zhì)發(fā)生泄漏教翩。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,810評(píng)論 3 333
  • 文/蒙蒙 一贪壳、第九天 我趴在偏房一處隱蔽的房頂上張望饱亿。 院中可真熱鬧,春花似錦、人聲如沸彪笼。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 32,285評(píng)論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)配猫。三九已至幅恋,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間泵肄,已是汗流浹背捆交。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,399評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留腐巢,地道東北人品追。 一個(gè)月前我還...
    沈念sama閱讀 48,837評(píng)論 3 376
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像冯丙,于是被迫代替她去往敵國(guó)和親肉瓦。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,455評(píng)論 2 359