1. 從encoder-decoder說(shuō)起
encoder-decoder是一個(gè)框架之宿,在生成模型中使用廣泛,這里以翻譯系統(tǒng)為例苛坚。
- 添加encoder-decoder圖片
encoder側(cè)輸入源句子:source = (x1, x2, ...xn)比被,source經(jīng)過(guò)encoder進(jìn)行編碼成C(非線性變換為中間語(yǔ)義),decoder拿著C和歷史信息生成target = (y1, y2,...yn)泼舱,當(dāng)source和target為不同語(yǔ)種時(shí)就是翻譯系統(tǒng)等缀。
- 添加翻譯動(dòng)圖
encoder負(fù)責(zé)解釋source的語(yǔ)義,decoder負(fù)責(zé)編譯語(yǔ)義至target
需要注意的是這里的encoder和decoder可以是RNN模型也可以是attention模型娇昙,且encoder和decoder可以分別使用不同的模型尺迂。
2. attention機(jī)制
- 添加注意力機(jī)制的模型
基本款的seq2seq模型使用RNN作為基礎(chǔ)模型,局限性在于當(dāng)句子很長(zhǎng)時(shí),輸入句子濃縮成固定維度的語(yǔ)義向量C不能表達(dá)所有的語(yǔ)義信息而進(jìn)行后續(xù)的decoder噪裕。因此attention機(jī)制的思路是:與其將encoder處理完的最后一個(gè)向量交給decoder去萃取整個(gè)句子信息蹲盘, 不如將encoder生成的每個(gè)向量都給decoder,decoder在生成新的序列時(shí)膳音,自己決定要將注意力放在哪些向量上面召衔,即組合使用哪些向量。
attention機(jī)制流程:
- 拿 Decoder 當(dāng)下的紅色隱狀態(tài)向量 ht 跟 Encoder 所有藍(lán)色隱狀態(tài)向量 hs 做比較祭陷,利用 score 函式計(jì)算出 ht 對(duì)每個(gè) hs 的注意程度
- 以此注意程度為權(quán)重苍凛,加權(quán)平均所有 Encoder 隱狀態(tài) hs 以取得上下文向量 context vector
- 將此上下文向量與 Decoder 隱狀態(tài)結(jié)合成一個(gè)注意向量 attention vector 並作為該時(shí)間的輸出
- 該注意向量會(huì)作為 Decoder 下個(gè)時(shí)間點(diǎn)的輸入
那什么是self-attention呢?
中心思想是在建立序列的每個(gè)元素的rep時(shí)兵志,同時(shí)去注意同序列中其他元素的信息毫深,結(jié)合后作為上下文咨詢作為自己的rep。
待補(bǔ)充....
3. Transformer
attention機(jī)制一般嵌入在seq2seq模型中毒姨,而之前的seq2seq模型的encoder和decoder通常使用RNN作為基礎(chǔ)模型哑蔫,雖然attention解決了攜帶信息的問(wèn)題,但不能并行運(yùn)算弧呐,兩者的結(jié)合就是transformer闸迷,
在trasformer中,decoder利用Encoder-Deocder Attention關(guān)注encoder序列俘枫,encoder和decoder各自利用Self-Attention處理自己的序列腥沽,沒(méi)有使用RNN,可并行運(yùn)算鸠蚪。預(yù)計(jì)在后序的Seq2Seq模型中今阳,大概率就是transformer一統(tǒng)天下了。
4. 手撕transformer
假設(shè)經(jīng)過(guò)一系列的word2vec處理和預(yù)處理后得到源語(yǔ)言和目標(biāo)語(yǔ)言向量:
sour: tf.Tensor(
[[8135 105 10 1304 7925 8136 0 0]
[8135 17 3905 6013 12 2572 7925 8136]], shape=(2, 8), dtype=int64)
tar: tf.Tensor(
[[4201 10 241 80 27 3 4202 0 0 0]
[4201 162 467 421 189 14 7 553 3 4202]], shape=(2, 10), dtype=int64)
shape中的2表示batch_size大小茅信,8和10分別表示源語(yǔ)言和目標(biāo)語(yǔ)言長(zhǎng)度盾舌,不足的用0補(bǔ)齊,8135和8136蘸鲸,4201和4202是源語(yǔ)言和目標(biāo)語(yǔ)言的開(kāi)頭和結(jié)尾標(biāo)識(shí)妖谴。每個(gè)單詞再向量化后再多一維得到的shape=(batch_size, seq_len, d_model), 其中d_model為詞嵌入空間維度酌摇。
4.1 scaled dot product attention
首先這里用到了遮罩的概念膝舅,遮罩分為padding mask和look ahead mask,padding mask是將序列中補(bǔ)零的地方遮住不讓transformer看到窑多,look ahead mask是確保Decoder只看之前產(chǎn)生的信息仍稀,在實(shí)際處理時(shí)遮罩處的位置置為1,所以不論哪種遮罩埂息,那些值為1的位置就是遮罩存在的地方技潘。
def create_padding_mask(seq):
# padding mask的工作就是把索引序列中為為0的位置置為1
mask = tf.cast(tf.equal(seq, 0), tf.float32)
# broadcasting
return mask[:, tf.newaxis, tf.newaxis, :]
# 遮罩為右上角三角形
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
例如:
emb_tar的look_ahead_mask tf.Tensor(
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]], shape=(10, 10), dtype=float32)
注意力機(jī)制的本質(zhì)就是拿一個(gè)查詢(query)去跟一組(key)做運(yùn)算判沟,最后產(chǎn)生一個(gè)輸出,并行就是拿多個(gè)query去跟一組key做運(yùn)算而已崭篡。每個(gè)value的權(quán)值由value對(duì)應(yīng)的key跟query計(jì)算匹配程度得到。即:
首先準(zhǔn)備Q吧秕,K琉闪,V,這里Q和K直接使用向量化的源語(yǔ)言emb_sour砸彬,V是跟Q和K形狀相同的二值化張量颠毙。直觀見(jiàn)下圖:
def scaled_dot_product_attention(q, k, v, mask):
matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# 加入遮罩
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# 對(duì)v進(jìn)行softmax加權(quán)平均
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)
output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)
return output, attention_weights
4.2 Multi-head attention
num_heads * depth = d_model
class MultiHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads # 指定要將'd_model'拆成幾個(gè)heads
self.d_model = d_model # 在split_heads之前的維度
assert d_model % self.num_heads == 0
self.depth = d_model // self.num_heads # 每個(gè) head維度
self.wq = tf.keras.layers.Dense(d_model) # 分別給q, k, v的3個(gè)線性轉(zhuǎn)換
self.wk = tf.keras.layers.Dense(d_model) # 沒(méi)有指定activation func
self.wv = tf.keras.layers.Dense(d_model)
self.dense = tf.keras.layers.Dense(d_model) # 多heads串接后通過(guò)線性變換
def split_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, v, k, q, mask):
batch_size = tf.shape(q)[0]
# 將q, k, v線性轉(zhuǎn)換到d_model維空間
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wq(k) # (batch_size, seq_len, d_model)
v = self.wq(v) # (batch_size, seq_len, d_model)
# 將d_model分為num_heads個(gè)depth維度
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# 利用broadcasting讓每個(gè)句子的每個(gè)head的qi, ki, vi各自進(jìn)行注意力機(jī)制
# 輸出會(huì)多一個(gè)head維度
# scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
# attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
# 先transpose在reshape將num_heads個(gè)depth維度串接回原來(lái)的d_model維
scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
# (batch_size, seq_len_q, num_heads, depth)
concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
# (batch_size, seq_len_q, d_model)
# 通過(guò)最后一個(gè)線性轉(zhuǎn)換
output = self.dense(concat_attention) # (batch_size, seq_len_q, d_model)
return output, attention_weights
- 添加動(dòng)畫
4.3 疊加transformer
- 添加動(dòng)畫
層次結(jié)構(gòu)
Transformer
Encoder
輸入 Embedding
位置 Encoding
N個(gè)Encoder layers
sub-layer 1: Encoder 自注意力機(jī)制
sub-layer 2: Feed Forward
Decoder
輸出 Embedding
位置 Encoding
N個(gè)Decoder layers
sub-layer 1: Decoder 自注意力機(jī)制
sub-layer 2: Decoder-Encoder 注意力機(jī)制
sub-layer 3: Feed Forward
Final Dense Layer
4.3.1 Encoder Layer
從下圖可知,encoder layer中含有兩個(gè)sub-layer砂碉,分別為FFN和MHA蛀蜜,在Add&Norm中每個(gè)sub-layer間存在殘差連結(jié)來(lái)防止梯度消失,同時(shí)每個(gè)sub-layer都會(huì)對(duì)最后一維d-model做layer normalization增蹭,使其均值和方差接近0和1后輸出滴某。
FFN(Feed-Forward Networks)是在encoder layer和decoder layer中都有的feed-forward元件,輸入輸出維度一樣滋迈。
def point_wise_feed_forward_network(d_model, dff):
# 對(duì)輸入做兩次線性變換霎奢,中間加一ReLU
return tf.keras.Sequential([
tf.keras.layers.Dense(dff, activation='relu'), # (batch_size, seq_len, dff)
tf.keras.layers.Dense(d_model) # (batch_size, seq_len, d_model)
])
class EncoderLayer(tf.keras.layer.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(EncoderLayer, self).__init__()
self.mha = MultiHeadAattention(d_model, num_heads)
self.ffn = point_wise_feed_forward_network(d_model, dff)
# layer norm 很常在RNN-based的模型被使用。一個(gè)sub-layer 一個(gè)layer norm
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
# 同樣的一個(gè)sub-layer一個(gè)dropout layer
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
def call(self, x, training, mask):
# 傳入training是因?yàn)閐ropout在訓(xùn)練和測(cè)試的行為不同
# 除了atten饼灿,其他張量的shape均為(batch_size, input_seq_len, d_model)
# attn.shape == (batch_size, num_heads, input_seq_len, input_seq_len)
# sub-layer_1: MHA
attn_output, attn = self.mha(x, x, x, mask)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output)
# sub_layer_2: FFN
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training) # 記得 training
out2 = self.layernorm2(out1 + ffn_output)
return out2
4.3.2 Decoder layer
整體上幕侠,decoder layer包含3部分:
- decoder自身的mask MHA1
- Decoder和Encoder之間的MHA2
- FFN
class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(DecoderLayer, self).__init__()
self.mha1 = MultiHeadAttention(d_model, num_heads)
self.mha2 = MultiHeadAttention(d_model, num_heads)
self.ffn = point_wise_feed_forward_network(d_model, dff)
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
self.dropout3 = tf.keras.layers.Dropout(rate)
def call(self, x, enc_output, training, combined_mask, inp_padding_mask):
# 所有sub-layer的輸出都是(batch_size, target_seq_len, d_model)
# enc_output為Encoder的輸出,shape為(batch_size, input_seq_len, d_model)
# attn_weights_block_1的shape為(batch_size, num_heads, target_seq_len, target_seq_len)
# attn_weights_block_2的shape為(batch_size, num_heads, target_seq_len, input_seq_len)
# sub-layer 1: Decoder做self-attention碍彭,v, k, q都是x
# 同時(shí)需要decoder的look ahead mask和輸出序列的padding mask
attn1, attn_weight1 = self.mha1(x, x, x, combined_mask)
attnn1 = self.dropout1(attn1, training=training)
out1 = self.layernorm1(attn1)
# sub-layer 2: Decoder layer關(guān)注Encoder的輸出序列
# v, k是enc_output晤硕,q為MHA1的結(jié)果out1
# 需要用到padding mask避免關(guān)注到<pad>
attn2, attn_weight2 = self.mha2(enc_output, enc_output, out1, inp_padding_mask)
attn2 = self.dropout2(attn2, training=training)
out2 = self.layernorm2(attn2 + out1)
# sub-layer 3: FFN
ffn_output = self.ffn(out2)
ffn_output = self.dropout3(ffn_output, training=training)
out3 = self.layernorm3(ffn_output + out2)
return out3, attn_weights_block1, attn_weights_block2
上面的代碼中在第一個(gè)mha中存在一個(gè)參數(shù)combined_mask,是自己的(tar的而不是sour)padding mask和look ahead mask的結(jié)合庇忌,結(jié)合方式如下舞箍,只需要把兩個(gè)遮罩取大即可,第二個(gè)mha的padding mask是sour的皆疹。
tar_padding_mask = create_padding_mask(tar)
look_ahead_mask = create_look_ahead_mask(tar.shape[-1])
combined_mask = tf.maximum(tar_padding_mask, look_ahead_mask)
4.3.3 Positional encoding
注意力機(jī)制使得序列之間的觀察通過(guò)O(1)計(jì)算就可達(dá)到创译,從而解決了長(zhǎng)依賴的問(wèn)題,但是無(wú)法很好的表達(dá)順序信息墙基,所以加入Positional encoding位置編碼給transformer容握,直接加到word embedding中,維度與d_model相同侄榴。
此函數(shù)具有一個(gè)很好的特性斩狱,就是給定一個(gè)位置的編碼PE(pos),離它k個(gè)單位位置的編碼PE(pos+k)可以用PE(pos)線性表示出來(lái)初茶。
def get_angles(pos, i, d_model):
angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
return pos * angle_rates
def positional_encoding(position, d_model):
angle_rads = get_angles(np.arange(position)[:, np.newaxis],
np.arange(d_model)[np.newaxis, :],
d_model)
# apply sin to even indices in the array; 2i
sines = np.sin(angle_rads[:, 0::2])
# apply cos to odd indices in the array; 2i+1
cosines = np.cos(angle_rads[:, 1::2])
pos_encoding = np.concatenate([sines, cosines], axis=-1)
pos_encoding = pos_encoding[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
4.3.4 Encoder
Encoder包含3個(gè)組件:
- 輸入的詞嵌入層
- 位置編碼
- N個(gè)Encoder layer
輸入:(batch_size, seq_len)
輸出:(batch_size, seq_len, d_model)
class Encoder(tf.keras.layers.Layer):
# num_layers: EncoderLayer層數(shù)
# input_vocab_size: 把索引轉(zhuǎn)換成詞嵌入向量
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, rate=0.1):
super(Encoder, self).__init__()
self.d_model = d_model
self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)
# 建立num_layers個(gè)Encoder Layers
self.enc_layer = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(rate)
def call(self, x, training, mask):
# x.shape = (batch_size, input_seq_len)
input_seq_len = tf.shape(x)[1]
# 將2維的索引序列轉(zhuǎn)換成3維的詞嵌入向量颗祝,根據(jù)論文乘上sqrt(d_model),再加上對(duì)應(yīng)的位置編碼
x = self.embedding(x)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
x += self.pos_encoding(:, :input_seq_len, :)
# 對(duì)embedding和位置編碼的結(jié)合通過(guò)dropout做regularization
x = self.dropout(x, training=training)
# 通過(guò)n個(gè)encoder layer編碼
for i,enc_layer in enumerate(self.enc_layers):
x = enc_layer(x, training, mask)
return x
4.3.5 Decoder
跟Encoder相同
class Decoder(tf.keras.layers.Layer):
def __init__(self, num_layers, d_model, dff, target_vocab_size, rate=0.1):
super(Decoder, self).__init__()
self.d_model = d_model
self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
self.pos_encoding = positional_encoding(target_vocab_size, self.d_model)
self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(rate)
def call(self, x, enc_output, training, combined_mask, inp_padding_mask):
tar_seq_len = tf.shape(x)[1]
attention_weights = {} # 用來(lái)存放每個(gè)Decoder layer的注意權(quán)重
x = self.embedding(x)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
x += self.pos_encoding[:, :tar_seq_len, :]
x = self.dropout(x, training=training)
for i, dec_layer in enumerate(self.dec_layers):
x, block1, block2 = dec_layer(x, enc_output, training, combined_mask, inp_padding_mask)
attention_weights['decoder_layer{}_block1'.format(i + 1)] = block1
attention_weights['decoder_layer{}_block2'.format(i + 1)] = block2
return x, attention_weights
4.3.6 組合transformer
由3部分組成:
- Encoder
- Decoder
- Final linear layer
輸入:
sour序列:(batch_size, sour_seq_len)
tar序列:(batch_size, tar_seq_len)
輸出:
生成序列:(batch_size, tar_seq_len, tar_vocab_size)
注意力權(quán)重
class Transformer(tf.keras.Model):
def __init__(self, num_layers, num_heads, dff, input_vocab_size, target_vocab_size, rate=0.1):
super(Transformer, self).__init__()
self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, rate)
self.decoder = Decoder(num_layers, d_model, num_heads, dff, target_vocab_size, rate)
self.final_layer = tf.keras.Dense(target_vocab_size)
def call(self, sour, tar, training, enc_paading_mask, combined_mask, dec_padding_mask):
enc_output = self.encoder(sour, training, enc_padding_mask)
# enc_output = (batch_size, sour_seq_len, d_model)
dec_output, attention_weights = self.decoder(tar, enc_output, training, combined_mask, dec_padding_mask)
# dec_output = (batch_size, tar_seq_len, d_model)
# 將decoder輸出經(jīng)過(guò)最后的linear layer
final_output = self.final_layer(dec_output)
# (batch_size, tar_seq_len, target_vocab_size)
retutrn final_output, attention_weights
- 補(bǔ)充最后的視頻