原文章鏈接:http://www.reibang.com/p/d7ce41b58801
本篇文章主要是解讀模型主體代碼modeling.py慨蛙。在閱讀這篇文章之前希望讀者們對(duì)bert的相關(guān)理論有一定的了解期贫,尤其是transformer的結(jié)構(gòu)原理,網(wǎng)上的資料很多玛臂,本文內(nèi)容對(duì)原理部分就不做過(guò)多的介紹了埠帕。
我自己寫(xiě)出來(lái)其中一個(gè)目的也是幫助自己學(xué)習(xí)整理敛瓷、當(dāng)你輸出的時(shí)候才也會(huì)明白哪里懂了哪里不懂斑匪。因?yàn)樗接邢蓿芏嗟胤嚼斫獠坏轿坏慕频€請(qǐng)各位批評(píng)指正贮勃。
1寂嘉、配置
class BertConfig(object):
"""Configuration for `BertModel`."""
def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02):
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
模型配置泉孩,比較簡(jiǎn)單,依次是:詞典大小珍昨、隱層神經(jīng)元個(gè)數(shù)、transformer的層數(shù)兔毙、attention的頭數(shù)骆撇、激活函數(shù)神郊、中間層神經(jīng)元個(gè)數(shù)、隱層dropout比例蜻懦、attention里面dropout比例夕晓、sequence最大長(zhǎng)度、token_type_ids的詞典大小征炼、truncated_normal_initializer的stdev谆奥。
2拂玻、word embedding
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
if use_one_hot_embeddings:
flat_input_ids = tf.reshape(input_ids, [-1])
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
output = tf.nn.embedding_lookup(embedding_table, input_ids)
input_shape = get_shape_list(input_ids)
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
return (output, embedding_table)
構(gòu)造embedding_table檐蚜,進(jìn)行word embedding,可選one_hot的方式市栗,返回embedding的結(jié)果和embedding_table
3咳短、詞向量的后續(xù)處理
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]
output = input_tensor
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
return output
主要是信息添加盲赊,可以將word的位置和word對(duì)應(yīng)的token type等信息添加到詞向量里面敷扫,并且layer正則化和dropout之后返回
4、構(gòu)造attention mask
def create_attention_mask_from_input_mask(from_tensor, to_mask):
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_shape = get_shape_list(to_mask, expected_rank=2)
to_seq_length = to_shape[1]
to_mask = tf.cast(
tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
broadcast_ones = tf.ones(
shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
mask = broadcast_ones * to_mask
return mask
將shape為[batch_size, to_seq_length]的2D mask轉(zhuǎn)換為一個(gè)shape 為[batch_size, from_seq_length, to_seq_length] 的3D mask用于attention當(dāng)中合溺。
5缀台、attention layer
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
seq_length, width):
output_tensor = tf.reshape(
input_tensor, [batch_size, seq_length, num_attention_heads, width])
output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
return output_tensor
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")
if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")
# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`
from_tensor_2d = reshape_to_matrix(from_tensor)
to_tensor_2d = reshape_to_matrix(to_tensor)
# `query_layer` = [B*F, N*H]
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query",
kernel_initializer=create_initializer(initializer_range))
# `key_layer` = [B*T, N*H]
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act,
name="key",
kernel_initializer=create_initializer(initializer_range))
# `value_layer` = [B*T, N*H]
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act,
name="value",
kernel_initializer=create_initializer(initializer_range))
# `query_layer` = [B, N, F, H]
query_layer = transpose_for_scores(query_layer, batch_size,
num_attention_heads, from_seq_length,
size_per_head)
# `key_layer` = [B, N, T, H]
key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
to_seq_length, size_per_head)
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))
if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1])
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
attention_scores += adder
attention_probs = tf.nn.softmax(attention_scores)
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
# `value_layer` = [B, T, N, H]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])
# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
# `context_layer` = [B, N, F, H]
context_layer = tf.matmul(attention_probs, value_layer)
# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
if do_return_2d_tensor:
# `context_layer` = [B*F, N*V]
context_layer = tf.reshape(
context_layer,
[batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
# `context_layer` = [B, F, N*V]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads * size_per_head])
return context_layer
整個(gè)網(wǎng)絡(luò)的重頭戲來(lái)了!tansformer的主要內(nèi)容都在這里面哲身,輸入的from_tensor當(dāng)作query勘天,to_tensor當(dāng)作key和value。當(dāng)self attention的時(shí)候from_tensor和to_tensor是同一個(gè)值商膊。
(1)函數(shù)一開(kāi)始對(duì)輸入的shape進(jìn)行校驗(yàn)宠进,獲取batch_size、from_seq_length 潦匈、to_seq_length 赚导。輸入如果是3D張量則轉(zhuǎn)化成2D矩陣(以輸入為word_embedding為例[batch_size, seq_lenth, hidden_size] -> [batch_size*seq_lenth, hidden_size])
(2)通過(guò)全連接線性投影生成query_layer吼旧、key_layer 未舟、value_layer,輸出的第二個(gè)維度變成num_attention_heads * size_per_head(整個(gè)模型默認(rèn)hidden_size=num_attention_heads * size_per_head)员串。然后通過(guò)transpose_for_scores轉(zhuǎn)換成多頭昼扛。
(3)根據(jù)公式計(jì)算attention_probs(attention score):
如果attention_mask is not None,對(duì)mask的部分加上一個(gè)很大的負(fù)數(shù)扰法,這樣softmax之后相應(yīng)的概率值接近為0毅厚,再dropout吸耿。
(4)最后再將value和attention_probs相乘,返回3D張量或者2D矩陣
總結(jié):
同學(xué)們可以將這段代碼與網(wǎng)絡(luò)結(jié)構(gòu)圖對(duì)照起來(lái)看:
該函數(shù)相比其他版本的的transformer很多地方都有簡(jiǎn)化锤岸,有以下四點(diǎn):
(1)缺少scale的操作板乙;
(2)沒(méi)有Causality mask募逞,個(gè)人猜測(cè)主要是bert沒(méi)有decoder的操作,所以對(duì)角矩陣mask是不需要的刺啦,從另一方面來(lái)說(shuō)正好體現(xiàn)了雙向transformer的特點(diǎn)纠脾;
(3)沒(méi)有query mask。跟(2)理由類(lèi)似糊渊,encoder都是self attention慧脱,query和key相同所以只需要一次key mask就夠了
(4)沒(méi)有query的Residual層和normalize
6菱鸥、Transformer
def transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
intermediate_act_fn=gelu,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False):
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
if input_width != hidden_size:
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
prev_output = reshape_to_matrix(input_tensor)
all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output
with tf.variable_scope("attention"):
attention_heads = []
with tf.variable_scope("self"):
attention_head = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)
attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
else:
attention_output = tf.concat(attention_heads, axis=-1)
with tf.variable_scope("output"):
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
with tf.variable_scope("intermediate"):
intermediate_output = tf.layers.dense(
attention_output,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))
with tf.variable_scope("output"):
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob)
layer_output = layer_norm(layer_output + attention_output)
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output
transformer是對(duì)attention的利用殷绍,分以下幾步:
(1)計(jì)算attention_head_size鹊漠,attention_head_size = int(hidden_size / num_attention_heads)即將隱層的輸出等分給各個(gè)attention頭殖侵。然后將input_tensor轉(zhuǎn)換成2D矩陣拢军;
(2)對(duì)input_tensor進(jìn)行多頭attention操作怔鳖,再做:線性投影——dropout——layer norm——intermediate線性投影——線性投影——dropout——attention_output的residual——layer norm
其中intermediate線性投影的hidden_size可以自行指定,其他層的線性投影hidden_size需要統(tǒng)一度陆,目的是為了對(duì)齊懂傀。
(3)如此循環(huán)計(jì)算若干次蜡感,且保存每一次的輸出,最后返回所有層的輸出或者最后一層的輸出犀斋。
總結(jié):
進(jìn)一步證實(shí)該函數(shù)transformer只存在encoder情连,而不存在decoder操作,所以所有層的多頭attention操作都是基于self encoder的虫几。對(duì)應(yīng)論文紅框的部分:
7辆脸、BertModel
class BertModel(object):
def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=True,
scope=None):
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]
if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
(self.embedding_output, self.embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"):
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
self.sequence_output = self.all_encoder_layers[-1]
with tf.variable_scope("pooler"):
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
終于到模型入口了始腾。
(1)設(shè)置各種參數(shù),如果input_mask為None的話浪箭,就指定所有input_mask值為1奶栖,即不進(jìn)行過(guò)濾门坷;如果token_type_ids是None的話袍镀,就指定所有token_type_ids值為0苇羡;
(2)對(duì)輸入的input_ids進(jìn)行embedding操作,再embedding_postprocessor操作设江,前面我們說(shuō)了叉存。主要是加入位置和token_type信息到詞向量里面;
(3)轉(zhuǎn)換attention_mask 后稿存,通過(guò)調(diào)用transformer_model進(jìn)行encoder操作瞳秽;
(4)獲取最后一層的輸出sequence_output和pooled_output,pooled_output是取sequence_output的第一個(gè)切片然后線性投影獲得(可以用于分類(lèi)問(wèn)題)
8拂苹、總結(jié):
(1)bert主要流程是先embedding(包括位置和token_type的embedding)痰洒,然后調(diào)用transformer得到輸出結(jié)果丘喻,其中embedding、embedding_table泉粉、所有transformer層輸出嗡靡、最后transformer層輸出以及pooled_output都可以獲得,用于遷移學(xué)習(xí)的fine-tune和預(yù)測(cè)任務(wù)歉井;
(2)bert對(duì)于transformer的使用僅限于encoder哈误,沒(méi)有decoder的過(guò)程躏嚎。這是因?yàn)槟P痛娲馐菫榱祟A(yù)訓(xùn)練服務(wù)卢佣,而預(yù)訓(xùn)練是通過(guò)語(yǔ)言模型箭阶,不同于NLP其他特定任務(wù)。在做遷移學(xué)習(xí)時(shí)可以自行添加媳危;
(3)正因?yàn)闆](méi)有decoder的操作冈敛,所以在attention函數(shù)里面也相應(yīng)地減少了很多不必要的功能。
其他非主要函數(shù)這里不做過(guò)多介紹暮蹂,感興趣的同學(xué)可以去看源碼癌压。
下一篇文章我們將繼續(xù)學(xué)習(xí)bert源碼的其他模塊滩届,包括訓(xùn)練、預(yù)測(cè)以及輸入輸出等相關(guān)功能帜消。
Reference
1.https://github.com/google-research/bert/blob/master/modeling.py
2.https://github.com/Kyubyong/transformer
4.BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding