bert_config.josn 模型中參數(shù)的配置
{
"attention_probs_dropout_prob": 0.1, #乘法attention時(shí)刚照,softmax后dropout概率
"hidden_act": "gelu", #激活函數(shù)
"hidden_dropout_prob": 0.1, #隱藏層dropout概率
"hidden_size": 768, #隱藏單元數(shù)
"initializer_range": 0.02, #初始化范圍
"intermediate_size": 3072, #升維維度
"max_position_embeddings": 512,#一個(gè)大于seq_length的參數(shù)伐庭,用于生成position_embedding "num_attention_heads": 12, #每個(gè)隱藏層中的attention head數(shù)
"num_hidden_layers": 12, #隱藏層數(shù)
"type_vocab_size": 2, #segment_ids類別 [0,1]
"vocab_size": 30522 #詞典中詞數(shù)
}
模型配置類
def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02):
"""Constructs BertConfig.
Args:
vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.字典大小
hidden_size: Size of the encoder layers and the pooler layer.隱層節(jié)點(diǎn)個(gè)數(shù)
num_hidden_layers: Number of hidden layers in the Transformer encoder.隱層層數(shù)
num_attention_heads: Number of attention heads for each attention layer in
the Transformer encoder.有多少個(gè)muiti-attention head
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
hidden_act: The non-linear activation function (function or string) in the
encoder and pooler.
hidden_dropout_prob: The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob: The dropout ratio for the attention
probabilities.
max_position_embeddings: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`BertModel`.
initializer_range: The stdev of the truncated_normal_initializer for
initializing all weight matrices.
"""
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
@classmethod
def from_dict(cls, json_object):
"""Constructs a `BertConfig` from a Python dictionary of parameters."""
config = BertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config
@classmethod
def from_json_file(cls, json_file):
"""Constructs a `BertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
對(duì)于整個(gè)模型來說辫塌,分清下面的值活鹰,很重要
1.輸入模型的值是啥哈恰?
2.模型的標(biāo)簽是啥只估?
3.loss是如何計(jì)算的?
1.模型的輸入值
模型的輸入是:train_input_fn
[tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124
masked_lm_labels: - - name is ##port , guardian and ##s lecture , sir pens stomach - of ##s - bucket
embedding_lookup()
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
for TPUs.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])
#print(input_ids) #shape=(32, 128, 1)
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
#print(embedding_table) #shape=(30522, 768)
if use_one_hot_embeddings:
flat_input_ids = tf.reshape(input_ids, [-1])
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
output = tf.nn.embedding_lookup(embedding_table, input_ids)
input_shape = get_shape_list(input_ids)
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
#print(output) #shape=(32, 128, 768) batch_size=32,embedding_size=128,hidden_size=768
#print(embedding_table) #shape=(30522, 768)
return (output, embedding_table)
#常量初始化器
v1_cons = tf.get_variable('v1_cons', shape=[1,4], initializer=tf.constant_initializer())
v2_cons = tf.get_variable('v2_cons', shape=[1,4], initializer=tf.constant_initializer(9))
常量初始化器v1_cons: [[0. 0. 0. 0.]]
常量初始化器v2_cons: [[9. 9. 9. 9.]]
embedding_postprocessor
embedding_postprocessor 它包括token_type_embedding和position_embedding着绷。也就是圖中的Segement Embeddings和Position Embeddings蛔钙。
embedding結(jié)構(gòu)圖:選自《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》。
但此代碼中Position Embeddings部分與之前提出的Transformer不同荠医,此代碼中Position Embeddings是訓(xùn)練出來的吁脱,而傳統(tǒng)的Transformer(如下)是固定值
在這里插入圖片描述
如上所示,輸入有 A 句「my dog is cute」和 B 句「he likes playing」這兩個(gè)自然句彬向,我們首先需要將每個(gè)單詞及特殊符號(hào)都轉(zhuǎn)化為詞嵌入向量兼贡,因?yàn)樯窠?jīng)網(wǎng)絡(luò)只能進(jìn)行數(shù)值計(jì)算。其中特殊符 [SEP] 是用于分割兩個(gè)句子的符號(hào)娃胆,前面半句會(huì)加上分割編碼 A遍希,后半句會(huì)加上分割編碼 B。
因?yàn)橐>渥又g的關(guān)系里烦,BERT 有一個(gè)任務(wù)是預(yù)測(cè) B 句是不是 A 句后面的一句話凿蒜,而這個(gè)分類任務(wù)會(huì)借助 A/B 句最前面的特殊符 [CLS] 實(shí)現(xiàn),該特殊符可以視為匯集了整個(gè)輸入序列的表征招驴。
最后的位置編碼是 Transformer 架構(gòu)本身決定的篙程,因?yàn)榛谕耆⒁饬Φ姆椒ú⒉荒芟?CNN 或 RNN 那樣編碼詞與詞之間的位置關(guān)系,但是正因?yàn)檫@種屬性才能無視距離長(zhǎng)短建模兩個(gè)詞之間的關(guān)系别厘。因此為了令 Transformer 感知詞與詞之間的位置關(guān)系虱饿,我們需要使用位置編碼給每個(gè)詞加上位置信息。
#主要是對(duì)位置等進(jìn)行embedding
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
#print(input_tensor) #shape=(32, 128, 768)
"""Performs various post-processing on a word embedding tensor.
Args:
input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.
Returns:
float tensor with same shape as `input_tensor`.
Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0] #32
seq_length = input_shape[1] #128
width = input_shape[2] #768
output = input_tensor
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())
# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
#print(output) #shape=(32, 128, 768)
return output
對(duì)于訓(xùn)練完成的模型触趴,問題的關(guān)鍵在于如何使用氮发??冗懦?
模型怎么用呢爽冕,在BertModel class中有兩個(gè)函數(shù)。get_pool_output表示獲取每個(gè)batch第一個(gè)詞的[CLS]表示結(jié)果披蕉。BERT認(rèn)為這個(gè)詞包含了整條語料的信息颈畸;適用于句子級(jí)別的分類問題。get_sequence_output表示BERT最終的輸出結(jié)果,shape為[batch_size,seq_length,hidden_size]没讲∶杏椋可以直觀理解為對(duì)每條語料的最終表示,適用于seq2seq問題爬凑。
BERT 的全稱是基于 Transformer 的雙向編碼器表征徙缴,其中「雙向」表示模型在處理某一個(gè)詞時(shí),它能同時(shí)利用前面的詞和后面的詞兩部分信息嘁信。這種「雙向」的來源在于 BERT 與傳統(tǒng)語言模型不同于样,它不是在給定所有前面詞的條件下預(yù)測(cè)最可能的當(dāng)前詞疏叨,而是隨機(jī)遮掩一些詞,并利用所有沒被遮掩的詞進(jìn)行預(yù)測(cè)穿剖。下圖展示了三種預(yù)訓(xùn)練模型蚤蔓,其中 BERT 和 ELMo 都使用雙向信息,OpenAI GPT 使用單向信息携御。