bert代碼模型部分的解讀

bert_config.josn 模型中參數(shù)的配置

{
"attention_probs_dropout_prob": 0.1, #乘法attention時(shí)刚照,softmax后dropout概率 
"hidden_act": "gelu", #激活函數(shù) 
"hidden_dropout_prob": 0.1, #隱藏層dropout概率 
"hidden_size": 768, #隱藏單元數(shù) 
"initializer_range": 0.02, #初始化范圍 
"intermediate_size": 3072, #升維維度
"max_position_embeddings": 512,#一個(gè)大于seq_length的參數(shù)伐庭,用于生成position_embedding "num_attention_heads": 12, #每個(gè)隱藏層中的attention head數(shù) 
"num_hidden_layers": 12, #隱藏層數(shù) 
"type_vocab_size": 2, #segment_ids類別 [0,1] 
"vocab_size": 30522 #詞典中詞數(shù)
}

模型配置類

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    """Constructs BertConfig.
    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.字典大小
      hidden_size: Size of the encoder layers and the pooler layer.隱層節(jié)點(diǎn)個(gè)數(shù)
      num_hidden_layers: Number of hidden layers in the Transformer encoder.隱層層數(shù)
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.有多少個(gè)muiti-attention head
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range
 
  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config
 
  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))
 
  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output
 
  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
 

對(duì)于整個(gè)模型來說辫塌,分清下面的值活鹰,很重要

1.輸入模型的值是啥哈恰?

2.模型的標(biāo)簽是啥只估?

3.loss是如何計(jì)算的?

1.模型的輸入值
模型的輸入是:train_input_fn

[tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124
masked_lm_labels: - - name is ##port , guardian and ##s lecture , sir pens stomach - of ##s - bucket
embedding_lookup()
 
def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.
 
  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
      for TPUs.
 
  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])
    #print(input_ids) #shape=(32, 128, 1)
 
  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))
  #print(embedding_table) #shape=(30522, 768)
 
  if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1])
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.nn.embedding_lookup(embedding_table, input_ids)
 
  input_shape = get_shape_list(input_ids)
 
  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  #print(output) #shape=(32, 128, 768)  batch_size=32,embedding_size=128,hidden_size=768
  #print(embedding_table) #shape=(30522, 768)
  return (output, embedding_table)

#常量初始化器
 v1_cons = tf.get_variable('v1_cons', shape=[1,4], initializer=tf.constant_initializer())
 v2_cons = tf.get_variable('v2_cons', shape=[1,4], initializer=tf.constant_initializer(9))
 常量初始化器v1_cons: [[0. 0. 0. 0.]]
 常量初始化器v2_cons: [[9. 9. 9. 9.]]
embedding_postprocessor
embedding_postprocessor 它包括token_type_embedding和position_embedding着绷。也就是圖中的Segement Embeddings和Position Embeddings蛔钙。
image.png

embedding結(jié)構(gòu)圖:選自《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》。
但此代碼中Position Embeddings部分與之前提出的Transformer不同荠医,此代碼中Position Embeddings是訓(xùn)練出來的吁脱,而傳統(tǒng)的Transformer(如下)是固定值
在這里插入圖片描述


image.png

如上所示,輸入有 A 句「my dog is cute」和 B 句「he likes playing」這兩個(gè)自然句彬向,我們首先需要將每個(gè)單詞及特殊符號(hào)都轉(zhuǎn)化為詞嵌入向量兼贡,因?yàn)樯窠?jīng)網(wǎng)絡(luò)只能進(jìn)行數(shù)值計(jì)算。其中特殊符 [SEP] 是用于分割兩個(gè)句子的符號(hào)娃胆,前面半句會(huì)加上分割編碼 A遍希,后半句會(huì)加上分割編碼 B。

因?yàn)橐>渥又g的關(guān)系里烦,BERT 有一個(gè)任務(wù)是預(yù)測(cè) B 句是不是 A 句后面的一句話凿蒜,而這個(gè)分類任務(wù)會(huì)借助 A/B 句最前面的特殊符 [CLS] 實(shí)現(xiàn),該特殊符可以視為匯集了整個(gè)輸入序列的表征招驴。

最后的位置編碼是 Transformer 架構(gòu)本身決定的篙程,因?yàn)榛谕耆⒁饬Φ姆椒ú⒉荒芟?CNN 或 RNN 那樣編碼詞與詞之間的位置關(guān)系,但是正因?yàn)檫@種屬性才能無視距離長(zhǎng)短建模兩個(gè)詞之間的關(guān)系别厘。因此為了令 Transformer 感知詞與詞之間的位置關(guān)系虱饿,我們需要使用位置編碼給每個(gè)詞加上位置信息。

#主要是對(duì)位置等進(jìn)行embedding
def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  #print(input_tensor) #shape=(32, 128, 768)
  """Performs various post-processing on a word embedding tensor.
  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.
 
  Returns:
    float tensor with same shape as `input_tensor`.
 
  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]   #32
  seq_length = input_shape[1]   #128
  width = input_shape[2]        #768
 
  output = input_tensor
 
  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings
 
  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())
      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings
 
  output = layer_norm_and_dropout(output, dropout_prob)
  #print(output) #shape=(32, 128, 768)
  return output

對(duì)于訓(xùn)練完成的模型触趴,問題的關(guān)鍵在于如何使用氮发??冗懦?
模型怎么用呢爽冕,在BertModel class中有兩個(gè)函數(shù)。get_pool_output表示獲取每個(gè)batch第一個(gè)詞的[CLS]表示結(jié)果披蕉。BERT認(rèn)為這個(gè)詞包含了整條語料的信息颈畸;適用于句子級(jí)別的分類問題。get_sequence_output表示BERT最終的輸出結(jié)果,shape為[batch_size,seq_length,hidden_size]没讲∶杏椋可以直觀理解為對(duì)每條語料的最終表示,適用于seq2seq問題爬凑。

BERT 的全稱是基于 Transformer 的雙向編碼器表征徙缴,其中「雙向」表示模型在處理某一個(gè)詞時(shí),它能同時(shí)利用前面的詞和后面的詞兩部分信息嘁信。這種「雙向」的來源在于 BERT 與傳統(tǒng)語言模型不同于样,它不是在給定所有前面詞的條件下預(yù)測(cè)最可能的當(dāng)前詞疏叨,而是隨機(jī)遮掩一些詞,并利用所有沒被遮掩的詞進(jìn)行預(yù)測(cè)穿剖。下圖展示了三種預(yù)訓(xùn)練模型蚤蔓,其中 BERT 和 ELMo 都使用雙向信息,OpenAI GPT 使用單向信息携御。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末昌粤,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子啄刹,更是在濱河造成了極大的恐慌涮坐,老刑警劉巖,帶你破解...
    沈念sama閱讀 217,542評(píng)論 6 504
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件誓军,死亡現(xiàn)場(chǎng)離奇詭異袱讹,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)昵时,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,822評(píng)論 3 394
  • 文/潘曉璐 我一進(jìn)店門捷雕,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人壹甥,你說我怎么就攤上這事救巷。” “怎么了句柠?”我有些...
    開封第一講書人閱讀 163,912評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵浦译,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我溯职,道長(zhǎng)精盅,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,449評(píng)論 1 293
  • 正文 為了忘掉前任谜酒,我火速辦了婚禮叹俏,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘僻族。我一直安慰自己粘驰,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,500評(píng)論 6 392
  • 文/花漫 我一把揭開白布述么。 她就那樣靜靜地躺著晴氨,像睡著了一般。 火紅的嫁衣襯著肌膚如雪碉输。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,370評(píng)論 1 302
  • 那天亭珍,我揣著相機(jī)與錄音敷钾,去河邊找鬼枝哄。 笑死,一個(gè)胖子當(dāng)著我的面吹牛阻荒,可吹牛的內(nèi)容都是我干的挠锥。 我是一名探鬼主播,決...
    沈念sama閱讀 40,193評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼侨赡,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼蓖租!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起羊壹,我...
    開封第一講書人閱讀 39,074評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤蓖宦,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后油猫,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體稠茂,經(jīng)...
    沈念sama閱讀 45,505評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,722評(píng)論 3 335
  • 正文 我和宋清朗相戀三年情妖,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了睬关。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,841評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡毡证,死狀恐怖电爹,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情料睛,我是刑警寧澤丐箩,帶...
    沈念sama閱讀 35,569評(píng)論 5 345
  • 正文 年R本政府宣布,位于F島的核電站秦效,受9級(jí)特大地震影響雏蛮,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜阱州,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,168評(píng)論 3 328
  • 文/蒙蒙 一挑秉、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧苔货,春花似錦犀概、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,783評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至诈茧,卻和暖如春产喉,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,918評(píng)論 1 269
  • 我被黑心中介騙來泰國打工曾沈, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留这嚣,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 47,962評(píng)論 2 370
  • 正文 我出身青樓塞俱,卻偏偏與公主長(zhǎng)得像姐帚,于是被迫代替她去往敵國和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子障涯,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,781評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容