【轉(zhuǎn)載】Bert系列（二）——源碼解讀之模型主體

原文章鏈接：http://www.reibang.com/p/d7ce41b58801

本篇文章主要是解讀模型主體代碼modeling.py慨蛙。在閱讀這篇文章之前希望讀者們對(duì)bert的相關(guān)理論有一定的了解期贫，尤其是transformer的結(jié)構(gòu)原理，網(wǎng)上的資料很多玛臂，本文內(nèi)容對(duì)原理部分就不做過(guò)多的介紹了埠帕。

我自己寫(xiě)出來(lái)其中一個(gè)目的也是幫助自己學(xué)習(xí)整理敛瓷、當(dāng)你輸出的時(shí)候才也會(huì)明白哪里懂了哪里不懂斑匪。因?yàn)樗接邢蓿芏嗟胤嚼斫獠坏轿坏慕频€請(qǐng)各位批評(píng)指正贮勃。

1寂嘉、配置

class BertConfig(object):
  """Configuration for `BertModel`."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

模型配置泉孩，比較簡(jiǎn)單，依次是：詞典大小珍昨、隱層神經(jīng)元個(gè)數(shù)、transformer的層數(shù)兔毙、attention的頭數(shù)骆撇、激活函數(shù)神郊、中間層神經(jīng)元個(gè)數(shù)、隱層dropout比例蜻懦、attention里面dropout比例夕晓、sequence最大長(zhǎng)度、token_type_ids的詞典大小征炼、truncated_normal_initializer的stdev谆奥。

2拂玻、word embedding

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1])
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.nn.embedding_lookup(embedding_table, input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

構(gòu)造embedding_table檐蚜，進(jìn)行word embedding，可選one_hot的方式市栗，返回embedding的結(jié)果和embedding_table

3咳短、詞向量的后續(xù)處理

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]
  output = input_tensor
  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings
  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings
  output = layer_norm_and_dropout(output, dropout_prob)
  return output

主要是信息添加盲赊，可以將word的位置和word對(duì)應(yīng)的token type等信息添加到詞向量里面敷扫，并且layer正則化和dropout之后返回

4、構(gòu)造attention mask

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]
  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]
  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
  mask = broadcast_ones * to_mask
  return mask

將shape為[batch_size, to_seq_length]的2D mask轉(zhuǎn)換為一個(gè)shape 為[batch_size, from_seq_length, to_seq_length] 的3D mask用于attention當(dāng)中合溺。

5缀台、attention layer

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    attention_scores += adder

  attention_probs = tf.nn.softmax(attention_scores)

  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

整個(gè)網(wǎng)絡(luò)的重頭戲來(lái)了！tansformer的主要內(nèi)容都在這里面哲身，輸入的from_tensor當(dāng)作query勘天，to_tensor當(dāng)作key和value。當(dāng)self attention的時(shí)候from_tensor和to_tensor是同一個(gè)值商膊。

（1）函數(shù)一開(kāi)始對(duì)輸入的shape進(jìn)行校驗(yàn)宠进，獲取batch_size、from_seq_length 潦匈、to_seq_length 赚导。輸入如果是3D張量則轉(zhuǎn)化成2D矩陣(以輸入為word_embedding為例[batch_size, seq_lenth, hidden_size] -> [batch_size*seq_lenth, hidden_size])

（2）通過(guò)全連接線性投影生成query_layer吼旧、key_layer 未舟、value_layer，輸出的第二個(gè)維度變成num_attention_heads * size_per_head（整個(gè)模型默認(rèn)hidden_size=num_attention_heads * size_per_head）员串。然后通過(guò)transpose_for_scores轉(zhuǎn)換成多頭昼扛。

（3）根據(jù)公式計(jì)算attention_probs（attention score）：

image

如果attention_mask is not None，對(duì)mask的部分加上一個(gè)很大的負(fù)數(shù)扰法，這樣softmax之后相應(yīng)的概率值接近為0毅厚，再dropout吸耿。

（4）最后再將value和attention_probs相乘，返回3D張量或者2D矩陣

總結(jié)：

同學(xué)們可以將這段代碼與網(wǎng)絡(luò)結(jié)構(gòu)圖對(duì)照起來(lái)看：

image

該函數(shù)相比其他版本的的transformer很多地方都有簡(jiǎn)化锤岸，有以下四點(diǎn)：

（1）缺少scale的操作板乙；

（2）沒(méi)有Causality mask募逞，個(gè)人猜測(cè)主要是bert沒(méi)有decoder的操作，所以對(duì)角矩陣mask是不需要的刺啦，從另一方面來(lái)說(shuō)正好體現(xiàn)了雙向transformer的特點(diǎn)纠脾；

（3）沒(méi)有query mask。跟（2）理由類(lèi)似糊渊，encoder都是self attention慧脱，query和key相同所以只需要一次key mask就夠了

（4）沒(méi)有query的Residual層和normalize

6菱鸥、Transformer

def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  prev_output = reshape_to_matrix(input_tensor)

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          attention_output = tf.concat(attention_heads, axis=-1)
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)
        prev_output = layer_output
        all_layer_outputs.append(layer_output)

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

transformer是對(duì)attention的利用殷绍，分以下幾步：

（1）計(jì)算attention_head_size鹊漠，attention_head_size = int(hidden_size / num_attention_heads)即將隱層的輸出等分給各個(gè)attention頭殖侵。然后將input_tensor轉(zhuǎn)換成2D矩陣拢军；

（2）對(duì)input_tensor進(jìn)行多頭attention操作怔鳖，再做：線性投影——dropout——layer norm——intermediate線性投影——線性投影——dropout——attention_output的residual——layer norm

其中intermediate線性投影的hidden_size可以自行指定，其他層的線性投影hidden_size需要統(tǒng)一度陆，目的是為了對(duì)齊懂傀。

（3）如此循環(huán)計(jì)算若干次蜡感，且保存每一次的輸出，最后返回所有層的輸出或者最后一層的輸出犀斋。

總結(jié)：

進(jìn)一步證實(shí)該函數(shù)transformer只存在encoder情连，而不存在decoder操作，所以所有層的多頭attention操作都是基于self encoder的虫几。對(duì)應(yīng)論文紅框的部分：

image

7辆脸、BertModel

class BertModel(object):
  def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=True,
               scope=None):
    config = copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

      self.sequence_output = self.all_encoder_layers[-1]
      with tf.variable_scope("pooler"):
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

終于到模型入口了始腾。

(1）設(shè)置各種參數(shù)，如果input_mask為None的話浪箭，就指定所有input_mask值為1奶栖，即不進(jìn)行過(guò)濾门坷；如果token_type_ids是None的話袍镀，就指定所有token_type_ids值為0苇羡；

(2）對(duì)輸入的input_ids進(jìn)行embedding操作，再embedding_postprocessor操作设江，前面我們說(shuō)了叉存。主要是加入位置和token_type信息到詞向量里面；

(3）轉(zhuǎn)換attention_mask 后稿存，通過(guò)調(diào)用transformer_model進(jìn)行encoder操作瞳秽；

(4）獲取最后一層的輸出sequence_output和pooled_output，pooled_output是取sequence_output的第一個(gè)切片然后線性投影獲得（可以用于分類(lèi)問(wèn)題）

8拂苹、總結(jié)：

（1）bert主要流程是先embedding（包括位置和token_type的embedding）痰洒，然后調(diào)用transformer得到輸出結(jié)果丘喻，其中embedding、embedding_table泉粉、所有transformer層輸出嗡靡、最后transformer層輸出以及pooled_output都可以獲得，用于遷移學(xué)習(xí)的fine-tune和預(yù)測(cè)任務(wù)歉井；

（2）bert對(duì)于transformer的使用僅限于encoder哈误，沒(méi)有decoder的過(guò)程躏嚎。這是因?yàn)槟Ｐ痛娲馐菫榱祟A(yù)訓(xùn)練服務(wù)卢佣，而預(yù)訓(xùn)練是通過(guò)語(yǔ)言模型箭阶，不同于NLP其他特定任務(wù)。在做遷移學(xué)習(xí)時(shí)可以自行添加媳危；

（3）正因?yàn)闆](méi)有decoder的操作冈敛，所以在attention函數(shù)里面也相應(yīng)地減少了很多不必要的功能。

其他非主要函數(shù)這里不做過(guò)多介紹暮蹂，感興趣的同學(xué)可以去看源碼癌压。

下一篇文章我們將繼續(xù)學(xué)習(xí)bert源碼的其他模塊滩届，包括訓(xùn)練、預(yù)測(cè)以及輸入輸出等相關(guān)功能帜消。

Reference

1.https://github.com/google-research/bert/blob/master/modeling.py

2.https://github.com/Kyubyong/transformer

3.Attention Is All You Need

4.BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding

最后編輯于：2019.08.10 21:01:04

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末泡挺，一起剝皮案震驚了整個(gè)濱河市娄猫，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌月幌，老刑警劉巖褂删，帶你破解...
沈念sama閱讀 217,277評(píng)論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件屯阀，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡钦无，警方通過(guò)查閱死者的電腦和手機(jī)盖袭，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,689評(píng)論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)鳄虱，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人决记，你說(shuō)我怎么就攤上這事倍踪。” “怎么了扩借？”我有些...
開(kāi)封第一講書(shū)人閱讀 163,624評(píng)論 0贊 353
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵潮罪，是天一觀的道長(zhǎng)领斥。經(jīng)常有香客問(wèn)我，道長(zhǎng)屯碴，這世上最難降的妖魔是什么膊存？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,356評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任隔崎，我火速辦了婚禮，結(jié)果婚禮上虚缎，老公的妹妹穿的比我還像新娘钓株。我一直安慰自己陌僵，他們只是感情好碗短，可當(dāng)我...
茶點(diǎn)故事閱讀 67,402評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布题涨。她就那樣靜靜地躺著纲堵，像睡著了一般。火紅的嫁衣襯著肌膚如雪鸯隅。梳的紋絲不亂的頭發(fā)上向挖，一...
開(kāi)封第一講書(shū)人閱讀 51,292評(píng)論 1贊 301
城市分裂傳說(shuō)
那天何之，我揣著相機(jī)與錄音，去河邊找鬼徊件。笑死蒜危，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的部翘。我是一名探鬼主播响委，決...
沈念sama閱讀 40,135評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼赘风，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了荸哟？” 一聲冷哼從身側(cè)響起，我...
開(kāi)封第一講書(shū)人閱讀 38,992評(píng)論 0贊 275
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤舵抹，失蹤者是張志新（化名）和其女友劉穎掏父，沒(méi)想到半個(gè)月后秆剪，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體爵政，經(jīng)...
沈念sama閱讀 45,429評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡钾挟，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,636評(píng)論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年掺出，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片汤锨。...
茶點(diǎn)故事閱讀 39,785評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡双抽，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出闲礼，到底是詐尸還是另有隱情牍汹，我是刑警寧澤，帶...
沈念sama閱讀 35,492評(píng)論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布柬泽，位于F島的核電站慎菲，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏锨并。R本人自食惡果不足惜露该，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,092評(píng)論 3贊 328
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一第煮、第九天我趴在偏房一處隱蔽的房頂上張望有决。院中可真熱鬧，春花似錦空盼、人聲如沸书幕。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,723評(píng)論 0贊 22
一樁弒父案揽趾，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)台汇。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間苟呐，已是汗流浹背痒芝。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,858評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留牵素，地道東北人严衬。一個(gè)月前我還...
沈念sama閱讀 47,891評(píng)論 2贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像笆呆，于是被迫代替她去往敵國(guó)和親请琳。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,713評(píng)論 2贊 354

【轉(zhuǎn)載】Bert系列（二）——源碼解讀之模型主體

1寂嘉、配置

2拂玻、word embedding

3咳短、詞向量的后續(xù)處理

4、構(gòu)造attention mask

5缀台、attention layer

6菱鸥、Transformer

7辆脸、BertModel

8拂苹、總結(jié)：

推薦閱讀更多精彩內(nèi)容