BERT測試實踐

運行環(huán)境NF5288M5

運行環(huán)境NF5488M5

谷歌BERT預訓練任務

http://www.reibang.com/p/22e462f01d8c

樣本生成

Bert工程中發(fā)布了一個“生成預訓練數(shù)據(jù)”的腳本氢妈。該腳本的輸入是待訓練的數(shù)據(jù)（純文本文件）和字典，輸出是處理得到的tfrecord文件。待訓練的數(shù)據(jù)一句話一行，一個段落或文檔中間用空格隔開。

# Input fileformat:

(1) One sentenceper line. These should ideally be actual sentences, not entire paragraphs orarbitrary spans of text. (Because we use the sentence boundaries for the"next sentence prediction" task).

(2) Blank linesbetween documents. Document boundaries are needed so that the "nextsentence prediction" task doesn't span between documents.

該生成訓練數(shù)據(jù)的腳本袍暴，會一次性將輸入文件的所有內容填充到內存在做處理，如果文件過大需要多次調用腳本生成不同的TFRECORD文件。

python create_pretraining_data.py \

? --input_file=./sample_text.txt \

?--output_file=/tmp/tf_examples.tfrecord \

?--vocab_file=$BERT_BASE_DIR/vocab.txt \

? --do_lower_case=True \

? --max_seq_length=128 \

? --max_predictions_per_seq=20 \

? --masked_lm_prob=0.15 \

? --random_seed=12345 \

? --dupe_factor=5

****在生成數(shù)據(jù)的過程中老是提醒我生成的數(shù)據(jù)是空數(shù)據(jù)宋下，沒辦法我就逐行的debug發(fā)現(xiàn)輸入文件讀不到，確認路徑不存在問題辑莫。后來学歧，竟然是因為文件名的最后多加了一個空格。

訓練步驟

如果想從頭開始訓練的話就不要添加init_checkpoint這個超參各吨。解釋下下面的參數(shù)，input_file是指預訓練用的數(shù)據(jù)集揭蜒，在上面流程中產生的tfrecord文件横浑；output_dir是存放日志和模型的文件夾；do_train & do_eval是否去做這兩個操作屉更，必須有大于等于一個是True伪嫁；bert_config_file構建bert模型時需要的參數(shù)，下載的模型文件中有這個json文件偶垮；init_checkpoint模型訓練的起點张咳；后面的幾個參數(shù)分別是批次大小帝洪、最大預測的詞數(shù)、訓練的步數(shù)脚猾、預熱學習率的步數(shù)葱峡、初始學習率。

python run_pretraining.py \

?--input_file=/tmp/tf_examples.tfrecord \

?--output_dir=/tmp/pretraining_output \

? --do_train=True \

? --do_eval=True \

?--bert_config_file=$BERT_BASE_DIR/bert_config.json \

?--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

? --train_batch_size=32 \

? --max_seq_length=128 \

? --max_predictions_per_seq=20 \

? --num_train_steps=20 \

? --num_warmup_steps=10 \

? --learning_rate=2e-5

1）輸入數(shù)據(jù)為TFRECORD格式數(shù)據(jù)龙助，該數(shù)據(jù)可以使用樣本生成步驟中的腳本來生成砰奕。

TFRECORD中包含的數(shù)據(jù)包括：

Input_ids：: [101, 1131, 3090, 1106, 9416, 1103, 18127, 103, 117, 1115, 103, 1821, 170, 14798, 103, 4267, 20394, 1785, 2111, 103, 102, 170, 4984, 2851, 117, 178, 1821, 117, 6442, 106, 112, 1598, 1119, 103, 8228, 8788, 103, 117, 15992, 103, 8290, 3472, 118, 118, 112, 103, 4984, 2851, 106, 103, 118, 4984, 117, 1191, 1103, 103, 1104, 103, 103, 2621, 1104, 1103, 27466, 17893, 117, 1621, 1103, 16358, 5700, 1104, 1103, 2211, 1362, 118, 118, 5750, 117, 1256, 1154, 103, 16358, 1403, 118, 15398, 2111, 119, 1218, 117, 1170, 1155, 117, 178, 6111, 1437, 1128, 1103, 1236, 1106, 1103, 19026, 112, 188, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Input_mask:: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Segement id:: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Mask_lm_position:: [4, 7, 10, 14, 16, 19, 33, 36, 39, 45, 49, 55, 57, 58, 79, 92, 0, 0, 0, 0]

Mask_lm_ids:: [1143, 1864, 178, 1104, 6871, 119, 117, 1193, 1117, 170, 118, 14931, 5027, 1209, 1103, 1209, 0, 0, 0, 0]

Mask_lm_weights:: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]

Next_sentence_lable 0

2) 給數(shù)據(jù)打batch，用tf.data提鸟，函數(shù)是input_fn_builder

3) 構建模型军援，輸入input_id input_mask segement_id

4) embedding及embedding的后處理，得到的大小batch*seq_len*emd_size

5)? ?經(jīng)過transformer_model,得到的sequence_out大小batch*seq_len*hidden_size,如果是句子級別的分類等任務輸出可以選擇pool_out,大小batch*hidden_size

6)? ?mask_lm_loss 的求取称勋，輸入sequence_out,得到per example_loss的大小 batchsize* mask_lm_ids_length,總loss 標量的loss

7)? next_seq_loss的求取胸哥，輸入pool_out，得到batch_size *2的分類赡鲜，得到per example_loss的大小 batchsize*1空厌，總loss 標量的loss

8)? 優(yōu)化器然后loss反傳求解梯度，學習率反向更新權重

預訓練速度1

NF5288M5：96-100examples/s

預訓練速度2

NF5488M5：107~112examples/s

BERT模型構建過程

http://www.reibang.com/p/d7ce41b58801

（1）模型配置

????????模型配置银酬，比較簡單嘲更，依次是：詞典大小、隱層神經(jīng)元個數(shù)揩瞪、transformer的層數(shù)赋朦、attention的頭數(shù)、激活函數(shù)李破、中間層神經(jīng)元個數(shù)宠哄、隱層dropout比例、attention里面dropout比例喷屋、sequence最大長度、token_type_ids的詞典大小瞭恰、truncated_normal_initializer的stdev屯曹。

（2）word embedding

（3）詞向量的后處理（添加位置信息、詞性信息）

（4）構造attention mask

（5）attention layer（多頭attention）

（6）transformer

（7）BERT模型構造

Bert模型返回的結果

***bert主要流程是先embedding（包括位置和token_type的embedding）惊畏，然后調用transformer得到輸出結果恶耽，其中embedding、embedding_table颜启、所有transformer層輸出偷俭、最后transformer層輸出以及pooled_output都可以獲得，用于遷移學習的fine-tune和預測任務缰盏；

***bert對于transformer的使用僅限于encoder涌萤，沒有decoder的過程淹遵。這是因為模型存粹是為了預訓練服務，而預訓練是通過語言模型负溪，不同于NLP其他特定任務透揣。在做遷移學習時可以自行添加；

***正因為沒有decoder的操作川抡，所以在attention函數(shù)里面也相應地減少了很多不必要的功能辐真。

BERT預訓練tips了解

1）Masked LM 和nextsentence prediction? loss

```***** Evalresults *****

? global_step = 20

? loss = 0.0979674

? masked_lm_accuracy = 0.985479

? masked_lm_loss = 0.0979328

? next_sentence_accuracy = 1.0

? next_sentence_loss = 3.45724e-05

```

2）更換自己詞典時，注意vocab_size的大小

3）Check point開始訓練崖堤，專有行業(yè)的語料影評分析

4）The learning rate we used inthe paper was 1e-4. However, if you are doing additional steps of pre-trainingstarting from an existing BERT checkpoint, you should use a smaller learningrate (e.g., 2e-5).

5）Longer sequences are disproportionately expensive because? attention is quadratic to the sequence length.In otherwords, a batch of 64 sequences of length 512 is much more expensive than abatch of 256 sequences of length 128. The fully-connected/convolutional cost isthe same, but the attention cost is far greater for the 512-length sequences.Therefore, one good recipe is to pre-train for, say, 90,000 steps with asequence length of 128 and then for 10,000 additional steps with a sequencelength of 512. The very long sequences are mostly needed to learn positionalembeddings, which can be learned fairly quickly. Note that this does requiregenerating the data twice with different values of`max_seq_length`.

6）Isthis code compatible with Cloud TPUs? What about GPUs?

Yes, all of the code in this repository worksout-of-the-box with CPU, GPU, and Cloud TPU. However, GPU training issingle-GPU only.

7）選擇BERT-Base, Uncased這個模型呢侍咱？原因有三：1、訓練語料為英文密幔，所以不選擇中文或者多語種楔脯；2、設備條件有限老玛，如果您的顯卡內存小于16個G淤年，那就請乖乖選擇base,不要折騰large了；3蜡豹、cased表示區(qū)分大小寫麸粮，uncased表示不區(qū)分大小寫。除非你明確知道你的任務對大小寫敏感（比如命名實體識別镜廉、詞性標注等）那么通常情況下uncased效果更好弄诲。

BERT和Transformer理解及測試（二）