# Input fileformat:
(1) One sentenceper line. These should ideally be actual sentences, not entire paragraphs orarbitrary spans of text. (Because we use the sentence boundaries for the"next sentence prediction" task).
(2) Blank linesbetween documents. Document boundaries are needed so that the "nextsentence prediction" task doesn't span between documents.
python \
? --input_file=./sample_text.txt \
?--output_file=/tmp/tf_examples.tfrecord \
?--vocab_file=$BERT_BASE_DIR/vocab.txt \
? --do_lower_case=True \
? --max_seq_length=128 \
? --max_predictions_per_seq=20 \
? --masked_lm_prob=0.15 \
? --random_seed=12345 \
? --dupe_factor=5
如果想從頭開始訓練的話就不要添加init_checkpoint這個超參各吨。解釋下下面的參數(shù),input_file是指預訓練用的數(shù)據(jù)集揭蜒,在上面流程中產生的tfrecord文件横浑;output_dir是存放日志和模型的文件夾;do_train & do_eval是否去做這兩個操作屉更,必須有大于等于一個是True伪嫁;bert_config_file構建bert模型時需要的參數(shù),下載的模型文件中有這個json文件偶垮;init_checkpoint模型訓練的起點张咳;后面的幾個參數(shù)分別是批次大小帝洪、最大預測的詞數(shù)、訓練的步數(shù)脚猾、預熱學習率的步數(shù)葱峡、初始學習率。
python \
?--input_file=/tmp/tf_examples.tfrecord \
?--output_dir=/tmp/pretraining_output \
? --do_train=True \
? --do_eval=True \
?--bert_config_file=$BERT_BASE_DIR/bert_config.json \
?--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
? --train_batch_size=32 \
? --max_seq_length=128 \
? --max_predictions_per_seq=20 \
? --num_train_steps=20 \
? --num_warmup_steps=10 \
? --learning_rate=2e-5
Input_ids:: [101, 1131, 3090, 1106, 9416, 1103, 18127, 103, 117, 1115, 103, 1821, 170, 14798, 103, 4267, 20394, 1785, 2111, 103, 102, 170, 4984, 2851, 117, 178, 1821, 117, 6442, 106, 112, 1598, 1119, 103, 8228, 8788, 103, 117, 15992, 103, 8290, 3472, 118, 118, 112, 103, 4984, 2851, 106, 103, 118, 4984, 117, 1191, 1103, 103, 1104, 103, 103, 2621, 1104, 1103, 27466, 17893, 117, 1621, 1103, 16358, 5700, 1104, 1103, 2211, 1362, 118, 118, 5750, 117, 1256, 1154, 103, 16358, 1403, 118, 15398, 2111, 119, 1218, 117, 1170, 1155, 117, 178, 6111, 1437, 1128, 1103, 1236, 1106, 1103, 19026, 112, 188, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Input_mask:: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Segement id:: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Mask_lm_position:: [4, 7, 10, 14, 16, 19, 33, 36, 39, 45, 49, 55, 57, 58, 79, 92, 0, 0, 0, 0]
Mask_lm_ids:: [1143, 1864, 178, 1104, 6871, 119, 117, 1193, 1117, 170, 118, 14931, 5027, 1209, 1103, 1209, 0, 0, 0, 0]
Mask_lm_weights:: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]
Next_sentence_lable 0
2) 給數(shù)據(jù)打batch,用tf.data提鸟,函數(shù)是input_fn_builder
3) 構建模型军援,輸入input_id input_mask segement_id
4) embedding及embedding的后處理,得到的大小batch*seq_len*emd_size
5)? ?經(jīng)過transformer_model,得到的sequence_out大小batch*seq_len*hidden_size,如果是句子級別的分類等任務輸出可以選擇pool_out,大小batch*hidden_size
6)? ?mask_lm_loss 的求取称勋,輸入sequence_out,得到per example_loss的大小 batchsize* mask_lm_ids_length,總loss 標量的loss
7)? next_seq_loss的求取胸哥,輸入pool_out,得到batch_size *2的分類赡鲜,得到per example_loss的大小 batchsize*1空厌,總loss 標量的loss
8)? 優(yōu)化器然后loss反傳求解梯度,學習率反向更新權重
(2)word embedding
(4)構造attention mask
(5)attention layer(多頭attention)
1)Masked LM 和nextsentence prediction? loss
```***** Evalresults *****
? global_step = 20
? loss = 0.0979674
? masked_lm_accuracy = 0.985479
? masked_lm_loss = 0.0979328
? next_sentence_accuracy = 1.0
? next_sentence_loss = 3.45724e-05
3)Check point開始訓練崖堤,專有行業(yè)的語料影評分析
4)The learning rate we used inthe paper was 1e-4. However, if you are doing additional steps of pre-trainingstarting from an existing BERT checkpoint, you should use a smaller learningrate (e.g., 2e-5).
5)Longer sequences are disproportionately expensive because? attention is quadratic to the sequence length.In otherwords, a batch of 64 sequences of length 512 is much more expensive than abatch of 256 sequences of length 128. The fully-connected/convolutional cost isthe same, but the attention cost is far greater for the 512-length sequences.Therefore, one good recipe is to pre-train for, say, 90,000 steps with asequence length of 128 and then for 10,000 additional steps with a sequencelength of 512. The very long sequences are mostly needed to learn positionalembeddings, which can be learned fairly quickly. Note that this does requiregenerating the data twice with different values of`max_seq_length`.
6)Isthis code compatible with Cloud TPUs? What about GPUs?
Yes, all of the code in this repository worksout-of-the-box with CPU, GPU, and Cloud TPU. However, GPU training issingle-GPU only.
7)選擇BERT-Base, Uncased這個模型呢侍咱?原因有三:1、訓練語料為英文密幔,所以不選擇中文或者多語種楔脯;2、設備條件有限老玛,如果您的顯卡內存小于16個G淤年,那就請乖乖選擇base,不要折騰large了;3蜡豹、cased表示區(qū)分大小寫麸粮,uncased表示不區(qū)分大小寫。除非你明確知道你的任務對大小寫敏感(比如命名實體識別镜廉、詞性標注等)那么通常情況下uncased效果更好弄诲。