google開源的tensorflow版本的bert 源碼見 https://github.com/google-research/bert端盆。本文主要對該官方代碼的一些關鍵部分進行解讀骑疆。
首先我們來看數(shù)據(jù)預處理部分吱雏,分析原始數(shù)據(jù)集是如何轉(zhuǎn)化成能夠送入bert模型的特征的畦木。
DataProcessor
DataProcessor
這個抽象基類定義了get_train_examples, get_dev_examples, get_test_examples, get_labels
這四個需要子類實現(xiàn)的方法蛙粘,還定義了一個_read_tsv
函數(shù)來讀取原始數(shù)據(jù)集tsv文件。
針對文本二分類任務囱稽,我們可以通過學習繼承DataProcessor
類的子類ColaProcessor
的具體實現(xiàn)過程來了解數(shù)據(jù)處理的過程派昧。我們可以發(fā)現(xiàn)子類ColaProcessor
處理原始數(shù)據(jù)的關鍵函數(shù)如下:
class ColaProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
# Only the test set has a header
if set_type == "test" and i == 0:
continue
guid = "%s-%s" % (set_type, i)
if set_type == "test":
text_a = tokenization.convert_to_unicode(line[1])
label = "0"
else:
text_a = tokenization.convert_to_unicode(line[3])
label = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
該函數(shù)首先通過_read_tsv
讀入原始數(shù)據(jù)集文件train.tsv
,然后調(diào)用_create_examples
將數(shù)據(jù)集中的每一行轉(zhuǎn)換成一個InputExample
對象剃浇。
- 在函數(shù)
_create_examples
中巾兆,如果是訓練集和驗證集,那么line[1]
就是label虎囚,line[3]
就是文本內(nèi)容角塑,而對于測試集,line[1]
就是文本內(nèi)容淘讥,沒有l(wèi)abel圃伶,因此全部設成0。這個具體可看CoLA數(shù)據(jù)集蒲列。注意這里將所有字符串用tokenization.convert_to_unicode
轉(zhuǎn)成unicode字符串窒朋,是為了兼容python2和python3。 - 對象
InputExample
有四個屬性蝗岖,guid
僅僅是一個唯一的id標識侥猩,text_a
表示第一個句子,text_b
表示第二個句子(可選抵赢,針對句子對任務)欺劳,label表示標簽(可選唧取,測試集沒有)。
tokenizer
對InputExample里的句子字符串text_a
或text_b
進行分詞操作的主要函數(shù)是FullTokenizer
划提。
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
該函數(shù)通過load_vocab
加載詞典枫弟,方便將后續(xù)分詞得到的token映射到對應的id。通過調(diào)用BasicTokenizer
和WordpieceTokenizer
進行分詞鹏往,前者根據(jù)標點符號淡诗、空格等進行普通的分詞,后者則會對前者的結果進行更細粒度的分詞伊履。
- 注意
BasicTokenizer
會將中文切分成一個個的漢字韩容,也就是在中文字符(字)前后加上空格,從而后續(xù)分詞將每個中文字符當成一個詞湾碎。 -
WordpieceTokenizer
基于傳入的詞典vocab
,對單詞進行更細粒度的切分奠货,比如"unaffable"被進一步切分為["un", "##aff", "##able"]介褥。對于中文來說,WordpieceTokenizer什么也不干递惋,因為前一步分詞已經(jīng)是基于字符的了柔滔。注意,對于在詞典vocab
中找不到的單詞萍虽,會設置為[UNK]
token睛廊。 - WordPiece是一種解決OOV問題的方法,具體可參考google/sentencepiece項目杉编。
convert_single_example
接下來我們對Processor處理后得到的InputExample進行處理超全,得到能夠送入網(wǎng)絡的特征。
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir)
num_train_steps = int(
len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
if FLAGS.do_train:
train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
file_based_convert_examples_to_features(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
tf.logging.info("***** Running training *****")
tf.logging.info(" Num examples = %d", len(train_examples))
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = file_based_input_fn_builder(
input_file=train_file,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
## 提取特征的函數(shù)
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()
對于Processor處理后得到每個InputExample對象邓馒,file_based_convert_examples_to_features
函數(shù)會把這些對象轉(zhuǎn)化成能夠送入bert網(wǎng)絡的特征嘶朱,并將其保存到一個TFRecord文件中」夂ǎ可以發(fā)現(xiàn)疏遏,該過程提取特征的關鍵函數(shù)是convert_single_example
:
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
## 將label映射為 id
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
- 首先調(diào)用
tokenizer
函數(shù)對text_a
或text_b
進行分詞,將句子轉(zhuǎn)化為tokens救军,若分詞后的句子(一個或兩個)長度過長财异,則需要進行截斷,保證在句子首尾加了[CLS]
和[SEP]
之后的總長度在max_seq_length
范圍內(nèi)唱遭。 -
segment_ids
也就是type_ids
用來區(qū)分單詞來自第一條句子還是第二條句子戳寸,type=0
和type=1
對應的embedding會在模型pre-train階段學得。盡管理論上這不是必要的拷泽,因為[SEP]
可以區(qū)分句子的邊界庆揩,但是加上type
后模型會更容易知道這個詞屬于哪個序列俐东。 -
convert_tokens_to_ids
利用詞典vocab
,將句子分詞后的token映射為id订晌。 - 當句子長度小于
max_seq_length
時虏辫,會進行padding,補充到固定的max_seq_length
長度锈拨。input_mask=1
表示該token來自于句子砌庄,input_mask=0
表示該token是padding的。 - 最后將提取的
input_ids, input_mask, segment_ids
封裝到InputFeatures
對象中奕枢。至此娄昆,送入網(wǎng)絡前的數(shù)據(jù)處理過程完成。
下一篇會解讀模型的網(wǎng)絡結構缝彬,以及輸入的ids到詞向量的映射過程等萌焰。