好久沒來更新呛讲,好慚愧凯亮,現(xiàn)在也沒了當(dāng)初做這個的心情,就大概記錄一下吧磷斧。
首先BERT模型是一個像word2vec這種的預(yù)訓(xùn)練模型,word2vec結(jié)構(gòu)比較簡單就是一個最簡單的神經(jīng)網(wǎng)絡(luò)并且只取中間那個隱藏的weights作為詞向量诗芜,而BERT復(fù)雜一點瞳抓,用的是很多層(BASE是12層,也是我實驗用到的)的transformer網(wǎng)絡(luò)結(jié)構(gòu)伏恐,transfomer細節(jié)https://jalammar.github.io/illustrated-transformer/這里講的比較形象好理解孩哑,或者直接去tensorflow看源碼。
BERT的開源代碼里是這樣寫的(在modeling.py里):
class BertModel(object):
"""BERT model ("Bidirectional Encoder Representations from Transformers").
Example usage:
# Already been converted into WordPiece token ids
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
model = modeling.BertModel(config=config, is_training=True,
input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)
...
"""
def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=False,
scope=None):
"""Constructor for BertModel.
Args:
config: `BertConfig` instance.
is_training: bool. true for training model, false for eval model. Controls
whether dropout will be applied.
input_ids: int32 Tensor of shape [batch_size, seq_length].
input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
embeddings or tf.embedding_lookup() for the word embeddings.
scope: (optional) variable scope. Defaults to "bert".
Raises:
ValueError: The config is invalid or one of the input tensor shapes
is invalid.
"""
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]
if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.
(self.embedding_output, self.embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"):
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)
# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
可以看到先進行embedding對應(yīng)各個id翠桦,然后encoder就是用的transformer_model横蜒,sequence_output是最后一個隱層(這個在閱讀理解任務(wù)會直接拿出來用到),pooled_output就是我們要做的分類任務(wù)拿出來用到的销凑,也就是在[CLS]這里輸出的結(jié)果丛晌。
實驗細節(jié)其實記不太清了,分類用到的是run_classifier.py斗幼。說兩點要改的地方澎蛛。一個是用自己的數(shù)據(jù)跑的話需要把讀文件那部分處理一下,一個是原代碼是處理分類的蜕窿,如果做多標簽分類谋逻,模型的最后一步輸出要從softmax改為多個sigmoid,相應(yīng)代碼是從create_model函數(shù)的
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
改為
probabilities = tf.nn.sigmoid(logits)
per_example_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
loss_batch = tf.reduce_mean(per_example_loss, axis=1)
loss = tf.reduce_mean(loss_batch)
另外說一下桐经,他的源代碼回調(diào)函數(shù)比較多毁兆,也不好拆解模型加載和預(yù)測部分∫跽酰可以改寫一下气堕。
gpu_config = tf.ConfigProto()
gpu_config.gpu_options.allow_growth = True
sess = tf.Session(config=gpu_config)
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
if os.path.exists(MODEL_PATH):
saver.restore(sess, tf.train.latest_checkpoint(MODEL_PATH))
feed_dict = {input_ids:batch_input_ids, input_mask:batch_input_mask, segment_ids:batch_segment_ids}
sess.run([probabilities], feed_dict)
大概這樣寫- -,很傳統(tǒng)易讀的tensorflow加載和預(yù)測方式畔咧,MODEL_PATH是我們自己訓(xùn)練好的模型茎芭,probabilities這里是create_model那里的定義的probabilities,入?yún)nput_ids那些是我們用tf.placeholder定義好的參數(shù)盒卸,batch_input_ids那些是我們feature那里拿到一批批對應(yīng)名字的數(shù)據(jù)骗爆。這樣就可以脫離他的各種fn回調(diào)函數(shù),分離加載和預(yù)測部分了蔽介≌叮或者用將ckpt轉(zhuǎn)換為pb和variables煮寡,用Tensorflow modeling serving的方式去部署也行。