1. 講下BERT。
-
雙向二階段預(yù)訓(xùn)練模型-word-piece狰域。
- Special Token:
[CLS]
媳拴、[SEP]
。 - BERT_Base
(12 layers)
兆览、BERT_Large(24 layers)
屈溉。 - Pre-training:
Task #1: Masked LM
、Task #2: Next Sentence Prediction
Special Token:
start and end tokens (<s>, <e>)
抬探、delimiter token ($)
子巾。-
Fine-tuning:
Two Sentence Classification
、Single Sentence Classification
小压、Question Answering
线梗、Single Sentence Tagging
。
2. 能否實現(xiàn)下Word Piece?忘記步驟了怠益,換成實現(xiàn)一下從若干文件中生成一個詞典缠导,即word2idx和idx2word。BPE算法溉痢。
WordPiece算法可以看作是BPE的變種僻造。不同點在于,WordPiece基于概率生成新的subword而不是下一最高頻字節(jié)對孩饼。
算法:
- 準(zhǔn)備足夠大的訓(xùn)練語料
- 確定期望的subword詞表大小
- 將單詞拆分成字符序列
- 基于第3步數(shù)據(jù)訓(xùn)練語言模型
- 從所有可能的subword單元中選擇加入語言模型后能最大程度地增加訓(xùn)練數(shù)據(jù)概率的單元作為新的單元
- 重復(fù)第5步直到達到第2步設(shè)定的subword詞表大小或概率增量低于某一閾值
class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""
def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.
Returns:
A list of wordpiece tokens.
"""
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def bpe(self, token):
if token in self.cache:
return self.cache[token]
word = tuple(token)
pairs = get_pairs(word)
if not pairs:
return token
while True:
bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
except ValueError:
new_word.extend(word[i:])
break
else:
new_word.extend(word[i:j])
i = j
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = " ".join(word)
self.cache[token] = word
return word
def _tokenize(self, text):
""" Tokenize a string. """
bpe_tokens = []
for token in re.findall(self.pat, text):
token = "".join(
self.byte_encoder[b] for b in token.encode("utf-8")
) # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
return bpe_tokens
# '想要有直升機\n想要和你飛到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'
# 這個數(shù)據(jù)集有6萬多個字符髓削。為了打印方便,我們把換行符替換成空格镀娶,然后僅使用前1萬個字符來訓(xùn)練模型立膛。
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]
idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size # 1027
3. 講下bert,講著講著面試官打斷了我梯码,說你幫我估算下一層bert大概有多少參數(shù)量宝泵。
講下BERT:如問題1。
# BertEmbeddings:
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# BertAttention(BertSelfAttention+ BertSelfOutput):
#### BertSelfAttention
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
#### BertSelfOutput
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# BertIntermediate
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
# BertOutput
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# BertPooler
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
4. bert里add&norm是什么以及作用轩娶。
- add:Residual connection
(殘差連接)
儿奶,主要是為了避免模型較深時,在進行反向傳播時鳄抒,梯度消失等問題闯捎。 - norm:Layer Normalization
(層歸一化)
椰弊,為了解決網(wǎng)絡(luò)中數(shù)據(jù)分布變化大,學(xué)習(xí)過程慢的問題瓤鼻。
5. 了不了解bert的擴展模型秉版,roberta,SpanBERT茬祷,XLM清焕,albert;介紹幾個除了bert之外的模型祭犯。
roberta:
SpanBERT:
XLM:
albert:
transformer-xl(extra-long)
:
xlnet:
6. bert源碼里mask部分在哪個模塊耐朴;bert如何mask。
Multi-Head Attention:(BertSelfAttention)
class BertSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
if encoder_hidden_states is not None:
mixed_key_layer = self.key(encoder_hidden_states)
mixed_value_layer = self.value(encoder_hidden_states)
attention_mask = encoder_attention_mask
else:
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
7. 估計一下bert的參數(shù)量盹憎。
估算下bert的參數(shù)量:如問題3。
- : L=12, H=768, A=12, Total Parameters=110M铐刘。
- : L=24, H=1024, A=16, Total Parameters=340M陪每。
8. roberta和bert在預(yù)訓(xùn)練時的不同。
- Static vs. Dynamic Masking
- Model Input Format and Next Sentence Prediction
9. 介紹下roberta镰吵,為什么選用wwm檩禾。
同上:
Whole Word Masking (wwm),暫翻譯為全詞Mask或整詞Mask疤祭,是谷歌在2019年5月31日發(fā)布的一項BERT的升級版本盼产,主要更改了原預(yù)訓(xùn)練階段的訓(xùn)練樣本生成策略。 簡單來說勺馆,原有基于WordPiece的分詞方式會把一個完整的詞切分成若干個子詞戏售,在生成訓(xùn)練樣本時,這些被分開的子詞會隨機被mask草穆。 在全詞Mask中灌灾,如果一個完整的詞的部分WordPiece子詞被mask,則同屬該詞的其他部分也會被mask悲柱,即全詞Mask锋喜。
需要注意的是,這里的mask指的是廣義的mask(替換成[MASK]豌鸡;保持原詞匯嘿般;隨機替換成另外一個詞),并非只局限于單詞替換成[MASK]
標(biāo)簽的情況涯冠。 更詳細的說明及樣例請參考:#4
同理炉奴,由于谷歌官方發(fā)布的BERT-base, Chinese
中,中文是以字為粒度進行切分蛇更,沒有考慮到傳統(tǒng)NLP中的中文分詞(CWS)盆佣。 我們將全詞Mask的方法應(yīng)用在了中文中往堡,使用了中文維基百科(包括簡體和繁體)進行訓(xùn)練,并且使用了哈工大LTP作為分詞工具共耍,即對組成同一個詞的漢字全部進行Mask虑灰。
下述文本展示了全詞Mask
的生成樣例。 注意:為了方便理解痹兜,下述例子中只考慮替換成[MASK]標(biāo)簽的情況穆咐。
10. BERT、GPT字旭、ELMO之間的區(qū)別(模型結(jié)構(gòu)对湃、訓(xùn)練方式)。
- BERT-雙向兩階段預(yù)訓(xùn)練模型:Pre-training
(MLM遗淳、NSP)
+Fine-Tuning(...)
- GPT-單向兩階段預(yù)訓(xùn)練模型:Pre-training
(LM)
+Fine-Tuning(...)
- ELMO-兩個方向預(yù)訓(xùn)練模型:Pre-training
(兩個單向LM)
+supervised NLP tasks(使用LSTM各層表征)
11. BERT為什么只用Transformer的Encoder而不用Decoder拍柒。
BERT在Pre-training過程中使用的Masked Language Model(AE)
,聯(lián)合上下文信息預(yù)測被[MASK]掉的標(biāo)記屈暗。而Decoder采用的一種單向的語言模型(LM)
拆讯。所以BERT使用Encoder,而不是用Decoder养叛。
12. xlnet和bert有啥不同种呐。自回歸&&自編碼的知識,其中解釋了xlnet排列語言模型以及雙流attention弃甥。
bert:自回歸爽室;xlnet:自回歸&&自編碼。