遷移學(xué)習(xí)-Question

1. 講下BERT。

雙向二階段預(yù)訓(xùn)練模型-word-piece狰域。

Special Token：[CLS]媳拴、[SEP]。
BERT_Base（12 layers）兆览、BERT_Large（24 layers）屈溉。
Pre-training：Task #1: Masked LM、Task #2: Next Sentence Prediction

Special Token：start and end tokens (<s>, <e>)抬探、delimiter token ($)子巾。
Fine-tuning：Two Sentence Classification、Single Sentence Classification小压、Question Answering线梗、Single Sentence Tagging。

2. 能否實現(xiàn)下Word Piece?忘記步驟了怠益，換成實現(xiàn)一下從若干文件中生成一個詞典缠导，即word2idx和idx2word。BPE算法溉痢。

WordPiece算法可以看作是BPE的變種僻造。不同點在于，WordPiece基于概率生成新的subword而不是下一最高頻字節(jié)對孩饼。

算法：

準(zhǔn)備足夠大的訓(xùn)練語料
確定期望的subword詞表大小
將單詞拆分成字符序列
基于第3步數(shù)據(jù)訓(xùn)練語言模型
從所有可能的subword單元中選擇加入語言模型后能最大程度地增加訓(xùn)練數(shù)據(jù)概率的單元作為新的單元
重復(fù)第5步直到達到第2步設(shè)定的subword詞表大小或概率增量低于某一閾值

class WordpieceTokenizer(object):
    """Runs WordPiece tokenization."""

    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.

        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.

        For example:
          input = "unaffable"
          output = ["un", "##aff", "##able"]

        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer`.

        Returns:
          A list of wordpiece tokens.
        """

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue

            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens

Byte Pair Encoding：

def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token)
        pairs = get_pairs(word)

        if not pairs:
            return token

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                except ValueError:
                    new_word.extend(word[i:])
                    break
                else:
                    new_word.extend(word[i:j])
                    i = j

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = " ".join(word)
        self.cache[token] = word
        return word

    def _tokenize(self, text):
        """ Tokenize a string. """
        bpe_tokens = []
        for token in re.findall(self.pat, text):
            token = "".join(
                self.byte_encoder[b] for b in token.encode("utf-8")
            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
        return bpe_tokens

# '想要有直升機\n想要和你飛到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'

# 這個數(shù)據(jù)集有6萬多個字符髓削。為了打印方便，我們把換行符替換成空格镀娶，然后僅使用前1萬個字符來訓(xùn)練模型立膛。
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]

idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size # 1027

3. 講下bert，講著講著面試官打斷了我梯码，說你幫我估算下一層bert大概有多少參數(shù)量宝泵。

講下BERT：如問題1。

# BertEmbeddings：
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) 
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

# BertAttention（BertSelfAttention+ BertSelfOutput）：
#### BertSelfAttention
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
#### BertSelfOutput
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# BertIntermediate
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)

# BertOutput
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# BertPooler
self.dense = nn.Linear(config.hidden_size, config.hidden_size)

4. bert里add&norm是什么以及作用轩娶。

add：Residual connection（殘差連接）儿奶，主要是為了避免模型較深時，在進行反向傳播時鳄抒，梯度消失等問題闯捎。
norm：Layer Normalization（層歸一化）椰弊，為了解決網(wǎng)絡(luò)中數(shù)據(jù)分布變化大，學(xué)習(xí)過程慢的問題瓤鼻。

5. 了不了解bert的擴展模型秉版，roberta，SpanBERT茬祷，XLM清焕，albert；介紹幾個除了bert之外的模型祭犯。

roberta：

SpanBERT：

XLM：

albert：

transformer-xl(extra-long)：

Transformer

Transformer-XL (extra-long)

State Reuse for Segment-Level Recurrence

Incoherent Positional Encoding

Relative Positional Encoding

Segment-Level Recurrence in Inference

Contributions

xlnet：

Permutation Language Model

Formulation Reparameterizing

Two-Stream Self-Attention

Contributions

6. bert源碼里mask部分在哪個模塊耐朴；bert如何mask。

Multi-Head Attention：(BertSelfAttention)

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        if encoder_hidden_states is not None:
            mixed_key_layer = self.key(encoder_hidden_states)
            mixed_value_layer = self.value(encoder_hidden_states)
            attention_mask = encoder_attention_mask
        else:
            mixed_key_layer = self.key(hidden_states)
            mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
        return outputs

7. 估計一下bert的參數(shù)量盹憎。

估算下bert的參數(shù)量：如問題3。

$BERT_{BASE}$ : L=12, H=768, A=12, Total Parameters=110M铐刘。
$BERT_{LARGE}$ : L=24, H=1024, A=16, Total Parameters=340M陪每。

8. roberta和bert在預(yù)訓(xùn)練時的不同。

RoBERTa

Static vs. Dynamic Masking
Model Input Format and Next Sentence Prediction

9. 介紹下roberta镰吵，為什么選用wwm檩禾。

同上：
Whole Word Masking (wwm)，暫翻譯為全詞Mask或整詞Mask疤祭，是谷歌在2019年5月31日發(fā)布的一項BERT的升級版本盼产，主要更改了原預(yù)訓(xùn)練階段的訓(xùn)練樣本生成策略。簡單來說勺馆，原有基于WordPiece的分詞方式會把一個完整的詞切分成若干個子詞戏售，在生成訓(xùn)練樣本時，這些被分開的子詞會隨機被mask草穆。在全詞Mask中灌灾，如果一個完整的詞的部分WordPiece子詞被mask，則同屬該詞的其他部分也會被mask悲柱，即全詞Mask锋喜。

需要注意的是，這里的mask指的是廣義的mask（替換成[MASK]豌鸡；保持原詞匯嘿般；隨機替換成另外一個詞），并非只局限于單詞替換成[MASK]標(biāo)簽的情況涯冠。更詳細的說明及樣例請參考：#4

同理炉奴，由于谷歌官方發(fā)布的BERT-base, Chinese中，中文是以字為粒度進行切分蛇更，沒有考慮到傳統(tǒng)NLP中的中文分詞（CWS）盆佣。我們將全詞Mask的方法應(yīng)用在了中文中往堡，使用了中文維基百科（包括簡體和繁體）進行訓(xùn)練，并且使用了哈工大LTP作為分詞工具共耍，即對組成同一個詞的漢字全部進行Mask虑灰。

下述文本展示了全詞Mask的生成樣例。 注意：為了方便理解痹兜，下述例子中只考慮替換成[MASK]標(biāo)簽的情況穆咐。

10. BERT、GPT字旭、ELMO之間的區(qū)別（模型結(jié)構(gòu)对湃、訓(xùn)練方式）。

BERT-雙向兩階段預(yù)訓(xùn)練模型：Pre-training（MLM遗淳、NSP）+Fine-Tuning（...）
GPT-單向兩階段預(yù)訓(xùn)練模型：Pre-training（LM）+Fine-Tuning（...）
ELMO-兩個方向預(yù)訓(xùn)練模型：Pre-training（兩個單向LM）+supervised NLP tasks（使用LSTM各層表征）

11. BERT為什么只用Transformer的Encoder而不用Decoder拍柒。

BERT在Pre-training過程中使用的Masked Language Model（AE），聯(lián)合上下文信息預(yù)測被[MASK]掉的標(biāo)記屈暗。而Decoder采用的一種單向的語言模型（LM）拆讯。所以BERT使用Encoder，而不是用Decoder养叛。

12. xlnet和bert有啥不同种呐。自回歸&&自編碼的知識，其中解釋了xlnet排列語言模型以及雙流attention弃甥。

bert：自回歸爽室；xlnet：自回歸&&自編碼。

Auto-Regressive (AR)

Auto-Encoding (AE)