Task04?編寫B(tài)ERT模型

1 BertTokenizer(Tokenization分詞)

組成結(jié)構(gòu):BasicTokenizer和WordPieceTokenizer

BasicTokenizer主要作用:

按標(biāo)點孔庭、空格分割句子,對于中文字符嘶摊,通過預(yù)處理(加空格方式)進行按字分割

通過never_split指定對某些詞不進行分割

處理是否統(tǒng)一小寫

清理非法字符

WordPieceTokenizer主要作用:

進一步將詞分解為子詞(subword)

subword介于char和word之間嫡霞,保留了詞的含義,又能夠解決英文中單復(fù)數(shù)、時態(tài)導(dǎo)致的詞表爆炸和未登錄詞的OOV問題

將詞根和時態(tài)詞綴分割耻矮,減小詞表,降低訓(xùn)練難度

BertTokenizer常用方法:

from_pretrained:從包含詞表文件(vocab.txt)的目錄中初始化一個分詞器忆谓;

tokenize:將文本(詞或者句子)分解為子詞列表裆装;

convert_tokens_to_ids:將子詞列表轉(zhuǎn)化為子詞對應(yīng)的下標(biāo)列表;

convert_ids_to_tokens :與上一個相反;

convert_tokens_to_string:將subword列表按“##”拼接回詞或者句子哨免;

encode:

對于單個句子輸入篷就,分解詞犀被,同時加入特殊詞形成“[CLS], x, [SEP]”的結(jié)構(gòu)嘁傀,并轉(zhuǎn)換為詞表對應(yīng)的下標(biāo)列表和敬;

對于兩個句子輸入(多個句子只取前兩個),分解詞并加入特殊詞形成“[CLS], x1, [SEP], x2, [SEP]”的結(jié)構(gòu)并轉(zhuǎn)換為下標(biāo)列表慧耍;

decode:可以將encode方法的輸出變?yōu)橥暾渥印?/p>

2 BertModel(BERT Model 本體模型)

組成結(jié)構(gòu):主要是Transformer-Encoder結(jié)構(gòu)

embeddings:BertEmbeddings類的實體身辨,根據(jù)單詞符號獲取對應(yīng)的向量表示丐谋;

encoder:BertEncoder類的實體芍碧;

pooler:BertPooler類的實體,這一部分是可選的

BertModel常用方法:

get_input_embeddings:提取 embedding 中的 word_embeddings号俐,即詞向量部分泌豆;

set_input_embeddings:為 embedding 中的 word_embeddings 賦值;

_prune_heads:提供了將注意力頭剪枝的函數(shù)吏饿,輸入為{layer_num: list of heads to prune in this layer}的字典踪危,可以將指定層的某些注意力頭剪枝。

2.1 BertEmbeddings

輸出結(jié)果:通過word_embeddings猪落、token_type_embeddings贞远、position_embeddings三個部分求和,并通過一層 LayerNorm+Dropout 后輸出得到笨忌,其大小為(batch_size, sequence_length, hidden_size)

word_embeddings:子詞(subword)對應(yīng)的embeddings

token_type_embeddings:用于表示當(dāng)前詞所在的句子蓝仲,區(qū)別句子與 padding、句子對之間的差異

position_embeddings:表示句子中每個詞的位置嵌入官疲,用于區(qū)別詞的順序

使用 LayerNorm+Dropout 的必要性:

通過layer normalization得到的embedding的分布袱结,是以坐標(biāo)原點為中心,1為標(biāo)準(zhǔn)差途凫,越往外越稀疏的球體空間中

2.2 BertEncoder

技術(shù)拓展:梯度檢查點(gradient checkpointing)垢夹,通過減少保存的計算圖節(jié)點壓縮模型占用空間

2.2.1 BertAttention

BertSelfAttention

初始化部分:檢查隱藏層和注意力頭的參數(shù)配置倍率、進行各參數(shù)的賦值

前向傳播部分:

multi-head self-attention的基本公式:

\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \\ \text{head}_i = \text{SDPA}(\text{QW}_i^Q, \text{KW}_i^K, \text{VW}_i^V) \\ \text{SDPA}(Q, K, V) = \text{softmax}(\frac{Q \cdot K^T}{\sqrt{d_k}}) \cdot V

MHA(Q,K,V)=Concat(head

1

,…,head

h

)W

O

head

i

=SDPA(QW

i

Q

,KW

i

K

,VW

i

V

)

SDPA(Q,K,V)=softmax(

d

k

Q?K

T

)?V

transpose_for_scores:用于將 hidden_size 拆成多個頭輸出的形狀维费,并且將中間兩維轉(zhuǎn)置進行矩陣相乘

torch.einsum:根據(jù)下標(biāo)表示形式果元,對矩陣中輸入元素的乘積求和

positional_embedding_type:

absolute:默認(rèn)值,不用進行處理

relative_key:對key layer處理

relative_key_query:對 key 和 value 都進行相乘以作為位置編碼

BertSelfOutput:

前向傳播部分使用LayerNorm+Dropout組合犀盟,殘差連接用于降低網(wǎng)絡(luò)層數(shù)過深而晒,帶來的訓(xùn)練難度,對原始輸入更加敏感且蓬。

2.2.2 BertIntermediate

主要結(jié)構(gòu):全連接和激活操作

全連接:將原始維度進行擴展欣硼,參數(shù)intermediate_size

激活:激活函數(shù)默認(rèn)為 gelu,使用一個包含tanh的表達(dá)式進行近似求解

2.2.3 BertOutput

主要結(jié)構(gòu):全連接、dropout+LayerNorm诈胜、殘差連接(residual connect)

2.3 BertPooler

主要作用:取出句子的第一個token豹障,即[CLS]對應(yīng)的向量,然后通過一個全連接層和一個激活函數(shù)后輸出結(jié)果焦匈。

3 實戰(zhàn)練習(xí)

3.1 BertToknizer代碼

import collections

import os

import unicodedata

from typing import List, Optional, Tuple

from transformers.tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace

from transformers.utils import logging

logger = logging.get_logger(__name__)

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

PRETRAINED_VOCAB_FILES_MAP = {

? ? "vocab_file": {

? ? ? ? "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",

? ? }

}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {

? ? "bert-base-uncased": 512,

}

PRETRAINED_INIT_CONFIGURATION = {

? ? "bert-base-uncased": {"do_lower_case": True},

}

def load_vocab(vocab_file):

? ? """Loads a vocabulary file into a dictionary."""

? ? vocab = collections.OrderedDict()

? ? with open(vocab_file, "r", encoding="utf-8") as reader:

? ? ? ? tokens = reader.readlines()

? ? for index, token in enumerate(tokens):

? ? ? ? token = token.rstrip("\n")

? ? ? ? vocab[token] = index

? ? return vocab

def whitespace_tokenize(text):

? ? """Runs basic whitespace cleaning and splitting on a piece of text."""

? ? text = text.strip()

? ? if not text:

? ? ? ? return []

? ? tokens = text.split()

? ? return tokens

class BertTokenizer(PreTrainedTokenizer):

? ? vocab_files_names = VOCAB_FILES_NAMES

? ? pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP

? ? pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

? ? max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

? ? def __init__(

? ? ? ? self,

? ? ? ? vocab_file,

? ? ? ? do_lower_case=True,

? ? ? ? do_basic_tokenize=True,

? ? ? ? never_split=None,

? ? ? ? unk_token="[UNK]",

? ? ? ? sep_token="[SEP]",

? ? ? ? pad_token="[PAD]",

? ? ? ? cls_token="[CLS]",

? ? ? ? mask_token="[MASK]",

? ? ? ? tokenize_chinese_chars=True,

? ? ? ? strip_accents=None,

? ? ? ? **kwargs

? ? ):

? ? ? ? super().__init__(

? ? ? ? ? ? do_lower_case=do_lower_case,

? ? ? ? ? ? do_basic_tokenize=do_basic_tokenize,

? ? ? ? ? ? never_split=never_split,

? ? ? ? ? ? unk_token=unk_token,

? ? ? ? ? ? sep_token=sep_token,

? ? ? ? ? ? pad_token=pad_token,

? ? ? ? ? ? cls_token=cls_token,

? ? ? ? ? ? mask_token=mask_token,

? ? ? ? ? ? tokenize_chinese_chars=tokenize_chinese_chars,

? ? ? ? ? ? strip_accents=strip_accents,

? ? ? ? ? ? **kwargs,

? ? ? ? )

? ? ? ? if not os.path.isfile(vocab_file):

? ? ? ? ? ? raise ValueError(

? ? ? ? ? ? ? ? f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "

? ? ? ? ? ? ? ? "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"

? ? ? ? ? ? )

? ? ? ? self.vocab = load_vocab(vocab_file)

? ? ? ? self.ids_to_tokens = collections.OrderedDict(

? ? ? ? ? ? [(ids, tok) for tok, ids in self.vocab.items()])

? ? ? ? self.do_basic_tokenize = do_basic_tokenize

? ? ? ? if do_basic_tokenize:

? ? ? ? ? ? self.basic_tokenizer = BasicTokenizer(

? ? ? ? ? ? ? ? do_lower_case=do_lower_case,

? ? ? ? ? ? ? ? never_split=never_split,

? ? ? ? ? ? ? ? tokenize_chinese_chars=tokenize_chinese_chars,

? ? ? ? ? ? ? ? strip_accents=strip_accents,

? ? ? ? ? ? )

? ? ? ? self.wordpiece_tokenizer = WordpieceTokenizer(

? ? ? ? ? ? vocab=self.vocab, unk_token=self.unk_token)

? ? @property

? ? def do_lower_case(self):

? ? ? ? return self.basic_tokenizer.do_lower_case

? ? @property

? ? def vocab_size(self):

? ? ? ? return len(self.vocab)

? ? def get_vocab(self):

? ? ? ? return dict(self.vocab, **self.added_tokens_encoder)

? ? def _tokenize(self, text):

? ? ? ? split_tokens = []

? ? ? ? if self.do_basic_tokenize:

? ? ? ? ? ? for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):

? ? ? ? ? ? ? ? # If the token is part of the never_split set

? ? ? ? ? ? ? ? if token in self.basic_tokenizer.never_split:

? ? ? ? ? ? ? ? ? ? split_tokens.append(token)

? ? ? ? ? ? ? ? else:

? ? ? ? ? ? ? ? ? ? split_tokens += self.wordpiece_tokenizer.tokenize(token)

? ? ? ? else:

? ? ? ? ? ? split_tokens = self.wordpiece_tokenizer.tokenize(text)

? ? ? ? return split_tokens

? ? def _convert_token_to_id(self, token):

? ? ? ? """Converts a token (str) in an id using the vocab."""

? ? ? ? return self.vocab.get(token, self.vocab.get(self.unk_token))

? ? def _convert_id_to_token(self, index):

? ? ? ? """Converts an index (integer) in a token (str) using the vocab."""

? ? ? ? return self.ids_to_tokens.get(index, self.unk_token)

? ? def convert_tokens_to_string(self, tokens):

? ? ? ? """Converts a sequence of tokens (string) in a single string."""

? ? ? ? out_string = " ".join(tokens).replace(" ##", "").strip()

? ? ? ? return out_string

? ? def build_inputs_with_special_tokens(

? ? ? ? self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

? ? ) -> List[int]:

? ? ? ? """

? ? ? ? Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and

? ? ? ? adding special tokens. A BERT sequence has the following format:

? ? ? ? - single sequence: ``[CLS] X [SEP]``

? ? ? ? - pair of sequences: ``[CLS] A [SEP] B [SEP]``

? ? ? ? Args:

? ? ? ? ? ? token_ids_0 (:obj:`List[int]`):

? ? ? ? ? ? ? ? List of IDs to which the special tokens will be added.

? ? ? ? ? ? token_ids_1 (:obj:`List[int]`, `optional`):

? ? ? ? ? ? ? ? Optional second list of IDs for sequence pairs.

? ? ? ? Returns:

? ? ? ? ? ? :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.

? ? ? ? """

? ? ? ? if token_ids_1 is None:

? ? ? ? ? ? return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]

? ? ? ? cls = [self.cls_token_id]

? ? ? ? sep = [self.sep_token_id]

? ? ? ? return cls + token_ids_0 + sep + token_ids_1 + sep

? ? def get_special_tokens_mask(

? ? ? ? self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False

? ? ) -> List[int]:

? ? ? ? """

? ? ? ? Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding

? ? ? ? special tokens using the tokenizer ``prepare_for_model`` method.

? ? ? ? Args:

? ? ? ? ? ? token_ids_0 (:obj:`List[int]`):

? ? ? ? ? ? ? ? List of IDs.

? ? ? ? ? ? token_ids_1 (:obj:`List[int]`, `optional`):

? ? ? ? ? ? ? ? Optional second list of IDs for sequence pairs.

? ? ? ? ? ? already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):

? ? ? ? ? ? ? ? Whether or not the token list is already formatted with special tokens for the model.

? ? ? ? Returns:

? ? ? ? ? ? :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

? ? ? ? """

? ? ? ? if already_has_special_tokens:

? ? ? ? ? ? return super().get_special_tokens_mask(

? ? ? ? ? ? ? ? token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True

? ? ? ? ? ? )

? ? ? ? if token_ids_1 is not None:

? ? ? ? ? ? return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]

? ? ? ? return [1] + ([0] * len(token_ids_0)) + [1]

? ? def create_token_type_ids_from_sequences(

? ? ? ? self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

? ? ) -> List[int]:

? ? ? ? """

? ? ? ? Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence

? ? ? ? pair mask has the following format:

? ? ? ? ::

? ? ? ? ? ? 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

? ? ? ? ? ? | first sequence? ? | second sequence |

? ? ? ? If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

? ? ? ? Args:

? ? ? ? ? ? token_ids_0 (:obj:`List[int]`):

? ? ? ? ? ? ? ? List of IDs.

? ? ? ? ? ? token_ids_1 (:obj:`List[int]`, `optional`):

? ? ? ? ? ? ? ? Optional second list of IDs for sequence pairs.

? ? ? ? Returns:

? ? ? ? ? ? :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given

? ? ? ? ? ? sequence(s).

? ? ? ? """

? ? ? ? sep = [self.sep_token_id]

? ? ? ? cls = [self.cls_token_id]

? ? ? ? if token_ids_1 is None:

? ? ? ? ? ? return len(cls + token_ids_0 + sep) * [0]

? ? ? ? return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]

? ? def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:

? ? ? ? index = 0

? ? ? ? if os.path.isdir(save_directory):

? ? ? ? ? ? vocab_file = os.path.join(

? ? ? ? ? ? ? ? save_directory, (filename_prefix + "-" if filename_prefix else "") +

? ? ? ? ? ? ? ? VOCAB_FILES_NAMES["vocab_file"]

? ? ? ? ? ? )

? ? ? ? else:

? ? ? ? ? ? vocab_file = (filename_prefix +

? ? ? ? ? ? ? ? ? ? ? ? ? "-" if filename_prefix else "") + save_directory

? ? ? ? with open(vocab_file, "w", encoding="utf-8") as writer:

? ? ? ? ? ? for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):

? ? ? ? ? ? ? ? if index != token_index:

? ? ? ? ? ? ? ? ? ? logger.warning(

? ? ? ? ? ? ? ? ? ? ? ? f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."

? ? ? ? ? ? ? ? ? ? ? ? " Please check that the vocabulary is not corrupted!"

? ? ? ? ? ? ? ? ? ? )

? ? ? ? ? ? ? ? ? ? index = token_index

? ? ? ? ? ? ? ? writer.write(token + "\n")

? ? ? ? ? ? ? ? index += 1

? ? ? ? return (vocab_file,)

class BasicTokenizer(object):

? ? def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):

? ? ? ? if never_split is None:

? ? ? ? ? ? never_split = []

? ? ? ? self.do_lower_case = do_lower_case

? ? ? ? self.never_split = set(never_split)

? ? ? ? self.tokenize_chinese_chars = tokenize_chinese_chars

? ? ? ? self.strip_accents = strip_accents

? ? def tokenize(self, text, never_split=None):

? ? ? ? """

? ? ? ? Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see

? ? ? ? WordPieceTokenizer.

? ? ? ? Args:

? ? ? ? ? ? **never_split**: (`optional`) list of str

? ? ? ? ? ? ? ? Kept for backward compatibility purposes. Now implemented directly at the base class level (see

? ? ? ? ? ? ? ? :func:`PreTrainedTokenizer.tokenize`) List of token not to split.

? ? ? ? """

? ? ? ? # union() returns a new set by concatenating the two sets.

? ? ? ? never_split = self.never_split.union(

? ? ? ? ? ? set(never_split)) if never_split else self.never_split

? ? ? ? text = self._clean_text(text)

? ? ? ? # This was added on November 1st, 2018 for the multilingual and Chinese

? ? ? ? # models. This is also applied to the English models now, but it doesn't

? ? ? ? # matter since the English models were not trained on any Chinese data

? ? ? ? # and generally don't have any Chinese data in them (there are Chinese

? ? ? ? # characters in the vocabulary because Wikipedia does have some Chinese

? ? ? ? # words in the English Wikipedia.).

? ? ? ? if self.tokenize_chinese_chars:

? ? ? ? ? ? text = self._tokenize_chinese_chars(text)

? ? ? ? orig_tokens = whitespace_tokenize(text)

? ? ? ? split_tokens = []

? ? ? ? for token in orig_tokens:

? ? ? ? ? ? if token not in never_split:

? ? ? ? ? ? ? ? if self.do_lower_case:

? ? ? ? ? ? ? ? ? ? token = token.lower()

? ? ? ? ? ? ? ? ? ? if self.strip_accents is not False:

? ? ? ? ? ? ? ? ? ? ? ? token = self._run_strip_accents(token)

? ? ? ? ? ? ? ? elif self.strip_accents:

? ? ? ? ? ? ? ? ? ? token = self._run_strip_accents(token)

? ? ? ? ? ? split_tokens.extend(self._run_split_on_punc(token, never_split))

? ? ? ? output_tokens = whitespace_tokenize(" ".join(split_tokens))

? ? ? ? return output_tokens

? ? def _run_strip_accents(self, text):

? ? ? ? """Strips accents from a piece of text."""

? ? ? ? text = unicodedata.normalize("NFD", text)

? ? ? ? output = []

? ? ? ? for char in text:

? ? ? ? ? ? cat = unicodedata.category(char)

? ? ? ? ? ? if cat == "Mn":

? ? ? ? ? ? ? ? continue

? ? ? ? ? ? output.append(char)

? ? ? ? return "".join(output)

? ? def _run_split_on_punc(self, text, never_split=None):

? ? ? ? """Splits punctuation on a piece of text."""

? ? ? ? if never_split is not None and text in never_split:

? ? ? ? ? ? return [text]

? ? ? ? chars = list(text)

? ? ? ? i = 0

? ? ? ? start_new_word = True

? ? ? ? output = []

? ? ? ? while i < len(chars):

? ? ? ? ? ? char = chars[i]

? ? ? ? ? ? if _is_punctuation(char):

? ? ? ? ? ? ? ? output.append([char])

? ? ? ? ? ? ? ? start_new_word = True

? ? ? ? ? ? else:

? ? ? ? ? ? ? ? if start_new_word:

? ? ? ? ? ? ? ? ? ? output.append([])

? ? ? ? ? ? ? ? start_new_word = False

? ? ? ? ? ? ? ? output[-1].append(char)

? ? ? ? ? ? i += 1

? ? ? ? return ["".join(x) for x in output]

? ? def _tokenize_chinese_chars(self, text):

? ? ? ? """Adds whitespace around any CJK character."""

? ? ? ? output = []

? ? ? ? for char in text:

? ? ? ? ? ? cp = ord(char)

? ? ? ? ? ? if self._is_chinese_char(cp):

? ? ? ? ? ? ? ? output.append(" ")

? ? ? ? ? ? ? ? output.append(char)

? ? ? ? ? ? ? ? output.append(" ")

? ? ? ? ? ? else:

? ? ? ? ? ? ? ? output.append(char)

? ? ? ? return "".join(output)

? ? def _is_chinese_char(self, cp):

? ? ? ? """Checks whether CP is the codepoint of a CJK character."""

? ? ? ? # This defines a "chinese character" as anything in the CJK Unicode block:

? ? ? ? #? https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)

? ? ? ? #

? ? ? ? # Note that the CJK Unicode block is NOT all Japanese and Korean characters,

? ? ? ? # despite its name. The modern Korean Hangul alphabet is a different block,

? ? ? ? # as is Japanese Hiragana and Katakana. Those alphabets are used to write

? ? ? ? # space-separated words, so they are not treated specially and handled

? ? ? ? # like the all of the other languages.

? ? ? ? if (

? ? ? ? ? ? (cp >= 0x4E00 and cp <= 0x9FFF)

? ? ? ? ? ? or (cp >= 0x3400 and cp <= 0x4DBF)? #

? ? ? ? ? ? or (cp >= 0x20000 and cp <= 0x2A6DF)? #

? ? ? ? ? ? or (cp >= 0x2A700 and cp <= 0x2B73F)? #

? ? ? ? ? ? or (cp >= 0x2B740 and cp <= 0x2B81F)? #

? ? ? ? ? ? or (cp >= 0x2B820 and cp <= 0x2CEAF)? #

? ? ? ? ? ? or (cp >= 0xF900 and cp <= 0xFAFF)

? ? ? ? ? ? or (cp >= 0x2F800 and cp <= 0x2FA1F)? #

? ? ? ? ):? #

? ? ? ? ? ? return True

? ? ? ? return False

? ? def _clean_text(self, text):

? ? ? ? """Performs invalid character removal and whitespace cleanup on text."""

? ? ? ? output = []

? ? ? ? for char in text:

? ? ? ? ? ? cp = ord(char)

? ? ? ? ? ? if cp == 0 or cp == 0xFFFD or _is_control(char):

? ? ? ? ? ? ? ? continue

? ? ? ? ? ? if _is_whitespace(char):

? ? ? ? ? ? ? ? output.append(" ")

? ? ? ? ? ? else:

? ? ? ? ? ? ? ? output.append(char)

? ? ? ? return "".join(output)

class WordpieceTokenizer(object):

? ? """Runs WordPiece tokenization."""

? ? def __init__(self, vocab, unk_token, max_input_chars_per_word=100):

? ? ? ? self.vocab = vocab

? ? ? ? self.unk_token = unk_token

? ? ? ? self.max_input_chars_per_word = max_input_chars_per_word

? ? def tokenize(self, text):

? ? ? ? """

? ? ? ? Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform

? ? ? ? tokenization using the given vocabulary.

? ? ? ? For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.

? ? ? ? Args:

? ? ? ? ? text: A single token or whitespace separated tokens. This should have

? ? ? ? ? ? already been passed through `BasicTokenizer`.

? ? ? ? Returns:

? ? ? ? ? A list of wordpiece tokens.

? ? ? ? """

? ? ? ? output_tokens = []

? ? ? ? for token in whitespace_tokenize(text):

? ? ? ? ? ? chars = list(token)

? ? ? ? ? ? if len(chars) > self.max_input_chars_per_word:

? ? ? ? ? ? ? ? output_tokens.append(self.unk_token)

? ? ? ? ? ? ? ? continue

? ? ? ? ? ? is_bad = False

? ? ? ? ? ? start = 0

? ? ? ? ? ? sub_tokens = []

? ? ? ? ? ? while start < len(chars):

? ? ? ? ? ? ? ? end = len(chars)

? ? ? ? ? ? ? ? cur_substr = None

? ? ? ? ? ? ? ? while start < end:

? ? ? ? ? ? ? ? ? ? substr = "".join(chars[start:end])

? ? ? ? ? ? ? ? ? ? if start > 0:

? ? ? ? ? ? ? ? ? ? ? ? substr = "##" + substr

? ? ? ? ? ? ? ? ? ? if substr in self.vocab:

? ? ? ? ? ? ? ? ? ? ? ? cur_substr = substr

? ? ? ? ? ? ? ? ? ? ? ? break

? ? ? ? ? ? ? ? ? ? end -= 1

? ? ? ? ? ? ? ? if cur_substr is None:

? ? ? ? ? ? ? ? ? ? is_bad = True

? ? ? ? ? ? ? ? ? ? break

? ? ? ? ? ? ? ? sub_tokens.append(cur_substr)

? ? ? ? ? ? ? ? start = end

? ? ? ? ? ? if is_bad:

? ? ? ? ? ? ? ? output_tokens.append(self.unk_token)

? ? ? ? ? ? else:

? ? ? ? ? ? ? ? output_tokens.extend(sub_tokens)

? ? ? ? return output_tokens

Copy to clipboardErrorCopied

bt = BertTokenizer.from_pretrained('bert-base-uncased')

bt('I like natural language progressing!')

Copy to clipboardErrorCopied

{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Copy to clipboardErrorCopied

3.2 BertSelfAttention

from torch import nn

class BertSelfAttention(nn.Module):

? ? def __init__(self, config):

? ? ? ? super().__init__()

? ? ? ? if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):

? ? ? ? ? ? raise ValueError(

? ? ? ? ? ? ? ? f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "

? ? ? ? ? ? ? ? f"heads ({config.num_attention_heads})"

? ? ? ? ? ? )

? ? ? ? self.num_attention_heads = config.num_attention_heads

? ? ? ? self.attention_head_size = int(

? ? ? ? ? ? config.hidden_size / config.num_attention_heads)

? ? ? ? self.all_head_size = self.num_attention_heads * self.attention_head_size

? ? ? ? self.query = nn.Linear(config.hidden_size, self.all_head_size)

? ? ? ? self.key = nn.Linear(config.hidden_size, self.all_head_size)

? ? ? ? self.value = nn.Linear(config.hidden_size, self.all_head_size)

? ? ? ? self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

? ? ? ? self.position_embedding_type = getattr(

? ? ? ? ? ? config, "position_embedding_type", "absolute")

? ? ? ? if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":

? ? ? ? ? ? self.max_position_embeddings = config.max_position_embeddings

? ? ? ? ? ? self.distance_embedding = nn.Embedding(

? ? ? ? ? ? ? ? 2 * config.max_position_embeddings - 1, self.attention_head_size)

? ? ? ? self.is_decoder = config.is_decoder

? ? def transpose_for_scores(self, x):

? ? ? ? new_x_shape = x.size()[

? ? ? ? ? ? :-1] + (self.num_attention_heads, self.attention_head_size)

? ? ? ? x = x.view(*new_x_shape)

? ? ? ? return x.permute(0, 2, 1, 3)

? ? def forward(

? ? ? ? self,

? ? ? ? hidden_states,

? ? ? ? attention_mask=None,

? ? ? ? head_mask=None,

? ? ? ? encoder_hidden_states=None,

? ? ? ? encoder_attention_mask=None,

? ? ? ? past_key_value=None,

? ? ? ? output_attentions=False,

? ? ):

? ? ? ? mixed_query_layer = self.query(hidden_states)

? ? ? ? # If this is instantiated as a cross-attention module, the keys

? ? ? ? # and values come from an encoder; the attention mask needs to be

? ? ? ? # such that the encoder's padding tokens are not attended to.

? ? ? ? is_cross_attention = encoder_hidden_states is not None

? ? ? ? if is_cross_attention and past_key_value is not None:

? ? ? ? ? ? # reuse k,v, cross_attentions

? ? ? ? ? ? key_layer = past_key_value[0]

? ? ? ? ? ? value_layer = past_key_value[1]

? ? ? ? ? ? attention_mask = encoder_attention_mask

? ? ? ? elif is_cross_attention:

? ? ? ? ? ? key_layer = self.transpose_for_scores(

? ? ? ? ? ? ? ? self.key(encoder_hidden_states))

? ? ? ? ? ? value_layer = self.transpose_for_scores(

? ? ? ? ? ? ? ? self.value(encoder_hidden_states))

? ? ? ? ? ? attention_mask = encoder_attention_mask

? ? ? ? elif past_key_value is not None:

? ? ? ? ? ? key_layer = self.transpose_for_scores(self.key(hidden_states))

? ? ? ? ? ? value_layer = self.transpose_for_scores(self.value(hidden_states))

? ? ? ? ? ? key_layer = torch.cat([past_key_value[0], key_layer], dim=2)

? ? ? ? ? ? value_layer = torch.cat([past_key_value[1], value_layer], dim=2)

? ? ? ? else:

? ? ? ? ? ? key_layer = self.transpose_for_scores(self.key(hidden_states))

? ? ? ? ? ? value_layer = self.transpose_for_scores(self.value(hidden_states))

? ? ? ? query_layer = self.transpose_for_scores(mixed_query_layer)

? ? ? ? if self.is_decoder:

? ? ? ? ? ? # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.

? ? ? ? ? ? # Further calls to cross_attention layer can then reuse all cross-attention

? ? ? ? ? ? # key/value_states (first "if" case)

? ? ? ? ? ? # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of

? ? ? ? ? ? # all previous decoder key/value_states. Further calls to uni-directional self-attention

? ? ? ? ? ? # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)

? ? ? ? ? ? # if encoder bi-directional self-attention `past_key_value` is always `None`

? ? ? ? ? ? past_key_value = (key_layer, value_layer)

? ? ? ? # Take the dot product between "query" and "key" to get the raw attention scores.

? ? ? ? attention_scores = torch.matmul(

? ? ? ? ? ? query_layer, key_layer.transpose(-1, -2))

? ? ? ? if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":

? ? ? ? ? ? seq_length = hidden_states.size()[1]

? ? ? ? ? ? position_ids_l = torch.arange(

? ? ? ? ? ? ? ? seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)

? ? ? ? ? ? position_ids_r = torch.arange(

? ? ? ? ? ? ? ? seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)

? ? ? ? ? ? distance = position_ids_l - position_ids_r

? ? ? ? ? ? positional_embedding = self.distance_embedding(

? ? ? ? ? ? ? ? distance + self.max_position_embeddings - 1)

? ? ? ? ? ? positional_embedding = positional_embedding.to(

? ? ? ? ? ? ? ? dtype=query_layer.dtype)? # fp16 compatibility

? ? ? ? ? ? if self.position_embedding_type == "relative_key":

? ? ? ? ? ? ? ? relative_position_scores = torch.einsum(

? ? ? ? ? ? ? ? ? ? "bhld,lrd->bhlr", query_layer, positional_embedding)

? ? ? ? ? ? ? ? attention_scores = attention_scores + relative_position_scores

? ? ? ? ? ? elif self.position_embedding_type == "relative_key_query":

? ? ? ? ? ? ? ? relative_position_scores_query = torch.einsum(

? ? ? ? ? ? ? ? ? ? "bhld,lrd->bhlr", query_layer, positional_embedding)

? ? ? ? ? ? ? ? relative_position_scores_key = torch.einsum(

? ? ? ? ? ? ? ? ? ? "bhrd,lrd->bhlr", key_layer, positional_embedding)

? ? ? ? ? ? ? ? attention_scores = attention_scores + \

? ? ? ? ? ? ? ? ? ? relative_position_scores_query + relative_position_scores_key

? ? ? ? attention_scores = attention_scores / \

? ? ? ? ? ? math.sqrt(self.attention_head_size)

? ? ? ? if attention_mask is not None:

? ? ? ? ? ? # Apply the attention mask is (precomputed for all layers in BertModel forward() function)

? ? ? ? ? ? attention_scores = attention_scores + attention_mask

? ? ? ? # Normalize the attention scores to probabilities.

? ? ? ? attention_probs = nn.Softmax(dim=-1)(attention_scores)

? ? ? ? # This is actually dropping out entire tokens to attend to, which might

? ? ? ? # seem a bit unusual, but is taken from the original Transformer paper.

? ? ? ? attention_probs = self.dropout(attention_probs)

? ? ? ? # Mask heads if we want to

? ? ? ? if head_mask is not None:

? ? ? ? ? ? attention_probs = attention_probs * head_mask

? ? ? ? context_layer = torch.matmul(attention_probs, value_layer)

? ? ? ? context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

? ? ? ? new_context_layer_shape = context_layer.size()[

? ? ? ? ? ? :-2] + (self.all_head_size,)

? ? ? ? context_layer = context_layer.view(*new_context_layer_shape)

? ? ? ? outputs = (context_layer, attention_probs) if output_attentions else (

? ? ? ? ? ? context_layer,)

? ? ? ? if self.is_decoder:

? ? ? ? ? ? outputs = outputs + (past_key_value,)

? ? ? ? return outputs

Copy to clipboardErrorCopied

3.3 BerSelfOutput

class BertSelfOutput(nn.Module):

? ? def __init__(self, config):

? ? ? ? super().__init__()

? ? ? ? self.dense = nn.Linear(config.hidden_size, config.hidden_size)

? ? ? ? self.LayerNorm = nn.LayerNorm(

? ? ? ? ? ? config.hidden_size, eps=config.layer_norm_eps)

? ? ? ? self.dropout = nn.Dropout(config.hidden_dropout_prob)

? ? def forward(self, hidden_states, input_tensor):

? ? ? ? hidden_states = self.dense(hidden_states)

? ? ? ? hidden_states = self.dropout(hidden_states)

? ? ? ? hidden_states = self.LayerNorm(hidden_states + input_tensor)

? ? ? ? return hidden_states

Copy to clipboardErrorCopied

3.4 BertOutput

class BertOutput(nn.Module):

? ? def __init__(self, config):

? ? ? ? super().__init__()

? ? ? ? self.dense = nn.Linear(config.intermediate_size, config.hidden_size)

? ? ? ? self.LayerNorm = nn.LayerNorm(

? ? ? ? ? ? config.hidden_size, eps=config.layer_norm_eps)

? ? ? ? self.dropout = nn.Dropout(config.hidden_dropout_prob)

? ? def forward(self, hidden_states, input_tensor):

? ? ? ? hidden_states = self.dense(hidden_states)

? ? ? ? hidden_states = self.dropout(hidden_states)

? ? ? ? hidden_states = self.LayerNorm(hidden_states + input_tensor)

? ? ? ? return hidden_states

Copy to clipboardErrorCopied

3.5 BertPooler

class BertPooler(nn.Module):

? ? def __init__(self, config):

? ? ? ? super().__init__()

? ? ? ? self.dense = nn.Linear(config.hidden_size, config.hidden_size)

? ? ? ? self.activation = nn.Tanh()

? ? def forward(self, hidden_states):

? ? ? ? # We "pool" the model by simply taking the hidden state corresponding

? ? ? ? # to the first token.

? ? ? ? first_token_tensor = hidden_states[:, 0]

? ? ? ? pooled_output = self.dense(first_token_tensor)

? ? ? ? pooled_output = self.activation(pooled_output)

? ? ? ? return pooled_output

Copy to clipboardErrorCopied

from transformers.models.bert.configuration_bert import *

import torch

# 配置參數(shù)

config = BertConfig.from_pretrained("bert-base-uncased")

bert_pooler = BertPooler(config = config)

print("input to bert pooler size: {}".format(config.hidden_size))

# 調(diào)用bert_pooler

batch_size = 1

seq_len = 2

hidden_size = 768

x = torch.rand(batch_size, seq_len, hidden_size)

y = bert_pooler(x)

print(y.size())

Copy to clipboardErrorCopied

input to bert pooler size: 768

torch.Size([1, 768])

Copy to clipboardErrorCopied

4 總結(jié)

本次任務(wù)血公,主要講解了BERT的源碼,包括BertTokenizer缓熟、BertModel累魔,其中BertTokenizer主要用于分割句子,并分解成subword够滑;BertModel是BERT的本體模型類垦写,主要包括BertEmbeddings、BertEncoder和BertPooler三部分彰触,BertEmbeddings用于構(gòu)造word梯投、position和token_type embedings的Embeddings,BertEncoder由BertAttention况毅、BertIntermediate和BertOutput三個部分組成分蓖,BertPooler用于取出句子的第一個token。整個過程需要配合Task03的Bert模型架構(gòu)來閱讀尔许。

上一章節(jié)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末么鹤,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子味廊,更是在濱河造成了極大的恐慌蒸甜,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,651評論 6 501
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件毡们,死亡現(xiàn)場離奇詭異迅皇,居然都是意外死亡,警方通過查閱死者的電腦和手機衙熔,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,468評論 3 392
  • 文/潘曉璐 我一進店門登颓,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人红氯,你說我怎么就攤上這事框咙。” “怎么了痢甘?”我有些...
    開封第一講書人閱讀 162,931評論 0 353
  • 文/不壞的土叔 我叫張陵喇嘱,是天一觀的道長。 經(jīng)常有香客問我塞栅,道長者铜,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,218評論 1 292
  • 正文 為了忘掉前任,我火速辦了婚禮作烟,結(jié)果婚禮上愉粤,老公的妹妹穿的比我還像新娘。我一直安慰自己拿撩,他們只是感情好衣厘,可當(dāng)我...
    茶點故事閱讀 67,234評論 6 388
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著压恒,像睡著了一般影暴。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上探赫,一...
    開封第一講書人閱讀 51,198評論 1 299
  • 那天型宙,我揣著相機與錄音,去河邊找鬼期吓。 笑死早歇,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的讨勤。 我是一名探鬼主播,決...
    沈念sama閱讀 40,084評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼晨另,長吁一口氣:“原來是場噩夢啊……” “哼潭千!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起借尿,我...
    開封第一講書人閱讀 38,926評論 0 274
  • 序言:老撾萬榮一對情侶失蹤刨晴,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后路翻,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體狈癞,經(jīng)...
    沈念sama閱讀 45,341評論 1 311
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,563評論 2 333
  • 正文 我和宋清朗相戀三年茂契,在試婚紗的時候發(fā)現(xiàn)自己被綠了蝶桶。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 39,731評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡掉冶,死狀恐怖真竖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情厌小,我是刑警寧澤恢共,帶...
    沈念sama閱讀 35,430評論 5 343
  • 正文 年R本政府宣布,位于F島的核電站璧亚,受9級特大地震影響讨韭,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,036評論 3 326
  • 文/蒙蒙 一透硝、第九天 我趴在偏房一處隱蔽的房頂上張望吉嚣。 院中可真熱鬧,春花似錦蹬铺、人聲如沸尝哆。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,676評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽秋泄。三九已至,卻和暖如春规阀,著一層夾襖步出監(jiān)牢的瞬間恒序,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,829評論 1 269
  • 我被黑心中介騙來泰國打工谁撼, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留歧胁,地道東北人。 一個月前我還...
    沈念sama閱讀 47,743評論 2 368
  • 正文 我出身青樓厉碟,卻偏偏與公主長得像喊巍,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子箍鼓,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,629評論 2 354

推薦閱讀更多精彩內(nèi)容