1 BertTokenizer(Tokenization分詞)
組成結(jié)構(gòu):BasicTokenizer和WordPieceTokenizer
BasicTokenizer主要作用:
按標(biāo)點孔庭、空格分割句子,對于中文字符嘶摊,通過預(yù)處理(加空格方式)進行按字分割
通過never_split指定對某些詞不進行分割
處理是否統(tǒng)一小寫
清理非法字符
WordPieceTokenizer主要作用:
進一步將詞分解為子詞(subword)
subword介于char和word之間嫡霞,保留了詞的含義,又能夠解決英文中單復(fù)數(shù)、時態(tài)導(dǎo)致的詞表爆炸和未登錄詞的OOV問題
將詞根和時態(tài)詞綴分割耻矮,減小詞表,降低訓(xùn)練難度
BertTokenizer常用方法:
from_pretrained:從包含詞表文件(vocab.txt)的目錄中初始化一個分詞器忆谓;
tokenize:將文本(詞或者句子)分解為子詞列表裆装;
convert_tokens_to_ids:將子詞列表轉(zhuǎn)化為子詞對應(yīng)的下標(biāo)列表;
convert_ids_to_tokens :與上一個相反;
convert_tokens_to_string:將subword列表按“##”拼接回詞或者句子哨免;
encode:
對于單個句子輸入篷就,分解詞犀被,同時加入特殊詞形成“[CLS], x, [SEP]”的結(jié)構(gòu)嘁傀,并轉(zhuǎn)換為詞表對應(yīng)的下標(biāo)列表和敬;
對于兩個句子輸入(多個句子只取前兩個),分解詞并加入特殊詞形成“[CLS], x1, [SEP], x2, [SEP]”的結(jié)構(gòu)并轉(zhuǎn)換為下標(biāo)列表慧耍;
decode:可以將encode方法的輸出變?yōu)橥暾渥印?/p>
2 BertModel(BERT Model 本體模型)
組成結(jié)構(gòu):主要是Transformer-Encoder結(jié)構(gòu)
embeddings:BertEmbeddings類的實體身辨,根據(jù)單詞符號獲取對應(yīng)的向量表示丐谋;
encoder:BertEncoder類的實體芍碧;
pooler:BertPooler類的實體,這一部分是可選的
BertModel常用方法:
get_input_embeddings:提取 embedding 中的 word_embeddings号俐,即詞向量部分泌豆;
set_input_embeddings:為 embedding 中的 word_embeddings 賦值;
_prune_heads:提供了將注意力頭剪枝的函數(shù)吏饿,輸入為{layer_num: list of heads to prune in this layer}的字典踪危,可以將指定層的某些注意力頭剪枝。
2.1 BertEmbeddings
輸出結(jié)果:通過word_embeddings猪落、token_type_embeddings贞远、position_embeddings三個部分求和,并通過一層 LayerNorm+Dropout 后輸出得到笨忌,其大小為(batch_size, sequence_length, hidden_size)
word_embeddings:子詞(subword)對應(yīng)的embeddings
token_type_embeddings:用于表示當(dāng)前詞所在的句子蓝仲,區(qū)別句子與 padding、句子對之間的差異
position_embeddings:表示句子中每個詞的位置嵌入官疲,用于區(qū)別詞的順序
使用 LayerNorm+Dropout 的必要性:
通過layer normalization得到的embedding的分布袱结,是以坐標(biāo)原點為中心,1為標(biāo)準(zhǔn)差途凫,越往外越稀疏的球體空間中
2.2 BertEncoder
技術(shù)拓展:梯度檢查點(gradient checkpointing)垢夹,通過減少保存的計算圖節(jié)點壓縮模型占用空間
2.2.1 BertAttention
BertSelfAttention
初始化部分:檢查隱藏層和注意力頭的參數(shù)配置倍率、進行各參數(shù)的賦值
前向傳播部分:
multi-head self-attention的基本公式:
\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \\ \text{head}_i = \text{SDPA}(\text{QW}_i^Q, \text{KW}_i^K, \text{VW}_i^V) \\ \text{SDPA}(Q, K, V) = \text{softmax}(\frac{Q \cdot K^T}{\sqrt{d_k}}) \cdot V
MHA(Q,K,V)=Concat(head
1
,…,head
h
)W
O
head
i
=SDPA(QW
i
Q
,KW
i
K
,VW
i
V
)
SDPA(Q,K,V)=softmax(
d
k
Q?K
T
)?V
transpose_for_scores:用于將 hidden_size 拆成多個頭輸出的形狀维费,并且將中間兩維轉(zhuǎn)置進行矩陣相乘
torch.einsum:根據(jù)下標(biāo)表示形式果元,對矩陣中輸入元素的乘積求和
positional_embedding_type:
absolute:默認(rèn)值,不用進行處理
relative_key:對key layer處理
relative_key_query:對 key 和 value 都進行相乘以作為位置編碼
BertSelfOutput:
前向傳播部分使用LayerNorm+Dropout組合犀盟,殘差連接用于降低網(wǎng)絡(luò)層數(shù)過深而晒,帶來的訓(xùn)練難度,對原始輸入更加敏感且蓬。
2.2.2 BertIntermediate
主要結(jié)構(gòu):全連接和激活操作
全連接:將原始維度進行擴展欣硼,參數(shù)intermediate_size
激活:激活函數(shù)默認(rèn)為 gelu,使用一個包含tanh的表達(dá)式進行近似求解
2.2.3 BertOutput
主要結(jié)構(gòu):全連接、dropout+LayerNorm诈胜、殘差連接(residual connect)
2.3 BertPooler
主要作用:取出句子的第一個token豹障,即[CLS]對應(yīng)的向量,然后通過一個全連接層和一個激活函數(shù)后輸出結(jié)果焦匈。
3 實戰(zhàn)練習(xí)
3.1 BertToknizer代碼
import collections
import os
import unicodedata
from typing import List, Optional, Tuple
from transformers.tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
from transformers.utils import logging
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
PRETRAINED_VOCAB_FILES_MAP = {
? ? "vocab_file": {
? ? ? ? "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",
? ? }
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
? ? "bert-base-uncased": 512,
}
PRETRAINED_INIT_CONFIGURATION = {
? ? "bert-base-uncased": {"do_lower_case": True},
}
def load_vocab(vocab_file):
? ? """Loads a vocabulary file into a dictionary."""
? ? vocab = collections.OrderedDict()
? ? with open(vocab_file, "r", encoding="utf-8") as reader:
? ? ? ? tokens = reader.readlines()
? ? for index, token in enumerate(tokens):
? ? ? ? token = token.rstrip("\n")
? ? ? ? vocab[token] = index
? ? return vocab
def whitespace_tokenize(text):
? ? """Runs basic whitespace cleaning and splitting on a piece of text."""
? ? text = text.strip()
? ? if not text:
? ? ? ? return []
? ? tokens = text.split()
? ? return tokens
class BertTokenizer(PreTrainedTokenizer):
? ? vocab_files_names = VOCAB_FILES_NAMES
? ? pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
? ? pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
? ? max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
? ? def __init__(
? ? ? ? self,
? ? ? ? vocab_file,
? ? ? ? do_lower_case=True,
? ? ? ? do_basic_tokenize=True,
? ? ? ? never_split=None,
? ? ? ? unk_token="[UNK]",
? ? ? ? sep_token="[SEP]",
? ? ? ? pad_token="[PAD]",
? ? ? ? cls_token="[CLS]",
? ? ? ? mask_token="[MASK]",
? ? ? ? tokenize_chinese_chars=True,
? ? ? ? strip_accents=None,
? ? ? ? **kwargs
? ? ):
? ? ? ? super().__init__(
? ? ? ? ? ? do_lower_case=do_lower_case,
? ? ? ? ? ? do_basic_tokenize=do_basic_tokenize,
? ? ? ? ? ? never_split=never_split,
? ? ? ? ? ? unk_token=unk_token,
? ? ? ? ? ? sep_token=sep_token,
? ? ? ? ? ? pad_token=pad_token,
? ? ? ? ? ? cls_token=cls_token,
? ? ? ? ? ? mask_token=mask_token,
? ? ? ? ? ? tokenize_chinese_chars=tokenize_chinese_chars,
? ? ? ? ? ? strip_accents=strip_accents,
? ? ? ? ? ? **kwargs,
? ? ? ? )
? ? ? ? if not os.path.isfile(vocab_file):
? ? ? ? ? ? raise ValueError(
? ? ? ? ? ? ? ? f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
? ? ? ? ? ? ? ? "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
? ? ? ? ? ? )
? ? ? ? self.vocab = load_vocab(vocab_file)
? ? ? ? self.ids_to_tokens = collections.OrderedDict(
? ? ? ? ? ? [(ids, tok) for tok, ids in self.vocab.items()])
? ? ? ? self.do_basic_tokenize = do_basic_tokenize
? ? ? ? if do_basic_tokenize:
? ? ? ? ? ? self.basic_tokenizer = BasicTokenizer(
? ? ? ? ? ? ? ? do_lower_case=do_lower_case,
? ? ? ? ? ? ? ? never_split=never_split,
? ? ? ? ? ? ? ? tokenize_chinese_chars=tokenize_chinese_chars,
? ? ? ? ? ? ? ? strip_accents=strip_accents,
? ? ? ? ? ? )
? ? ? ? self.wordpiece_tokenizer = WordpieceTokenizer(
? ? ? ? ? ? vocab=self.vocab, unk_token=self.unk_token)
? ? @property
? ? def do_lower_case(self):
? ? ? ? return self.basic_tokenizer.do_lower_case
? ? @property
? ? def vocab_size(self):
? ? ? ? return len(self.vocab)
? ? def get_vocab(self):
? ? ? ? return dict(self.vocab, **self.added_tokens_encoder)
? ? def _tokenize(self, text):
? ? ? ? split_tokens = []
? ? ? ? if self.do_basic_tokenize:
? ? ? ? ? ? for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
? ? ? ? ? ? ? ? # If the token is part of the never_split set
? ? ? ? ? ? ? ? if token in self.basic_tokenizer.never_split:
? ? ? ? ? ? ? ? ? ? split_tokens.append(token)
? ? ? ? ? ? ? ? else:
? ? ? ? ? ? ? ? ? ? split_tokens += self.wordpiece_tokenizer.tokenize(token)
? ? ? ? else:
? ? ? ? ? ? split_tokens = self.wordpiece_tokenizer.tokenize(text)
? ? ? ? return split_tokens
? ? def _convert_token_to_id(self, token):
? ? ? ? """Converts a token (str) in an id using the vocab."""
? ? ? ? return self.vocab.get(token, self.vocab.get(self.unk_token))
? ? def _convert_id_to_token(self, index):
? ? ? ? """Converts an index (integer) in a token (str) using the vocab."""
? ? ? ? return self.ids_to_tokens.get(index, self.unk_token)
? ? def convert_tokens_to_string(self, tokens):
? ? ? ? """Converts a sequence of tokens (string) in a single string."""
? ? ? ? out_string = " ".join(tokens).replace(" ##", "").strip()
? ? ? ? return out_string
? ? def build_inputs_with_special_tokens(
? ? ? ? self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
? ? ) -> List[int]:
? ? ? ? """
? ? ? ? Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
? ? ? ? adding special tokens. A BERT sequence has the following format:
? ? ? ? - single sequence: ``[CLS] X [SEP]``
? ? ? ? - pair of sequences: ``[CLS] A [SEP] B [SEP]``
? ? ? ? Args:
? ? ? ? ? ? token_ids_0 (:obj:`List[int]`):
? ? ? ? ? ? ? ? List of IDs to which the special tokens will be added.
? ? ? ? ? ? token_ids_1 (:obj:`List[int]`, `optional`):
? ? ? ? ? ? ? ? Optional second list of IDs for sequence pairs.
? ? ? ? Returns:
? ? ? ? ? ? :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
? ? ? ? """
? ? ? ? if token_ids_1 is None:
? ? ? ? ? ? return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
? ? ? ? cls = [self.cls_token_id]
? ? ? ? sep = [self.sep_token_id]
? ? ? ? return cls + token_ids_0 + sep + token_ids_1 + sep
? ? def get_special_tokens_mask(
? ? ? ? self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
? ? ) -> List[int]:
? ? ? ? """
? ? ? ? Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
? ? ? ? special tokens using the tokenizer ``prepare_for_model`` method.
? ? ? ? Args:
? ? ? ? ? ? token_ids_0 (:obj:`List[int]`):
? ? ? ? ? ? ? ? List of IDs.
? ? ? ? ? ? token_ids_1 (:obj:`List[int]`, `optional`):
? ? ? ? ? ? ? ? Optional second list of IDs for sequence pairs.
? ? ? ? ? ? already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
? ? ? ? ? ? ? ? Whether or not the token list is already formatted with special tokens for the model.
? ? ? ? Returns:
? ? ? ? ? ? :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
? ? ? ? """
? ? ? ? if already_has_special_tokens:
? ? ? ? ? ? return super().get_special_tokens_mask(
? ? ? ? ? ? ? ? token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
? ? ? ? ? ? )
? ? ? ? if token_ids_1 is not None:
? ? ? ? ? ? return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
? ? ? ? return [1] + ([0] * len(token_ids_0)) + [1]
? ? def create_token_type_ids_from_sequences(
? ? ? ? self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
? ? ) -> List[int]:
? ? ? ? """
? ? ? ? Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence
? ? ? ? pair mask has the following format:
? ? ? ? ::
? ? ? ? ? ? 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
? ? ? ? ? ? | first sequence? ? | second sequence |
? ? ? ? If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
? ? ? ? Args:
? ? ? ? ? ? token_ids_0 (:obj:`List[int]`):
? ? ? ? ? ? ? ? List of IDs.
? ? ? ? ? ? token_ids_1 (:obj:`List[int]`, `optional`):
? ? ? ? ? ? ? ? Optional second list of IDs for sequence pairs.
? ? ? ? Returns:
? ? ? ? ? ? :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
? ? ? ? ? ? sequence(s).
? ? ? ? """
? ? ? ? sep = [self.sep_token_id]
? ? ? ? cls = [self.cls_token_id]
? ? ? ? if token_ids_1 is None:
? ? ? ? ? ? return len(cls + token_ids_0 + sep) * [0]
? ? ? ? return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
? ? def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
? ? ? ? index = 0
? ? ? ? if os.path.isdir(save_directory):
? ? ? ? ? ? vocab_file = os.path.join(
? ? ? ? ? ? ? ? save_directory, (filename_prefix + "-" if filename_prefix else "") +
? ? ? ? ? ? ? ? VOCAB_FILES_NAMES["vocab_file"]
? ? ? ? ? ? )
? ? ? ? else:
? ? ? ? ? ? vocab_file = (filename_prefix +
? ? ? ? ? ? ? ? ? ? ? ? ? "-" if filename_prefix else "") + save_directory
? ? ? ? with open(vocab_file, "w", encoding="utf-8") as writer:
? ? ? ? ? ? for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
? ? ? ? ? ? ? ? if index != token_index:
? ? ? ? ? ? ? ? ? ? logger.warning(
? ? ? ? ? ? ? ? ? ? ? ? f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
? ? ? ? ? ? ? ? ? ? ? ? " Please check that the vocabulary is not corrupted!"
? ? ? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? ? ? index = token_index
? ? ? ? ? ? ? ? writer.write(token + "\n")
? ? ? ? ? ? ? ? index += 1
? ? ? ? return (vocab_file,)
class BasicTokenizer(object):
? ? def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
? ? ? ? if never_split is None:
? ? ? ? ? ? never_split = []
? ? ? ? self.do_lower_case = do_lower_case
? ? ? ? self.never_split = set(never_split)
? ? ? ? self.tokenize_chinese_chars = tokenize_chinese_chars
? ? ? ? self.strip_accents = strip_accents
? ? def tokenize(self, text, never_split=None):
? ? ? ? """
? ? ? ? Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
? ? ? ? WordPieceTokenizer.
? ? ? ? Args:
? ? ? ? ? ? **never_split**: (`optional`) list of str
? ? ? ? ? ? ? ? Kept for backward compatibility purposes. Now implemented directly at the base class level (see
? ? ? ? ? ? ? ? :func:`PreTrainedTokenizer.tokenize`) List of token not to split.
? ? ? ? """
? ? ? ? # union() returns a new set by concatenating the two sets.
? ? ? ? never_split = self.never_split.union(
? ? ? ? ? ? set(never_split)) if never_split else self.never_split
? ? ? ? text = self._clean_text(text)
? ? ? ? # This was added on November 1st, 2018 for the multilingual and Chinese
? ? ? ? # models. This is also applied to the English models now, but it doesn't
? ? ? ? # matter since the English models were not trained on any Chinese data
? ? ? ? # and generally don't have any Chinese data in them (there are Chinese
? ? ? ? # characters in the vocabulary because Wikipedia does have some Chinese
? ? ? ? # words in the English Wikipedia.).
? ? ? ? if self.tokenize_chinese_chars:
? ? ? ? ? ? text = self._tokenize_chinese_chars(text)
? ? ? ? orig_tokens = whitespace_tokenize(text)
? ? ? ? split_tokens = []
? ? ? ? for token in orig_tokens:
? ? ? ? ? ? if token not in never_split:
? ? ? ? ? ? ? ? if self.do_lower_case:
? ? ? ? ? ? ? ? ? ? token = token.lower()
? ? ? ? ? ? ? ? ? ? if self.strip_accents is not False:
? ? ? ? ? ? ? ? ? ? ? ? token = self._run_strip_accents(token)
? ? ? ? ? ? ? ? elif self.strip_accents:
? ? ? ? ? ? ? ? ? ? token = self._run_strip_accents(token)
? ? ? ? ? ? split_tokens.extend(self._run_split_on_punc(token, never_split))
? ? ? ? output_tokens = whitespace_tokenize(" ".join(split_tokens))
? ? ? ? return output_tokens
? ? def _run_strip_accents(self, text):
? ? ? ? """Strips accents from a piece of text."""
? ? ? ? text = unicodedata.normalize("NFD", text)
? ? ? ? output = []
? ? ? ? for char in text:
? ? ? ? ? ? cat = unicodedata.category(char)
? ? ? ? ? ? if cat == "Mn":
? ? ? ? ? ? ? ? continue
? ? ? ? ? ? output.append(char)
? ? ? ? return "".join(output)
? ? def _run_split_on_punc(self, text, never_split=None):
? ? ? ? """Splits punctuation on a piece of text."""
? ? ? ? if never_split is not None and text in never_split:
? ? ? ? ? ? return [text]
? ? ? ? chars = list(text)
? ? ? ? i = 0
? ? ? ? start_new_word = True
? ? ? ? output = []
? ? ? ? while i < len(chars):
? ? ? ? ? ? char = chars[i]
? ? ? ? ? ? if _is_punctuation(char):
? ? ? ? ? ? ? ? output.append([char])
? ? ? ? ? ? ? ? start_new_word = True
? ? ? ? ? ? else:
? ? ? ? ? ? ? ? if start_new_word:
? ? ? ? ? ? ? ? ? ? output.append([])
? ? ? ? ? ? ? ? start_new_word = False
? ? ? ? ? ? ? ? output[-1].append(char)
? ? ? ? ? ? i += 1
? ? ? ? return ["".join(x) for x in output]
? ? def _tokenize_chinese_chars(self, text):
? ? ? ? """Adds whitespace around any CJK character."""
? ? ? ? output = []
? ? ? ? for char in text:
? ? ? ? ? ? cp = ord(char)
? ? ? ? ? ? if self._is_chinese_char(cp):
? ? ? ? ? ? ? ? output.append(" ")
? ? ? ? ? ? ? ? output.append(char)
? ? ? ? ? ? ? ? output.append(" ")
? ? ? ? ? ? else:
? ? ? ? ? ? ? ? output.append(char)
? ? ? ? return "".join(output)
? ? def _is_chinese_char(self, cp):
? ? ? ? """Checks whether CP is the codepoint of a CJK character."""
? ? ? ? # This defines a "chinese character" as anything in the CJK Unicode block:
? ? ? ? #? https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
? ? ? ? #
? ? ? ? # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
? ? ? ? # despite its name. The modern Korean Hangul alphabet is a different block,
? ? ? ? # as is Japanese Hiragana and Katakana. Those alphabets are used to write
? ? ? ? # space-separated words, so they are not treated specially and handled
? ? ? ? # like the all of the other languages.
? ? ? ? if (
? ? ? ? ? ? (cp >= 0x4E00 and cp <= 0x9FFF)
? ? ? ? ? ? or (cp >= 0x3400 and cp <= 0x4DBF)? #
? ? ? ? ? ? or (cp >= 0x20000 and cp <= 0x2A6DF)? #
? ? ? ? ? ? or (cp >= 0x2A700 and cp <= 0x2B73F)? #
? ? ? ? ? ? or (cp >= 0x2B740 and cp <= 0x2B81F)? #
? ? ? ? ? ? or (cp >= 0x2B820 and cp <= 0x2CEAF)? #
? ? ? ? ? ? or (cp >= 0xF900 and cp <= 0xFAFF)
? ? ? ? ? ? or (cp >= 0x2F800 and cp <= 0x2FA1F)? #
? ? ? ? ):? #
? ? ? ? ? ? return True
? ? ? ? return False
? ? def _clean_text(self, text):
? ? ? ? """Performs invalid character removal and whitespace cleanup on text."""
? ? ? ? output = []
? ? ? ? for char in text:
? ? ? ? ? ? cp = ord(char)
? ? ? ? ? ? if cp == 0 or cp == 0xFFFD or _is_control(char):
? ? ? ? ? ? ? ? continue
? ? ? ? ? ? if _is_whitespace(char):
? ? ? ? ? ? ? ? output.append(" ")
? ? ? ? ? ? else:
? ? ? ? ? ? ? ? output.append(char)
? ? ? ? return "".join(output)
class WordpieceTokenizer(object):
? ? """Runs WordPiece tokenization."""
? ? def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
? ? ? ? self.vocab = vocab
? ? ? ? self.unk_token = unk_token
? ? ? ? self.max_input_chars_per_word = max_input_chars_per_word
? ? def tokenize(self, text):
? ? ? ? """
? ? ? ? Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
? ? ? ? tokenization using the given vocabulary.
? ? ? ? For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.
? ? ? ? Args:
? ? ? ? ? text: A single token or whitespace separated tokens. This should have
? ? ? ? ? ? already been passed through `BasicTokenizer`.
? ? ? ? Returns:
? ? ? ? ? A list of wordpiece tokens.
? ? ? ? """
? ? ? ? output_tokens = []
? ? ? ? for token in whitespace_tokenize(text):
? ? ? ? ? ? chars = list(token)
? ? ? ? ? ? if len(chars) > self.max_input_chars_per_word:
? ? ? ? ? ? ? ? output_tokens.append(self.unk_token)
? ? ? ? ? ? ? ? continue
? ? ? ? ? ? is_bad = False
? ? ? ? ? ? start = 0
? ? ? ? ? ? sub_tokens = []
? ? ? ? ? ? while start < len(chars):
? ? ? ? ? ? ? ? end = len(chars)
? ? ? ? ? ? ? ? cur_substr = None
? ? ? ? ? ? ? ? while start < end:
? ? ? ? ? ? ? ? ? ? substr = "".join(chars[start:end])
? ? ? ? ? ? ? ? ? ? if start > 0:
? ? ? ? ? ? ? ? ? ? ? ? substr = "##" + substr
? ? ? ? ? ? ? ? ? ? if substr in self.vocab:
? ? ? ? ? ? ? ? ? ? ? ? cur_substr = substr
? ? ? ? ? ? ? ? ? ? ? ? break
? ? ? ? ? ? ? ? ? ? end -= 1
? ? ? ? ? ? ? ? if cur_substr is None:
? ? ? ? ? ? ? ? ? ? is_bad = True
? ? ? ? ? ? ? ? ? ? break
? ? ? ? ? ? ? ? sub_tokens.append(cur_substr)
? ? ? ? ? ? ? ? start = end
? ? ? ? ? ? if is_bad:
? ? ? ? ? ? ? ? output_tokens.append(self.unk_token)
? ? ? ? ? ? else:
? ? ? ? ? ? ? ? output_tokens.extend(sub_tokens)
? ? ? ? return output_tokens
Copy to clipboardErrorCopied
bt = BertTokenizer.from_pretrained('bert-base-uncased')
bt('I like natural language progressing!')
Copy to clipboardErrorCopied
{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
Copy to clipboardErrorCopied
3.2 BertSelfAttention
from torch import nn
class BertSelfAttention(nn.Module):
? ? def __init__(self, config):
? ? ? ? super().__init__()
? ? ? ? if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
? ? ? ? ? ? raise ValueError(
? ? ? ? ? ? ? ? f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
? ? ? ? ? ? ? ? f"heads ({config.num_attention_heads})"
? ? ? ? ? ? )
? ? ? ? self.num_attention_heads = config.num_attention_heads
? ? ? ? self.attention_head_size = int(
? ? ? ? ? ? config.hidden_size / config.num_attention_heads)
? ? ? ? self.all_head_size = self.num_attention_heads * self.attention_head_size
? ? ? ? self.query = nn.Linear(config.hidden_size, self.all_head_size)
? ? ? ? self.key = nn.Linear(config.hidden_size, self.all_head_size)
? ? ? ? self.value = nn.Linear(config.hidden_size, self.all_head_size)
? ? ? ? self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
? ? ? ? self.position_embedding_type = getattr(
? ? ? ? ? ? config, "position_embedding_type", "absolute")
? ? ? ? if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
? ? ? ? ? ? self.max_position_embeddings = config.max_position_embeddings
? ? ? ? ? ? self.distance_embedding = nn.Embedding(
? ? ? ? ? ? ? ? 2 * config.max_position_embeddings - 1, self.attention_head_size)
? ? ? ? self.is_decoder = config.is_decoder
? ? def transpose_for_scores(self, x):
? ? ? ? new_x_shape = x.size()[
? ? ? ? ? ? :-1] + (self.num_attention_heads, self.attention_head_size)
? ? ? ? x = x.view(*new_x_shape)
? ? ? ? return x.permute(0, 2, 1, 3)
? ? def forward(
? ? ? ? self,
? ? ? ? hidden_states,
? ? ? ? attention_mask=None,
? ? ? ? head_mask=None,
? ? ? ? encoder_hidden_states=None,
? ? ? ? encoder_attention_mask=None,
? ? ? ? past_key_value=None,
? ? ? ? output_attentions=False,
? ? ):
? ? ? ? mixed_query_layer = self.query(hidden_states)
? ? ? ? # If this is instantiated as a cross-attention module, the keys
? ? ? ? # and values come from an encoder; the attention mask needs to be
? ? ? ? # such that the encoder's padding tokens are not attended to.
? ? ? ? is_cross_attention = encoder_hidden_states is not None
? ? ? ? if is_cross_attention and past_key_value is not None:
? ? ? ? ? ? # reuse k,v, cross_attentions
? ? ? ? ? ? key_layer = past_key_value[0]
? ? ? ? ? ? value_layer = past_key_value[1]
? ? ? ? ? ? attention_mask = encoder_attention_mask
? ? ? ? elif is_cross_attention:
? ? ? ? ? ? key_layer = self.transpose_for_scores(
? ? ? ? ? ? ? ? self.key(encoder_hidden_states))
? ? ? ? ? ? value_layer = self.transpose_for_scores(
? ? ? ? ? ? ? ? self.value(encoder_hidden_states))
? ? ? ? ? ? attention_mask = encoder_attention_mask
? ? ? ? elif past_key_value is not None:
? ? ? ? ? ? key_layer = self.transpose_for_scores(self.key(hidden_states))
? ? ? ? ? ? value_layer = self.transpose_for_scores(self.value(hidden_states))
? ? ? ? ? ? key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
? ? ? ? ? ? value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
? ? ? ? else:
? ? ? ? ? ? key_layer = self.transpose_for_scores(self.key(hidden_states))
? ? ? ? ? ? value_layer = self.transpose_for_scores(self.value(hidden_states))
? ? ? ? query_layer = self.transpose_for_scores(mixed_query_layer)
? ? ? ? if self.is_decoder:
? ? ? ? ? ? # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
? ? ? ? ? ? # Further calls to cross_attention layer can then reuse all cross-attention
? ? ? ? ? ? # key/value_states (first "if" case)
? ? ? ? ? ? # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
? ? ? ? ? ? # all previous decoder key/value_states. Further calls to uni-directional self-attention
? ? ? ? ? ? # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
? ? ? ? ? ? # if encoder bi-directional self-attention `past_key_value` is always `None`
? ? ? ? ? ? past_key_value = (key_layer, value_layer)
? ? ? ? # Take the dot product between "query" and "key" to get the raw attention scores.
? ? ? ? attention_scores = torch.matmul(
? ? ? ? ? ? query_layer, key_layer.transpose(-1, -2))
? ? ? ? if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
? ? ? ? ? ? seq_length = hidden_states.size()[1]
? ? ? ? ? ? position_ids_l = torch.arange(
? ? ? ? ? ? ? ? seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
? ? ? ? ? ? position_ids_r = torch.arange(
? ? ? ? ? ? ? ? seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
? ? ? ? ? ? distance = position_ids_l - position_ids_r
? ? ? ? ? ? positional_embedding = self.distance_embedding(
? ? ? ? ? ? ? ? distance + self.max_position_embeddings - 1)
? ? ? ? ? ? positional_embedding = positional_embedding.to(
? ? ? ? ? ? ? ? dtype=query_layer.dtype)? # fp16 compatibility
? ? ? ? ? ? if self.position_embedding_type == "relative_key":
? ? ? ? ? ? ? ? relative_position_scores = torch.einsum(
? ? ? ? ? ? ? ? ? ? "bhld,lrd->bhlr", query_layer, positional_embedding)
? ? ? ? ? ? ? ? attention_scores = attention_scores + relative_position_scores
? ? ? ? ? ? elif self.position_embedding_type == "relative_key_query":
? ? ? ? ? ? ? ? relative_position_scores_query = torch.einsum(
? ? ? ? ? ? ? ? ? ? "bhld,lrd->bhlr", query_layer, positional_embedding)
? ? ? ? ? ? ? ? relative_position_scores_key = torch.einsum(
? ? ? ? ? ? ? ? ? ? "bhrd,lrd->bhlr", key_layer, positional_embedding)
? ? ? ? ? ? ? ? attention_scores = attention_scores + \
? ? ? ? ? ? ? ? ? ? relative_position_scores_query + relative_position_scores_key
? ? ? ? attention_scores = attention_scores / \
? ? ? ? ? ? math.sqrt(self.attention_head_size)
? ? ? ? if attention_mask is not None:
? ? ? ? ? ? # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
? ? ? ? ? ? attention_scores = attention_scores + attention_mask
? ? ? ? # Normalize the attention scores to probabilities.
? ? ? ? attention_probs = nn.Softmax(dim=-1)(attention_scores)
? ? ? ? # This is actually dropping out entire tokens to attend to, which might
? ? ? ? # seem a bit unusual, but is taken from the original Transformer paper.
? ? ? ? attention_probs = self.dropout(attention_probs)
? ? ? ? # Mask heads if we want to
? ? ? ? if head_mask is not None:
? ? ? ? ? ? attention_probs = attention_probs * head_mask
? ? ? ? context_layer = torch.matmul(attention_probs, value_layer)
? ? ? ? context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
? ? ? ? new_context_layer_shape = context_layer.size()[
? ? ? ? ? ? :-2] + (self.all_head_size,)
? ? ? ? context_layer = context_layer.view(*new_context_layer_shape)
? ? ? ? outputs = (context_layer, attention_probs) if output_attentions else (
? ? ? ? ? ? context_layer,)
? ? ? ? if self.is_decoder:
? ? ? ? ? ? outputs = outputs + (past_key_value,)
? ? ? ? return outputs
Copy to clipboardErrorCopied
3.3 BerSelfOutput
class BertSelfOutput(nn.Module):
? ? def __init__(self, config):
? ? ? ? super().__init__()
? ? ? ? self.dense = nn.Linear(config.hidden_size, config.hidden_size)
? ? ? ? self.LayerNorm = nn.LayerNorm(
? ? ? ? ? ? config.hidden_size, eps=config.layer_norm_eps)
? ? ? ? self.dropout = nn.Dropout(config.hidden_dropout_prob)
? ? def forward(self, hidden_states, input_tensor):
? ? ? ? hidden_states = self.dense(hidden_states)
? ? ? ? hidden_states = self.dropout(hidden_states)
? ? ? ? hidden_states = self.LayerNorm(hidden_states + input_tensor)
? ? ? ? return hidden_states
Copy to clipboardErrorCopied
3.4 BertOutput
class BertOutput(nn.Module):
? ? def __init__(self, config):
? ? ? ? super().__init__()
? ? ? ? self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
? ? ? ? self.LayerNorm = nn.LayerNorm(
? ? ? ? ? ? config.hidden_size, eps=config.layer_norm_eps)
? ? ? ? self.dropout = nn.Dropout(config.hidden_dropout_prob)
? ? def forward(self, hidden_states, input_tensor):
? ? ? ? hidden_states = self.dense(hidden_states)
? ? ? ? hidden_states = self.dropout(hidden_states)
? ? ? ? hidden_states = self.LayerNorm(hidden_states + input_tensor)
? ? ? ? return hidden_states
Copy to clipboardErrorCopied
3.5 BertPooler
class BertPooler(nn.Module):
? ? def __init__(self, config):
? ? ? ? super().__init__()
? ? ? ? self.dense = nn.Linear(config.hidden_size, config.hidden_size)
? ? ? ? self.activation = nn.Tanh()
? ? def forward(self, hidden_states):
? ? ? ? # We "pool" the model by simply taking the hidden state corresponding
? ? ? ? # to the first token.
? ? ? ? first_token_tensor = hidden_states[:, 0]
? ? ? ? pooled_output = self.dense(first_token_tensor)
? ? ? ? pooled_output = self.activation(pooled_output)
? ? ? ? return pooled_output
Copy to clipboardErrorCopied
from transformers.models.bert.configuration_bert import *
import torch
# 配置參數(shù)
config = BertConfig.from_pretrained("bert-base-uncased")
bert_pooler = BertPooler(config = config)
print("input to bert pooler size: {}".format(config.hidden_size))
# 調(diào)用bert_pooler
batch_size = 1
seq_len = 2
hidden_size = 768
x = torch.rand(batch_size, seq_len, hidden_size)
y = bert_pooler(x)
print(y.size())
Copy to clipboardErrorCopied
input to bert pooler size: 768
torch.Size([1, 768])
Copy to clipboardErrorCopied
4 總結(jié)
本次任務(wù)血公,主要講解了BERT的源碼,包括BertTokenizer缓熟、BertModel累魔,其中BertTokenizer主要用于分割句子,并分解成subword够滑;BertModel是BERT的本體模型類垦写,主要包括BertEmbeddings、BertEncoder和BertPooler三部分彰触,BertEmbeddings用于構(gòu)造word梯投、position和token_type embedings的Embeddings,BertEncoder由BertAttention况毅、BertIntermediate和BertOutput三個部分組成分蓖,BertPooler用于取出句子的第一個token。整個過程需要配合Task03的Bert模型架構(gòu)來閱讀尔许。
上一章節(jié)