1.BERT
??BERT主要是多個Transformer的Encoder作為主題,主要包含Embedding層振惰,Encoder層。
1.1 Embedding
??BERT中的Embedding主要有3種:
- Token Embedding(詞編碼),
- Position Embedding (位置編碼),
- Segment Embedding
1.1.1 Token Embedding
??Token Embedding 是對詞向量進行編碼焕刮。原始的輸入是[batch,seq_len]轿塔。經過 Token Embedding 后數(shù)據的維度為[batch,seq_len,d_model]。
??在BERT中Token Embedding的內部計算流程是初始化一個二維數(shù)組裕便,大小為[vocab_size,d_model]绒净,然后將輸入的數(shù)據進行one-hot編碼见咒,維度為[batch,seq_len,vocab_size]偿衰,進行tensor的乘法。驗證如下:
- 以torch原始的Embedding進行token編碼
import torch
import torch.nn.functional as F
## 驗證Token embedding
input = torch.tensor([[1,4,2,3,4],[4,2,3,1,5]],dtype = torch.long)
init_weight = torch.rand(6,3) # 這里是6為詞典的大小改览,3為d_model
print(init_weight)
# init_weight的值為:
# tensor([[0.2741, 0.7190, 0.5863],
# [0.9283, 0.3595, 0.8193],
# [0.6051, 0.4441, 0.6545],
# [0.8852, 0.9930, 0.6367],
# [0.0421, 0.1417, 0.6370],
# [0.3956, 0.5442, 0.4503]])
out = F.embedding(input,init_weight)
print(out)
# out的值為:
# tensor([[[0.9283, 0.3595, 0.8193],
# [0.0421, 0.1417, 0.6370],
# [0.6051, 0.4441, 0.6545],
# [0.8852, 0.9930, 0.6367],
# [0.0421, 0.1417, 0.6370]],
#
# [[0.0421, 0.1417, 0.6370],
# [0.6051, 0.4441, 0.6545],
# [0.8852, 0.9930, 0.6367],
# [0.9283, 0.3595, 0.8193],
# [0.3956, 0.5442, 0.4503]]])
- 將索引進行onehot編碼后做矩陣乘法進行驗證下翎,這里固定init_weight和上面一樣
import numpy as np
input2 = np.array([
[[0,1,0,0,0,0],[0,0,0,0,1,0],[0,0,1,0,0,0],[0,0,0,1,0,0],[0,0,0,0,1,0]],
[[0,0,0,0,1,0],[0,0,1,0,0,0],[0,0,0,1,0,0],[0,1,0,0,0,0],[0,0,0,0,0,1]]
])
init_weight = np.array([[0.2741, 0.7190, 0.5863],
[0.9283, 0.3595, 0.8193],
[0.6051, 0.4441, 0.6545],
[0.8852, 0.9930, 0.6367],
[0.0421, 0.1417, 0.6370],
[0.3956, 0.5442, 0.4503]])
for i in range(len(input2)):
out = np.dot(input2[i],init_weight)
print(out)
# [[0.9283 0.3595 0.8193]
# [0.0421 0.1417 0.637 ]
# [0.6051 0.4441 0.6545]
# [0.8852 0.993 0.6367]
# [0.0421 0.1417 0.637 ]]
# [[0.0421 0.1417 0.637 ]
# [0.6051 0.4441 0.6545]
# [0.8852 0.993 0.6367]
# [0.9283 0.3595 0.8193]
# [0.3956 0.5442 0.4503]]
可以看見兩者的結果是一樣的,所以猜測embedding內部就是先將句子中每個詞的索引表示轉化為one-hot表示宝当,然后對編碼后的數(shù)據進行矩陣的變換视事,其中參數(shù)開始是輸出化的,后面訓練的時候可以用來學習庆揩。編碼后的輸出為[batch,seq_len,d_model]
1.1.2 Position Embedding
??BERT中的Position Embedding和Transformer不一樣俐东,transormer中式直接利用公式,計算出對用維度的值订晌。在BERT中是要學習的虏辫。比如說d_model的大小為512,那么每個句子就會生成一個[0,1,2,...511]的一維數(shù)組锈拨,然后重復batch次砌庄,因此實際的輸入為[batch,d_model],將其送到one_hot中進行編碼奕枢,具體的編碼過程和Token Embedding一樣娄昆,然后最后的輸出為[batch,seq_len,d_model]。和Token Embedding輸出的維度一樣缝彬。
1.1.3 Segment Embedding
??BERT 能夠處理對輸入句子對的分類任務萌焰。這類任務就像判斷兩個文本是否是語義相似的。句子對中的兩個句子被簡單的拼接在一起后送入到模型中谷浅。那BERT如何去區(qū)分一個句子對中的兩個句子呢扒俯?答案就是segment embeddings.一般是不用的,只在句子對的時候采用壳贪。其編碼后的維度也是[batch,seq_len,d_model]陵珍。
- BERT預訓練模型中關于embedding的代碼如下:
class BertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings.
"""
def __init__(self, config):
super(BertEmbeddings, self).__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12) # 層歸一化就是對最后一個維度進行歸一化
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, input_ids, token_type_ids=None):
seq_length = input_ids.size(1) # 句子的長度,input_ids的維度一般位【batch_size,seq_length】
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
position_ids = position_ids.unsqueeze(0).expand_as(input_ids) # 將維度轉化位和input_ids一樣的維度
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
words_embeddings = self.word_embeddings(input_ids) # word_embedding就是直接將input_ids作為輸入送入embedding
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = words_embeddings + position_embeddings + token_type_embeddings # 將三者相加作為encoder的輸入
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class BertLayerNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-12):
"""Construct a layernorm module in the TF style (epsilon inside the square root).
"""
super(BertLayerNorm, self).__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.bias = nn.Parameter(torch.zeros(hidden_size))
self.variance_epsilon = eps
def forward(self, x):
u = x.mean(-1, keepdim=True) # layerNorm就是對最后一個維度進行變化的
s = (x - u).pow(2).mean(-1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.variance_epsilon)
return self.weight * x + self.bias