寫在前面
部分自己手敲代碼:鏈接
封面圖
1 緒論
預(yù)訓(xùn)練(Pre-train)
即首先在一個原任務(wù)上預(yù)先訓(xùn)練一個初始模型,然后在下游任務(wù)(目標(biāo)任務(wù))上繼續(xù)對該模型進(jìn)行精調(diào)(Fine-Tune)派殷,從而得到提高下游任務(wù)準(zhǔn)確率的目的华匾,本質(zhì)上也是一種遷移學(xué)習(xí)(Transfer Learning)
2 自然語言處理基礎(chǔ)
2.1 文本的表示
2.1.1 獨熱表示
One-hot Encoding無法使用余弦函數(shù)計算相似度阱洪,同時會造成數(shù)據(jù)稀疏(Data Sparsity)
2.1.2 詞的分布式表示
分布式語義假設(shè):詞的含義可以由其上下文的分布進(jìn)行表示
使得利用共現(xiàn)頻次表現(xiàn)的向量提供了一定的相似性
2.1.2.1 上下文
可以使用詞在句子中的一個固定窗口內(nèi)的詞作為其上下文踱侣,也可以使用所在的文檔本身作為上下文
- 前者反映詞的局部性質(zhì):具有相似詞法、句法屬性的詞將會具有相似的向量表示
- 后者更多反映詞代表的主題信息
2.1.2.2 共現(xiàn)頻次作為詞的向量表示的問題
- 高頻詞誤導(dǎo)計算結(jié)果
- 高階關(guān)系無法反映
- 仍有稀疏性問題
例子:”A“與”B“共現(xiàn)過覆享,”B“與”C“共現(xiàn)過靶橱,”C“與”D“共現(xiàn)過寥袭,只能知道”A“與”C“都和”B“共現(xiàn)過路捧,但”A“與”D“這種高階關(guān)系沒法知曉
2.1.2.3 奇異值分解
可以使用奇異值分解的做法解決共現(xiàn)頻次無法反映詞之間高階關(guān)系的問題
奇異值分解后的U的每一行其實表示對應(yīng)詞的d維向量表示,由于U的各列相互正交传黄,則可以認(rèn)為詞表示的每一維表達(dá)了該詞的一種獨立的”潛在語義“
分解結(jié)果上下文比較相近的詞在空間上距離比較近杰扫,潛在語義分析(LSA)
2.1.3 詞嵌入表示
經(jīng)常直接簡稱為詞向量,利用自然語言文本中蘊含的自監(jiān)督學(xué)習(xí)信號(即詞與上下文的共現(xiàn)信息)膘掰,先來預(yù)訓(xùn)練詞向量章姓,往往會獲得更好的效果
2.1.4 詞袋表示
BOW,不考慮順序识埋,將文本中全部的詞所對應(yīng)的向量表示(即可以是獨熱表示凡伊,也可以是分布式或者詞向量)下那個家,即構(gòu)成了文本的向量表示窒舟。如果是獨熱表示系忙,每一維是詞在文本中出現(xiàn)的次數(shù)
2.2 自然語言處理任務(wù)
2.2.1 n-gram
句首加上<BOS>,句尾加上<EOS>
2.2.2 平滑
- 當(dāng)n比較大或者測試句子中含有未登錄詞(Out-Of-Vocaabulary惠豺,OOV)時银还,會出現(xiàn)零概率,可以使用加1平滑
- 當(dāng)訓(xùn)練集較小時洁墙,加1會得到過高的概率估計蛹疯,所以轉(zhuǎn)為加平滑,其中热监,例如bigram語言模型捺弦,平滑后條件概率為:
2.2.3 語言模型性能評估
方法:
- 運用到具體的任務(wù)中,得到外部任務(wù)評價(計算代價高)
- 困惑度(Perplexity, PPL)孝扛,內(nèi)部評價
困惑度:
連乘會浮點下溢
困惑度越小列吼,單詞序列的概率越大,但是困惑度越低的語言模型并不總能在外部任務(wù)上得到更好的性能指標(biāo)疗琉,但是會有一定的正相關(guān)性冈欢,困惑度是一種快速評價語言模型性能的指標(biāo)
2.3 自然語言處理基礎(chǔ)任務(wù)
2.3.1 中文分詞
前向最大匹配分詞歉铝,明顯缺點是傾向與切分出較長的詞盈简,也會有切分歧義的問題
見附錄代碼2
2.3.2 子詞切分
以英語為例的語言,如果按照天然的分隔符進(jìn)行切分的話太示,會造成一定的數(shù)據(jù)稀疏的問題柠贤,而且會導(dǎo)致詞表過大而降低處理速度,所以有傳統(tǒng)語言學(xué)規(guī)則类缤,詞形還原和詞干提取臼勉,但是結(jié)果可能不是一個完整的詞
2.3.2.1 子詞切分算法
原理:都是使用盡量長且頻次高的子詞對單詞進(jìn)行切分,字節(jié)對編碼算法(Byte Pair Encoding餐弱, BPE)
2.3.3 詞性標(biāo)注
主要難點在于歧義性宴霸,一個詞在不同的上下文中可能有不同的意思
2.3.4 句法分析
給定一個句子囱晴,分析句子的句法成分信息,輔助下游處理任務(wù)
句法結(jié)構(gòu)表示法:
S表示起始符號瓢谢,NP名詞短語畸写,VP動詞短語,sub表示主語氓扛,obj表示賓語
例子
您轉(zhuǎn)的這篇文章很無知枯芬。
您轉(zhuǎn)這篇文章很無知。第一句話主語是“文章”采郎,第二句話的主語是“轉(zhuǎn)”這個動作
2.3.5 語義分析
詞語的顆粒度考慮千所,一個詞語具有多重語義,例如“打”蒜埋,詞義消歧(Word Sense Disambiguation淫痰,WSD)任務(wù),可以使用語義詞典確定整份,例如WordNet
2.4 自然語言處理應(yīng)用任務(wù)
2.4.1 信息抽取
- NER
- 關(guān)系抽群诮纭(實體之間的語義關(guān)系,如夫妻皂林、子女朗鸠、工作單元和地理空間上的位置關(guān)系等二元關(guān)系)
-
事件抽取(識別人們感興趣的事件以及事件所涉及的時間础倍、地點和任務(wù)等關(guān)鍵信息)
2.4.2 情感分析
- 情感分類(識別文中蘊含的情感類型或者情感強度)
- 情感信息抽戎蛘肌(抽取文中情感元素,如評價詞語沟启、評價對象和評價搭配等)
2.4.3 問答系統(tǒng)(QA)
- 檢索式問答系統(tǒng)
- 知識庫問答系統(tǒng)
- 常問問題集問答系統(tǒng)
- 閱讀理解式問答系統(tǒng)
2.4.4 機(jī)器翻譯(MT)
”理性主義“:基于規(guī)則
”經(jīng)驗主義“:數(shù)據(jù)驅(qū)動
基于深度學(xué)習(xí)的MT也成為神經(jīng)機(jī)器翻譯NMT
2.4.5 對話系統(tǒng)
對話系統(tǒng)主要分為任務(wù)型對話系統(tǒng)和開放域?qū)υ捪到y(tǒng)忆家,后者也會被稱為聊天機(jī)器人(Chtabot)
2.4.5.1 任務(wù)型對話系統(tǒng)
包含三個模塊:NLU(自然語言理解)、DM(對話管理)德迹、NLG(自然語言生成)
1)NLU通常包含話語的領(lǐng)域芽卿、槽值、意圖
2)DM通常包含對話狀態(tài)跟蹤(Dialogue State Tracking)胳搞、對話策略優(yōu)化(Dialogue Policy Optimization)
對話狀態(tài)一般表示為語義槽和值的列表卸例。
U:幫我定一張明天去北京的機(jī)票
例如,通過到對以上用戶話語
對話狀態(tài)一般表示為語義槽和值的列表肌毅,例如對于話術(shù)的NLU的結(jié)果進(jìn)行對話狀態(tài)跟蹤筷转,得到當(dāng)前對話狀態(tài):【到達(dá)地=北京;出發(fā)時間=明天悬而;出發(fā)地=NULL呜舒;數(shù)量=1】
獲取到當(dāng)前對話狀態(tài)后,進(jìn)行策略優(yōu)化笨奠,即選擇下一步采用什么樣的策略袭蝗,也叫動作唤殴,比如可以詢問出發(fā)地等
NLG通常通過寫模板即可實現(xiàn)
2.5 基本問題
2.5.1 文本分類
2.5.2 結(jié)構(gòu)預(yù)測
2.5.2.1 序列標(biāo)注(Sequence Labeling)
CRF幾考慮了每個詞屬于某個標(biāo)簽(發(fā)射概率),還考慮了標(biāo)簽之間的相互關(guān)系(轉(zhuǎn)移概率)
2.5.2.2 序列分割
人名(PER)到腥、地名(LOC)眨八、機(jī)構(gòu)名(ORG)
輸入:”我愛北京天安門“,分詞:”我 愛 北京 天安門“左电,NER結(jié)果:”北京天安門=LOC“
2.5.2.3 圖結(jié)構(gòu)生成
輸入的是自然語言廉侧,輸出結(jié)果是一個以圖表示的結(jié)構(gòu),算法有兩大類:基于圖和基于轉(zhuǎn)移
2.5.3 序列到序列問題(Seq2Seq)
也成為編碼器-解碼器(Encoder-Decoder)模型
2.6 評價指標(biāo)
2.6.1 準(zhǔn)確率(Accuracy
最簡單直觀的評價指標(biāo)篓足,常被用于文本分類段誊、詞性標(biāo)注等問題
2.6.2 F值
針對某一類別的評價
是加權(quán)調(diào)和參數(shù);P是精確率(Precision)栈拖;R是召回率(Recall)连舍,當(dāng)權(quán)重為=1時,表示精確率和召回率同樣重要涩哟,也稱F1值
“正確識別的命名實體數(shù)目”為1 (“哈爾濱”)掖疮,“識別出的命名實體總數(shù)”為2 ("張"和"哈爾濱")堪遂,"測試文本中命名實體的總數(shù)"為2("張三"和"哈爾濱")帕识,那么此時精確率和召回率皆為1/2 = 0.5窍育,最終的F1 = 0.5,與基于詞計算的準(zhǔn)確率(0.875)相比器仗,該值更為合理了
2.6.3 其他評價
BLEU值是最常用的機(jī)器翻譯自動評價指標(biāo)
2.7 習(xí)題
3 基礎(chǔ)工具集與常用數(shù)據(jù)集
3.1 NTLK工具集
3.1.1 語料庫和詞典資源
3.1.1.1 停用詞
英文中的”a“融涣,”the“,”of“精钮,“to”等
from nltk.corpus import stopwords
stopwords.words('english')
3.1.1.2 常用語料庫
NLTK提供了多種語料庫(文本數(shù)據(jù)集)威鹿,如圖書、電影評論和聊天記錄等轨香,它們可以被分為兩類忽你,即未標(biāo)注語料庫(又稱生語料庫或生文本,Raw text)和人工標(biāo)注語料庫( Annotated corpus )
- 未標(biāo)注語料庫
比如說小說的原文等
- 人工標(biāo)注語料庫
3.1.1.3 常用詞典
WordNet
普林斯頓大學(xué)構(gòu)建的英文語義詞典(也稱作辭典臂容,Thesaurus)科雳,其主要特色是定義了同義詞集合(Synset),每個同義詞集合由具有相同意義的詞義組成策橘,為每一個同義詞集合提供了簡短的釋義(Gloss)炸渡,不同同義詞集合之間還具有一定的語義關(guān)系
from nltk.corpus import wordnet
syns = wordnet.synsets("bank")
syns[0].name()
syns[1].definition()
SentiWordNet
基于WordNet標(biāo)注的同義詞集合的情感傾向性詞典娜亿,有褒義丽已、貶義、中性三個情感值
3.1.2 NLP工具集
3.1.2.1 分句
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize
text = gutenberg.raw('austen-emma.txt')
sentences = sent_tokenize(text)
print(sentences[100])
# 結(jié)果:
# Mr. Knightley loves to find fault with me, you know-- \nin a joke--it is all a joke.
3.1.2.2 標(biāo)記解析
一個句子是由若干標(biāo)記(Token)按順序構(gòu)成的买决,其中標(biāo)記既可以是一個詞沛婴,也可以是標(biāo)點符號等吼畏,這些標(biāo)記是自然語言處理最基本的輸入單元。
將句子分割為標(biāo)記的過程叫作標(biāo)記解析(Tokenization)嘁灯。英文中的單詞之間通常使用空格進(jìn)行分割泻蚊。不過標(biāo)點符號通常和前面的單詞連在一起,因此標(biāo)記解析的一項主要工作是將標(biāo)點符號和前面的單詞進(jìn)行拆分丑婿。和分句一樣性雄,也無法使用簡單的規(guī)則進(jìn)行標(biāo)記解析,仍以符號"."為例羹奉,它既可作為句號秒旋,也可以作為標(biāo)記的一部分,如不能簡單地將"Mr."分成兩個標(biāo)記诀拭。同樣迁筛,NLTK提供了標(biāo)記解析功能,也稱作標(biāo)記解析器(Tokenizer)
from nltk.tokenize import word_tokenize
word_tokenize(sentences[100])
# 結(jié)果:
# ['Mr.','Knightley','loves','to','find','fault','with','me',',','you','know','--','in','a','joke','--','it','is','all','a','joke','.']
3.1.2.3 詞性標(biāo)注
from nltk import pos_tag
print(pos_tag(word_tokenize("They sat by the fire.")))
print(pos_tag(word_tokenize("They fire a gun.")))
# 結(jié)果:
# [('They', 'PRP'), ('sat', 'VBP'), ('by', 'IN'), ('the', 'DT'), ('fire', 'NN'), ('.', '.')]
# [('They', 'PRP'), ('fire', 'VBP'), ('a', 'DT'), ('gun', 'NN'), ('.', '.')]
3.1.2.4 其他工具
命名實體識別耕挨、組塊分析细卧、句法分析等
3.2 LTP工具集(哈工大)
中文分詞、詞形標(biāo)注筒占、命名實體識別贪庙、依存句法分析和語義角色標(biāo)注等,具體查api
3.3 Pytorch
# 1.創(chuàng)建
torch.empty(2, 3) # 未初始化
torch.randn(2, 3) # 標(biāo)準(zhǔn)正態(tài)
torch.zeros(2, 3, dtype=torch.long) # 張量為整數(shù)類型
torch.zeros(2, 3, dtype=torch.double) # 雙精度浮點數(shù)
torch.tensor([[1.0, 3.8, 2.1], [8.6, 4.0, 2.4]]) # 通過列表創(chuàng)建
torch.arange(1, 4) # 生成1到4的數(shù)
# 2.GPU
torch.rand(2, 3).to("cuda")
# 3.加減乘除都是元素運算
x = torch.tensor([1, 2, 3], dtype=torch.double)
y = torch.tensor([4, 5, 6], dtype=torch.double)
print(x * y)
# tensor([ 4., 10., 18.], dtype=torch.float64)
# 4.點積
x.dot(y)
# 5.所有元素求平均
x.mean()
# 6.按維度求平均
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double)
print(x.mean(dim=0))
print(x.mean(dim=1))
# tensor([2.5000, 3.5000, 4.5000], dtype=torch.float64)
# tensor([2., 5.], dtype=torch.float64)
# 7.拼接(按列和行)
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double)
y = torch.tensor([[7, 8, 9], [10, 11, 12]], dtype=torch.double)
torch.cat((x, y), dim=0)
# tensor([[ 1., 2., 3.],
# [ 4., 5., 6.],
# [ 7., 8., 9.],
# [10., 11., 12.]], dtype=torch.float64)
# 8.梯度
x = torch.tensor([2.], requires_grad=True)
y = torch.tensor([3.], requires_grad=True)
z = (x + y) * (y - 2)
print(z)
z.backward()
print(x.grad, y.grad)
# tensor([5.], grad_fn=<MulBackward0>)
# tensor([1.]) tensor([6.])
# 9.調(diào)整形狀
view和reshape區(qū)別是翰苫,view要求張量為連續(xù)的插勤,張量可以用is_conuous判斷是否連續(xù),其他一樣
transpose交換維度(只能交換兩個維度)革骨,permute可以交換多個維度
# 10.升維和降維
a = torch.tensor([1, 2, 3, 4])
b = a.unsqueeze(dim=0)
print(b, b.shape)
c = b.squeeze() # 去掉所有形狀中為1的維
print(c, c.shape)
# tensor([[1, 2, 3, 4]]) torch.Size([1, 4])
# tensor([1, 2, 3, 4]) torch.Size([4])
3.4 語料處理
# 刪除空的成對符號
def remove_empty_paired_punc(in_str):
return in_str.replace('()', '').replace('《》', '').replace('【】', '').replace('[]', '')
# 刪除多余的html標(biāo)簽
def remove_html_tags(in_str):
html_pattern = re.compile(r'<[^>]+>', re.S)
return html_pattern.sub('', in_str)
# 刪除不可見控制字符
def remove_control_chars(in_str):
control_chars = ''.join(map(chr, list(range(0, 32)) + list(range(127, 160))))
control_chars = re.compile('[%s]' % re.escape(control_chars))
return control_chars.sub('', in_str)
3.5 數(shù)據(jù)集
- Common Crawl
- HuggingFace Datasets(超多數(shù)據(jù)集)
使用HuggingFace Datasets之前农尖,pip安裝datasets,其提供數(shù)據(jù)集以及評價方法
3.6 習(xí)題
4 自然語言處理中的神經(jīng)網(wǎng)絡(luò)基礎(chǔ)
4.1 多層感知器模型
4.1.1 感知機(jī)
輸入轉(zhuǎn)成x良哲,過程為特征提仁⒖ā(Feature Extraction)
4.1.2 線性回歸
和感知機(jī)類似,
4.1.3 邏輯回歸
其中筑凫,滑沧,回歸模型常用于分類問題
4.1.4 softmax回歸
例如數(shù)字識別,每個類巍实,結(jié)果為
使用矩陣表示
4.1.5 多層感知機(jī)(Multi-layer Perceptron滓技,MLP)
多層感知機(jī)是解決線性不可分問題的方案,是堆疊多層線性分類器棚潦,并在隱層假如了非線性激活函數(shù)
4.1.6 模型實現(xiàn)
4.1.6.1 nn
from torch import nn
linear = nn.Linear(32, 2) # 輸入特征數(shù)目為32維令漂,輸出特征數(shù)目為2維
inputs = torch.rand(3, 32) # 創(chuàng)建一個形狀為(3,32)的隨機(jī)張量,3為batch批次大小
outputs = linear(inputs)
print(outputs)
# 輸出:
# tensor([[ 0.2488, -0.3663],
# [ 0.4467, -0.5097],
# [ 0.4149, -0.7504]], grad_fn=<AddmmBackward>)
# 輸出為(3叠必,2)荚孵,即(batch,輸出維度)
4.1.6.2 激活函數(shù)
from torch.nn import functional as F
# 對于每個元素進(jìn)行sigmoid
activation = F.sigmoid(outputs)
print(activation)
# 結(jié)果:
# tensor([[0.6142, 0.5029],
# [0.5550, 0.4738],
# [0.6094, 0.4907]], grad_fn=<SigmoidBackward>)
# 沿著第2維(行方向)進(jìn)行softmax纬朝,即對于每批次中的各樣例分別進(jìn)行softmax
activation = F.softmax(outputs, dim=1)
print(activation)
# 結(jié)果:
# tensor([[0.6115, 0.3885],
# [0.5808, 0.4192],
# [0.6182, 0.3818]], grad_fn=<SoftmaxBackward>)
activation = F.relu(outputs)
print(activation)
# 結(jié)果:
# tensor([[0.4649, 0.0115],
# [0.2210, 0.0000],
# [0.4447, 0.0000]], grad_fn=<ReluBackward0>)
4.1.6.3 多層感知機(jī)
import torch
from torch import nn
from torch.nn import functional as F
# 多層感知機(jī)
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, num_class):
super(MLP, self).__init__()
# 線性變換:輸入 -> 隱層
self.linear1 = nn.Linear(input_dim, hidden_dim)
# ReLU
self.activate = F.relu
# 線性變換:隱層 -> 輸出
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs):
hidden = self.linear1(inputs)
activation = self.activate(hidden)
outputs = self.linear2(activation)
probs = F.softmax(outputs, dim=1) # 獲得每個輸入屬于某個類別的概率
return probs
mlp = MLP(input_dim=4, hidden_dim=5, num_class=2)
# 3個輸入batch收叶,4為每個輸入的維度
inputs = torch.rand(3, 4)
probs = mlp(inputs)
print(probs)
# 結(jié)果:
# tensor([[0.3465, 0.6535],
# [0.3692, 0.6308],
# [0.4319, 0.5681]], grad_fn=<SoftmaxBackward>)
4.2 CNN
4.2.1 模型結(jié)構(gòu)
計算最后輸出邊長為
其中,n為輸入邊長共苛,p為padding判没,f為卷積核寬度
前饋神經(jīng)網(wǎng)絡(luò),卷積神經(jīng)網(wǎng)絡(luò)
4.2.2 模型實現(xiàn)
4.2.2.1 卷積
Conv1d隅茎、Conv2d哆致、Conv3d,自然語言處理中常用的一維卷積
簡單來說患膛,2d先橫著掃再豎著掃摊阀,1d只能豎著掃,3d是三維立體掃
代碼實現(xiàn):注意pytorch中只能對倒數(shù)第2維數(shù)據(jù)進(jìn)行卷積踪蹬,因此傳參時要轉(zhuǎn)置一下胞此,將需要卷積的數(shù)據(jù)弄到倒數(shù)第2維,這里將embeding的維度進(jìn)行卷積跃捣,最后一般會在轉(zhuǎn)置過來(沒辦法漱牵,pytorch設(shè)計的不太好,這點確實繞了一圈)
import torch
from torch.nn import Conv1d
inputs = torch.ones(2, 7, 5)
conv1 = Conv1d(in_channels=5, out_channels=3, kernel_size=2)
inputs = inputs.permute(0, 2, 1)
outputs = conv1(inputs)
outputs = outputs.permute(0, 2, 1)
print(outputs, outputs.shape)
# 結(jié)果:
# tensor([[[ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679]],
# [[ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679]]], grad_fn=<PermuteBackward> torch.Size([2, 6, 3]))
4.2.2.2 卷積疚漆、池化酣胀、全鏈接
import torch
from torch.nn import Conv1d
# 輸入批次大小為2,即有兩個序列娶聘,每個序列長度為6闻镶,輸入的維度為5
inputs = torch.rand(2, 5, 6)
print("inputs = ", inputs, inputs.shape)
# class torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
# in_channels 詞向量維度
# out_channels 卷積產(chǎn)生的通道
# kernel_size 卷積核尺寸,卷積大小實際為 kernel_size*in_channels
# 定義一個一維卷積丸升,輸入通道為5铆农,輸出通道為2,卷積核寬度為4
conv1 = Conv1d(in_channels=5, out_channels=2, kernel_size=4)
# 卷積核的權(quán)值是隨機(jī)初始化的
print("conv1.weight = ", conv1.weight, conv1.weight.shape)
# 再定義一個一維卷積狡耻,輸入通道為5墩剖,輸出通道為2,卷積核寬度為3
conv2 = Conv1d(in_channels=5, out_channels=2, kernel_size=3)
outputs1 = conv1(inputs)
outputs2 = conv2(inputs)
# 輸出1為2個序列夷狰,兩個序列長度為3岭皂,大小為2
print("outputs1 = ", outputs1, outputs1.shape)
# 輸出2為2個序列,兩個序列長度為4沼头,大小為2
print("outputs2 = ", outputs2, outputs2.shape)
# inputs = tensor([[[0.5801, 0.6436, 0.1947, 0.6487, 0.8968, 0.3009],
# [0.8895, 0.0390, 0.5899, 0.1805, 0.1035, 0.9368],
# [0.1585, 0.8440, 0.8345, 0.0849, 0.4730, 0.5783],
# [0.3659, 0.2716, 0.4990, 0.6657, 0.2565, 0.9945],
# [0.6403, 0.2125, 0.6234, 0.1210, 0.3517, 0.6784]],
# [[0.0855, 0.1844, 0.3558, 0.1458, 0.9264, 0.9538],
# [0.1427, 0.9598, 0.2031, 0.2354, 0.5456, 0.6808],
# [0.8981, 0.6998, 0.1424, 0.7445, 0.3664, 0.9132],
# [0.9393, 0.6905, 0.1617, 0.7266, 0.6220, 0.0726],
# [0.6940, 0.1242, 0.0561, 0.3435, 0.1775, 0.8076]]]) torch.Size([2, 5, 6])
# conv1.weight = Parameter containing:
# tensor([[[ 0.1562, -0.1094, -0.0228, 0.1879],
# [-0.0304, 0.1720, 0.0392, 0.0476],
# [ 0.0479, 0.0050, -0.0942, 0.0502],
# [-0.0905, -0.1414, 0.0421, 0.0708],
# [ 0.0671, 0.2107, 0.1556, 0.1809]],
# [[ 0.0453, 0.0267, 0.0821, 0.0792],
# [ 0.0428, 0.1096, 0.0132, 0.1285],
# [-0.0082, 0.2208, 0.2189, 0.1461],
# [ 0.0550, -0.0019, -0.0607, -0.1238],
# [ 0.0730, 0.1778, -0.0817, 0.2204]]], requires_grad=True) torch.Size([2, 5, 4])
# outputs1 = tensor([[[0.2778, 0.5726, 0.2568],
# [0.6502, 0.7603, 0.6844]],
# ...
# [-0.0940, -0.2529, -0.2081, 0.0786]],
# [[-0.0102, -0.0118, 0.0119, -0.1874],
# [-0.5899, -0.0979, -0.1233, -0.1664]]], grad_fn=<SqueezeBackward1>) torch.Size([2, 2, 4])
from torch.nn import MaxPool1d
# 輸出序列長度3
pool1 = MaxPool1d(3)
# 輸出序列長度4
pool2 = MaxPool1d(4)
outputs_pool1 = pool1(outputs1)
outputs_pool2 = pool2(outputs2)
print(outputs_pool1)
print(outputs_pool2)
# 由于outputs_pool1和outputs_pool2是兩個獨立的張量爷绘,需要cat拼接起來书劝,刪除最后一個維度,將2行1列的矩陣變成1個向量
outputs_pool_squeeze1 = outputs_pool1.squeeze(dim=2)
print(outputs_pool_squeeze1)
outputs_pool_squeeze2 = outputs_pool2.squeeze(dim=2)
print(outputs_pool_squeeze2)
outputs_pool = torch.cat([outputs_pool_squeeze1, outputs_pool_squeeze2], dim=1)
print(outputs_pool)
# tensor([[[0.5726],
# [0.7603]],
# [[0.4595],
# [0.9858]]], grad_fn=<SqueezeBackward1>)
# tensor([[[-0.0104],
# [ 0.0786]],
# [[ 0.0119],
# [-0.0979]]], grad_fn=<SqueezeBackward1>)
# tensor([[0.5726, 0.7603],
# [0.4595, 0.9858]], grad_fn=<SqueezeBackward1>)
# tensor([[-0.0104, 0.0786],
# [ 0.0119, -0.0979]], grad_fn=<SqueezeBackward1>)
# tensor([[ 0.5726, 0.7603, -0.0104, 0.0786],
# [ 0.4595, 0.9858, 0.0119, -0.0979]], grad_fn=<CatBackward>)
from torch.nn import Linear
linear = Linear(4, 2)
outputs_linear = linear(outputs_pool)
print(outputs_linear)
# tensor([[-0.0555, -0.0656],
# [-0.0428, -0.0303]], grad_fn=<AddmmBackward>)
4.2.3 TextCNN網(wǎng)絡(luò)結(jié)構(gòu)
class TextCNN(nn.Module):
def __init__(self, config):
super(TextCNN, self).__init__()
self.is_training = True
self.dropout_rate = config.dropout_rate
self.num_class = config.num_class
self.use_element = config.use_element
self.config = config
self.embedding = nn.Embedding(num_embeddings=config.vocab_size,
embedding_dim=config.embedding_size)
self.convs = nn.ModuleList([
nn.Sequential(nn.Conv1d(in_channels=config.embedding_size,
out_channels=config.feature_size,
kernel_size=h),
# nn.BatchNorm1d(num_features=config.feature_size),
nn.ReLU(),
nn.MaxPool1d(kernel_size=config.max_text_len-h+1))
for h in config.window_sizes
])
self.fc = nn.Linear(in_features=config.feature_size*len(config.window_sizes),
out_features=config.num_class)
if os.path.exists(config.embedding_path) and config.is_training and config.is_pretrain:
print("Loading pretrain embedding...")
self.embedding.weight.data.copy_(torch.from_numpy(np.load(config.embedding_path)))
def forward(self, x):
embed_x = self.embedding(x)
#print('embed size 1',embed_x.size()) # 32*35*256
# batch_size x text_len x embedding_size -> batch_size x embedding_size x text_len
embed_x = embed_x.permute(0, 2, 1)
#print('embed size 2',embed_x.size()) # 32*256*35
out = [conv(embed_x) for conv in self.convs] #out[i]:batch_size x feature_size*1
#for o in out:
# print('o',o.size()) # 32*100*1
out = torch.cat(out, dim=1) # 對應(yīng)第二個維度(行)拼接起來揉阎,比如說5*2*1,5*3*1的拼接變成5*5*1
#print(out.size(1)) # 32*400*1
out = out.view(-1, out.size(1))
#print(out.size()) # 32*400
if not self.use_element:
out = F.dropout(input=out, p=self.dropout_rate)
out = self.fc(out)
return out
4.3 RNN
4.3.1 RNN和HMM的區(qū)別
4.3.2 模型實現(xiàn)
from torch.nn import RNN
# 每個時刻輸入大小為4庄撮,隱含層大小為5
rnn = RNN(input_size=4, hidden_size=5, batch_first=True)
# 輸入批次為背捌,即有2個序列毙籽,序列長度為3,輸入大小為4
inputs = torch.rand(2, 3, 4)
# 得到輸出和更新之后的隱藏狀態(tài)
outputs, hn = rnn(inputs)
print(outputs)
print(hn)
print(outputs.shape, hn.shape)
# tensor([[[-0.1413, 0.1952, -0.2586, -0.4585, -0.4973],
# [-0.3413, 0.3166, -0.2132, -0.5002, -0.2506],
# [-0.0390, 0.1016, -0.1492, -0.4582, -0.0017]],
# [[ 0.1747, 0.2208, -0.1599, -0.4487, -0.1219],
# [-0.1236, 0.1097, -0.2268, -0.4487, -0.0603],
# [ 0.0973, 0.3031, -0.1482, -0.4647, 0.0809]]],
# grad_fn=<TransposeBackward1>)
# tensor([[[-0.0390, 0.1016, -0.1492, -0.4582, -0.0017],
# [ 0.0973, 0.3031, -0.1482, -0.4647, 0.0809]]],
# grad_fn=<StackBackward>)
# torch.Size([2, 3, 5]) torch.Size([1, 2, 5])
import torch
from torch.autograd import Variable
from torch import nn
# 首先建立一個簡單的循環(huán)神經(jīng)網(wǎng)絡(luò):輸入維度為20毡庆, 輸出維度是50坑赡, 兩層的單向網(wǎng)絡(luò)
basic_rnn = nn.RNN(input_size=20, hidden_size=50, num_layers=2)
"""
通過 weight_ih_l0 來訪問第一層中的 w_{ih},因為輸入 x_{t}是20維么抗,輸出是50維毅否,所以w_{ih}是一個50*20維的向量,另外要訪問第
二層網(wǎng)絡(luò)可以使用 weight_ih_l1.對于 w_{hh}蝇刀,可以用 weight_hh_l0來訪問螟加,而 b_{ih}則可以通過 bias_ih_l0來訪問。當(dāng)然可以對它
進(jìn)行自定義的初始化吞琐,只需要記得它們是 Variable捆探,取出它們的data,對它進(jìn)行自定的初始化即可站粟。
"""
print(basic_rnn.weight_ih_l0.size(), basic_rnn.weight_ih_l1.size(), basic_rnn.weight_hh_l0.size())
# 隨機(jī)初始化輸入和隱藏狀態(tài)
toy_input = Variable(torch.randn(3, 1, 20))
h_0 = Variable(torch.randn(2*1, 1, 50))
print(toy_input[0].size())
# 將輸入和隱藏狀態(tài)傳入網(wǎng)絡(luò)黍图,得到輸出和更新之后的隱藏狀態(tài),輸出維度是(100, 32, 20)奴烙。
toy_output, h_n = basic_rnn(toy_input, h_0)
print(toy_output[-1])
print(h_n)
print(h_n[1])
# torch.Size([50, 20]) torch.Size([50, 50]) torch.Size([50, 50])
# torch.Size([1, 20])
# tensor([[-0.5984, -0.3677, 0.0775, 0.2553, 0.1232, -0.1161, -0.2288, 0.1609,
# -0.1241, -0.3501, -0.3164, 0.3403, 0.0332, 0.2511, 0.0951, 0.2445,
# 0.0558, -0.0419, -0.1222, 0.0901, -0.2851, 0.1737, 0.0637, -0.3362,
# -0.1706, 0.2050, -0.3277, -0.2112, -0.4245, 0.0265, -0.0052, -0.4551,
# -0.3270, -0.1220, -0.1531, -0.0151, 0.2504, 0.5659, 0.4878, -0.0656,
# -0.7775, 0.4294, 0.2054, 0.0318, 0.4798, -0.1439, 0.3873, 0.1039,
# 0.1654, -0.5765]], grad_fn=<SelectBackward>)
# tensor([[[ 0.2338, 0.1578, 0.7547, 0.0439, -0.6009, 0.1042, -0.4840,
# -0.1806, -0.2075, -0.2174, 0.2023, 0.3301, -0.1899, 0.1618,
# 0.0790, 0.1213, 0.0053, -0.2586, 0.6376, 0.0315, 0.6949,
# 0.3184, -0.4901, -0.0852, 0.4542, 0.1393, -0.0074, -0.8129,
# -0.1013, 0.0852, 0.2550, -0.4294, 0.2316, 0.0662, 0.0465,
# -0.1976, -0.6093, 0.4097, 0.3909, -0.1091, -0.3569, 0.0366,
# 0.0665, 0.5302, -0.1765, -0.3919, -0.0308, 0.0061, 0.1447,
# 0.2676]],
# [[-0.5984, -0.3677, 0.0775, 0.2553, 0.1232, -0.1161, -0.2288,
# 0.1609, -0.1241, -0.3501, -0.3164, 0.3403, 0.0332, 0.2511,
# 0.0951, 0.2445, 0.0558, -0.0419, -0.1222, 0.0901, -0.2851,
# 0.1737, 0.0637, -0.3362, -0.1706, 0.2050, -0.3277, -0.2112,
# -0.4245, 0.0265, -0.0052, -0.4551, -0.3270, -0.1220, -0.1531,
# -0.0151, 0.2504, 0.5659, 0.4878, -0.0656, -0.7775, 0.4294,
# 0.2054, 0.0318, 0.4798, -0.1439, 0.3873, 0.1039, 0.1654,
# ...
# -0.1706, 0.2050, -0.3277, -0.2112, -0.4245, 0.0265, -0.0052, -0.4551,
# -0.3270, -0.1220, -0.1531, -0.0151, 0.2504, 0.5659, 0.4878, -0.0656,
# -0.7775, 0.4294, 0.2054, 0.0318, 0.4798, -0.1439, 0.3873, 0.1039,
# 0.1654, -0.5765]], grad_fn=<SelectBackward>)
初始化時助被,還可以設(shè)置其他網(wǎng)絡(luò)參數(shù),bidirectional=True切诀、num_layers等
4.3.3 LSTM
from torch.nn import LSTM
lstm = LSTM(input_size=4, hidden_size=5, batch_first=True)
inputs = torch.rand(2, 3, 4)
# outputs為輸出序列的隱含層揩环,hn為最后一個時刻的隱含層,cn為最后一個時刻的記憶細(xì)胞
outputs, (hn, cn) = lstm(inputs)
# 輸出兩個序列幅虑,每個序列長度為3检盼,大小為5
print(outputs)
print(hn)
print(cn)
# 輸出銀行層序列和最后一個時刻隱含層以及記憶細(xì)胞的形狀
print(outputs.shape, hn.shape, cn.shape)
# tensor([[[-0.1102, 0.0568, 0.0929, 0.0579, -0.1300],
# [-0.2051, 0.0829, 0.0245, 0.0202, -0.2124],
# [-0.2509, 0.0854, 0.0882, -0.0272, -0.2385]],
# [[-0.1302, 0.0804, 0.0200, 0.0543, -0.1033],
# [-0.2794, 0.0736, 0.0247, -0.0406, -0.2233],
# [-0.2913, 0.1044, 0.0407, 0.0044, -0.2345]]],
# grad_fn=<TransposeBackward0>)
# tensor([[[-0.2509, 0.0854, 0.0882, -0.0272, -0.2385],
# [-0.2913, 0.1044, 0.0407, 0.0044, -0.2345]]],
# grad_fn=<StackBackward>)
# tensor([[[-0.3215, 0.2153, 0.1180, -0.0568, -0.4162],
# [-0.3982, 0.2704, 0.0568, 0.0097, -0.3959]]],
# grad_fn=<StackBackward>)
# torch.Size([2, 3, 5]) torch.Size([1, 2, 5]) torch.Size([1, 2, 5])
4.4 注意力模型
seq2seq這樣的模型有一個基本假設(shè),就是原始序列的最后一個隱含狀態(tài)(一個向量)包含了該序列的全部信息翘单,顯然假設(shè)不合理吨枉,當(dāng)序列比較長的時候,這點就更困難了哄芜,所以有了注意力模型
4.4.1 注意力機(jī)制
$$
\operatorname{attn}(\boldsymbol{q}, \boldsymbol{k})=\left{\begin{array}{l}
\boldsymbol{w}^{\top} \tanh (\boldsymbol{W}[\boldsymbol{q} ; \boldsymbol{k}]) 貌亭;多層感知器\
\boldsymbol{q}^{\top} \boldsymbol{W} \boldsymbol{k} ;雙線性\
\boldsymbol{q}^{\top} \boldsymbol{k} 认臊;點積\
\frac{\boldsymbol{q}^{\top} \boldsymbol{k}}{\sqrtb06n3g0} 圃庭;避免因為向量維度d過大導(dǎo)致點積結(jié)果過大
\end{array}\right.
$$
4.4.2 自注意力模型
具體地,假設(shè)輸入為n個向量組成的序列,剧腻,...拘央,,輸出為每個向量對應(yīng)的新的向量表示书在,灰伟,...,儒旬,其中所有向量的大小均為d栏账。那么,的計算公式為
式中栈源,j是整個序列的索引值;是與之間的注意力(權(quán)重)挡爵,其通過attn函數(shù)計算,然后再經(jīng)過softmax函數(shù)進(jìn)行歸一化后獲得甚垦。直觀上的含義是如果與越相關(guān)茶鹃,則它們計算的注意力值就越大,那么與對應(yīng)的新的表示的貢獻(xiàn)就越大
通過自注意力機(jī)制艰亮,可以直接計算兩個距離較遠(yuǎn)的時刻之間的關(guān)系闭翩。而在循環(huán)神經(jīng)網(wǎng)絡(luò)中,由于信息是沿著時刻逐層傳遞的垃杖,因此當(dāng)兩個相關(guān)性較大的時刻距離較遠(yuǎn)時男杈,會產(chǎn)生較大的信息損失。雖然引入了門控機(jī)制模型调俘,如LSTM等伶棒,可指標(biāo)不治本。因此彩库,基于自注意力機(jī)制的自注意力模型已經(jīng)逐步取代循環(huán)神經(jīng)網(wǎng)絡(luò)肤无,成為自然語言處理的標(biāo)準(zhǔn)模型
4.4.3 Transformer
4.4.3.1 融入位置信息
兩種方式,位置嵌入(Position Embedding)和位置編碼(Position Encodings)
- 位置嵌入與詞嵌入類似
- 位置編碼是講位置索引值通過函數(shù)映射到一個d維向量
4.4.3.2 Transformer塊(Block)
包含自注意力骇钦、層歸一化(Layer Normalization)宛渐、殘差(Residual Connections)
4.4.4.3 自注意力計算結(jié)果互斥
自注意力結(jié)果需要經(jīng)過歸一化,導(dǎo)致即使一個輸入和多個其他的輸入相關(guān)眯搭,也無法同時為這些輸入賦予較大的注意力值窥翩,集自注意力結(jié)果之間是互斥的,無法同時關(guān)注多個輸入鳞仙,所以使用多頭自注意力模型
4.4.3.4 Transformer模型的優(yōu)缺點
優(yōu)點:與循環(huán)神經(jīng)網(wǎng)絡(luò)相比寇蚊,Transformer 能夠直接建模輸入序列單元之間更長距離的依賴關(guān)系,從而使得 Transformer 對于長序列建模的能力更強棍好。另外仗岸,在 Trans-former 的編碼階段允耿,由于可以利用GPU 等多核計算設(shè)備并行地計算 Transformer 塊內(nèi)部的自注意力模型,而循環(huán)神經(jīng)網(wǎng)絡(luò)需要逐個計算扒怖,因此 Transformer 具有更高的訓(xùn)練速度较锡。
缺點:不過,與循環(huán)神經(jīng)網(wǎng)絡(luò)相比盗痒,Transformer 的一個明顯的缺點是參數(shù)量過于龐大蚂蕴。每一層的 Transformer 塊大部分參數(shù)集中在自注意力模型中輸入向量的三個角色映射矩陣、多頭機(jī)制導(dǎo)致相應(yīng)參數(shù)的倍增和引入非線性的多層感知器等积糯。
更主要的是掂墓,還需要堆疊多層Transformer 塊谦纱,從而參數(shù)量又?jǐn)U大多倍看成。最終導(dǎo)致一個實用的Transformer模型含有巨大的參數(shù)量。巨大的參數(shù)量導(dǎo)致 Transformer模型非常不容易訓(xùn)練跨嘉,尤其是當(dāng)訓(xùn)練數(shù)據(jù)較小時
4.5 神經(jīng)網(wǎng)絡(luò)模型的訓(xùn)練
4.5.1 損失函數(shù)
均方誤差MSE(Mean Squared Error)和交叉熵?fù)p失CE(Cross-Entropy)
4.5.2 小批次梯度下降
import torch
from torch import nn, optim
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, num_class):
super(MLP, self).__init__()
self.linear1 = nn.Linear(input_dim, hidden_dim)
self.activate = F.relu
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs):
hidden = self.linear1(inputs)
activation = self.activate(hidden)
outputs = self.linear2(activation)
log_probs = F.log_softmax(outputs, dim=1)
return log_probs
# 異或問題的4個輸入
x_train = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
# 每個輸入對應(yīng)的輸出類別
y_train = torch.tensor([0, 1, 1, 0])
# 創(chuàng)建多層感知器模型川慌,輸入層大小為2,隱含層大小為5祠乃,輸出層大小為2(即有兩個類別)
model = MLP(input_dim=2, hidden_dim=5, num_class=2)
criterion = nn.NLLLoss() # 當(dāng)使用log_softmax輸出時梦重,需要調(diào)用負(fù)對數(shù)似然損失(Negative Log Likelihood,NLL)
optimizer = optim.SGD(model.parameters(), lr=0.05)
for epoch in range(500):
y_pred = model(x_train)
loss = criterion(y_pred, y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Parameters:")
for name, param in model.named_parameters():
print (name, param.data)
y_pred = model(x_train)
print("y_pred = ", y_pred)
print("Predicted results:", y_pred.argmax(axis=1))
# 結(jié)果:
# Parameters:
# linear1.weight tensor([[-0.4509, -0.5591],
# [-1.2904, 1.2947],
# [ 0.8418, 0.8424],
# [-0.4408, -0.1356],
# [ 1.2886, -1.2879]])
# linear1.bias tensor([ 4.5582e-01, -2.5727e-03, -8.4167e-01, -1.7634e-03, -1.5244e-04])
# linear2.weight tensor([[ 0.5994, -1.4792, 1.0836, -0.2860, -1.0873],
# [-0.2534, 0.9911, -0.7348, 0.0413, 1.3398]])
# linear2.bias tensor([ 0.7375, -0.1796])
# y_pred = tensor([[-0.2398, -1.5455],
# [-2.3716, -0.0980],
# [-2.3101, -0.1045],
# [-0.0833, -2.5269]], grad_fn=<LogSoftmaxBackward>)
# Predicted results: tensor([0, 1, 1, 0])
注:
- nn.linear理解為兩層的,input_dim個神經(jīng)元和output_dim個神經(jīng)元的全連接網(wǎng)絡(luò)
- argmax(axis=1)函數(shù)沛膳,找到第二維度方向最大的索引(比如二維的,就是沿列方向去統(tǒng)計)
- 可以將輸出層的softmax層去掉芯侥,可以使用corssEntropyLoss作為損失函數(shù)研乒,其在計算損失時會自動進(jìn)行softmax計算,這樣在模型預(yù)測時可以提高速度芽世,因為沒有進(jìn)行softmax運算蔑鹦,直接將輸出分?jǐn)?shù)最高的類別作為預(yù)測結(jié)果即可
- 除了SGD還有火鼻,Adam、Adagrad等,這些優(yōu)化器是對原始梯度下降的改進(jìn)缘揪,改進(jìn)思路包括動態(tài)調(diào)整學(xué)習(xí)率曹抬,對梯度積累等
4.6 情感分類實戰(zhàn)
4.6.1 詞表映射
from collections import defaultdict
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回詞表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找輸入標(biāo)記對應(yīng)的索引值,如果1該標(biāo)記不存在兼雄,則返回標(biāo)記<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
注:@classmethod表示的是類方法
4.6.2 詞向量層
# 詞表大小為8,向量維度為3
embedding = nn.Embedding(8, 3)
input = torch.tensor([[0, 1, 2, 1], [4, 6, 6, 7]], dtype=torch.long) # torch.long = torch.int64
output = embedding(input)
output
# 即在原始輸入后增加了一個長度為3的維
# 結(jié)果:
# tensor([[[ 0.1747, 0.7580, 0.3107],
# [ 0.1595, 0.9152, 0.2757],
# [ 1.0136, -0.5204, 1.0620],
# [ 0.1595, 0.9152, 0.2757]],
# [[-0.9784, -0.3794, 1.2752],
# [-0.4441, -0.2990, 1.0913],
# [-0.4441, -0.2990, 1.0913],
# [ 2.0153, -1.0434, -0.9038]]], grad_fn=<EmbeddingBackward>)
4.6.3 融入詞向量的MLP
import torch
from torch import nn
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class):
super(MLP, self).__init__()
# 詞向量層
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# 線性變換:詞向量層 -> 隱含層
self.linear1 = nn.Linear(embedding_dim, hidden_dim)
self.activate = F.relu
# 線性變換:激活層 -> 輸出層
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs):
embeddings = self.embedding(inputs)
# 將序列中多個Embedding進(jìn)行聚合(求平均)
embedding = embeddings.mean(dim=1)
hidden = self.activate(self.linear1(embedding))
outputs = self.linear2(hidden)
# 獲得每個序列屬于某一個類別概率的對數(shù)值
probs = F.log_softmax(outputs, dim=1)
return probs
mlp = MLP(vocab_size=8, embedding_dim=3, hidden_dim=5, num_class=2)
inputs = torch.tensor([[0, 1, 2, 1], [4, 6, 6, 7]], dtype=torch.long)
outputs = mlp(inputs)
print(outputs)
# 結(jié)果:
# tensor([[-0.6600, -0.7275],
# [-0.6108, -0.7828]], grad_fn=<LogSoftmaxBackward>)
4.6.4 文本長度統(tǒng)一
input1 = torch.tensor([0, 1, 2, 1], dtype=torch.long)
input2 = torch.tensor([2, 1, 3, 7, 5], dtype=torch.long)
input3 = torch.tensor([6, 4, 2], dtype=torch.long)
input4 = torch.tensor([1, 3, 4, 3, 5, 7], dtype=torch.long)
inputs = [input1, input2, input3, input4]
offsets = [0] + [i.shape[0] for i in inputs]
print(offsets)
# cumsum累加帽蝶,即0+4=4赦肋,4+5=9,9+3=12
offsets = torch.tensor(offsets[: -1]).cumsum(dim=0)
print(offsets)
inputs = torch.cat(inputs)
print(inputs)
embeddingbag = nn.EmbeddingBag(num_embeddings=8, embedding_dim=3)
embeddings = embeddingbag(inputs, offsets)
print(embeddings)
# 結(jié)果:
# [0, 4, 5, 3, 6]
# tensor([ 0, 4, 9, 12])
# tensor([0, 1, 2, 1, 2, 1, 3, 7, 5, 6, 4, 2, 1, 3, 4, 3, 5, 7])
# tensor([[-0.6750, 0.8048, -0.1771],
# [ 0.2023, -0.1735, 0.2372],
# [ 0.4699, -0.2902, 0.3136],
# [ 0.2327, -0.2667, 0.0326]], grad_fn=<EmbeddingBagBackward>)
4.6.5 數(shù)據(jù)處理
def load_sentence_polarity():
from nltk.corpus import sentence_polarity
vocab = Vocab.build(sentence_polarity.sents())
train_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][: 4000] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][: 4000]
test_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][4000: ] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][4000: ]
return train_data, test_data, vocab
train_data, test_data, vocab = load_sentence_polarity()
4.6.5.1 構(gòu)建DataLoader對象
from torch.utils.data import DataLoader, dataset
data_loader = DataLoader(
dataset,
batch_size=64,
collate_fn=collate_fn,
shuffle=True
)
# dataset為Dataset類的一個對象励稳,用于存儲數(shù)據(jù)
class BowDataset(Dataset):
def __init__(self, data):
# data為原始的數(shù)據(jù)佃乘,如使用load_sentence_polarity函數(shù)生成的訓(xùn)練數(shù)據(jù)
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, i):
# 返回下標(biāo)為i的樣例
return self.data[i]
# collaate_fn參數(shù)指向一個函數(shù),用于對一個批次的樣本進(jìn)行整理驹尼,如將其轉(zhuǎn)換為張量等
def collate_fn(examples):
# 從獨立樣本集合中構(gòu)建各批次的輸入輸出
inputs = [torch.tensor(ex[0]) for ex in examples]
targets = torch.tensor([(ex[0]) for ex in examples], dtype=torch.long)
offsets = [0] + [i.shape[0] for i in inputs]
offsets = torch.tensor(offsets[: -1]).cumsum(dim=0)
inputs = torch.cat(inputs)
return inputs, offsets, targets
4.6.6 MLP的訓(xùn)練和測試
# tqdm 進(jìn)度條
from tqdm.auto import tqdm
import torch
from torch import nn, optim
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class):
super(MLP, self).__init__()
# 詞向量層
self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim)
# 線性變換:詞向量層 -> 隱含層
self.linear1 = nn.Linear(embedding_dim, hidden_dim)
self.activate = F.relu
# 線性變換:激活層 -> 輸出層
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs, offsets):
embedding = self.embedding(inputs, offsets)
hidden = self.activate(self.linear1(embedding))
outputs = self.linear2(hidden)
# 獲得每個序列屬于某一個類別概率的對數(shù)值
probs = F.log_softmax(outputs, dim=1)
return probs
embedding_dim = 128
hidden_dim = 256
num_class = 2
batch_size = 32
num_epoch = 5
# 加載數(shù)據(jù)
train_data, test_data, vocab = load_sentence_polarity()
train_data = BowDataset(train_data)
test_data = BowDataset(test_data)
train_data_loader = DataLoader(train_data, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_data, batch_size=1, collate_fn=collate_fn, shuffle=False)
# 加載模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MLP(len(vocab), embedding_dim, hidden_dim, num_class)
model.to(device)
# 訓(xùn)練
nll_loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch + 1}"):
inputs, offsets, targets = [x.to(device) for x in batch]
log_probs = model(inputs, offsets)
loss = nll_loss(log_probs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss:{total_loss:.2f}")
# 測試
acc = 0
for batch in tqdm(test_data_loader, desc=f"Testing"):
inputs, offsets, targets = [x.to(device) for x in batch]
with torch.no_grad():
output = model(inputs, offsets)
acc += (output.argmax(dim=1) == targets).sum().item()
print(f"Acc: {acc / len(test_data_loader):.2f}")
# 結(jié)果:
# Training Epoch 1: 100%|██████████| 250/250 [00:03<00:00, 64.04it/s]
# Training Epoch 2: 100%|██████████| 250/250 [00:04<00:00, 55.40it/s]
# Training Epoch 3: 100%|██████████| 250/250 [00:03<00:00, 82.54it/s]
# Training Epoch 4: 100%|██████████| 250/250 [00:03<00:00, 73.36it/s]
# Training Epoch 5: 100%|██████████| 250/250 [00:03<00:00, 72.61it/s]
# Testing: 33%|███▎ | 879/2662 [00:00<00:00, 4420.03it/s]
# Loss:45.66
# Testing: 100%|██████████| 2662/2662 [00:00<00:00, 4633.54it/s]
# Acc: 0.73
4.6.7 基于CNN的情感分類
復(fù)習(xí)一下conv1d
由于MLP詞袋模型表示文本時趣避,只考慮文本中的詞語信息,忽略了詞組信息新翎,卷積可以提取詞組信息程帕,例如“不 喜歡”住练,卷積核為2,就可以提取特征“不 喜歡”
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import defaultdict
# 進(jìn)度條
from tqdm.auto import tqdm
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回詞表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找輸入標(biāo)記對應(yīng)的索引值愁拭,如果1該標(biāo)記不存在澎羞,則返回標(biāo)記<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
class CnnDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(examples):
inputs = [torch.tensor(ex[0]) for ex in examples]
targets = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
# 對batch內(nèi)的樣本進(jìn)行padding,使其具有相同長度
inputs = pad_sequence(inputs, batch_first=True)
return inputs, targets
class CNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, filter_size, num_filter, num_class):
super(CNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.conv1d = nn.Conv1d(embedding_dim, num_filter, filter_size, padding=1)
self.activate = F.relu
self.linear = nn.Linear(num_filter, num_class)
def forward(self, inputs): # inputs: (32, 47) 敛苇,32個長度為47的序列
embedding = self.embedding(inputs) # embedding: (32, 47, 128)妆绞,相當(dāng)于加了原有加了一個詞向量維度,
convolution = self.activate(self.conv1d(embedding.permute(0, 2, 1))) # convolution: (32, 100, 47)
pooling = F.max_pool1d(convolution, kernel_size=convolution.shape[2]) # pooling: (32, 100, 1)
pooling_squeeze = pooling.squeeze(dim=2) # pooling_squeeze: (32, 100)
outputs = self.linear(pooling_squeeze) # outputs: (32, 2)
log_probs = F.log_softmax(outputs, dim=1) # log_probs: (32, 2)
return log_probs
def load_sentence_polarity():
from nltk.corpus import sentence_polarity
vocab = Vocab.build(sentence_polarity.sents())
train_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][: 4000] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][: 4000]
test_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][4000: ] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][4000: ]
return train_data, test_data, vocab
#超參數(shù)設(shè)置
embedding_dim = 128
hidden_dim = 256
num_class = 2
batch_size = 32
num_epoch = 5
filter_size = 3
num_filter = 100
#加載數(shù)據(jù)
train_data, test_data, vocab = load_sentence_polarity()
train_dataset = CnnDataset(train_data)
test_dataset = CnnDataset(test_data)
train_data_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_dataset, batch_size=1, collate_fn=collate_fn, shuffle=False)
#加載模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNN(len(vocab), embedding_dim, filter_size, num_filter, num_class)
model.to(device) #將模型加載到CPU或GPU設(shè)備
#訓(xùn)練過程
nll_loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001) #使用Adam優(yōu)化器
model.train()
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch + 1}"):
inputs, targets = [x.to(device) for x in batch]
log_probs = model(inputs)
loss = nll_loss(log_probs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss: {total_loss:.2f}")
#測試過程
acc = 0
for batch in tqdm(test_data_loader, desc=f"Testing"):
inputs, targets = [x.to(device) for x in batch]
with torch.no_grad():
output = model(inputs)
acc += (output.argmax(dim=1) == targets).sum().item()
#輸出在測試集上的準(zhǔn)確率
print(f"Acc: {acc / len(test_data_loader):.2f}")
# 結(jié)果:
# Training Epoch 1: 100%|██████████| 250/250 [00:06<00:00, 36.27it/s]
# Loss: 165.55
# Training Epoch 2: 100%|██████████| 250/250 [00:08<00:00, 31.13it/s]
# Loss: 122.83
# Training Epoch 3: 100%|██████████| 250/250 [00:06<00:00, 36.45it/s]
# Loss: 76.39
# Training Epoch 4: 100%|██████████| 250/250 [00:06<00:00, 41.66it/s]
# Loss: 33.92
# Training Epoch 5: 100%|██████████| 250/250 [00:06<00:00, 39.79it/s]
# Loss: 12.04
# Testing: 100%|██████████| 2662/2662 [00:00<00:00, 2924.88it/s]
#
# Acc: 0.72
4.6.8 基于Transformer的情感分類
!!!
<p style="color:red">需要重新研讀</p>
!!!
4.7 詞性標(biāo)注實戰(zhàn)
!!!
<p style="color:red">需要重新研讀</p>
!!!
4.8 習(xí)題
5 靜態(tài)詞向量預(yù)訓(xùn)練模型
5.1 神經(jīng)網(wǎng)絡(luò)語言模型
N-gram語言模型存在明顯的缺點:
- 容易受數(shù)據(jù)稀疏的影響枫攀,一般需要平滑處理
- 無法對超過N的上下文依賴關(guān)系進(jìn)行建模
所以括饶,基于神經(jīng)網(wǎng)絡(luò)的語言模型(如RNN、Transformer等)幾乎替代了N-gram語言模型
5.1.1 預(yù)訓(xùn)練任務(wù)
監(jiān)督信號來自與數(shù)據(jù)自身来涨,這種學(xué)習(xí)方式成為自監(jiān)督學(xué)習(xí)
5.1.1.1 前饋神經(jīng)網(wǎng)絡(luò)語言模型
(1)輸入層
由當(dāng)前時刻t的歷史次序構(gòu)成图焰,可以用毒熱編碼也可以用位置下標(biāo)表示
(2)詞向量層
用低維、稠密向量表示蹦掐,表示歷史序列詞向量拼接后的結(jié)果技羔,詞向量矩陣為
(3)隱含層
為輸入層到隱含層之間的線性變換矩陣,為偏置卧抗,隱含層可以表示為:
(4)輸出層
所以語言模型
參數(shù)量為:
詞向量參數(shù)+隱藏層+輸出層+詞表藤滥,即
m和d是常數(shù),模型的自由參數(shù)數(shù)量隨詞表大小呈線性增長社裆,且歷史詞數(shù)n的增大并不會顯著增加參數(shù)的數(shù)量
注:語言模型訓(xùn)練完成后的矩陣E為預(yù)訓(xùn)練得到的靜態(tài)詞向量
5.1.1.2 循環(huán)神經(jīng)網(wǎng)絡(luò)語言模型
RNN可以處理不定長依賴拙绊,“他喜歡吃蘋果”(“吃”),“他感冒了泳秀,于是下班之后去了醫(yī)院”(“感冒”和“醫(yī)院”)
(1)輸入層
由當(dāng)前時刻t的歷史次序構(gòu)成标沪,可以用毒熱編碼也可以用位置下標(biāo)表示
(2)詞向量層
t時刻的輸入為前一個詞和t-1時刻的隱含狀態(tài)組成
(3)隱含層
,其中分別是v_{w_{t-1};h_{t-1}}與隱含層之間的權(quán)值矩陣嗜傅,公式常常區(qū)分開
(4)輸出層
所以RNN當(dāng)序列較長時金句,訓(xùn)練存在梯度消失或者梯度爆炸,以前的做法是在反向傳播過程中按長度進(jìn)行截斷吕嘀,從而得到有效的訓(xùn)練违寞,現(xiàn)在用LSTM和Transformer替代
5.1.2 模型實現(xiàn)
5.1.2.1 前饋神經(jīng)網(wǎng)絡(luò)語言模型
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
from collections import defaultdict
# 進(jìn)度條
from tqdm.auto import tqdm
import nltk
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
nltk.download('reuters')
nltk.download('punkt')
# from zipfile import ZipFile
# file_loc = '/root/nltk_data/corpora/reuters.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
# file_loc = '/root/nltk_data/corpora/punkt.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回詞表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找輸入標(biāo)記對應(yīng)的索引值,如果1該標(biāo)記不存在币他,則返回標(biāo)記<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
def get_loader(dataset, batch_size, shuffle=True):
data_loader = DataLoader(
dataset,
batch_size=batch_size,
collate_fn=dataset.collate_fn,
shuffle=shuffle
)
return data_loader
# 讀取Reuters語料庫
def load_reuters():
from nltk.corpus import reuters
text = reuters.sents()
text = [[word.lower() for word in sentence]for sentence in text]
vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, EOS_TOKEN, PAD_TOKEN])
corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text]
return corpus, vocab
# 保存詞向量
def save_pretrained(vocab, embeds, save_path):
"""
Save pretrained token vectors in a unified format, where the first line
specifies the `number_of_tokens` and `embedding_dim` followed with all
token vectors, one token per line.
"""
with open(save_path, "w") as writer:
writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n")
for idx, token in enumerate(vocab.idx_to_token):
vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]])
writer.write(f"{token} {vec}\n")
print(f"Pretrained embeddings saved to: {save_path}")
# Dataset類
class NGramDataset(Dataset):
def __init__(self, corpus, vocab, context_size=2):
self.data = []
self.bos = vocab[BOS_TOKEN] # 句首標(biāo)記id
self.eos = vocab[EOS_TOKEN] # 句尾標(biāo)記id
for sentence in tqdm(corpus, desc="Data Construction"):
# 插入句首坞靶、句尾標(biāo)記符
sentence = [self.bos] + sentence + [self.eos]
# 如句子長度小于預(yù)定義的上下文大小、則跳過
if len(sentence) < context_size:
continue
for i in range(context_size, len(sentence)):
# 模型輸入:長度1為context_size的上下文
context = sentence[i-context_size: i]
# 當(dāng)前詞
target = sentence[i]
# 每個訓(xùn)練樣本由(context, target)構(gòu)成
self.data.append((context, target))
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(self, examples):
# 從獨立樣本集合中構(gòu)建批次的輸入輸出蝴悉,并轉(zhuǎn)換為PyTorch張量類型
inputs = torch.tensor([ex[0] for ex in examples], dtype=torch.long)
targets = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
return (inputs, targets)
# 模型
class FeedForwaardNNLM(nn.Module):
def __init__(self, vocab_size, embedding_dim, context_size, hidden_dim):
super(FeedForwaardNNLM, self).__init__()
# 詞向量層
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim)
self.linear2 = nn.Linear(hidden_dim, vocab_size)
self.activate = F.relu
def forward(self, inputs):
embeds = self.embeddings(inputs).view((inputs.shape[0], -1))
hidden = self.activate(self.linear1(embeds))
output = self.linear2(hidden)
log_probs = F.log_softmax(output, dim=1)
return log_probs
# 訓(xùn)練
embedding_dim = 128
hidden_dim = 256
batch_size = 1024
context_size = 3
num_epoch = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
corpus, vocab = load_reuters()
dataset = NGramDataset(corpus, vocab, context_size)
data_loader = get_loader(dataset, batch_size)
nll_loss = nn.NLLLoss()
model = FeedForwaardNNLM(len(vocab), embedding_dim, context_size, hidden_dim)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
total_losses = []
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
inputs, targets = [x.to(device) for x in batch]
optimizer.zero_grad()
log_probs = model(inputs)
loss = nll_loss(log_probs, targets)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss: {total_loss:.2f}")
total_losses.append(total_loss)
save_pretrained(vocab, model.embeddings.weight.data, "/home/ffnnlm.vec")
# 結(jié)果
# [nltk_data] Downloading package reuters to /root/nltk_data...
# [nltk_data] Package reuters is already up-to-date!
# [nltk_data] Downloading package punkt to /root/nltk_data...
# [nltk_data] Package punkt is already up-to-date!
# Data Construction: 100%
# 54716/54716 [00:03<00:00, 19224.51it/s]
# Training Epoch 0: 100%
# 1628/1628 [00:35<00:00, 34.02it/s]
# Loss: 8310.34
# Training Epoch 1: 100%
# 1628/1628 [00:36<00:00, 44.29it/s]
# Loss: 6934.16
# Training Epoch 2: 100%
# 1628/1628 [00:36<00:00, 44.31it/s]
# Loss: 6342.58
# Training Epoch 3: 100%
# 1628/1628 [00:37<00:00, 42.65it/s]
# Loss: 5939.16
# Training Epoch 4: 100%
# 1628/1628 [00:37<00:00, 42.70it/s]
# Loss: 5666.03
# Training Epoch 5: 100%
# 1628/1628 [00:38<00:00, 42.76it/s]
# Loss: 5477.37
# Training Epoch 6: 100%
# 1628/1628 [00:38<00:00, 42.18it/s]
# Loss: 5333.53
# Training Epoch 7: 100%
# 1628/1628 [00:38<00:00, 42.44it/s]
# Loss: 5214.55
# Training Epoch 8: 100%
# 1628/1628 [00:38<00:00, 42.16it/s]
# Loss: 5111.15
# Training Epoch 9: 100%
# 1628/1628 [00:38<00:00, 42.21it/s]
# Loss: 5021.05
# Pretrained embeddings saved to: /home/ffnnlm.vec
5.1.2.2 前饋神經(jīng)網(wǎng)絡(luò)語言模型
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
from tqdm.auto import tqdm
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
nltk.download('reuters')
nltk.download('punkt')
# from zipfile import ZipFile
# file_loc = '/root/nltk_data/corpora/reuters.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
# file_loc = '/root/nltk_data/corpora/punkt.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回詞表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找輸入標(biāo)記對應(yīng)的索引值彰阴,如果1該標(biāo)記不存在,則返回標(biāo)記<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
def get_loader(dataset, batch_size, shuffle=True):
data_loader = DataLoader(
dataset,
batch_size=batch_size,
collate_fn=dataset.collate_fn,
shuffle=shuffle
)
return data_loader
# 讀取Reuters語料庫
def load_reuters():
from nltk.corpus import reuters
text = reuters.sents()
text = [[word.lower() for word in sentence]for sentence in text]
vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, EOS_TOKEN, PAD_TOKEN])
corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text]
return corpus, vocab
# 保存詞向量
def save_pretrained(vocab, embeds, save_path):
"""
Save pretrained token vectors in a unified format, where the first line
specifies the `number_of_tokens` and `embedding_dim` followed with all
token vectors, one token per line.
"""
with open(save_path, "w") as writer:
writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n")
for idx, token in enumerate(vocab.idx_to_token):
vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]])
writer.write(f"{token} {vec}\n")
print(f"Pretrained embeddings saved to: {save_path}")
class RnnlmDataset(Dataset):
def __init__(self, corpus, vocab):
self.data = []
self.bos = vocab[BOS_TOKEN]
self.eos = vocab[EOS_TOKEN]
self.pad = vocab[PAD_TOKEN]
for sentence in tqdm(corpus, desc="Dataset Construction"):
# 模型輸入:BOS_TOKEN, w_1, w_2, ..., w_n
input = [self.bos] + sentence
# 模型輸出:w_1, w_2, ..., w_n, EOS_TOKEN
target = sentence + [self.eos]
self.data.append((input, target))
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(self, examples):
# 從獨立樣本集合中構(gòu)建batch輸入輸出
inputs = [torch.tensor(ex[0]) for ex in examples]
targets = [torch.tensor(ex[1]) for ex in examples]
# 對batch內(nèi)的樣本進(jìn)行padding拍冠,使其具有相同長度
inputs = pad_sequence(inputs, batch_first=True, padding_value=self.pad)
targets = pad_sequence(targets, batch_first=True, padding_value=self.pad)
return (inputs, targets)
class RNNLM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super(RNNLM, self).__init__()
# 詞嵌入層
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
# 循環(huán)神經(jīng)網(wǎng)絡(luò):這里使用LSTM
self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
# 輸出層
self.output = nn.Linear(hidden_dim, vocab_size)
def forward(self, inputs):
embeds = self.embeddings(inputs)
# 計算每一時刻的隱含層表示
hidden, _ = self.rnn(embeds)
output = self.output(hidden)
log_probs = F.log_softmax(output, dim=2)
return log_probs
embedding_dim = 64
context_size = 2
hidden_dim = 128
batch_size = 1024
num_epoch = 10
# 讀取文本數(shù)據(jù)尿这,構(gòu)建FFNNLM訓(xùn)練數(shù)據(jù)集(n-grams)
corpus, vocab = load_reuters()
dataset = RnnlmDataset(corpus, vocab)
data_loader = get_loader(dataset, batch_size)
# 負(fù)對數(shù)似然損失函數(shù)簇抵,忽略pad_token處的損失
nll_loss = nn.NLLLoss(ignore_index=dataset.pad)
# 構(gòu)建RNNLM,并加載至device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = RNNLM(len(vocab), embedding_dim, hidden_dim)
model.to(device)
# 使用Adam優(yōu)化器
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
inputs, targets = [x.to(device) for x in batch]
optimizer.zero_grad()
log_probs = model(inputs)
loss = nll_loss(log_probs.view(-1, log_probs.shape[-1]), targets.view(-1))
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss: {total_loss:.2f}")
save_pretrained(vocab, model.embeddings.weight.data, "/home/rnnlm.vec")
5.2 Word2vec詞向量
5.2.1 CBOW
用周圍詞預(yù)測中心詞
(1)輸入層
窗口為5射众,輸入層由4個維度為詞表長度的獨熱表示向量構(gòu)成
(2)詞向量層
輸人層中每個詞的獨熱表示向量經(jīng)由矩陣 映射 至詞向量空間:
對應(yīng)的詞向量即為矩陣 中相應(yīng)位置的列向量, 則為由所有詞向量構(gòu) 成的矩陣或查找表碟摆。令 表示 的上下文單詞集合, 對 中所有詞向量取平均, 就得到了 的上下文表示:
(3)輸出層
令 為隱含層到輸出層的權(quán)值矩陣, 記 為 中與 對應(yīng)的行向量, 那么 輸出 的概率可由下式計算:
在 CBOW 模型的參數(shù)中, 矩陣 (上下文矩陣) 和 (中心詞矩陣) 均可作為詞向量矩陣, 它們分別描述了詞表中的詞在作為條件上下文或目標(biāo)詞時的不同性質(zhì)。在實際中, 通常只用 就能夠滿足應(yīng)用需求, 但是在某些任務(wù)中, 對兩者進(jìn)行組合得到的向量可能會取得更好的表現(xiàn)
5.2.2 Skip-gram模型
中心詞預(yù)測周圍
過程:
式中, 叨橱。
與 CBOW 模型類似, Skip-gram 模型中的權(quán)值矩陣 (中心詞矩陣) 與 (上下文矩陣) 均可作為詞向量 矩陣使用典蜕。
5.2.3 參數(shù)估計
與神經(jīng)網(wǎng)絡(luò)語言模型類似, 可以通過優(yōu)化分類損失對 CBOW 模型和 Skipgram 模型進(jìn)行訓(xùn)練, 需要估計的參數(shù)為 。例如, 給定一段長為 T 的 詞序列
5.2.3.1 CBOW 模型的負(fù)對數(shù)似然損失函數(shù)為:
式中,
5.2.3.2 Skip-gram 模型的負(fù)對數(shù)似然損失函數(shù)為:
5.2.4 負(fù)采樣
負(fù)采樣(Negative Sampling)是構(gòu)造了一個新的有監(jiān)督學(xué)習(xí)問題:給定兩個單詞罗洗,比如orange和juice愉舔,去預(yù)測這是否是一對上下文詞-目標(biāo)詞對(context-target),即是否這兩個詞會在一句話中相連出現(xiàn)伙菜,這也是一個二分類問題