《自然語言處理基于預(yù)訓(xùn)練模型的方法》筆記

寫在前面

部分自己手敲代碼:鏈接

封面圖

1 緒論

預(yù)訓(xùn)練(Pre-train)

即首先在一個原任務(wù)上預(yù)先訓(xùn)練一個初始模型,然后在下游任務(wù)(目標(biāo)任務(wù))上繼續(xù)對該模型進(jìn)行精調(diào)(Fine-Tune)派殷,從而得到提高下游任務(wù)準(zhǔn)確率的目的华匾,本質(zhì)上也是一種遷移學(xué)習(xí)(Transfer Learning)

2 自然語言處理基礎(chǔ)

2.1 文本的表示

2.1.1 獨熱表示

One-hot Encoding無法使用余弦函數(shù)計算相似度阱洪,同時會造成數(shù)據(jù)稀疏(Data Sparsity)

2.1.2 詞的分布式表示

分布式語義假設(shè):詞的含義可以由其上下文的分布進(jìn)行表示

使得利用共現(xiàn)頻次表現(xiàn)的向量提供了一定的相似性

2.1.2.1 上下文

可以使用詞在句子中的一個固定窗口內(nèi)的詞作為其上下文踱侣,也可以使用所在的文檔本身作為上下文

  • 前者反映詞的局部性質(zhì):具有相似詞法、句法屬性的詞將會具有相似的向量表示
  • 后者更多反映詞代表的主題信息

2.1.2.2 共現(xiàn)頻次作為詞的向量表示的問題

  1. 高頻詞誤導(dǎo)計算結(jié)果
  2. 高階關(guān)系無法反映
  3. 仍有稀疏性問題

例子:”A“與”B“共現(xiàn)過覆享,”B“與”C“共現(xiàn)過靶橱,”C“與”D“共現(xiàn)過寥袭,只能知道”A“與”C“都和”B“共現(xiàn)過路捧,但”A“與”D“這種高階關(guān)系沒法知曉

2.1.2.3 奇異值分解

可以使用奇異值分解的做法解決共現(xiàn)頻次無法反映詞之間高階關(guān)系的問題

M = U\Sigma V^{T}
奇異值分解后的U的每一行其實表示對應(yīng)詞的d維向量表示,由于U的各列相互正交传黄,則可以認(rèn)為詞表示的每一維表達(dá)了該詞的一種獨立的”潛在語義“

分解結(jié)果上下文比較相近的詞在空間上距離比較近杰扫,潛在語義分析(LSA)

2.1.3 詞嵌入表示

經(jīng)常直接簡稱為詞向量,利用自然語言文本中蘊含的自監(jiān)督學(xué)習(xí)信號(即詞與上下文的共現(xiàn)信息)膘掰,先來預(yù)訓(xùn)練詞向量章姓,往往會獲得更好的效果

2.1.4 詞袋表示

BOW,不考慮順序识埋,將文本中全部的詞所對應(yīng)的向量表示(即可以是獨熱表示凡伊,也可以是分布式或者詞向量)下那個家,即構(gòu)成了文本的向量表示窒舟。如果是獨熱表示系忙,每一維是詞在文本中出現(xiàn)的次數(shù)

2.2 自然語言處理任務(wù)

2.2.1 n-gram

句首加上<BOS>,句尾加上<EOS>

2.2.2 平滑

  • 當(dāng)n比較大或者測試句子中含有未登錄詞(Out-Of-Vocaabulary惠豺,OOV)時银还,會出現(xiàn)零概率,可以使用加1平滑
    P(w_i) = \frac{C(w_i) + 1}{\sum_{}^{w}(C(w) + 1)} = \frac{C(w_i) + 1}{N + |V|}
  • 當(dāng)訓(xùn)練集較小時洁墙,加1會得到過高的概率估計蛹疯,所以轉(zhuǎn)為加\delta平滑,其中0\le \delta \le 1热监,例如bigram語言模型捺弦,平滑后條件概率為:
    P(w_i|w_{i-1}) = \frac{C(w_{i-1}w_i) + \delta }{\sum_{}^{w}(C(w_{i-1}w) + \delta )} = \frac{C(w_{i-1}w_i) + \delta }{C(w_{i-1}) + \delta |V|}

2.2.3 語言模型性能評估

方法:

  1. 運用到具體的任務(wù)中,得到外部任務(wù)評價(計算代價高)
  2. 困惑度(Perplexity, PPL)孝扛,內(nèi)部評價

困惑度:

\operatorname{PPL}\left(\mathbb{D}^{\text {test }}\right)=\left(\prod_{i=1}^{N} P\left(w_{i} \mid w_{1: i-1}\right)\right)^{-\frac{1}{N}}

連乘會浮點下溢

\operatorname{PPL}\left(\mathbb{D}^{\text {test }}\right)=2^{-\frac{1}{N} \Sigma_{i=1}^{N} \log _{2} P\left(w_{i} \mid w_{i-1}\right)}

困惑度越小列吼,單詞序列的概率越大,但是困惑度越低的語言模型并不總能在外部任務(wù)上得到更好的性能指標(biāo)疗琉,但是會有一定的正相關(guān)性冈欢,困惑度是一種快速評價語言模型性能的指標(biāo)

2.3 自然語言處理基礎(chǔ)任務(wù)

2.3.1 中文分詞

前向最大匹配分詞歉铝,明顯缺點是傾向與切分出較長的詞盈简,也會有切分歧義的問題

見附錄代碼2

2.3.2 子詞切分

以英語為例的語言,如果按照天然的分隔符進(jìn)行切分的話太示,會造成一定的數(shù)據(jù)稀疏的問題柠贤,而且會導(dǎo)致詞表過大而降低處理速度,所以有傳統(tǒng)語言學(xué)規(guī)則类缤,詞形還原詞干提取臼勉,但是結(jié)果可能不是一個完整的詞

2.3.2.1 子詞切分算法

原理:都是使用盡量長且頻次高的子詞對單詞進(jìn)行切分,字節(jié)對編碼算法(Byte Pair Encoding餐弱, BPE)

2.3.3 詞性標(biāo)注

主要難點在于歧義性宴霸,一個詞在不同的上下文中可能有不同的意思

2.3.4 句法分析

給定一個句子囱晴,分析句子的句法成分信息,輔助下游處理任務(wù)

句法結(jié)構(gòu)表示法:

S表示起始符號瓢谢,NP名詞短語畸写,VP動詞短語,sub表示主語氓扛,obj表示賓語

例子

您轉(zhuǎn)的這篇文章很無知枯芬。
您轉(zhuǎn)這篇文章很無知。

第一句話主語是“文章”采郎,第二句話的主語是“轉(zhuǎn)”這個動作

2.3.5 語義分析

詞語的顆粒度考慮千所,一個詞語具有多重語義,例如“打”蒜埋,詞義消歧(Word Sense Disambiguation淫痰,WSD)任務(wù),可以使用語義詞典確定整份,例如WordNet

2.4 自然語言處理應(yīng)用任務(wù)

2.4.1 信息抽取

  • NER
  • 關(guān)系抽群诮纭(實體之間的語義關(guān)系,如夫妻皂林、子女朗鸠、工作單元和地理空間上的位置關(guān)系等二元關(guān)系)
  • 事件抽取(識別人們感興趣的事件以及事件所涉及的時間础倍、地點和任務(wù)等關(guān)鍵信息)


2.4.2 情感分析

  • 情感分類(識別文中蘊含的情感類型或者情感強度)
  • 情感信息抽戎蛘肌(抽取文中情感元素,如評價詞語沟启、評價對象和評價搭配等)

2.4.3 問答系統(tǒng)(QA)

  • 檢索式問答系統(tǒng)
  • 知識庫問答系統(tǒng)
  • 常問問題集問答系統(tǒng)
  • 閱讀理解式問答系統(tǒng)

2.4.4 機(jī)器翻譯(MT)

”理性主義“:基于規(guī)則
”經(jīng)驗主義“:數(shù)據(jù)驅(qū)動
基于深度學(xué)習(xí)的MT也成為神經(jīng)機(jī)器翻譯NMT

2.4.5 對話系統(tǒng)

對話系統(tǒng)主要分為任務(wù)型對話系統(tǒng)開放域?qū)υ捪到y(tǒng)忆家,后者也會被稱為聊天機(jī)器人(Chtabot)

2.4.5.1 任務(wù)型對話系統(tǒng)

包含三個模塊:NLU(自然語言理解)、DM(對話管理)德迹、NLG(自然語言生成)

1)NLU通常包含話語的領(lǐng)域芽卿、槽值意圖
2)DM通常包含對話狀態(tài)跟蹤(Dialogue State Tracking)胳搞、對話策略優(yōu)化(Dialogue Policy Optimization)
對話狀態(tài)一般表示為語義槽和值的列表卸例。

U:幫我定一張明天去北京的機(jī)票

例如,通過到對以上用戶話語

對話狀態(tài)一般表示為語義槽和值的列表肌毅,例如對于話術(shù)的NLU的結(jié)果進(jìn)行對話狀態(tài)跟蹤筷转,得到當(dāng)前對話狀態(tài):【到達(dá)地=北京;出發(fā)時間=明天悬而;出發(fā)地=NULL呜舒;數(shù)量=1】
獲取到當(dāng)前對話狀態(tài)后,進(jìn)行策略優(yōu)化笨奠,即選擇下一步采用什么樣的策略袭蝗,也叫動作唤殴,比如可以詢問出發(fā)地等

NLG通常通過寫模板即可實現(xiàn)

2.5 基本問題

2.5.1 文本分類

2.5.2 結(jié)構(gòu)預(yù)測

2.5.2.1 序列標(biāo)注(Sequence Labeling)

CRF幾考慮了每個詞屬于某個標(biāo)簽(發(fā)射概率),還考慮了標(biāo)簽之間的相互關(guān)系(轉(zhuǎn)移概率)

2.5.2.2 序列分割

人名(PER)到腥、地名(LOC)眨八、機(jī)構(gòu)名(ORG)

輸入:”我愛北京天安門“,分詞:”我 愛 北京 天安門“左电,NER結(jié)果:”北京天安門=LOC“

2.5.2.3 圖結(jié)構(gòu)生成

輸入的是自然語言廉侧,輸出結(jié)果是一個以圖表示的結(jié)構(gòu),算法有兩大類:基于圖基于轉(zhuǎn)移

2.5.3 序列到序列問題(Seq2Seq)

也成為編碼器-解碼器(Encoder-Decoder)模型

2.6 評價指標(biāo)

2.6.1 準(zhǔn)確率(Accuracy

最簡單直觀的評價指標(biāo)篓足,常被用于文本分類段誊、詞性標(biāo)注等問題

ACC^{cls} = \frac{正確分類的文本數(shù)}{測試文本總數(shù)}

ACC^{pos} = \frac{正確標(biāo)注的詞數(shù)}{測試文本中詞的總數(shù)}

2.6.2 F值

針對某一類別的評價

F值 = \frac{(\beta ^{2} + 1)PR}{\beta ^{2}(P+R)}

\beta是加權(quán)調(diào)和參數(shù);P是精確率(Precision)栈拖;R是召回率(Recall)连舍,當(dāng)權(quán)重為\beta=1時,表示精確率和召回率同樣重要涩哟,也稱F1值

F_{1} = \frac{2PR}{P+R}

“正確識別的命名實體數(shù)目”為1 (“哈爾濱”)掖疮,“識別出的命名實體總數(shù)”為2 ("張"和"哈爾濱")堪遂,"測試文本中命名實體的總數(shù)"為2("張三"和"哈爾濱")帕识,那么此時精確率和召回率皆為1/2 = 0.5窍育,最終的F1 = 0.5,與基于詞計算的準(zhǔn)確率(0.875)相比器仗,該值更為合理了

2.6.3 其他評價

BLEU值是最常用的機(jī)器翻譯自動評價指標(biāo)

2.7 習(xí)題

3 基礎(chǔ)工具集與常用數(shù)據(jù)集

3.1 NTLK工具集

3.1.1 語料庫和詞典資源

3.1.1.1 停用詞

英文中的”a“融涣,”the“,”of“精钮,“to”等

from nltk.corpus import stopwords

stopwords.words('english')

3.1.1.2 常用語料庫

NLTK提供了多種語料庫(文本數(shù)據(jù)集)威鹿,如圖書、電影評論和聊天記錄等轨香,它們可以被分為兩類忽你,即未標(biāo)注語料庫(又稱生語料庫或生文本,Raw text)和人工標(biāo)注語料庫( Annotated corpus )

  • 未標(biāo)注語料庫

比如說小說的原文等

  • 人工標(biāo)注語料庫

3.1.1.3 常用詞典

WordNet

普林斯頓大學(xué)構(gòu)建的英文語義詞典(也稱作辭典臂容,Thesaurus)科雳,其主要特色是定義了同義詞集合(Synset),每個同義詞集合由具有相同意義的詞義組成策橘,為每一個同義詞集合提供了簡短的釋義(Gloss)炸渡,不同同義詞集合之間還具有一定的語義關(guān)系

from nltk.corpus import wordnet
syns = wordnet.synsets("bank")
syns[0].name()
syns[1].definition()

SentiWordNet

基于WordNet標(biāo)注的同義詞集合的情感傾向性詞典娜亿,有褒義丽已、貶義、中性三個情感值

3.1.2 NLP工具集

3.1.2.1 分句

from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize

text = gutenberg.raw('austen-emma.txt')
sentences = sent_tokenize(text)
print(sentences[100])

# 結(jié)果:
# Mr. Knightley loves to find fault with me, you know-- \nin a joke--it is all a joke.

3.1.2.2 標(biāo)記解析

一個句子是由若干標(biāo)記(Token)按順序構(gòu)成的买决,其中標(biāo)記既可以是一個詞沛婴,也可以是標(biāo)點符號等吼畏,這些標(biāo)記是自然語言處理最基本的輸入單元。

將句子分割為標(biāo)記的過程叫作標(biāo)記解析(Tokenization)嘁灯。英文中的單詞之間通常使用空格進(jìn)行分割泻蚊。不過標(biāo)點符號通常和前面的單詞連在一起,因此標(biāo)記解析的一項主要工作是將標(biāo)點符號和前面的單詞進(jìn)行拆分丑婿。和分句一樣性雄,也無法使用簡單的規(guī)則進(jìn)行標(biāo)記解析,仍以符號"."為例羹奉,它既可作為句號秒旋,也可以作為標(biāo)記的一部分,如不能簡單地將"Mr."分成兩個標(biāo)記诀拭。同樣迁筛,NLTK提供了標(biāo)記解析功能,也稱作標(biāo)記解析器(Tokenizer)

from nltk.tokenize import word_tokenize
word_tokenize(sentences[100])

# 結(jié)果:
# ['Mr.','Knightley','loves','to','find','fault','with','me',',','you','know','--','in','a','joke','--','it','is','all','a','joke','.']

3.1.2.3 詞性標(biāo)注

from nltk import pos_tag

print(pos_tag(word_tokenize("They sat by the fire.")))
print(pos_tag(word_tokenize("They fire a gun.")))

# 結(jié)果:
# [('They', 'PRP'), ('sat', 'VBP'), ('by', 'IN'), ('the', 'DT'), ('fire', 'NN'), ('.', '.')]
# [('They', 'PRP'), ('fire', 'VBP'), ('a', 'DT'), ('gun', 'NN'), ('.', '.')]

3.1.2.4 其他工具

命名實體識別耕挨、組塊分析细卧、句法分析等

3.2 LTP工具集(哈工大)

中文分詞、詞形標(biāo)注筒占、命名實體識別贪庙、依存句法分析和語義角色標(biāo)注等,具體查api

3.3 Pytorch

# 1.創(chuàng)建
torch.empty(2, 3)                                 # 未初始化
torch.randn(2, 3)                                 # 標(biāo)準(zhǔn)正態(tài)
torch.zeros(2, 3, dtype=torch.long)               # 張量為整數(shù)類型
torch.zeros(2, 3, dtype=torch.double)             # 雙精度浮點數(shù)
torch.tensor([[1.0, 3.8, 2.1], [8.6, 4.0, 2.4]])  # 通過列表創(chuàng)建
torch.arange(1, 4)                                # 生成1到4的數(shù)

# 2.GPU
torch.rand(2, 3).to("cuda")                       

# 3.加減乘除都是元素運算
x = torch.tensor([1, 2, 3], dtype=torch.double)
y = torch.tensor([4, 5, 6], dtype=torch.double)
print(x * y)
# tensor([ 4., 10., 18.], dtype=torch.float64)

# 4.點積
x.dot(y)

# 5.所有元素求平均
x.mean()

# 6.按維度求平均
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double)

print(x.mean(dim=0))
print(x.mean(dim=1))

# tensor([2.5000, 3.5000, 4.5000], dtype=torch.float64)
# tensor([2., 5.], dtype=torch.float64)

# 7.拼接(按列和行)
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double)
y = torch.tensor([[7, 8, 9], [10, 11, 12]], dtype=torch.double)
torch.cat((x, y), dim=0)
# tensor([[ 1.,  2.,  3.],
#        [ 4.,  5.,  6.],
#        [ 7.,  8.,  9.],
#        [10., 11., 12.]], dtype=torch.float64)

# 8.梯度
x = torch.tensor([2.], requires_grad=True)
y = torch.tensor([3.], requires_grad=True)

z = (x + y) * (y - 2)
print(z)
z.backward()
print(x.grad, y.grad)

# tensor([5.], grad_fn=<MulBackward0>)
# tensor([1.]) tensor([6.])

# 9.調(diào)整形狀
view和reshape區(qū)別是翰苫,view要求張量為連續(xù)的插勤,張量可以用is_conuous判斷是否連續(xù),其他一樣
transpose交換維度(只能交換兩個維度)革骨,permute可以交換多個維度

# 10.升維和降維
a = torch.tensor([1, 2, 3, 4])
b = a.unsqueeze(dim=0)
print(b, b.shape)
c = b.squeeze()         # 去掉所有形狀中為1的維
print(c, c.shape)
# tensor([[1, 2, 3, 4]]) torch.Size([1, 4])
# tensor([1, 2, 3, 4]) torch.Size([4])


3.4 語料處理


# 刪除空的成對符號
def remove_empty_paired_punc(in_str):
    return in_str.replace('()', '').replace('《》', '').replace('【】', '').replace('[]', '')
    
# 刪除多余的html標(biāo)簽
def remove_html_tags(in_str):
    html_pattern = re.compile(r'<[^>]+>', re.S)
    return html_pattern.sub('', in_str)

# 刪除不可見控制字符
def remove_control_chars(in_str):
    control_chars = ''.join(map(chr, list(range(0, 32)) + list(range(127, 160))))
    control_chars = re.compile('[%s]' % re.escape(control_chars))
    return control_chars.sub('', in_str)

3.5 數(shù)據(jù)集

  • Common Crawl
  • HuggingFace Datasets(超多數(shù)據(jù)集)

使用HuggingFace Datasets之前农尖,pip安裝datasets,其提供數(shù)據(jù)集以及評價方法

3.6 習(xí)題

4 自然語言處理中的神經(jīng)網(wǎng)絡(luò)基礎(chǔ)

4.1 多層感知器模型

4.1.1 感知機(jī)

y = \begin{cases}1, 如果w\cdot x+b \ge 0 \\ 0,否則 \end{cases}
輸入轉(zhuǎn)成x良哲,過程為特征提仁⒖ā(Feature Extraction)

4.1.2 線性回歸

和感知機(jī)類似,y = wx + b

4.1.3 邏輯回歸

y = \frac{1}{1+e^{-z}}

其中筑凫,y' = y(1-y)滑沧,回歸模型常用于分類問題

4.1.4 softmax回歸

例如數(shù)字識別,每個類z_{i}=w_{i 1} x_{1}+w_{i 2} x_{2}+\cdots+w_{i n} x_{n}+b_{i}巍实,結(jié)果為

y_{i}=\operatorname{Softmax}(z)_{i}=\frac{\mathrm{e}^{z_{i}}}{\mathrm{e}^{z_{1}}+\mathrm{e}^{z_{2}}+\cdots+\mathrm{e}^{z_{m}}}

使用矩陣表示y = Softmax(Wx+b)

4.1.5 多層感知機(jī)(Multi-layer Perceptron滓技,MLP)

多層感知機(jī)是解決線性不可分問題的方案,是堆疊多層線性分類器棚潦,并在隱層假如了非線性激活函數(shù)

4.1.6 模型實現(xiàn)

4.1.6.1 nn


from torch import nn

linear = nn.Linear(32, 2) # 輸入特征數(shù)目為32維令漂,輸出特征數(shù)目為2維
inputs = torch.rand(3, 32) # 創(chuàng)建一個形狀為(3,32)的隨機(jī)張量,3為batch批次大小
outputs = linear(inputs)
print(outputs)

# 輸出:
# tensor([[ 0.2488, -0.3663],
#         [ 0.4467, -0.5097],
#         [ 0.4149, -0.7504]], grad_fn=<AddmmBackward>)

# 輸出為(3叠必,2)荚孵,即(batch,輸出維度)

4.1.6.2 激活函數(shù)


from torch.nn import functional as F

# 對于每個元素進(jìn)行sigmoid
activation = F.sigmoid(outputs) 
print(activation)

# 結(jié)果:
# tensor([[0.6142, 0.5029],
#         [0.5550, 0.4738],
#         [0.6094, 0.4907]], grad_fn=<SigmoidBackward>)

# 沿著第2維(行方向)進(jìn)行softmax纬朝,即對于每批次中的各樣例分別進(jìn)行softmax
activation = F.softmax(outputs, dim=1)
print(activation)

# 結(jié)果:
# tensor([[0.6115, 0.3885],
#         [0.5808, 0.4192],
#         [0.6182, 0.3818]], grad_fn=<SoftmaxBackward>)

activation = F.relu(outputs)
print(activation)

# 結(jié)果:
# tensor([[0.4649, 0.0115],
#         [0.2210, 0.0000],
#         [0.4447, 0.0000]], grad_fn=<ReluBackward0>)

4.1.6.3 多層感知機(jī)


import torch
from torch import nn
from torch.nn import functional as F

# 多層感知機(jī)
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_class):
        super(MLP, self).__init__()
        # 線性變換:輸入 -> 隱層
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        # ReLU
        self.activate = F.relu
        # 線性變換:隱層 -> 輸出
        self.linear2 = nn.Linear(hidden_dim, num_class)

    def forward(self, inputs):
        hidden = self.linear1(inputs)
        activation = self.activate(hidden)
        outputs = self.linear2(activation)
        probs = F.softmax(outputs, dim=1)   # 獲得每個輸入屬于某個類別的概率
        return probs

mlp = MLP(input_dim=4, hidden_dim=5, num_class=2)
# 3個輸入batch收叶,4為每個輸入的維度
inputs = torch.rand(3, 4)
probs = mlp(inputs)
print(probs)

# 結(jié)果:
# tensor([[0.3465, 0.6535],
#         [0.3692, 0.6308],
#         [0.4319, 0.5681]], grad_fn=<SoftmaxBackward>)

4.2 CNN

4.2.1 模型結(jié)構(gòu)

計算最后輸出邊長為
\frac{n + 2p - f}{s} + 1
其中,n為輸入邊長共苛,p為padding判没,f為卷積核寬度

前饋神經(jīng)網(wǎng)絡(luò),卷積神經(jīng)網(wǎng)絡(luò)

4.2.2 模型實現(xiàn)

4.2.2.1 卷積

Conv1d隅茎、Conv2d哆致、Conv3d,自然語言處理中常用的一維卷積

簡單來說患膛,2d先橫著掃再豎著掃摊阀,1d只能豎著掃,3d是三維立體掃

代碼實現(xiàn):注意pytorch中只能對倒數(shù)第2維數(shù)據(jù)進(jìn)行卷積踪蹬,因此傳參時要轉(zhuǎn)置一下胞此,將需要卷積的數(shù)據(jù)弄到倒數(shù)第2維,這里將embeding的維度進(jìn)行卷積跃捣,最后一般會在轉(zhuǎn)置過來(沒辦法漱牵,pytorch設(shè)計的不太好,這點確實繞了一圈)

import torch
from torch.nn import Conv1d

inputs = torch.ones(2, 7, 5)
conv1 = Conv1d(in_channels=5, out_channels=3, kernel_size=2)
inputs = inputs.permute(0, 2, 1)
outputs = conv1(inputs)
outputs = outputs.permute(0, 2, 1)
print(outputs, outputs.shape)


# 結(jié)果:
# tensor([[[ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679]],

#         [[ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679],
#          [ 0.4823, -0.3208, -0.2679]]], grad_fn=<PermuteBackward> torch.Size([2, 6, 3]))

4.2.2.2 卷積疚漆、池化酣胀、全鏈接


import torch
from torch.nn import Conv1d

# 輸入批次大小為2,即有兩個序列娶聘,每個序列長度為6闻镶,輸入的維度為5
inputs = torch.rand(2, 5, 6)
print("inputs = ", inputs, inputs.shape)
# class torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
# in_channels 詞向量維度
# out_channels 卷積產(chǎn)生的通道
# kernel_size 卷積核尺寸,卷積大小實際為 kernel_size*in_channels

# 定義一個一維卷積丸升,輸入通道為5铆农,輸出通道為2,卷積核寬度為4
conv1 = Conv1d(in_channels=5, out_channels=2, kernel_size=4)
# 卷積核的權(quán)值是隨機(jī)初始化的
print("conv1.weight = ", conv1.weight, conv1.weight.shape)
# 再定義一個一維卷積狡耻,輸入通道為5墩剖,輸出通道為2,卷積核寬度為3
conv2 = Conv1d(in_channels=5, out_channels=2, kernel_size=3)

outputs1 = conv1(inputs)
outputs2 = conv2(inputs)
# 輸出1為2個序列夷狰,兩個序列長度為3岭皂,大小為2
print("outputs1 = ", outputs1, outputs1.shape)
# 輸出2為2個序列,兩個序列長度為4沼头,大小為2
print("outputs2 = ", outputs2, outputs2.shape)

# inputs =  tensor([[[0.5801, 0.6436, 0.1947, 0.6487, 0.8968, 0.3009],
#          [0.8895, 0.0390, 0.5899, 0.1805, 0.1035, 0.9368],
#          [0.1585, 0.8440, 0.8345, 0.0849, 0.4730, 0.5783],
#          [0.3659, 0.2716, 0.4990, 0.6657, 0.2565, 0.9945],
#          [0.6403, 0.2125, 0.6234, 0.1210, 0.3517, 0.6784]],

#         [[0.0855, 0.1844, 0.3558, 0.1458, 0.9264, 0.9538],
#          [0.1427, 0.9598, 0.2031, 0.2354, 0.5456, 0.6808],
#          [0.8981, 0.6998, 0.1424, 0.7445, 0.3664, 0.9132],
#          [0.9393, 0.6905, 0.1617, 0.7266, 0.6220, 0.0726],
#          [0.6940, 0.1242, 0.0561, 0.3435, 0.1775, 0.8076]]]) torch.Size([2, 5, 6])
# conv1.weight =  Parameter containing:
# tensor([[[ 0.1562, -0.1094, -0.0228,  0.1879],
#          [-0.0304,  0.1720,  0.0392,  0.0476],
#          [ 0.0479,  0.0050, -0.0942,  0.0502],
#          [-0.0905, -0.1414,  0.0421,  0.0708],
#          [ 0.0671,  0.2107,  0.1556,  0.1809]],

#         [[ 0.0453,  0.0267,  0.0821,  0.0792],
#          [ 0.0428,  0.1096,  0.0132,  0.1285],
#          [-0.0082,  0.2208,  0.2189,  0.1461],
#          [ 0.0550, -0.0019, -0.0607, -0.1238],
#          [ 0.0730,  0.1778, -0.0817,  0.2204]]], requires_grad=True) torch.Size([2, 5, 4])
# outputs1 =  tensor([[[0.2778, 0.5726, 0.2568],
#          [0.6502, 0.7603, 0.6844]],
# ...
#          [-0.0940, -0.2529, -0.2081,  0.0786]],

#         [[-0.0102, -0.0118,  0.0119, -0.1874],
#          [-0.5899, -0.0979, -0.1233, -0.1664]]], grad_fn=<SqueezeBackward1>) torch.Size([2, 2, 4])

from torch.nn import MaxPool1d

# 輸出序列長度3
pool1 = MaxPool1d(3)
# 輸出序列長度4
pool2 = MaxPool1d(4)
outputs_pool1 = pool1(outputs1)
outputs_pool2 = pool2(outputs2)

print(outputs_pool1)
print(outputs_pool2)

# 由于outputs_pool1和outputs_pool2是兩個獨立的張量爷绘,需要cat拼接起來书劝,刪除最后一個維度,將2行1列的矩陣變成1個向量
outputs_pool_squeeze1 = outputs_pool1.squeeze(dim=2)
print(outputs_pool_squeeze1)
outputs_pool_squeeze2 = outputs_pool2.squeeze(dim=2)
print(outputs_pool_squeeze2)
outputs_pool = torch.cat([outputs_pool_squeeze1, outputs_pool_squeeze2], dim=1)
print(outputs_pool)

# tensor([[[0.5726],
#          [0.7603]],

#         [[0.4595],
#          [0.9858]]], grad_fn=<SqueezeBackward1>)
# tensor([[[-0.0104],
#          [ 0.0786]],

#         [[ 0.0119],
#          [-0.0979]]], grad_fn=<SqueezeBackward1>)
# tensor([[0.5726, 0.7603],
#         [0.4595, 0.9858]], grad_fn=<SqueezeBackward1>)
# tensor([[-0.0104,  0.0786],
#         [ 0.0119, -0.0979]], grad_fn=<SqueezeBackward1>)
# tensor([[ 0.5726,  0.7603, -0.0104,  0.0786],
#         [ 0.4595,  0.9858,  0.0119, -0.0979]], grad_fn=<CatBackward>)

from torch.nn import Linear

linear = Linear(4, 2)
outputs_linear = linear(outputs_pool)
print(outputs_linear)

# tensor([[-0.0555, -0.0656],
#         [-0.0428, -0.0303]], grad_fn=<AddmmBackward>)

4.2.3 TextCNN網(wǎng)絡(luò)結(jié)構(gòu)


class TextCNN(nn.Module):
    def __init__(self, config):
        super(TextCNN, self).__init__()
        self.is_training = True
        self.dropout_rate = config.dropout_rate
        self.num_class = config.num_class
        self.use_element = config.use_element
        self.config = config
 
        self.embedding = nn.Embedding(num_embeddings=config.vocab_size, 
                                embedding_dim=config.embedding_size)
        self.convs = nn.ModuleList([
                nn.Sequential(nn.Conv1d(in_channels=config.embedding_size, 
                                        out_channels=config.feature_size, 
                                        kernel_size=h),
#                              nn.BatchNorm1d(num_features=config.feature_size), 
                              nn.ReLU(),
                              nn.MaxPool1d(kernel_size=config.max_text_len-h+1))
                     for h in config.window_sizes
                    ])
        self.fc = nn.Linear(in_features=config.feature_size*len(config.window_sizes),
                            out_features=config.num_class)
        if os.path.exists(config.embedding_path) and config.is_training and config.is_pretrain:
            print("Loading pretrain embedding...")
            self.embedding.weight.data.copy_(torch.from_numpy(np.load(config.embedding_path)))    
    
    def forward(self, x):
        embed_x = self.embedding(x)
        
        #print('embed size 1',embed_x.size())  # 32*35*256
# batch_size x text_len x embedding_size  -> batch_size x embedding_size x text_len
        embed_x = embed_x.permute(0, 2, 1)
        #print('embed size 2',embed_x.size())  # 32*256*35
        out = [conv(embed_x) for conv in self.convs]  #out[i]:batch_size x feature_size*1
        #for o in out:
        #    print('o',o.size())  # 32*100*1
        out = torch.cat(out, dim=1)  # 對應(yīng)第二個維度(行)拼接起來揉阎,比如說5*2*1,5*3*1的拼接變成5*5*1
        #print(out.size(1)) # 32*400*1
        out = out.view(-1, out.size(1)) 
        #print(out.size())  # 32*400 
        if not self.use_element:
            out = F.dropout(input=out, p=self.dropout_rate)
            out = self.fc(out)
        return out

4.3 RNN

4.3.1 RNN和HMM的區(qū)別

4.3.2 模型實現(xiàn)


from torch.nn import RNN

# 每個時刻輸入大小為4庄撮,隱含層大小為5
rnn = RNN(input_size=4, hidden_size=5, batch_first=True)
# 輸入批次為背捌,即有2個序列毙籽,序列長度為3,輸入大小為4
inputs = torch.rand(2, 3, 4)
# 得到輸出和更新之后的隱藏狀態(tài)
outputs, hn = rnn(inputs)

print(outputs)
print(hn)
print(outputs.shape, hn.shape)

# tensor([[[-0.1413,  0.1952, -0.2586, -0.4585, -0.4973],
#          [-0.3413,  0.3166, -0.2132, -0.5002, -0.2506],
#          [-0.0390,  0.1016, -0.1492, -0.4582, -0.0017]],

#         [[ 0.1747,  0.2208, -0.1599, -0.4487, -0.1219],
#          [-0.1236,  0.1097, -0.2268, -0.4487, -0.0603],
#          [ 0.0973,  0.3031, -0.1482, -0.4647,  0.0809]]],
#        grad_fn=<TransposeBackward1>)
# tensor([[[-0.0390,  0.1016, -0.1492, -0.4582, -0.0017],
#          [ 0.0973,  0.3031, -0.1482, -0.4647,  0.0809]]],
#        grad_fn=<StackBackward>)
# torch.Size([2, 3, 5]) torch.Size([1, 2, 5])
import torch
from torch.autograd import Variable
from torch import nn

# 首先建立一個簡單的循環(huán)神經(jīng)網(wǎng)絡(luò):輸入維度為20毡庆, 輸出維度是50坑赡, 兩層的單向網(wǎng)絡(luò)
basic_rnn = nn.RNN(input_size=20, hidden_size=50, num_layers=2)
"""
通過 weight_ih_l0 來訪問第一層中的 w_{ih},因為輸入 x_{t}是20維么抗,輸出是50維毅否,所以w_{ih}是一個50*20維的向量,另外要訪問第
二層網(wǎng)絡(luò)可以使用 weight_ih_l1.對于 w_{hh}蝇刀,可以用 weight_hh_l0來訪問螟加,而 b_{ih}則可以通過 bias_ih_l0來訪問。當(dāng)然可以對它
進(jìn)行自定義的初始化吞琐,只需要記得它們是 Variable捆探,取出它們的data,對它進(jìn)行自定的初始化即可站粟。
"""
print(basic_rnn.weight_ih_l0.size(), basic_rnn.weight_ih_l1.size(), basic_rnn.weight_hh_l0.size())

# 隨機(jī)初始化輸入和隱藏狀態(tài)
toy_input = Variable(torch.randn(3, 1, 20))
h_0 = Variable(torch.randn(2*1, 1, 50))

print(toy_input[0].size())
# 將輸入和隱藏狀態(tài)傳入網(wǎng)絡(luò)黍图,得到輸出和更新之后的隱藏狀態(tài),輸出維度是(100, 32, 20)奴烙。
toy_output, h_n = basic_rnn(toy_input, h_0)
print(toy_output[-1])

print(h_n)
print(h_n[1])

# torch.Size([50, 20]) torch.Size([50, 50]) torch.Size([50, 50])
# torch.Size([1, 20])
# tensor([[-0.5984, -0.3677,  0.0775,  0.2553,  0.1232, -0.1161, -0.2288,  0.1609,
#          -0.1241, -0.3501, -0.3164,  0.3403,  0.0332,  0.2511,  0.0951,  0.2445,
#           0.0558, -0.0419, -0.1222,  0.0901, -0.2851,  0.1737,  0.0637, -0.3362,
#          -0.1706,  0.2050, -0.3277, -0.2112, -0.4245,  0.0265, -0.0052, -0.4551,
#          -0.3270, -0.1220, -0.1531, -0.0151,  0.2504,  0.5659,  0.4878, -0.0656,
#          -0.7775,  0.4294,  0.2054,  0.0318,  0.4798, -0.1439,  0.3873,  0.1039,
#           0.1654, -0.5765]], grad_fn=<SelectBackward>)
# tensor([[[ 0.2338,  0.1578,  0.7547,  0.0439, -0.6009,  0.1042, -0.4840,
#           -0.1806, -0.2075, -0.2174,  0.2023,  0.3301, -0.1899,  0.1618,
#            0.0790,  0.1213,  0.0053, -0.2586,  0.6376,  0.0315,  0.6949,
#            0.3184, -0.4901, -0.0852,  0.4542,  0.1393, -0.0074, -0.8129,
#           -0.1013,  0.0852,  0.2550, -0.4294,  0.2316,  0.0662,  0.0465,
#           -0.1976, -0.6093,  0.4097,  0.3909, -0.1091, -0.3569,  0.0366,
#            0.0665,  0.5302, -0.1765, -0.3919, -0.0308,  0.0061,  0.1447,
#            0.2676]],

#         [[-0.5984, -0.3677,  0.0775,  0.2553,  0.1232, -0.1161, -0.2288,
#            0.1609, -0.1241, -0.3501, -0.3164,  0.3403,  0.0332,  0.2511,
#            0.0951,  0.2445,  0.0558, -0.0419, -0.1222,  0.0901, -0.2851,
#            0.1737,  0.0637, -0.3362, -0.1706,  0.2050, -0.3277, -0.2112,
#           -0.4245,  0.0265, -0.0052, -0.4551, -0.3270, -0.1220, -0.1531,
#           -0.0151,  0.2504,  0.5659,  0.4878, -0.0656, -0.7775,  0.4294,
#            0.2054,  0.0318,  0.4798, -0.1439,  0.3873,  0.1039,  0.1654,
# ...
#          -0.1706,  0.2050, -0.3277, -0.2112, -0.4245,  0.0265, -0.0052, -0.4551,
#          -0.3270, -0.1220, -0.1531, -0.0151,  0.2504,  0.5659,  0.4878, -0.0656,
#          -0.7775,  0.4294,  0.2054,  0.0318,  0.4798, -0.1439,  0.3873,  0.1039,
#           0.1654, -0.5765]], grad_fn=<SelectBackward>)

初始化時助被,還可以設(shè)置其他網(wǎng)絡(luò)參數(shù),bidirectional=True切诀、num_layers等

4.3.3 LSTM


from torch.nn import LSTM

lstm = LSTM(input_size=4, hidden_size=5, batch_first=True)
inputs = torch.rand(2, 3, 4)
# outputs為輸出序列的隱含層揩环,hn為最后一個時刻的隱含層,cn為最后一個時刻的記憶細(xì)胞
outputs, (hn, cn) = lstm(inputs)
# 輸出兩個序列幅虑,每個序列長度為3检盼,大小為5
print(outputs)
print(hn)
print(cn)
# 輸出銀行層序列和最后一個時刻隱含層以及記憶細(xì)胞的形狀
print(outputs.shape, hn.shape, cn.shape)

# tensor([[[-0.1102,  0.0568,  0.0929,  0.0579, -0.1300],
#          [-0.2051,  0.0829,  0.0245,  0.0202, -0.2124],
#          [-0.2509,  0.0854,  0.0882, -0.0272, -0.2385]],

#         [[-0.1302,  0.0804,  0.0200,  0.0543, -0.1033],
#          [-0.2794,  0.0736,  0.0247, -0.0406, -0.2233],
#          [-0.2913,  0.1044,  0.0407,  0.0044, -0.2345]]],
#        grad_fn=<TransposeBackward0>)
# tensor([[[-0.2509,  0.0854,  0.0882, -0.0272, -0.2385],
#          [-0.2913,  0.1044,  0.0407,  0.0044, -0.2345]]],
#        grad_fn=<StackBackward>)
# tensor([[[-0.3215,  0.2153,  0.1180, -0.0568, -0.4162],
#          [-0.3982,  0.2704,  0.0568,  0.0097, -0.3959]]],
#        grad_fn=<StackBackward>)
# torch.Size([2, 3, 5]) torch.Size([1, 2, 5]) torch.Size([1, 2, 5])

4.4 注意力模型

seq2seq這樣的模型有一個基本假設(shè),就是原始序列的最后一個隱含狀態(tài)(一個向量)包含了該序列的全部信息翘单,顯然假設(shè)不合理吨枉,當(dāng)序列比較長的時候,這點就更困難了哄芜,所以有了注意力模型

4.4.1 注意力機(jī)制

$$

\operatorname{attn}(\boldsymbol{q}, \boldsymbol{k})=\left{\begin{array}{l}
\boldsymbol{w}^{\top} \tanh (\boldsymbol{W}[\boldsymbol{q} ; \boldsymbol{k}]) 貌亭;多層感知器\
\boldsymbol{q}^{\top} \boldsymbol{W} \boldsymbol{k} ;雙線性\
\boldsymbol{q}^{\top} \boldsymbol{k} 认臊;點積\
\frac{\boldsymbol{q}^{\top} \boldsymbol{k}}{\sqrtb06n3g0} 圃庭;避免因為向量維度d過大導(dǎo)致點積結(jié)果過大
\end{array}\right.

$$

4.4.2 自注意力模型

具體地,假設(shè)輸入為n個向量組成的序列x_1x_2剧腻,...拘央,x_n,輸出為每個向量對應(yīng)的新的向量表示y_1书在,y_2灰伟,...,y_n儒旬,其中所有向量的大小均為d栏账。那么,y_i的計算公式為

y_i = \sum_{j=1}^{n} \alpha_{ij}x_j

式中栈源,j是整個序列的索引值;\alpha_{ij}x_ix_j之間的注意力(權(quán)重)挡爵,其通過attn函數(shù)計算,然后再經(jīng)過softmax函數(shù)進(jìn)行歸一化后獲得甚垦。直觀上的含義是如果x_ix_j越相關(guān)茶鹃,則它們計算的注意力值就越大,那么x_jx_i對應(yīng)的新的表示y_i的貢獻(xiàn)就越大

通過自注意力機(jī)制艰亮,可以直接計算兩個距離較遠(yuǎn)的時刻之間的關(guān)系闭翩。而在循環(huán)神經(jīng)網(wǎng)絡(luò)中,由于信息是沿著時刻逐層傳遞的垃杖,因此當(dāng)兩個相關(guān)性較大的時刻距離較遠(yuǎn)時男杈,會產(chǎn)生較大的信息損失。雖然引入了門控機(jī)制模型调俘,如LSTM等伶棒,可指標(biāo)不治本。因此彩库,基于自注意力機(jī)制的自注意力模型已經(jīng)逐步取代循環(huán)神經(jīng)網(wǎng)絡(luò)肤无,成為自然語言處理的標(biāo)準(zhǔn)模型

4.4.3 Transformer

4.4.3.1 融入位置信息

兩種方式,位置嵌入(Position Embedding)和位置編碼(Position Encodings)

  • 位置嵌入與詞嵌入類似
  • 位置編碼是講位置索引值通過函數(shù)映射到一個d維向量

4.4.3.2 Transformer塊(Block)

包含自注意力骇钦、層歸一化(Layer Normalization)宛渐、殘差(Residual Connections)

4.4.4.3 自注意力計算結(jié)果互斥

自注意力結(jié)果需要經(jīng)過歸一化,導(dǎo)致即使一個輸入和多個其他的輸入相關(guān)眯搭,也無法同時為這些輸入賦予較大的注意力值窥翩,集自注意力結(jié)果之間是互斥的,無法同時關(guān)注多個輸入鳞仙,所以使用多頭自注意力模型

4.4.3.4 Transformer模型的優(yōu)缺點

優(yōu)點:與循環(huán)神經(jīng)網(wǎng)絡(luò)相比寇蚊,Transformer 能夠直接建模輸入序列單元之間更長距離的依賴關(guān)系,從而使得 Transformer 對于長序列建模的能力更強棍好。另外仗岸,在 Trans-former 的編碼階段允耿,由于可以利用GPU 等多核計算設(shè)備并行地計算 Transformer 塊內(nèi)部的自注意力模型,而循環(huán)神經(jīng)網(wǎng)絡(luò)需要逐個計算扒怖,因此 Transformer 具有更高的訓(xùn)練速度较锡。

缺點:不過,與循環(huán)神經(jīng)網(wǎng)絡(luò)相比盗痒,Transformer 的一個明顯的缺點是參數(shù)量過于龐大蚂蕴。每一層的 Transformer 塊大部分參數(shù)集中在自注意力模型中輸入向量的三個角色映射矩陣、多頭機(jī)制導(dǎo)致相應(yīng)參數(shù)的倍增和引入非線性的多層感知器等积糯。
更主要的是掂墓,還需要堆疊多層Transformer 塊谦纱,從而參數(shù)量又?jǐn)U大多倍看成。最終導(dǎo)致一個實用的Transformer模型含有巨大的參數(shù)量。巨大的參數(shù)量導(dǎo)致 Transformer模型非常不容易訓(xùn)練跨嘉,尤其是當(dāng)訓(xùn)練數(shù)據(jù)較小時

4.5 神經(jīng)網(wǎng)絡(luò)模型的訓(xùn)練

4.5.1 損失函數(shù)

均方誤差MSE(Mean Squared Error)交叉熵?fù)p失CE(Cross-Entropy)

4.5.2 小批次梯度下降


import torch
from torch import nn, optim
from torch.nn import functional as F

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_class):
        super(MLP, self).__init__()

        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.activate = F.relu
        self.linear2 = nn.Linear(hidden_dim, num_class)

    def forward(self, inputs):
        hidden = self.linear1(inputs)
        activation = self.activate(hidden)
        outputs = self.linear2(activation)
        log_probs = F.log_softmax(outputs, dim=1)
        return log_probs

# 異或問題的4個輸入
x_train = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
# 每個輸入對應(yīng)的輸出類別
y_train = torch.tensor([0, 1, 1, 0])

# 創(chuàng)建多層感知器模型川慌,輸入層大小為2,隱含層大小為5祠乃,輸出層大小為2(即有兩個類別)
model = MLP(input_dim=2, hidden_dim=5, num_class=2)

criterion = nn.NLLLoss() # 當(dāng)使用log_softmax輸出時梦重,需要調(diào)用負(fù)對數(shù)似然損失(Negative Log Likelihood,NLL)
optimizer = optim.SGD(model.parameters(), lr=0.05)

for epoch in range(500):
    y_pred = model(x_train)
    loss = criterion(y_pred, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print("Parameters:")
for name, param in model.named_parameters():
    print (name, param.data)

y_pred = model(x_train)
print("y_pred = ", y_pred)
print("Predicted results:", y_pred.argmax(axis=1))

# 結(jié)果:
# Parameters:
# linear1.weight tensor([[-0.4509, -0.5591],
#         [-1.2904,  1.2947],
#         [ 0.8418,  0.8424],
#         [-0.4408, -0.1356],
#         [ 1.2886, -1.2879]])
# linear1.bias tensor([ 4.5582e-01, -2.5727e-03, -8.4167e-01, -1.7634e-03, -1.5244e-04])
# linear2.weight tensor([[ 0.5994, -1.4792,  1.0836, -0.2860, -1.0873],
#         [-0.2534,  0.9911, -0.7348,  0.0413,  1.3398]])
# linear2.bias tensor([ 0.7375, -0.1796])
# y_pred =  tensor([[-0.2398, -1.5455],
#         [-2.3716, -0.0980],
#         [-2.3101, -0.1045],
#         [-0.0833, -2.5269]], grad_fn=<LogSoftmaxBackward>)
# Predicted results: tensor([0, 1, 1, 0])

注:

  • nn.linear理解為兩層的,input_dim個神經(jīng)元和output_dim個神經(jīng)元的全連接網(wǎng)絡(luò)
  • argmax(axis=1)函數(shù)沛膳,找到第二維度方向最大的索引(比如二維的,就是沿列方向去統(tǒng)計)
  • 可以將輸出層的softmax層去掉芯侥,可以使用corssEntropyLoss作為損失函數(shù)研乒,其在計算損失時會自動進(jìn)行softmax計算,這樣在模型預(yù)測時可以提高速度芽世,因為沒有進(jìn)行softmax運算蔑鹦,直接將輸出分?jǐn)?shù)最高的類別作為預(yù)測結(jié)果即可
  • 除了SGD還有火鼻,Adam、Adagrad等,這些優(yōu)化器是對原始梯度下降的改進(jìn)缘揪,改進(jìn)思路包括動態(tài)調(diào)整學(xué)習(xí)率曹抬,對梯度積累

4.6 情感分類實戰(zhàn)

4.6.1 詞表映射


from collections import defaultdict

class Vocab:
    def __init__(self, tokens=None):
        self.idx_to_token = list()
        self.token_to_idx = dict()

        if tokens is not None:
            if "<unk>" not in tokens:
                tokens = tokens + "<unk>"
            for token in tokens:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
            self.unk = self.token_to_idx['<unk>']

    @classmethod

    def build(cls, text, min_freq=1, reserved_tokens=None):
        token_freqs = defaultdict(int)
        for sentence in text:
            for token in sentence:
                token_freqs[token] += 1
        uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
        uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
        return cls(uniq_tokens)

    def __len__(self):
        # 返回詞表的大小
        return len(self.idx_to_token)

    def __getitem__(self, token):
        # 查找輸入標(biāo)記對應(yīng)的索引值,如果1該標(biāo)記不存在兼雄,則返回標(biāo)記<unk>的索引值(0)
        return self.token_to_idx.get(token, self.unk)

    def convert_tokens_to_ids(self, tokens):
        return [self[token] for token in tokens]

    def convert_ids_to_tokens(self, indices):
        return [self.idx_to_token[index] for index in indices]

:@classmethod表示的是類方法

4.6.2 詞向量層


# 詞表大小為8,向量維度為3
embedding = nn.Embedding(8, 3)
input = torch.tensor([[0, 1, 2, 1], [4, 6, 6, 7]], dtype=torch.long) # torch.long = torch.int64
output = embedding(input)

output
# 即在原始輸入后增加了一個長度為3的維

# 結(jié)果:
# tensor([[[ 0.1747,  0.7580,  0.3107],
#          [ 0.1595,  0.9152,  0.2757],
#          [ 1.0136, -0.5204,  1.0620],
#          [ 0.1595,  0.9152,  0.2757]],

#         [[-0.9784, -0.3794,  1.2752],
#          [-0.4441, -0.2990,  1.0913],
#          [-0.4441, -0.2990,  1.0913],
#          [ 2.0153, -1.0434, -0.9038]]], grad_fn=<EmbeddingBackward>)

4.6.3 融入詞向量的MLP


import torch
from torch import nn
from torch.nn import functional as F

class MLP(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class):
        super(MLP, self).__init__()
        # 詞向量層
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # 線性變換:詞向量層 -> 隱含層
        self.linear1 = nn.Linear(embedding_dim, hidden_dim)
        self.activate = F.relu
        # 線性變換:激活層 -> 輸出層
        self.linear2 = nn.Linear(hidden_dim, num_class)

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        # 將序列中多個Embedding進(jìn)行聚合(求平均)
        embedding = embeddings.mean(dim=1)
        hidden = self.activate(self.linear1(embedding))
        outputs = self.linear2(hidden)
        # 獲得每個序列屬于某一個類別概率的對數(shù)值
        probs = F.log_softmax(outputs, dim=1)
        return probs

mlp = MLP(vocab_size=8, embedding_dim=3, hidden_dim=5, num_class=2)
inputs = torch.tensor([[0, 1, 2, 1], [4, 6, 6, 7]], dtype=torch.long)
outputs = mlp(inputs)
print(outputs)

# 結(jié)果:
# tensor([[-0.6600, -0.7275],
#         [-0.6108, -0.7828]], grad_fn=<LogSoftmaxBackward>)

4.6.4 文本長度統(tǒng)一


input1 = torch.tensor([0, 1, 2, 1], dtype=torch.long)
input2 = torch.tensor([2, 1, 3, 7, 5], dtype=torch.long)
input3 = torch.tensor([6, 4, 2], dtype=torch.long)
input4 = torch.tensor([1, 3, 4, 3, 5, 7], dtype=torch.long)
inputs = [input1, input2, input3, input4]

offsets = [0] + [i.shape[0] for i in inputs]
print(offsets)

# cumsum累加帽蝶,即0+4=4赦肋,4+5=9,9+3=12
offsets = torch.tensor(offsets[: -1]).cumsum(dim=0)
print(offsets)

inputs = torch.cat(inputs)
print(inputs)
embeddingbag = nn.EmbeddingBag(num_embeddings=8, embedding_dim=3)
embeddings = embeddingbag(inputs, offsets)
print(embeddings)

# 結(jié)果:
# [0, 4, 5, 3, 6]
# tensor([ 0,  4,  9, 12])
# tensor([0, 1, 2, 1, 2, 1, 3, 7, 5, 6, 4, 2, 1, 3, 4, 3, 5, 7])
# tensor([[-0.6750,  0.8048, -0.1771],
#         [ 0.2023, -0.1735,  0.2372],
#         [ 0.4699, -0.2902,  0.3136],
#         [ 0.2327, -0.2667,  0.0326]], grad_fn=<EmbeddingBagBackward>)

4.6.5 數(shù)據(jù)處理


def load_sentence_polarity():
    
    from nltk.corpus import sentence_polarity

    vocab = Vocab.build(sentence_polarity.sents())

    train_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][: 4000] \
        + [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][: 4000] 

    test_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][4000: ] \
        + [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][4000: ]

    return train_data, test_data, vocab

train_data, test_data, vocab = load_sentence_polarity()

4.6.5.1 構(gòu)建DataLoader對象


from torch.utils.data import DataLoader, dataset

data_loader = DataLoader(
                        dataset,
                        batch_size=64,
                        collate_fn=collate_fn,
                        shuffle=True
                        )

# dataset為Dataset類的一個對象励稳,用于存儲數(shù)據(jù)
class BowDataset(Dataset):
    def __init__(self, data):
        # data為原始的數(shù)據(jù)佃乘,如使用load_sentence_polarity函數(shù)生成的訓(xùn)練數(shù)據(jù)
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, i):
        # 返回下標(biāo)為i的樣例
        return self.data[i]

# collaate_fn參數(shù)指向一個函數(shù),用于對一個批次的樣本進(jìn)行整理驹尼,如將其轉(zhuǎn)換為張量等

def collate_fn(examples):
    # 從獨立樣本集合中構(gòu)建各批次的輸入輸出
    inputs = [torch.tensor(ex[0]) for ex in examples]
    targets = torch.tensor([(ex[0]) for ex in examples], dtype=torch.long)
    offsets = [0] + [i.shape[0] for i in inputs]
    offsets = torch.tensor(offsets[: -1]).cumsum(dim=0)
    inputs = torch.cat(inputs)
    return inputs, offsets, targets

4.6.6 MLP的訓(xùn)練和測試


# tqdm 進(jìn)度條
from tqdm.auto import tqdm
import torch
from torch import nn, optim
from torch.nn import functional as F

class MLP(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class):
        super(MLP, self).__init__()
        # 詞向量層
        self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim)
        # 線性變換:詞向量層 -> 隱含層
        self.linear1 = nn.Linear(embedding_dim, hidden_dim)
        self.activate = F.relu
        # 線性變換:激活層 -> 輸出層
        self.linear2 = nn.Linear(hidden_dim, num_class)

    def forward(self, inputs, offsets):
        embedding = self.embedding(inputs, offsets)
        hidden = self.activate(self.linear1(embedding))
        outputs = self.linear2(hidden)
        # 獲得每個序列屬于某一個類別概率的對數(shù)值
        probs = F.log_softmax(outputs, dim=1)
        return probs

embedding_dim = 128
hidden_dim = 256
num_class = 2
batch_size = 32
num_epoch = 5

# 加載數(shù)據(jù)
train_data, test_data, vocab = load_sentence_polarity()
train_data = BowDataset(train_data)
test_data = BowDataset(test_data)
train_data_loader = DataLoader(train_data, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_data, batch_size=1, collate_fn=collate_fn, shuffle=False)

# 加載模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MLP(len(vocab), embedding_dim, hidden_dim, num_class)
model.to(device)

# 訓(xùn)練
nll_loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

model.train()
for epoch in range(num_epoch):
    total_loss = 0

    for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch + 1}"):
        inputs, offsets, targets = [x.to(device) for x in batch]
        log_probs = model(inputs, offsets)
        loss = nll_loss(log_probs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
print(f"Loss:{total_loss:.2f}")

# 測試
acc = 0
for batch in tqdm(test_data_loader, desc=f"Testing"):
    inputs, offsets, targets = [x.to(device) for x in batch]
    
    with torch.no_grad():
        output = model(inputs, offsets)
        acc += (output.argmax(dim=1) == targets).sum().item()

print(f"Acc: {acc / len(test_data_loader):.2f}")

# 結(jié)果:
# Training Epoch 1: 100%|██████████| 250/250 [00:03<00:00, 64.04it/s]
# Training Epoch 2: 100%|██████████| 250/250 [00:04<00:00, 55.40it/s]
# Training Epoch 3: 100%|██████████| 250/250 [00:03<00:00, 82.54it/s]
# Training Epoch 4: 100%|██████████| 250/250 [00:03<00:00, 73.36it/s]
# Training Epoch 5: 100%|██████████| 250/250 [00:03<00:00, 72.61it/s]
# Testing:  33%|███▎      | 879/2662 [00:00<00:00, 4420.03it/s]
# Loss:45.66
# Testing: 100%|██████████| 2662/2662 [00:00<00:00, 4633.54it/s]
# Acc: 0.73

4.6.7 基于CNN的情感分類

復(fù)習(xí)一下conv1d

由于MLP詞袋模型表示文本時趣避,只考慮文本中的詞語信息,忽略了詞組信息新翎,卷積可以提取詞組信息程帕,例如“不 喜歡”住练,卷積核為2,就可以提取特征“不 喜歡”

import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import defaultdict
# 進(jìn)度條
from tqdm.auto import tqdm

class Vocab:
    def __init__(self, tokens=None):
        self.idx_to_token = list()
        self.token_to_idx = dict()

        if tokens is not None:
            if "<unk>" not in tokens:
                tokens = tokens + "<unk>"
            for token in tokens:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
            self.unk = self.token_to_idx['<unk>']

    @classmethod

    def build(cls, text, min_freq=1, reserved_tokens=None):
        token_freqs = defaultdict(int)
        for sentence in text:
            for token in sentence:
                token_freqs[token] += 1
        uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
        uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
        return cls(uniq_tokens)

    def __len__(self):
        # 返回詞表的大小
        return len(self.idx_to_token)

    def __getitem__(self, token):
        # 查找輸入標(biāo)記對應(yīng)的索引值愁拭,如果1該標(biāo)記不存在澎羞,則返回標(biāo)記<unk>的索引值(0)
        return self.token_to_idx.get(token, self.unk)

    def convert_tokens_to_ids(self, tokens):
        return [self[token] for token in tokens]

    def convert_ids_to_tokens(self, indices):
        return [self.idx_to_token[index] for index in indices]

class CnnDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, i):
        return self.data[i]

def collate_fn(examples):
    inputs = [torch.tensor(ex[0]) for ex in examples]
    targets = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
    # 對batch內(nèi)的樣本進(jìn)行padding,使其具有相同長度
    inputs = pad_sequence(inputs, batch_first=True)
    return inputs, targets

class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, filter_size, num_filter, num_class):
        super(CNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv1d = nn.Conv1d(embedding_dim, num_filter, filter_size, padding=1)
        self.activate = F.relu
        self.linear = nn.Linear(num_filter, num_class)
    def forward(self, inputs): # inputs: (32, 47) 敛苇,32個長度為47的序列
        embedding = self.embedding(inputs)  # embedding: (32, 47, 128)妆绞,相當(dāng)于加了原有加了一個詞向量維度,
        convolution = self.activate(self.conv1d(embedding.permute(0, 2, 1))) # convolution: (32, 100, 47)
        pooling = F.max_pool1d(convolution, kernel_size=convolution.shape[2]) # pooling: (32, 100, 1)
        pooling_squeeze = pooling.squeeze(dim=2) # pooling_squeeze: (32, 100)
        outputs = self.linear(pooling_squeeze) # outputs: (32, 2)
        log_probs = F.log_softmax(outputs, dim=1) # log_probs: (32, 2)
        return log_probs


def load_sentence_polarity():
    
    from nltk.corpus import sentence_polarity

    vocab = Vocab.build(sentence_polarity.sents())
    train_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][: 4000] \
        + [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][: 4000] 
    test_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][4000: ] \
        + [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][4000: ]

    return train_data, test_data, vocab

#超參數(shù)設(shè)置
embedding_dim = 128
hidden_dim = 256
num_class = 2
batch_size = 32
num_epoch = 5
filter_size = 3
num_filter = 100

#加載數(shù)據(jù)
train_data, test_data, vocab = load_sentence_polarity()
train_dataset = CnnDataset(train_data)
test_dataset = CnnDataset(test_data)
train_data_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_dataset, batch_size=1, collate_fn=collate_fn, shuffle=False)

#加載模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNN(len(vocab), embedding_dim, filter_size, num_filter, num_class)
model.to(device) #將模型加載到CPU或GPU設(shè)備

#訓(xùn)練過程
nll_loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001) #使用Adam優(yōu)化器

model.train()
for epoch in range(num_epoch):
    total_loss = 0
    for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch + 1}"):
        inputs, targets = [x.to(device) for x in batch]
        log_probs = model(inputs)
        loss = nll_loss(log_probs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Loss: {total_loss:.2f}")

#測試過程
acc = 0
for batch in tqdm(test_data_loader, desc=f"Testing"):
    inputs, targets = [x.to(device) for x in batch]
    with torch.no_grad():
        output = model(inputs)
        acc += (output.argmax(dim=1) == targets).sum().item()

#輸出在測試集上的準(zhǔn)確率
print(f"Acc: {acc / len(test_data_loader):.2f}")

# 結(jié)果:
# Training Epoch 1: 100%|██████████| 250/250 [00:06<00:00, 36.27it/s]
# Loss: 165.55
# Training Epoch 2: 100%|██████████| 250/250 [00:08<00:00, 31.13it/s]
# Loss: 122.83
# Training Epoch 3: 100%|██████████| 250/250 [00:06<00:00, 36.45it/s]
# Loss: 76.39
# Training Epoch 4: 100%|██████████| 250/250 [00:06<00:00, 41.66it/s]
# Loss: 33.92
# Training Epoch 5: 100%|██████████| 250/250 [00:06<00:00, 39.79it/s]
# Loss: 12.04
# Testing: 100%|██████████| 2662/2662 [00:00<00:00, 2924.88it/s]
# 
# Acc: 0.72

4.6.8 基于Transformer的情感分類

!!!
<p style="color:red">需要重新研讀</p>
!!!

4.7 詞性標(biāo)注實戰(zhàn)

!!!
<p style="color:red">需要重新研讀</p>
!!!

4.8 習(xí)題

5 靜態(tài)詞向量預(yù)訓(xùn)練模型

5.1 神經(jīng)網(wǎng)絡(luò)語言模型

N-gram語言模型存在明顯的缺點:

  • 容易受數(shù)據(jù)稀疏的影響枫攀,一般需要平滑處理
  • 無法對超過N的上下文依賴關(guān)系進(jìn)行建模

所以括饶,基于神經(jīng)網(wǎng)絡(luò)的語言模型(如RNN、Transformer等)幾乎替代了N-gram語言模型

5.1.1 預(yù)訓(xùn)練任務(wù)

監(jiān)督信號來自與數(shù)據(jù)自身来涨,這種學(xué)習(xí)方式成為自監(jiān)督學(xué)習(xí)

5.1.1.1 前饋神經(jīng)網(wǎng)絡(luò)語言模型

(1)輸入層

由當(dāng)前時刻t的歷史次序w_{t-n+1:t-1}構(gòu)成图焰,可以用毒熱編碼也可以用位置下標(biāo)表示

(2)詞向量層

用低維、稠密向量表示蹦掐,x\in R^{(n-1)d}表示歷史序列詞向量拼接后的結(jié)果技羔,詞向量矩陣為E\in R^{d\times V}

(3)隱含層

W^{hid}\in R^{m\times (n-1)d}為輸入層到隱含層之間的線性變換矩陣,b\in R^{m}為偏置卧抗,隱含層可以表示為:

h = f(W^{hid}x+b^{hid})

(4)輸出層

y = softmax(W^{out}h+b^{out})

所以語言模型

\theta = \{E, W^{hid}, b^{hid}, W^{out}, b^{out}\}

參數(shù)量為:
詞向量參數(shù)+隱藏層+輸出層+詞表藤滥,即V\times d+m\times (n-1)d+m+V\times m+V

m和d是常數(shù),模型的自由參數(shù)數(shù)量隨詞表大小呈線性增長社裆,且歷史詞數(shù)n的增大并不會顯著增加參數(shù)的數(shù)量

注:語言模型訓(xùn)練完成后的矩陣E為預(yù)訓(xùn)練得到的靜態(tài)詞向量

5.1.1.2 循環(huán)神經(jīng)網(wǎng)絡(luò)語言模型

RNN可以處理不定長依賴拙绊,“他喜歡吃蘋果”(“吃”),“他感冒了泳秀,于是下班之后去了醫(yī)院”(“感冒”和“醫(yī)院”)

(1)輸入層

由當(dāng)前時刻t的歷史次序w_{1:t-1}構(gòu)成标沪,可以用毒熱編碼也可以用位置下標(biāo)表示

(2)詞向量層

t時刻的輸入為前一個詞w_{t-1}和t-1時刻的隱含狀態(tài)h_{t-1}組成

x_{t} = [v_{w_{t-1};h_{t-1}}]

(3)隱含層

h_t = tanh(W^{hid}x_t+b^{hid})

W^{hid}\in R^{m\times (d+m)};b^{hid}\in R^m,其中W^{hid}=[U;V],U\in R^{m\times d}、V\in R^{m\times m}分別是v_{w_{t-1};h_{t-1}}與隱含層之間的權(quán)值矩陣嗜傅,公式常常區(qū)分開

h_t = tanh(Uv_{w_{t-1}}+Vh_{t-1}+b^{hid})

(4)輸出層

y = softmax(W^{out}h_t+b^{out})

所以RNN當(dāng)序列較長時金句,訓(xùn)練存在梯度消失或者梯度爆炸,以前的做法是在反向傳播過程中按長度進(jìn)行截斷吕嘀,從而得到有效的訓(xùn)練违寞,現(xiàn)在用LSTM和Transformer替代

5.1.2 模型實現(xiàn)

5.1.2.1 前饋神經(jīng)網(wǎng)絡(luò)語言模型


import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
from collections import defaultdict
# 進(jìn)度條
from tqdm.auto import tqdm
import nltk

BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"

nltk.download('reuters')
nltk.download('punkt')

# from zipfile import ZipFile
# file_loc = '/root/nltk_data/corpora/reuters.zip'
# with ZipFile(file_loc, 'r') as z:
#   z.extractall('/root/nltk_data/corpora/')


# file_loc = '/root/nltk_data/corpora/punkt.zip'
# with ZipFile(file_loc, 'r') as z:
#   z.extractall('/root/nltk_data/corpora/')

class Vocab:
    def __init__(self, tokens=None):
        self.idx_to_token = list()
        self.token_to_idx = dict()

        if tokens is not None:
            if "<unk>" not in tokens:
                tokens = tokens + "<unk>"
            for token in tokens:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
            self.unk = self.token_to_idx['<unk>']

    @classmethod

    def build(cls, text, min_freq=1, reserved_tokens=None):
        token_freqs = defaultdict(int)
        for sentence in text:
            for token in sentence:
                token_freqs[token] += 1
        uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
        uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
        return cls(uniq_tokens)

    def __len__(self):
        # 返回詞表的大小
        return len(self.idx_to_token)

    def __getitem__(self, token):
        # 查找輸入標(biāo)記對應(yīng)的索引值,如果1該標(biāo)記不存在币他,則返回標(biāo)記<unk>的索引值(0)
        return self.token_to_idx.get(token, self.unk)

    def convert_tokens_to_ids(self, tokens):
        return [self[token] for token in tokens]

    def convert_ids_to_tokens(self, indices):
        return [self.idx_to_token[index] for index in indices]

def get_loader(dataset, batch_size, shuffle=True):
    data_loader = DataLoader(
        dataset,
        batch_size=batch_size,
        collate_fn=dataset.collate_fn,
        shuffle=shuffle
    )
    return data_loader

# 讀取Reuters語料庫
def load_reuters():

    from nltk.corpus import reuters

    text = reuters.sents()
    text = [[word.lower() for word in sentence]for sentence in text]
    vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, EOS_TOKEN, PAD_TOKEN])
    corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text]
    return corpus, vocab


# 保存詞向量
def save_pretrained(vocab, embeds, save_path):
    """
    Save pretrained token vectors in a unified format, where the first line
    specifies the `number_of_tokens` and `embedding_dim` followed with all
    token vectors, one token per line.
    """
    with open(save_path, "w") as writer:
        writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n")
        for idx, token in enumerate(vocab.idx_to_token):
            vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]])
            writer.write(f"{token} {vec}\n")
    print(f"Pretrained embeddings saved to: {save_path}")


# Dataset類
class NGramDataset(Dataset):
    def __init__(self, corpus, vocab, context_size=2):
        self.data = []
        self.bos = vocab[BOS_TOKEN] # 句首標(biāo)記id
        self.eos = vocab[EOS_TOKEN] # 句尾標(biāo)記id
        for sentence in tqdm(corpus, desc="Data Construction"):
            # 插入句首坞靶、句尾標(biāo)記符
            sentence = [self.bos] + sentence + [self.eos]
            # 如句子長度小于預(yù)定義的上下文大小、則跳過
            if len(sentence) < context_size:
                continue
            for i in range(context_size, len(sentence)):
                # 模型輸入:長度1為context_size的上下文
                context = sentence[i-context_size: i]
                # 當(dāng)前詞
                target = sentence[i]
                # 每個訓(xùn)練樣本由(context, target)構(gòu)成
                self.data.append((context, target))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        return self.data[i]

    def collate_fn(self, examples):
        # 從獨立樣本集合中構(gòu)建批次的輸入輸出蝴悉,并轉(zhuǎn)換為PyTorch張量類型
        inputs = torch.tensor([ex[0] for ex in examples], dtype=torch.long)
        targets = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
        return (inputs, targets)

# 模型
class FeedForwaardNNLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, hidden_dim):
        super(FeedForwaardNNLM, self).__init__()
        # 詞向量層
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, vocab_size)
        self.activate = F.relu

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((inputs.shape[0], -1))
        hidden = self.activate(self.linear1(embeds))
        output = self.linear2(hidden)
        log_probs = F.log_softmax(output, dim=1)
        return log_probs

# 訓(xùn)練
embedding_dim = 128
hidden_dim = 256
batch_size = 1024
context_size = 3
num_epoch = 10

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
corpus, vocab = load_reuters()
dataset = NGramDataset(corpus, vocab, context_size)
data_loader = get_loader(dataset, batch_size)
nll_loss = nn.NLLLoss()
model = FeedForwaardNNLM(len(vocab), embedding_dim, context_size, hidden_dim)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

model.train()
total_losses = []

for epoch in range(num_epoch):
    total_loss = 0
    for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
        inputs, targets = [x.to(device) for x in batch]
        optimizer.zero_grad()
        log_probs = model(inputs)
        loss = nll_loss(log_probs, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Loss: {total_loss:.2f}")
    total_losses.append(total_loss)


save_pretrained(vocab, model.embeddings.weight.data, "/home/ffnnlm.vec")

# 結(jié)果
# [nltk_data] Downloading package reuters to /root/nltk_data...
# [nltk_data]   Package reuters is already up-to-date!
# [nltk_data] Downloading package punkt to /root/nltk_data...
# [nltk_data]   Package punkt is already up-to-date!
# Data Construction: 100%
# 54716/54716 [00:03<00:00, 19224.51it/s]
# Training Epoch 0: 100%
# 1628/1628 [00:35<00:00, 34.02it/s]
# Loss: 8310.34
# Training Epoch 1: 100%
# 1628/1628 [00:36<00:00, 44.29it/s]
# Loss: 6934.16
# Training Epoch 2: 100%
# 1628/1628 [00:36<00:00, 44.31it/s]
# Loss: 6342.58
# Training Epoch 3: 100%
# 1628/1628 [00:37<00:00, 42.65it/s]
# Loss: 5939.16
# Training Epoch 4: 100%
# 1628/1628 [00:37<00:00, 42.70it/s]
# Loss: 5666.03
# Training Epoch 5: 100%
# 1628/1628 [00:38<00:00, 42.76it/s]
# Loss: 5477.37
# Training Epoch 6: 100%
# 1628/1628 [00:38<00:00, 42.18it/s]
# Loss: 5333.53
# Training Epoch 7: 100%
# 1628/1628 [00:38<00:00, 42.44it/s]
# Loss: 5214.55
# Training Epoch 8: 100%
# 1628/1628 [00:38<00:00, 42.16it/s]
# Loss: 5111.15
# Training Epoch 9: 100%
# 1628/1628 [00:38<00:00, 42.21it/s]
# Loss: 5021.05
# Pretrained embeddings saved to: /home/ffnnlm.vec

5.1.2.2 前饋神經(jīng)網(wǎng)絡(luò)語言模型


import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
from tqdm.auto import tqdm

BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"

nltk.download('reuters')
nltk.download('punkt')

# from zipfile import ZipFile
# file_loc = '/root/nltk_data/corpora/reuters.zip'
# with ZipFile(file_loc, 'r') as z:
#   z.extractall('/root/nltk_data/corpora/')


# file_loc = '/root/nltk_data/corpora/punkt.zip'
# with ZipFile(file_loc, 'r') as z:
#   z.extractall('/root/nltk_data/corpora/')

class Vocab:
    def __init__(self, tokens=None):
        self.idx_to_token = list()
        self.token_to_idx = dict()

        if tokens is not None:
            if "<unk>" not in tokens:
                tokens = tokens + "<unk>"
            for token in tokens:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
            self.unk = self.token_to_idx['<unk>']

    @classmethod

    def build(cls, text, min_freq=1, reserved_tokens=None):
        token_freqs = defaultdict(int)
        for sentence in text:
            for token in sentence:
                token_freqs[token] += 1
        uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
        uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
        return cls(uniq_tokens)

    def __len__(self):
        # 返回詞表的大小
        return len(self.idx_to_token)

    def __getitem__(self, token):
        # 查找輸入標(biāo)記對應(yīng)的索引值彰阴,如果1該標(biāo)記不存在,則返回標(biāo)記<unk>的索引值(0)
        return self.token_to_idx.get(token, self.unk)

    def convert_tokens_to_ids(self, tokens):
        return [self[token] for token in tokens]

    def convert_ids_to_tokens(self, indices):
        return [self.idx_to_token[index] for index in indices]

def get_loader(dataset, batch_size, shuffle=True):
    data_loader = DataLoader(
        dataset,
        batch_size=batch_size,
        collate_fn=dataset.collate_fn,
        shuffle=shuffle
    )
    return data_loader

# 讀取Reuters語料庫
def load_reuters():

    from nltk.corpus import reuters

    text = reuters.sents()
    text = [[word.lower() for word in sentence]for sentence in text]
    vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, EOS_TOKEN, PAD_TOKEN])
    corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text]
    return corpus, vocab


# 保存詞向量
def save_pretrained(vocab, embeds, save_path):
    """
    Save pretrained token vectors in a unified format, where the first line
    specifies the `number_of_tokens` and `embedding_dim` followed with all
    token vectors, one token per line.
    """
    with open(save_path, "w") as writer:
        writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n")
        for idx, token in enumerate(vocab.idx_to_token):
            vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]])
            writer.write(f"{token} {vec}\n")
    print(f"Pretrained embeddings saved to: {save_path}")

class RnnlmDataset(Dataset):
    def __init__(self, corpus, vocab):
        self.data = []
        self.bos = vocab[BOS_TOKEN]
        self.eos = vocab[EOS_TOKEN]
        self.pad = vocab[PAD_TOKEN]
        for sentence in tqdm(corpus, desc="Dataset Construction"):
            # 模型輸入:BOS_TOKEN, w_1, w_2, ..., w_n
            input = [self.bos] + sentence
            # 模型輸出:w_1, w_2, ..., w_n, EOS_TOKEN
            target = sentence + [self.eos]
            self.data.append((input, target))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        return self.data[i]

    def collate_fn(self, examples):
        # 從獨立樣本集合中構(gòu)建batch輸入輸出
        inputs = [torch.tensor(ex[0]) for ex in examples]
        targets = [torch.tensor(ex[1]) for ex in examples]
        # 對batch內(nèi)的樣本進(jìn)行padding拍冠,使其具有相同長度
        inputs = pad_sequence(inputs, batch_first=True, padding_value=self.pad)
        targets = pad_sequence(targets, batch_first=True, padding_value=self.pad)
        return (inputs, targets)

class RNNLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNNLM, self).__init__()
        # 詞嵌入層
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # 循環(huán)神經(jīng)網(wǎng)絡(luò):這里使用LSTM
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        # 輸出層
        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        # 計算每一時刻的隱含層表示
        hidden, _ = self.rnn(embeds)
        output = self.output(hidden)
        log_probs = F.log_softmax(output, dim=2)
        return log_probs

embedding_dim = 64
context_size = 2
hidden_dim = 128
batch_size = 1024
num_epoch = 10

# 讀取文本數(shù)據(jù)尿这,構(gòu)建FFNNLM訓(xùn)練數(shù)據(jù)集(n-grams)
corpus, vocab = load_reuters()
dataset = RnnlmDataset(corpus, vocab)
data_loader = get_loader(dataset, batch_size)

# 負(fù)對數(shù)似然損失函數(shù)簇抵,忽略pad_token處的損失
nll_loss = nn.NLLLoss(ignore_index=dataset.pad)
# 構(gòu)建RNNLM,并加載至device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = RNNLM(len(vocab), embedding_dim, hidden_dim)
model.to(device)
# 使用Adam優(yōu)化器
optimizer = optim.Adam(model.parameters(), lr=0.001)

model.train()
for epoch in range(num_epoch):
    total_loss = 0
    for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
        inputs, targets = [x.to(device) for x in batch]
        optimizer.zero_grad()
        log_probs = model(inputs)
        loss = nll_loss(log_probs.view(-1, log_probs.shape[-1]), targets.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Loss: {total_loss:.2f}")

save_pretrained(vocab, model.embeddings.weight.data, "/home/rnnlm.vec")

5.2 Word2vec詞向量

5.2.1 CBOW

用周圍詞預(yù)測中心詞

(1)輸入層

窗口為5射众,輸入層由4個維度為詞表長度|\mathbb{V}|的獨熱表示向量構(gòu)成

(2)詞向量層

輸人層中每個詞的獨熱表示向量經(jīng)由矩陣 \boldsymbol{E} \in \mathbb{R}^{d \times|\mathrm{V}|} 映射 至詞向量空間:

\boldsymbol{v}_{w_{i}}=\boldsymbol{E} \boldsymbol{e}_{w_{i}}

w_{i} 對應(yīng)的詞向量即為矩陣 \boldsymbol{E} 中相應(yīng)位置的列向量, \boldsymbol{E} 則為由所有詞向量構(gòu) 成的矩陣或查找表碟摆。令 \mathcal{C}_{t}=\left\{w_{t-k}, \cdots, w_{t-1}, w_{t+1}, \cdots, w_{t+k}\right\} 表示 w_{t} 的上下文單詞集合, 對 \mathcal{C}_{t} 中所有詞向量取平均, 就得到了 w_{t} 的上下文表示:

\boldsymbol{v}_{\mathcal{C}_{t}}=\frac{1}{\left|\mathcal{C}_{t}\right|} \sum_{w \in \mathcal{C}_{t}} \boldsymbol{v}_{w}

(3)輸出層

\boldsymbol{E}^{\prime} \in \mathbb{R}^{|V| \times d} 為隱含層到輸出層的權(quán)值矩陣, 記 \boldsymbol{v}_{w_{i}}^{\prime}\boldsymbol{E}^{\prime} 中與 w_{i} 對應(yīng)的行向量, 那么 輸出 w_{t} 的概率可由下式計算:

P\left(w_{t} \mid \mathcal{C}_{t}\right)=\frac{\exp \left(\boldsymbol{v}_{\mathcal{C}_{t}} \cdot \boldsymbol{v}_{w_{t}}^{\prime}\right)}{\sum_{w^{\prime} \in \mathbb{V}} \exp \left(\boldsymbol{v}_{\mathcal{C}_{t}} \cdot \boldsymbol{v}_{w^{\prime}}^{\prime}\right)}

在 CBOW 模型的參數(shù)中, 矩陣 \boldsymbol{E} (上下文矩陣) 和 \boldsymbol{E}^{\prime} (中心詞矩陣) 均可作為詞向量矩陣, 它們分別描述了詞表中的詞在作為條件上下文或目標(biāo)詞時的不同性質(zhì)。在實際中, 通常只用 E 就能夠滿足應(yīng)用需求, 但是在某些任務(wù)中, 對兩者進(jìn)行組合得到的向量可能會取得更好的表現(xiàn)

5.2.2 Skip-gram模型

中心詞預(yù)測周圍

過程:

P\left(c \mid w_{t}\right)=\frac{\exp \left(\boldsymbol{v}_{w_{t}} \cdot \boldsymbol{v}_{c}^{\prime}\right)}{\sum_{w^{\prime} \in \mathbb{V}} \exp \left(\boldsymbol{v}_{w_{t}} \cdot \boldsymbol{v}_{w^{\prime}}^{\prime}\right)}

式中, c \in\left\{w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}\right\} 叨橱。
與 CBOW 模型類似, Skip-gram 模型中的權(quán)值矩陣 \boldsymbol{E} (中心詞矩陣) 與 \boldsymbol{E}^{\prime} (上下文矩陣) 均可作為詞向量 矩陣使用典蜕。

5.2.3 參數(shù)估計

與神經(jīng)網(wǎng)絡(luò)語言模型類似, 可以通過優(yōu)化分類損失對 CBOW 模型和 Skipgram 模型進(jìn)行訓(xùn)練, 需要估計的參數(shù)為 \boldsymbol{\theta}=\left\{\boldsymbol{E}, \boldsymbol{E}^{\prime}\right\} 。例如, 給定一段長為 T 的 詞序列 w_{1} w_{2} \cdots w_{T}

5.2.3.1 CBOW 模型的負(fù)對數(shù)似然損失函數(shù)為:

\mathcal{L}(\boldsymbol{\theta})=-\sum_{t=1}^{T} \log P\left(w_{t} \mid \mathcal{C}_{t}\right)

式中, \mathcal{C}_{t}=\left\{w_{t-k}, \cdots, w_{t-1}, w_{t+1}, \cdots, w_{t+k}\right\}_{\circ}

5.2.3.2 Skip-gram 模型的負(fù)對數(shù)似然損失函數(shù)為:

\mathcal{L}(\boldsymbol{\theta})=-\sum_{t=1}^{T} \sum_{-k \leqslant j \leqslant k, j \neq 0} \log P\left(w_{t+j} \mid w_{t}\right)

5.2.4 負(fù)采樣

負(fù)采樣(Negative Sampling)是構(gòu)造了一個新的有監(jiān)督學(xué)習(xí)問題:給定兩個單詞罗洗,比如orange和juice愉舔,去預(yù)測這是否是一對上下文詞-目標(biāo)詞對(context-target),即是否這兩個詞會在一句話中相連出現(xiàn)伙菜,這也是一個二分類問題

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末轩缤,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子贩绕,更是在濱河造成了極大的恐慌火的,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,378評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件淑倾,死亡現(xiàn)場離奇詭異馏鹤,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)踊淳,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,356評論 2 382
  • 文/潘曉璐 我一進(jìn)店門假瞬,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人迂尝,你說我怎么就攤上這事〖艚妫” “怎么了垄开?”我有些...
    開封第一講書人閱讀 152,702評論 0 342
  • 文/不壞的土叔 我叫張陵,是天一觀的道長税肪。 經(jīng)常有香客問我溉躲,道長,這世上最難降的妖魔是什么益兄? 我笑而不...
    開封第一講書人閱讀 55,259評論 1 279
  • 正文 為了忘掉前任锻梳,我火速辦了婚禮,結(jié)果婚禮上净捅,老公的妹妹穿的比我還像新娘疑枯。我一直安慰自己,他們只是感情好蛔六,可當(dāng)我...
    茶點故事閱讀 64,263評論 5 371
  • 文/花漫 我一把揭開白布荆永。 她就那樣靜靜地躺著废亭,像睡著了一般。 火紅的嫁衣襯著肌膚如雪具钥。 梳的紋絲不亂的頭發(fā)上豆村,一...
    開封第一講書人閱讀 49,036評論 1 285
  • 那天,我揣著相機(jī)與錄音骂删,去河邊找鬼掌动。 笑死,一個胖子當(dāng)著我的面吹牛宁玫,可吹牛的內(nèi)容都是我干的坏匪。 我是一名探鬼主播,決...
    沈念sama閱讀 38,349評論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼撬统,長吁一口氣:“原來是場噩夢啊……” “哼适滓!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起恋追,我...
    開封第一講書人閱讀 36,979評論 0 259
  • 序言:老撾萬榮一對情侶失蹤腕柜,失蹤者是張志新(化名)和其女友劉穎多艇,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,469評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡亡鼠,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,938評論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了源祈。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片榆苞。...
    茶點故事閱讀 38,059評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖羹铅,靈堂內(nèi)的尸體忽然破棺而出蚀狰,到底是詐尸還是另有隱情,我是刑警寧澤职员,帶...
    沈念sama閱讀 33,703評論 4 323
  • 正文 年R本政府宣布麻蹋,位于F島的核電站,受9級特大地震影響焊切,放射性物質(zhì)發(fā)生泄漏扮授。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 39,257評論 3 307
  • 文/蒙蒙 一专肪、第九天 我趴在偏房一處隱蔽的房頂上張望刹勃。 院中可真熱鬧,春花似錦嚎尤、人聲如沸荔仁。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,262評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽咕晋。三九已至雹拄,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間掌呜,已是汗流浹背滓玖。 一陣腳步聲響...
    開封第一講書人閱讀 31,485評論 1 262
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留质蕉,地道東北人势篡。 一個月前我還...
    沈念sama閱讀 45,501評論 2 354
  • 正文 我出身青樓,卻偏偏與公主長得像模暗,于是被迫代替她去往敵國和親禁悠。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,792評論 2 345

推薦閱讀更多精彩內(nèi)容