自然語言處理(NLP) Bert與Lstm結(jié)合

背景介紹

自然語言處理(NLP)在深度學(xué)習(xí)領(lǐng)域是一大分支(其他:CV云石、語音)箱残,經(jīng)過這些年的發(fā)展NLP發(fā)展已經(jīng)很成熟狞膘,同時在工業(yè)界也慢慢開始普及揩懒,谷歌開放的Bert是NLP前進(jìn)的又一里程碑。本篇文章結(jié)合Bert與Lstm挽封,對文本數(shù)據(jù)進(jìn)行二分類的研究已球。

需要的第三方庫

  • pandas
  • numpy
  • torch
  • transformers
  • sklearn

以上這些庫需要讀者對機(jī)器學(xué)習(xí)、深度學(xué)習(xí)有一定了解

數(shù)據(jù)及預(yù)訓(xùn)練Bert

完整過程

  • 數(shù)據(jù)預(yù)處理
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

np.random.seed(2020)
torch.manual_seed(2020)
USE_CUDA = torch.cuda.is_available()
if USE_CUDA:
    torch.cuda.manual_seed(2020)
data=pd.read_csv('./dianping.csv',encoding='utf-8')
#剔除標(biāo)點符號,\xa0 空格
def pretreatment(comments):
    result_comments=[]
    punctuation='辅愿。智亮,?5愦:%&~()阔蛉、;“”&|,.?!:%&~();""'
    for comment in comments:
        comment= ''.join([c for c in comment if c not in punctuation])
        comment= ''.join(comment.split())   #\xa0
        result_comments.append(comment)
    
    return result_comments
result_comments=pretreatment(list(data['comment'].values))
len(result_comments)

2000

result_comments[:1]

['口味不知道是我口高了還是這家真不怎么樣我感覺口味確實很一般很一般上菜相當(dāng)快我敢說菜都是提前做好的幾乎都不熱菜品酸湯肥牛干辣干辣的還有一股泡椒味著實受不了環(huán)境室內(nèi)整體裝修確實不錯但是大廳人多太亂服務(wù)一般吧說不上好但是也不差價格一般大眾價格都能接受人太多了排隊很厲害以后不排隊也許還會來比如早去路過排隊就不值了票據(jù)六日沒票告我周一到周五可能有票相當(dāng)不正規(guī)在這一點同等價位遠(yuǎn)不如外婆家']

  • 利用transformers 先進(jìn)行分字編碼
from transformers import BertTokenizer,BertModel

tokenizer = BertTokenizer.from_pretrained("./chinese-bert_chinese_wwm_pytorch/data")
result_comments_id=tokenizer(result_comments,padding=True,truncation=True,max_length=200,return_tensors='pt')
result_comments_id

{'input_ids': tensor([[ 101, 1366, 1456, ..., 0, 0, 0],
[ 101, 5831, 1501, ..., 0, 0, 0],
[ 101, 6432, 4696, ..., 0, 0, 0],
...,
[ 101, 7566, 4408, ..., 0, 0, 0],
[ 101, 2207, 6444, ..., 0, 0, 0],
[ 101, 2523, 679, ..., 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]])}

result_comments_id['input_ids'].shape

torch.Size([2000, 200])

  • 分割數(shù)據(jù)集
from sklearn.model_selection import train_test_split
X=result_comments_id['input_ids']
y=torch.from_numpy(data['sentiment'].values).float()

X_train,X_test, y_train, y_test =train_test_split(X,y,test_size=0.3,shuffle=True,stratify=y,random_state=2020)
len(X_train),len(X_test)

(1400, 600)

X_valid,X_test,y_valid,y_test=train_test_split(X_test,y_test,test_size=0.5,shuffle=True,stratify=y_test,random_state=2020)
len(X_valid),len(X_test)

(300, 300)

X_train.shape

torch.Size([1400, 200])

y_train.shape

torch.Size([1400])

y_train[:1]

tensor([1.])

  • 數(shù)據(jù)生成器
# create Tensor datasets
train_data = TensorDataset(X_train, y_train)
valid_data = TensorDataset(X_valid, y_valid)
test_data = TensorDataset(X_test,y_test)

# dataloaders
batch_size = 32

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size,drop_last=True)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size,drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size,drop_last=True)
  • 建立模型
if(USE_CUDA):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.

class bert_lstm(nn.Module):
    def __init__(self, hidden_dim,output_size,n_layers,bidirectional=True, drop_prob=0.5):
        super(bert_lstm, self).__init__()
 
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        
        #Bert ----------------重點癞埠,bert模型需要嵌入到自定義模型里面
        self.bert=BertModel.from_pretrained("../chinese-bert_chinese_wwm_pytorch/data")
        for param in self.bert.parameters():
            param.requires_grad = True
        
        # LSTM layers
        self.lstm = nn.LSTM(768, hidden_dim, n_layers, batch_first=True,bidirectional=bidirectional)
        
        # dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # linear and sigmoid layers
        if bidirectional:
            self.fc = nn.Linear(hidden_dim*2, output_size)
        else:
            self.fc = nn.Linear(hidden_dim, output_size)
          
        #self.sig = nn.Sigmoid()
 
    def forward(self, x, hidden):
        batch_size = x.size(0)
        #生成bert字向量
        x=self.bert(x)[0]     #bert 字向量
        
        # lstm_out
        #x = x.float()
        lstm_out, (hidden_last,cn_last) = self.lstm(x, hidden)
        #print(lstm_out.shape)   #[32,100,768]
        #print(hidden_last.shape)   #[4, 32, 384]
        #print(cn_last.shape)    #[4, 32, 384]
        
        #修改 雙向的需要單獨處理
        if self.bidirectional:
            #正向最后一層状原,最后一個時刻
            hidden_last_L=hidden_last[-2]
            #print(hidden_last_L.shape)  #[32, 384]
            #反向最后一層,最后一個時刻
            hidden_last_R=hidden_last[-1]
            #print(hidden_last_R.shape)   #[32, 384]
            #進(jìn)行拼接
            hidden_last_out=torch.cat([hidden_last_L,hidden_last_R],dim=-1)
            #print(hidden_last_out.shape,'hidden_last_out')   #[32, 768]
        else:
            hidden_last_out=hidden_last[-1]   #[32, 384]
            
            
        # dropout and fully-connected layer
        out = self.dropout(hidden_last_out)
        #print(out.shape)    #[32,768]
        out = self.fc(out)
        
        return out
    
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        
        number = 1
        if self.bidirectional:
            number = 2
        
        if (USE_CUDA):
            hidden = (weight.new(self.n_layers*number, batch_size, self.hidden_dim).zero_().float().cuda(),
                      weight.new(self.n_layers*number, batch_size, self.hidden_dim).zero_().float().cuda()
                     )
        else:
            hidden = (weight.new(self.n_layers*number, batch_size, self.hidden_dim).zero_().float(),
                      weight.new(self.n_layers*number, batch_size, self.hidden_dim).zero_().float()
                     )
        
        return hidden
output_size = 1
hidden_dim = 384   #768/2
n_layers = 2
bidirectional = True  #這里為True苗踪,為雙向LSTM

net = bert_lstm(hidden_dim, output_size,n_layers, bidirectional)

#print(net)
  • 訓(xùn)練模型
# loss and optimization functions
lr=2e-5
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# training params
epochs = 10
# batch_size=50
print_every = 7
clip=5 # gradient clipping
 
# move model to GPU, if available
if(USE_CUDA):
    net.cuda()
net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)
    counter = 0
 
    # batch loop
    for inputs, labels in train_loader:
        counter += 1
        
        if(USE_CUDA):
            inputs, labels = inputs.cuda(), labels.cuda()
        h = tuple([each.data for each in h])
        net.zero_grad()
        output= net(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        optimizer.step()
 
        # loss stats
        if counter % print_every == 0:
            net.eval()
            with torch.no_grad():
                val_h = net.init_hidden(batch_size)
                val_losses = []
                for inputs, labels in valid_loader:
                    val_h = tuple([each.data for each in val_h])

                    if(USE_CUDA):
                        inputs, labels = inputs.cuda(), labels.cuda()

                    output = net(inputs, val_h)
                    val_loss = criterion(output.squeeze(), labels.float())

                    val_losses.append(val_loss.item())
 
            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/10... Step: 7... Loss: 0.679703... Val Loss: 0.685275
Epoch: 1/10... Step: 14... Loss: 0.713852... Val Loss: 0.674887
.............
Epoch: 10/10... Step: 35... Loss: 0.078265... Val Loss: 0.370415
Epoch: 10/10... Step: 42... Loss: 0.171208... Val Loss: 0.323075

  • 測試
test_losses = [] # track loss
num_correct = 0
 
# init hidden state
h = net.init_hidden(batch_size)
 
net.eval()
# iterate over test data
for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    if(USE_CUDA):
        inputs, labels = inputs.cuda(), labels.cuda()
    output = net(inputs, h)
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    output=torch.nn.Softmax(dim=1)(output)
    pred=torch.max(output, 1)[1]

    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not USE_CUDA else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

print("Test loss: {:.3f}".format(np.mean(test_losses)))
 
# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.442
Test accuracy: 0.827

  • 直接用訓(xùn)練的模型推斷
def predict(net, test_comments):
    result_comments=pretreatment(test_comments)   #預(yù)處理去掉標(biāo)點符號
    
    #轉(zhuǎn)換為字id
    tokenizer = BertTokenizer.from_pretrained("./chinese-bert_chinese_wwm_pytorch/data")
    result_comments_id=tokenizer(result_comments,padding=True,truncation=True,max_length=120,return_tensors='pt')
    tokenizer_id=result_comments_id['input_ids']
    inputs=tokenizer_id
    batch_size = inputs.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(USE_CUDA):
        inputs = inputs.cuda()
    
    net.eval()
    with torch.no_grad():
        # get the output from the model
        output = net(inputs, h)
        output=torch.nn.Softmax(dim=1)(output)
        pred=torch.max(output, 1)[1]
        # printing output value, before rounding
        print('預(yù)測概率為: {:.6f}'.format(output.item()))
        if(pred.item()==1):
            print("預(yù)測結(jié)果為:正向")
        else:
            print("預(yù)測結(jié)果為:負(fù)向")
comment1 = ['菜品一般颠区,不好吃!徒探!']
predict(net, comment1)  

預(yù)測概率為: 0.015379
預(yù)測結(jié)果為:負(fù)向

comment2 = ['環(huán)境不錯']
predict(net, comment2)

預(yù)測概率為: 0.972344
預(yù)測結(jié)果為:正向

comment3 = ['服務(wù)員還可以瓦呼,就是菜有點不好吃']
predict(net, comment3)

預(yù)測概率為: 0.581665
預(yù)測結(jié)果為:正向

comment4 = ['服務(wù)員還可以,就是菜不好吃']
predict(net, comment4)

預(yù)測概率為: 0.353724
預(yù)測結(jié)果為:負(fù)向

  • 保存模型
# 保存
torch.save(net.state_dict(), './大眾點評二分類_parameters.pth')
  • 加載保存的模型测暗,進(jìn)行推斷
output_size = 1
hidden_dim = 384   #768/2
n_layers = 2
bidirectional = True  #這里為True央串,為雙向LSTM

net = bert_lstm(hidden_dim, output_size,n_layers, bidirectional)
net.load_state_dict(torch.load('./大眾點評二分類_parameters.pth'))

<All keys matched successfully>

# move model to GPU, if available
if(USE_CUDA):
    net.cuda()
comment1 = ['菜品一般,不好吃M胱摹质和!']
predict(net, comment1)

預(yù)測概率為: 0.015379
預(yù)測結(jié)果為:負(fù)向

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市稚字,隨后出現(xiàn)的幾起案子饲宿,更是在濱河造成了極大的恐慌厦酬,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,968評論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件瘫想,死亡現(xiàn)場離奇詭異仗阅,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)国夜,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,601評論 2 382
  • 文/潘曉璐 我一進(jìn)店門减噪,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人车吹,你說我怎么就攤上這事筹裕。” “怎么了窄驹?”我有些...
    開封第一講書人閱讀 153,220評論 0 344
  • 文/不壞的土叔 我叫張陵朝卒,是天一觀的道長。 經(jīng)常有香客問我乐埠,道長抗斤,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,416評論 1 279
  • 正文 為了忘掉前任丈咐,我火速辦了婚禮豪治,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘扯罐。我一直安慰自己,他們只是感情好烦衣,可當(dāng)我...
    茶點故事閱讀 64,425評論 5 374
  • 文/花漫 我一把揭開白布歹河。 她就那樣靜靜地躺著,像睡著了一般花吟。 火紅的嫁衣襯著肌膚如雪秸歧。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,144評論 1 285
  • 那天衅澈,我揣著相機(jī)與錄音键菱,去河邊找鬼。 笑死今布,一個胖子當(dāng)著我的面吹牛经备,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播部默,決...
    沈念sama閱讀 38,432評論 3 401
  • 文/蒼蘭香墨 我猛地睜開眼侵蒙,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了傅蹂?” 一聲冷哼從身側(cè)響起纷闺,我...
    開封第一講書人閱讀 37,088評論 0 261
  • 序言:老撾萬榮一對情侶失蹤算凿,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后犁功,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體氓轰,經(jīng)...
    沈念sama閱讀 43,586評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,028評論 2 325
  • 正文 我和宋清朗相戀三年浸卦,在試婚紗的時候發(fā)現(xiàn)自己被綠了署鸡。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 38,137評論 1 334
  • 序言:一個原本活蹦亂跳的男人離奇死亡镐躲,死狀恐怖储玫,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情萤皂,我是刑警寧澤撒穷,帶...
    沈念sama閱讀 33,783評論 4 324
  • 正文 年R本政府宣布,位于F島的核電站裆熙,受9級特大地震影響端礼,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜入录,卻給世界環(huán)境...
    茶點故事閱讀 39,343評論 3 307
  • 文/蒙蒙 一蛤奥、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧僚稿,春花似錦凡桥、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,333評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至蠢络,卻和暖如春衰猛,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背刹孔。 一陣腳步聲響...
    開封第一講書人閱讀 31,559評論 1 262
  • 我被黑心中介騙來泰國打工啡省, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人髓霞。 一個月前我還...
    沈念sama閱讀 45,595評論 2 355
  • 正文 我出身青樓卦睹,卻偏偏與公主長得像,于是被迫代替她去往敵國和親酸茴。 傳聞我的和親對象是個殘疾皇子分预,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,901評論 2 345