什么是Slot Filling?
Slot Filling是自然語言理解中的一個基本問題民晒,是對語言含義的簡單化處理贩毕,它的思想類似于語言學(xué)中框架主義的一派钓丰,先設(shè)定好特定的語言類型槽,再將輸入的單詞一一填入槽內(nèi)枝恋,而獲取言語含義的時候即是根據(jù)語義槽的含義進(jìn)行提取和檢索创倔。我們這里的任務(wù)就是將表示定購航班(ATIS數(shù)據(jù)集)這一言語行為的一系列語句填入各種類型的語義槽中。
為什么使用SimpleRNN?
Slot Filling屬于RNN應(yīng)用中一對一的應(yīng)用焚碌,通過訓(xùn)練模型畦攘,每個詞都能被填到合適的槽中。
RNN和一般的神經(jīng)網(wǎng)絡(luò)的不同在于十电,在RNN中知押,我們在時間t的輸出不僅取決于當(dāng)前的輸入和權(quán)重,還取決于之前的輸入鹃骂,而對于其他神經(jīng)網(wǎng)絡(luò)模型台盯,每個時刻的輸入和輸出都是獨立而隨機(jī)的,沒有相關(guān)性畏线。放到我們要處理語義理解的問題上看静盅,語言作為一種基于時間的線性輸出,顯然會受到前詞的影響寝殴,因此我們選取RNN模型來進(jìn)行解決這個問題蒿叠。
這里選取SimpleRNN,是因為這個RNN比較簡單,能達(dá)到熟悉框架的練習(xí)效果杯矩,之后可以選取其他有效的RNN模型栈虚,如LSTMS進(jìn)行優(yōu)化。
構(gòu)建思路一覽:
- 載入數(shù)據(jù)史隆,使用的是chsasank修改的mesnilgr的load.py魂务。
- 定義模型。采取Keras中的序列模型搭建泌射,首先使用一個100維的word embedding層將輸入的單詞轉(zhuǎn)化為高維空間中的一個向量(在這個空間中粘姜,語義和語法位置越近的單詞的距離越小)熔酷,然后我們構(gòu)建一個dropout層防止過擬合孤紧,設(shè)置SimpleRNN層,設(shè)置TimeDistributed層以完成基于時間的反向傳播拒秘。最后我們將這些層組織在一起号显,并確定optimizer和loss function臭猜。我們選取的optimizer是rmsprop,這樣在訓(xùn)練后期依然能找到較有項,而選取categorical_crossentropy作為損失函數(shù)押蚤,則是因為處理的問題性質(zhì)適合于此蔑歌。
- 訓(xùn)練模型。出于對計算資源的考慮揽碘,我們一般使用minibtach的方法批量對模型進(jìn)行訓(xùn)練次屠。但是我們這里的數(shù)據(jù)是一句句話,如果按照一個固定的batch_size將其分裂雳刺,可能增加了不必要的聯(lián)系(因為上下兩句話是獨立的)劫灶,因此我們將一句話作為一個batch去進(jìn)行訓(xùn)練、驗證以及預(yù)測掖桦,并手動算出一個epoch的平均誤差本昏。
- 評估和預(yù)測模型。我們通過觀察驗證誤差和預(yù)測F1精度來對模型進(jìn)行評估滞详。預(yù)測F1精度使用的是signsmile編寫的conlleval.py凛俱。
- 保存模型。
import numpy as np
import pickle
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
from keras.layers.core import Dense,Dropout
from keras.utils import to_categorical
from keras.layers.wrappers import TimeDistributed
from matplotlib import pyplot as plt
import data.load
from metrics.accuracy import evaluate
Using TensorFlow backend.
Load Data
train_set,valid_set,dicts = data.load.atisfull()
# print(train_set[:1])
# dicts = {'label2idx':{},'words2idx':{},'table2idx':{}}
w2idx,labels2idx = dicts['words2idx'],dicts['labels2idx']
train_x,_,train_label = train_set
val_x,_,val_label = valid_set
idx2w = {w2idx[i]:i for i in w2idx}
idx2lab = {labels2idx[i]:i for i in labels2idx}
n_classes = len(idx2lab)
n_vocab = len(idx2w)
words_train = [[idx2w[i] for i in w[:]] for w in train_x]
labels_train = [[idx2lab[i] for i in w[:]] for w in train_label]
words_val = [[idx2w[i] for i in w[:]] for w in val_x]
# labels_val = [[idx2lab[i] for i in w[:]] for w in val_label]
labels_val =[]
for w in val_label:
for i in w[:]:
labels_val.append(idx2lab[i])
print('Real Sentence : {}'.format(words_train[0]))
print('Encoded Form : {}'.format(train_x[0]))
print('='*40)
print('Real Label : {}'.format(labels_train[0]))
print('Encoded Form : {}'.format(train_label[0]))
Real Sentence : ['i', 'want', 'to', 'fly', 'from', 'boston', 'at', 'DIGITDIGITDIGIT', 'am', 'and', 'arrive', 'in', 'denver', 'at', 'DIGITDIGITDIGITDIGIT', 'in', 'the', 'morning']
Encoded Form : [232 542 502 196 208 77 62 10 35 40 58 234 137 62 11 234 481 321]
========================================
Real Label : ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day']
Encoded Form : [126 126 126 126 126 48 126 35 99 126 126 126 78 126 14 126 126 12]
Define and Compile the model
model = Sequential()
model.add(Embedding(n_vocab,100))
model.add(Dropout(0.25))
model.add(SimpleRNN(100,return_sequences=True))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.compile(optimizer = 'rmsprop',loss = 'categorical_crossentropy')
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 57200
_________________________________________________________________
dropout_1 (Dropout) (None, None, 100) 0
_________________________________________________________________
simple_rnn_1 (SimpleRNN) (None, None, 100) 20100
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 127) 12827
=================================================================
Total params: 90,127
Trainable params: 90,127
Non-trainable params: 0
_________________________________________________________________
Train the model
def train_the_model(n_epochs,train_x,train_label,val_x,val_label):
epoch,train_avgloss,val_avgloss,f1s = [],[],[],[]
for i in range(1,n_epochs+1):
epoch.append(i)
## training
train_avg_loss =0
for n_batch,sent in enumerate(train_x):
label = train_label[n_batch]
# label to one-hot
label = to_categorical(label,num_classes=n_classes)[np.newaxis,:]
sent = sent[np.newaxis,:]
loss = model.train_on_batch(sent,label)
train_avg_loss += loss
train_avg_loss = train_avg_loss/n_batch
train_avgloss.append(train_avg_loss)
## evaluate&predict
val_pred_label,pred_label_val,val_avg_loss = [],[],0
for n_batch,sent in enumerate(val_x):
label = val_label[n_batch]
label = to_categorical(label,num_classes=n_classes)[np.newaxis,:]
sent = sent[np.newaxis,:]
loss = model.test_on_batch(sent,label)
val_avg_loss += loss
pred = model.predict_on_batch(sent)
pred = np.argmax(pred,-1)[0]
val_pred_label.append(pred)
val_avg_loss = val_avg_loss/n_batch
val_avgloss.append(val_avg_loss)
for w in val_pred_label:
for k in w[:]:
pred_label_val.append(idx2lab[k])
prec, rec, f1 = evaluate(labels_val,pred_label_val, verbose=False)
print('Training epoch {}\t train_avg_loss = {} \t val_avg_loss = {}'.format(i,train_avg_loss,val_avg_loss))
print('precision: {:.2f}% \t recall: {:.2f}% \t f1 :{:.2f}%'.format(prec,rec,f1))
print('-'*60)
f1s.append(f1)
# return epoch,pred_label_train,train_avgloss,pred_label_val,val_avgloss
return epoch,f1s,val_avgloss,train_avgloss
epoch,f1s,val_avgloss,train_avgloss = train_the_model(40,train_x,train_label,val_x,val_label)
輸出:
Training epoch 1 train_avg_loss = 0.5546463992293973 val_avg_loss = 0.4345020865901363
precision: 84.79% recall: 80.79% f1 :82.74%
------------------------------------------------------------
Training epoch 2 train_avg_loss = 0.2575569036037627 val_avg_loss = 0.36228470020366654
precision: 86.64% recall: 83.86% f1 :85.22%
------------------------------------------------------------
Training epoch 3 train_avg_loss = 0.2238766908014994 val_avg_loss = 0.33974187403771694
precision: 88.03% recall: 85.55% f1 :86.77%
------------------------------------------------------------
……
------------------------------------------------------------
Training epoch 40 train_avg_loss = 0.09190682124901069 val_avg_loss = 0.2697056618613356
precision: 92.51% recall: 91.47% f1 :91.99%
------------------------------------------------------------
可視化
觀察驗證誤差料饥,選取合適的epoch蒲犬。
%matplotlib inline
plt.xlabel=('epoch')
plt.ylabel=('loss')
plt.plot(epoch,train_avgloss,'b')
plt.plot(epoch,val_avgloss,'r',label=('validation error'))
plt.show()
print('最大f1值為 {:.2f}%'.format(max(f1s)))
最大f1值為 92.56%
保存模型
model.save('slot_filling_with_simpleRNN.h5')
結(jié)果分析
使用SimpleRNN最終得到的F1值為92.56%,和師兄的95.47%相比確實還相差很多岸啡。這主要是和我們模型的選取有關(guān)原叮,SimpleRNN只能將前詞的影響帶入到模型中,但是語言中后詞對前詞也會有一定的影響巡蘸,因此可以通過選擇更加復(fù)雜的模型或者增加能夠捕捉到后詞信息的層來進(jìn)行優(yōu)化奋隶。