用詞向量加深度學(xué)習(xí)的方法做情感分析的基本思路是:
1.訓(xùn)練詞向量 2.句子預(yù)處理、分詞箭养,句子變成一個(gè)個(gè)詞的序列,指定序列最大長(zhǎng)度哥牍,多砍少補(bǔ)毕泌,詞分配索引、對(duì)應(yīng)上詞向量嗅辣。3. 定義網(wǎng)絡(luò)結(jié)構(gòu)撼泛,比如可以使用一層LSTM+全連接層,使用dropout增加泛化性澡谭,然后開始訓(xùn)練愿题。4.調(diào)整參數(shù),看訓(xùn)練集和驗(yàn)證集的loss, accuracy蛙奖,當(dāng)驗(yàn)證集的accuracy非偶然地不增反降(一般也對(duì)應(yīng)著loss開始上升)時(shí)抠忘,說明開始過擬合,停止訓(xùn)練外永,用這個(gè)epoch/iteration和參數(shù)重新對(duì)所有數(shù)據(jù)訓(xùn)練出模型。
1.詞向量
訓(xùn)練詞向量的語料和步驟在前面文章已有拧咳〔ィ可以把情感分析語料加上來一起訓(xùn)練詞向量,方法和代碼略過骆膝。值得一提的是祭衩,詞向量語料最好是用待分析情感領(lǐng)域的語料,越多越好阅签;另外掐暮,分詞的好壞會(huì)很大程度影響詞向量的準(zhǔn)確性,可以做一些額外預(yù)處理比如去停用詞政钟、加入行業(yè)詞條作為自定義詞典等路克。
2.文本變數(shù)字
下圖很好地解釋了文本變數(shù)字的過程。通過詞向量我們可以得到一個(gè)詞表养交,詞表里每個(gè)詞有個(gè)index(比如詞的index為詞在詞表中位置+1)精算,且這個(gè)數(shù)字對(duì)應(yīng)了該詞的詞向量,得到類似如圖中embedding matrix碎连。注意要留一個(gè)特殊數(shù)字如0代表非詞典詞灰羽。一句話分詞后得到一系列詞,如“I thought the movie was incredible and inspiring”分詞后是“I”,“thought”廉嚼,“the”玫镐,“movie”,“was”怠噪,“incredible”恐似,“and”,“inspiring”舰绘,每個(gè)詞對(duì)應(yīng)一個(gè)索引數(shù)字蹂喻,得到向量[41 804 201534 1005 15 7446 5 13767]。輸入需轉(zhuǎn)化為統(tǒng)一長(zhǎng)度(max_len)捂寿,如10口四,這句話只有8個(gè)詞,那么需要補(bǔ)齊剩下的2個(gè)空位為0秦陋。那么[41 804 201534 1005 15 7446 5 13767 0 0]通過查詢embedding matrix就可以得到[batch_size = 1, max_len = 10, word2vec_dimension = 50]的向量蔓彩。即為輸入。
3.網(wǎng)絡(luò)結(jié)構(gòu)
步驟2所說的是輸入驳概,輸出就是one-hot向量赤嚼。如3分類(正面、負(fù)面顺又、中性)更卒,對(duì)應(yīng)輸出為[1 0 0]和[0 1 0]和[0 0 1],softmax得到的輸出就可以代表各分類的概率稚照。對(duì)于二分類蹂空,也可以用0,1來代表輸出果录,這樣用sigmoid使輸出映射到0到1之間上枕,也可以作為概率。那么有了輸入和輸出弱恒,就要定義模型/網(wǎng)絡(luò)結(jié)構(gòu)辨萍,然后讓模型自己去學(xué)習(xí)參數(shù)。這里語料不多返弹,模型盡可能簡(jiǎn)單锈玉。可以用一層CNN(這里當(dāng)然是包括pooling層的)琉苇,RNN(LSTM, GRU, Bidirectional lstm)等嘲玫,最后是一層全連接層。實(shí)驗(yàn)發(fā)現(xiàn)Bidirectional lstm效果最好并扇,測(cè)試集上能達(dá)到95%以上的正確率去团。這也與一般認(rèn)知相符,因?yàn)镃NN只提取了一段段的詞,沒考慮上下文信息土陪;而lstm將句子由左向右計(jì)算昼汗,不能結(jié)合右邊的信息,所以bi-lstm加一遍反向計(jì)算的信息鬼雀。
4.訓(xùn)練
劃分訓(xùn)練集和驗(yàn)證集(0.2比例)顷窒,用訓(xùn)練集做訓(xùn)練,同時(shí)對(duì)驗(yàn)證集也要算loss和accuracy源哩。正常情況鞋吉,訓(xùn)練集loss越來越低,accuracy越來越高至收斂励烦;驗(yàn)證集開始也如此谓着,到某個(gè)時(shí)刻開始loss升高,accuracy降低坛掠,說明過擬合赊锚,在這一刻early-stopping。用當(dāng)前參數(shù)重新訓(xùn)練整個(gè)數(shù)據(jù)屉栓,得到模型舷蒲。
5.python代碼
keras訓(xùn)練
# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
np.random.seed(35) # For Reproducibility
import jieba
import pandas as pd
import sys
sys.setrecursionlimit(1000000)
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
validation_rate = 0.0
cpu_count = multiprocessing.cpu_count()
def read_txt(filename):
f = open(filename)
res = []
for i in f:
res.append(i.replace("\n",""))
del(res[0])
return res
#加載訓(xùn)練文件
def loadfile():
neg = read_txt("./bida_neg.txt")
pos = read_txt('./bida_pos.txt')
combined=np.concatenate((pos, neg))
y = np.concatenate((np.ones(len(pos),dtype=int), np.zeros(len(neg),dtype=int)))
return combined,y
#對(duì)句子經(jīng)行分詞,并去掉換行符
def tokenizer(text):
''' Simple Parser converting each document to lower-case, then
removing the breaks for new lines and finally splitting on the
whitespace
'''
text = [jieba.lcut(document.replace('\n', '')) for document in text]
return text
def create_dictionaries(model=None,
combined=None):
''' Function does are number of Jobs:
1- Creates a word to index mapping
2- Creates a word to vector mapping
3- Transforms the Training and Testing Dictionaries
'''
if (combined is not None) and (model is not None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有頻數(shù)超過10的詞語的
索引
w2vec = {word: model[word] for word in w2indx.keys()}#所有頻數(shù)超過10的詞
語的詞向量
def parse_dataset(combined):
''' Words become integers
'''
data=[]
for sentence in combined:
new_txt = []
for word in sentence:
try:
new_txt.append(w2indx[word])
except:
new_txt.append(0)
data.append(new_txt)
return data
combined=parse_dataset(combined)
combined= sequence.pad_sequences(combined, maxlen=maxlen)#每個(gè)句子所含詞
語對(duì)應(yīng)的索引
return w2indx, w2vec,combined
else:
print('No data provided...')
def get_data(index_dict,word_vectors,combined,y):
n_symbols = len(index_dict) + 1 # 所有單詞的索引數(shù)友多,頻數(shù)小于10的詞語索引為0牲平,所以加1
embedding_weights = np.zeros((n_symbols, vocab_dim))#索引為0的詞語,詞向量全
為0
for word, index in index_dict.items():#從索引為1的詞語開始域滥,對(duì)每個(gè)詞語對(duì)應(yīng)其
詞向量
embedding_weights[index, :] = word_vectors[word]
x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=validation_rate)
return n_symbols,embedding_weights,x_train,y_train,x_test,y_test
def word2vec_train(model, combined):
index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)
return index_dict, word_vectors, combined
##定義網(wǎng)絡(luò)結(jié)構(gòu)
def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):
model = Sequential()
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
mask_zero=True,
weights=[embedding_weights],
input_length=input_length)) # Adding Input Length
model.add(Bidirectional(LSTM(32, activation='sigmoid',inner_activation='sigmoid')))
model.add(Dropout(0.4))
model.add(Dense(1))
model.add(Activation('sigmoid'))
print('Compiling the Model...')
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=['accuracy'])
print("Train...")
model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))
print("Evaluate...")
score = model.evaluate(x_test, y_test,
batch_size=batch_size)
yaml_string = model.to_yaml()
with open('lstm_data/lstm.yml', 'w') as outfile:
outfile.write( yaml.dump(yaml_string, default_flow_style=True) )
model.save_weights('lstm_data/lstm.h5')
print('Test score:', score)
#訓(xùn)練模型欠拾,并保存
def train():
combined,y=loadfile()
combined = tokenizer(combined)
model = Word2Vec.load("../models/word2vec.model")
index_dict, word_vectors,combined=create_dictionaries(model, combined)
n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)
train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)
if __name__=='__main__':
train()
以上是二分類,輸出映射到0~1之間的代碼骗绕,如果多分類,激活函數(shù)用softmax代替sigmoid资昧,loss='binary_crossentropy'改為loss='categorical_crossentropy',另外y = to_categorical(y, num_classes=classes)
預(yù)測(cè)
# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
import jieba
import pandas as pd
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
cpu_count = multiprocessing.cpu_count()
def init_dictionaries(w2v_model):
gensim_dict = Dictionary()
gensim_dict.doc2bow(w2v_model.wv.vocab.keys(),
allow_update=True)
w2indx = {v: k+1 for k, v in gensim_dict.items()}
w2vec = {word: w2v_model[word] for word in w2indx.keys()}
return w2indx, w2vec
def process_words(w2indx, words):
temp = []
for word in words:
try:
temp.append(w2indx[word])
except:
temp.append(0)
res = sequence.pad_sequences([temp], maxlen = maxlen)
return res
def input_transform(string, w2index):
words=jieba.lcut(string)
return process_words(w2index, words)
def load_model():
print('loading model......')
with open('lstm_data/lstm.yml', 'r') as f:
yaml_string = yaml.load(f)
model = model_from_yaml(yaml_string)
model.load_weights('lstm_data/lstm.h5')
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=['accuracy'])
w2v_model=Word2Vec.load('../models/word2vec.model')
return model,w2v_model
def lstm_predict(string, model, w2index):
data=input_transform(string, w2index)
data.reshape(1,-1)
result=model.predict_classes(data)
prob = model.predict_proba(data)
print(string)
print("prob:" + str(prob))
if result[0][0]==1:
#print(string,' positive')
return 1
else:
#print(string,' negative')
return -1
if __name__=='__main__':
model,w2v_model = load_model()
w2index, _ = init_dictionaries(w2v_model)
lstm_predict("平安大跌", model, w2index)
tensorflow訓(xùn)練
#coding = utf-8
from gensim.corpora import Dictionary
from gensim.models import Word2Vec
import numpy as np
from random import randint
from sklearn.model_selection import train_test_split
import tensorflow as tf
import jieba
def read_txt(filename):
f = open(filename)
res = []
for i in f:
res.append(i.replace("\n",""))
del(res[0])
return res
def loadfile():
neg = read_txt("../data/bida_neg.txt")
pos = read_txt('../data/bida_pos.txt')
combined=np.concatenate((pos, neg))
y = np.concatenate((np.ones(len(pos),dtype=int),np.zeros(len(neg),dtype=int)))
return combined,y
def create_dictionaries(model=None):
if (combined is not None) and (model is not None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2index = {v: k+1 for k, v in gensim_dict.items()}
vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
for k, v in gensim_dict.items():
vectors[k+1] = model[v]
return w2index, vectors
def get_train_batch(batch_size):
labels = []
arr = np.zeros([batch_size, max_seq_length])
for i in range(batch_size):
num = randint(0,len(X_train) - 1)
labels.append(y_train[num])
arr[i] = X_train[num]
return arr, labels
def get_test_batch(batch_size):
labels = []
arr = np.zeros([batch_size, max_seq_length])
for i in range(batch_size):
num = randint(0,len(X_test) - 1)
labels.append(y_test[num])
arr[i] = X_test[num]
return arr, labels
def get_all_batches(batch_size = 32, mode = "train"):
X, y = None, None
if mode == "train":
X = X_train
y = y_train
elif mode == "test":
X = X_test
y = y_test
batches = int(len(y)/batch_size)
arrs = [X[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
arrs.append(X[batches*batch_size:len(y)])
labels = [y[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
labels.append(y[batches*batch_size:len(y)])
return arrs, labels
def parse_dataset(sentences, w2index, max_len):
data=[]
for sentence in sentences:
words = jieba.lcut(sentence.replace('\n', ''))
new_txt = np.zeros((max_len), dtype='int32')
index = 0
for word in words:
try:
new_txt[index] = w2index[word]
except:
new_txt[index] = 0
index += 1
if index >= max_len:
break
data.append(new_txt)
return data
batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 50000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 9876
output_keep_prob = 0.5
learning_rate = 0.001
combined, y = loadfile()
model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)
X = parse_dataset(combined, w2index, max_seq_len)
y = [[1,0] if yi == 1 else [0,1] for yi in y]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_rate, random_state=random_state)
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])
data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors, input_data)
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
dtype=tf.float32)
bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
initializer=tf.zeros_initializer())
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)
logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
cal_iter = 500
loss_train, loss_test = 0.0, 0.0
acc_train, acc_test = 0.0, 0.0
print("start training...")
for i in range(iterations):
#Next Batch of reviews
next_batch, next_batch_labels = get_train_batch(batch_size);
sess.run(optimizer, {input_data: next_batch, labels: next_batch_labels})
#Save the network every 10,000 training iterations
if (i % cal_iter == 0):
save_path = saver.save(sess, "models/pretrained_lstm.ckpt")
print("iteration: " + str(i))
train_acc, train_loss = 0.0, 0.0
test_acc, test_loss = 0.0, 0.0
train_arrs, train_labels = get_all_batches(300)
test_arrs, test_labels = get_all_batches(300, "test")
for k in range(len(train_labels)):
temp1, temp2 = sess.run([accuracy, loss], {input_data: train_arrs[k], labels : train_labels[k]})
train_acc += temp1
train_loss += temp2
train_acc /= len(train_labels)
train_loss /= len(train_labels)
for k in range(len(test_labels)):
temp1, temp2 = sess.run([accuracy, loss], {input_data: test_arrs[k], labels : test_labels[k]})
test_acc += temp1
test_loss += temp2
test_acc /= len(test_labels)
test_loss /= len(test_labels)
print("train accuracy: " + str(train_acc) + ", train loss: " + str(train_loss))
print("test accucary: " + str(test_acc) + ", test loss: " + str(test_loss))
預(yù)測(cè)
import tensorflow as tf
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
import numpy as np
import jieba
def create_dictionaries(model=None):
if model is not None:
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2index = {v: k+1 for k, v in gensim_dict.items()}
vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
for k, v in gensim_dict.items():
vectors[k+1] = model[v]
return w2index, vectors
def parse_dataset(sentence, w2index, max_len):
words = jieba.lcut(sentence.replace('\n', ''))
new_txt = np.zeros((max_len), dtype='int32')
index = 0
for word in words:
try:
new_txt[index] = w2index[word]
except:
new_txt[index] = 0
index += 1
if index >= max_len:
break
return [new_txt]
batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 100000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 333
output_keep_prob = 0.5
model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])
data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors,input_data)
"""
bi-lstm
"""
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
dtype=tf.float32)
bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
initializer=tf.zeros_initializer())
#last = tf.reshape(outputs, [-1, 2 * lstm_units])
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)
logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
sess = tf.InteractiveSession()
saver = tf.train.Saver()
#saver.restore(sess, 'models/pretrained_lstm.ckpt-27000.data-00000-of-00001')
saver.restore(sess, tf.train.latest_checkpoint('models'))
l = ["平安銀行大跌", "平安銀行暴跌", "平安銀行扭虧為盈","小米將加深與TCL合作",
"蘋果手機(jī)現(xiàn)在賣的不如以前了","蘋果和三星的糟糕業(yè)績(jī)預(yù)示著全球商業(yè)領(lǐng)域?qū)⒔?jīng)歷更加嚴(yán)
峻的考驗(yàn)酬土。"
,"這道菜不好吃"]
for s in l:
print(s)
X = parse_dataset(s, w2index, max_seq_len)
predictedSentiment = sess.run(prediction, {input_data: X})[0]
print(predictedSentiment[0], predictedSentiment[1])
參考資料:
https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb
https://buptldy.github.io/2016/07/20/2016-07-20-sentiment%20analysis/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
https://arxiv.org/abs/1408.5882