TextCNN網(wǎng)絡(luò)是2014年提出的用來做文本分類的卷積神經(jīng)網(wǎng)絡(luò),由于其結(jié)構(gòu)簡單捎拯、效果好,在文本分類盲厌、推薦等NLP領(lǐng)域應(yīng)用廣泛署照,我自己在工作中也有探索其在實際當(dāng)中的應(yīng)用祸泪,今天總結(jié)一下。
TextCNN的網(wǎng)絡(luò)結(jié)構(gòu)
數(shù)據(jù)預(yù)處理
再將TextCNN網(wǎng)絡(luò)的具體結(jié)構(gòu)之前建芙,先講一下TextCNN處理的是什么樣的數(shù)據(jù)以及需要什么樣的數(shù)據(jù)輸入格式没隘。假設(shè)現(xiàn)在有一個文本分類的任務(wù),我們需要對一段文本進(jìn)行分類來判斷這個文本是是屬于哪個類別:體育禁荸、經(jīng)濟右蒲、娛樂、科技等赶熟。訓(xùn)練數(shù)據(jù)集如下示意圖:
第一列是文本的內(nèi)容瑰妄,第二列是文本的標(biāo)簽。首先需要對數(shù)據(jù)集進(jìn)行處理钧大,步驟如下:
- 分詞 中文文本分類需要分詞翰撑,有很多開源的中文分詞工具,例如Jieba等啊央。分詞后還會做進(jìn)一步的處理眶诈,去除掉一些高頻詞匯和低頻詞匯,去掉一些無意義的符號等瓜饥。
-
建立詞典以及單詞索引 建立詞典就是統(tǒng)計文本中出現(xiàn)多少了單詞逝撬,然后為每個單詞編碼一個唯一的索引號,便于查找乓土。如果對以上詞典建立單詞索引宪潮,結(jié)果如下圖示意:
上面的詞典表明,“谷歌”這個單詞趣苏,可以用數(shù)字 0 來表示狡相,“樂視”這個單詞可以用數(shù)字 1 來表示。
-
將訓(xùn)練文本用單詞索引號表示 在上面的單詞-索引表示下食磕,訓(xùn)練示例中的第一個文本樣本可以用如下的一串?dāng)?shù)字表示:
到這里文本的預(yù)處理工作基本全部完成尽棕,將自然語言組成的訓(xùn)練文本表示成離散的數(shù)據(jù)格式,是處理NLP工作的第一步彬伦。
TextCNN結(jié)構(gòu)
TextCNN的結(jié)構(gòu)比較簡單滔悉,輸入數(shù)據(jù)首先通過一個embedding layer,得到輸入語句的embedding表示单绑,然后通過一個convolution layer回官,提取語句的特征,最后通過一個fully connected layer得到最終的輸出搂橙,整個模型的結(jié)構(gòu)如下圖:
上圖是論文中給出的視力圖歉提,下面分別介紹每一層。
- embedding layer:即嵌入層,這一層的主要作用是將輸入的自然語言編碼成distributed representation苔巨,具體的實現(xiàn)方法可以參考word2vec相關(guān)論文弯屈,這里不再贅述×悼剑可以使用預(yù)訓(xùn)練好的詞向量资厉,也可以直接在訓(xùn)練textcnn的過程中訓(xùn)練出一套詞向量,不過前者比或者快100倍不止蔬顾。如果使用預(yù)訓(xùn)練好的詞向量宴偿,又分為static方法和no-static方法,前者是指在訓(xùn)練textcnn過程中不再調(diào)節(jié)詞向量的參數(shù)诀豁,后者在訓(xùn)練過程中調(diào)節(jié)詞向量的參數(shù)窄刘,所以,后者的結(jié)果比前者要好舷胜。更為一般的做法是:不要在每一個batch中都調(diào)節(jié)emdbedding層娩践,而是每個100個batch調(diào)節(jié)一次,這樣可以減少訓(xùn)練的時間烹骨,又可以微調(diào)詞向量翻伺。
-
convolution layer:這一層主要是通過卷積,提取不同的n-gram特征沮焕。輸入的語句或者文本吨岭,通過embedding layer后,會轉(zhuǎn)變成一個二維矩陣峦树,假設(shè)文本的長度為|T|辣辫,詞向量的大小為|d|,則該二維矩陣的大小為|T|x|d|魁巩,接下的卷積工作就是對這一個|T|x|d|的二維矩陣進(jìn)行的急灭。卷積核的大小一般設(shè)定為
n是卷積核的長度,|d|是卷積核的寬度谷遂,這個寬度和詞向量的維度是相同的葬馋,也就是卷積只是沿著文本序列進(jìn)行的,n可以有多種選擇埋凯,比如2点楼、3扫尖、4白对、5等。對于一個|T|x|d|的文本换怖,如果選擇卷積核kernel的大小為2x|d|甩恼,則卷積后得到的結(jié)果是|T-2+1|x1的一個向量。在TextCNN網(wǎng)絡(luò)中,需要同時使用多個不同類型的kernel条摸,同時每個size的kernel又可以有多個悦污。如果我們使用的kernel size大小為2、3钉蒲、4切端、5x|d|,每個種類的size又有128個kernel顷啼,則卷積網(wǎng)絡(luò)一共有4x128個卷積核踏枣。
上圖是從google上找到的一個不太理想的卷積示意圖,我們看到紅色的橫框就是所謂的卷積核钙蒙,紅色的豎框是卷積后的結(jié)果茵瀑。從圖中看到卷積核的size=1、2躬厌、3马昨, 圖中上下方向是文本的序列方向,卷積核只能沿著“上下”方向移動扛施。卷積層本質(zhì)上是一個n-gram特征提取器鸿捧,不同的卷積核提取的特征不同,以文本分類為例疙渣,有的卷積核可能提取到娛樂類的n-gram笛谦,比如范冰冰、電影等n-gram昌阿;有的卷積核可能提取到經(jīng)濟類的n-gram饥脑,比如去產(chǎn)能、調(diào)結(jié)構(gòu)等懦冰。分類的時候灶轰,不同領(lǐng)域的文本包含的n-gram是不同的,激活對應(yīng)的卷積核刷钢,就會被分到對應(yīng)的類笋颤。
- max-pooling layer:最大池化層,對卷積后得到的若干個一維向量取最大值内地,然后拼接在一塊伴澄,作為本層的輸出值。如果卷積核的size=2阱缓,3非凌,4,5荆针,每個size有128個kernel敞嗡,則經(jīng)過卷積層后會得到4x128個一維的向量(注意這4x128個一維向量的大小不同颁糟,但是不妨礙取最大值),再經(jīng)過max-pooling之后喉悴,會得到4x128個scalar值棱貌,拼接在一塊,得到最終的結(jié)構(gòu)—512x1的向量箕肃。max-pooling層的意義在于對卷積提取的n-gram特征婚脱,提取激活程度最大的特征。
- fully-connected layer:這一層沒有特別的地方勺像,將max-pooling layer后再拼接一層起惕,作為輸出結(jié)果。實際中為了提高網(wǎng)絡(luò)的學(xué)習(xí)能力咏删,可以拼接多個全連接層惹想。
以上就是TextCNN的網(wǎng)絡(luò)結(jié)構(gòu),接下來是我自己寫的代碼(tensorflow版)督函,附上嘀粱,有不足之處,望大家指出辰狡。
TextCNN的代碼實現(xiàn)
寫tensorflow代碼锋叨,其實有模式可尋的,一般情況下就是三個文件:train.py宛篇、model.py娃磺、predict.py。除此之外叫倍,一般還有一個data_helper.py的文件偷卧,用來處理訓(xùn)練數(shù)據(jù)等。
model.py:定義模型的結(jié)構(gòu)吆倦。
train.py:構(gòu)建訓(xùn)練程序听诸,這里包括訓(xùn)練主循環(huán)、記錄必要的變量值蚕泽、保存模型等晌梨。
predict.py:用來做預(yù)測的当辐。
這里主要附上model.py文件和train.py文件博秫。
model.py
# -*- coding:utf-8 -*-
import tensorflow as tf
import numpy as np
class Settings(object):
"""
configuration class
"""
def __init__(self, vocab_size=100000, embedding_size=128):
self.model_name = "CNN"
self.embedding_size = embedding_size
self.filter_size = [2, 3, 4, 5]
self.n_filters = 128
self.fc_hidden_size = 1024
self.n_class = 2
self.vocab_size = vocab_size
self.max_words_in_doc = 20
class TextCNN(object):
"""
Text CNN
"""
def __init__(self, settings, pre_trained_word_vectors=None):
self.model_name = settings.model_name
self.embedding_size = settings.embedding_size
self.filter_size = settings.filter_size
self.n_filter = settings.n_filters
self.fc_hidden_size = settings.fc_hidden_size
self.n_filter_total = self.n_filter*(len(self.filter_size))
self.n_class = settings.n_class
self.max_words_in_doc = settings.max_words_in_doc
self.vocab_size = settings.vocab_size
""" 定義網(wǎng)絡(luò)的結(jié)構(gòu) """
# 輸入樣本
with tf.name_scope("inputs"):
self._inputs_x = tf.placeholder(tf.int64, [None, self.max_words_in_doc], name="_inputs_x")
self._inputs_y = tf.placeholder(tf.float16, [None, self.n_class], name="_inputs_y")
self._keep_dropout_prob = tf.placeholder(tf.float32, name="_keep_dropout_prob")
# 嵌入層
with tf.variable_scope("embedding"):
if isinstance( pre_trained_word_vectors, np.ndarray): # 使用預(yù)訓(xùn)練的詞向量
assert isinstance(pre_trained_word_vectors, np.ndarray), "pre_trained_word_vectors must be a numpy's ndarray"
assert pre_trained_word_vectors.shape[1] == self.embedding_size, "number of col of pre_trained_word_vectors must euqals embedding size"
self.embedding = tf.get_variable(name='embedding',
shape=pre_trained_word_vectors.shape,
initializer=tf.constant_initializer(pre_trained_word_vectors),
trainable=True)
else:
self.embedding = tf.Variable(tf.truncated_normal((self.vocab_size, self.embedding_size)))
# conv-pool
inputs = tf.nn.embedding_lookup(self.embedding, self._inputs_x) #[batch_size, words, embedding] # look up layer
inputs = tf.expand_dims(inputs, -1) # [batch_size, words, embedding, 1]
pooled_output = []
for i, filter_size in enumerate(self.filter_size): # filter_size = [2, 3, 4, 5]
with tf.variable_scope("conv-maxpool-%s" % filter_size):
# conv layer
filter_shape = [filter_size, self.embedding_size, 1, self.n_filter]
W = self.weight_variable(shape=filter_shape, name="W_filter")
b = self.bias_variable(shape=[self.n_filter], name="b_filter")
conv = tf.nn.conv2d(inputs, W, strides=[1, 1, 1, 1], padding="VALID", name='text_conv') # [batch, words-filter_size+1, 1, channel]
# apply activation
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# max pooling
pooled = tf.nn.max_pool(h, ksize=[1, self.max_words_in_doc - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding="VALID", name='max_pool') # [batch, 1, 1, channel]
pooled_output.append(pooled)
h_pool = tf.concat(pooled_output, 3) # concat on 4th dimension
self.h_pool_flat = tf.reshape(h_pool, [-1, self.n_filter_total], name="h_pool_flat")
# add dropout
with tf.name_scope("dropout"):
self.h_dropout = tf.nn.dropout(self.h_pool_flat, self._keep_dropout_prob, name="dropout")
# output layer
with tf.name_scope("output"):
W = self.weight_variable(shape=[self.n_filter_total, self.n_class], name="W_out")
b = self.bias_variable(shape=[self.n_class], name="bias_out")
self.scores = tf.nn.xw_plus_b(self.h_dropout, W, b, name="scores") # class socre
print "self.scores : " , self.scores.get_shape()
self.predictions = tf.argmax(self.scores, 1, name="predictions") # predict label , the output
print "self.predictions : " , self.predictions.get_shape()
# 輔助函數(shù)
def weight_variable(self, shape, name):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial, name=name)
def bias_variable(self, shape, name):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial, name=name)
train.py
#coding=utf-8
import tensorflow as tf
from datetime import datetime
import os
from load_data import load_dataset, load_dataset_from_pickle
from cnn_model import TextCNN
from cnn_model import Settings
# Data loading params
tf.flags.DEFINE_string("train_data_path", 'data/train_query_pair_test_data.pickle', "data directory")
tf.flags.DEFINE_string("embedding_W_path", "./data/embedding_matrix.pickle", "pre-trained embedding matrix")
tf.flags.DEFINE_integer("vocab_size", 3627705, "vocabulary size") # **這里需要根據(jù)詞典的大小設(shè)置**
tf.flags.DEFINE_integer("num_classes", 2, "number of classes")
tf.flags.DEFINE_integer("embedding_size", 100, "Dimensionality of character embedding (default: 200)")
tf.flags.DEFINE_integer("batch_size", 256, "Batch Size (default: 64)")
tf.flags.DEFINE_integer("num_epochs", 1, "Number of training epochs (default: 50)")
tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")
tf.flags.DEFINE_integer("max_words_in_doc", 30, "Number of checkpoints to store (default: 5)")
tf.flags.DEFINE_integer("evaluate_every", 100, "evaluate every this many batches")
tf.flags.DEFINE_float("learning_rate", 0.001, "learning rate")
tf.flags.DEFINE_float("keep_prob", 0.5, "dropout rate")
FLAGS = tf.flags.FLAGS
train_x, train_y, dev_x, dev_y, W_embedding = load_dataset_from_pickle(FLAGS.train_data_path, FLAGS.embedding_W_path)
train_sample_n = len(train_y)
print len(train_y)
print len(dev_y)
print "data load finished"
print "W_embedding : ", W_embedding.shape[0], W_embedding.shape[1]
# 模型的參數(shù)配置
settings = Settings()
"""
可以配置不同的參數(shù),需要根據(jù)訓(xùn)練數(shù)據(jù)集設(shè)置 vocab_size embedding_size
"""
settings.embedding_size = FLAGS.embedding_size
settings.vocab_size = FLAGS.vocab_size
# 設(shè)置GPU的使用率
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=1.0)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
with tf.Session() as sess:
# 在session中, 首先初始化定義好的model
textcnn = TextCNN(settings=settings, pre_trained_word_vectors=W_embedding)
# 在train.py 文件中定義loss和accuracy, 這兩個指標(biāo)不要再model中定義
with tf.name_scope('loss'):
#print textcnn._inputs_y
#print textcnn.predictions
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=textcnn.scores,
labels=textcnn._inputs_y,
name='loss'))
with tf.name_scope('accuracy'):
#predict = tf.argmax(textcnn.predictions, axis=0, name='predict')
predict = textcnn.predictions # 在模型的定義中, textcnn.predictions 已經(jīng)是經(jīng)過argmax后的結(jié)果, 在訓(xùn)練.py文件中不能再做一次argmax
label = tf.argmax(textcnn._inputs_y, axis=1, name='label')
#print predict.get_shape()
#print label.get_shape()
acc = tf.reduce_mean(tf.cast(tf.equal(predict, label), tf.float32))
# make一個文件夾, 存放模型訓(xùn)練的中間結(jié)果
timestamp = datetime.now().strftime( '%Y-%m-%d %H:%M:%S')
timestamp = "textcnn" + timestamp
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# 定義一個全局變量, 存放到目前為止,模型優(yōu)化迭代的次數(shù)
global_step = tf.Variable(0, trainable=False)
# 定義優(yōu)化器, 找出需要優(yōu)化的變量以及求出這些變量的梯度
optimizer = tf.train.AdamOptimizer(FLAGS.learning_rate)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
grads_and_vars = tuple(zip(grads, tvars))
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) # 我理解, global_step應(yīng)該會在這個函數(shù)中自動+1
# 不優(yōu)化預(yù)訓(xùn)練好的詞向量
tvars_no_embedding = [tvar for tvar in tvars if 'embedding' not in tvar.name]
grads_no_embedding = tf.gradients(loss, tvars_no_embedding)
grads_and_vars_no_embedding = tuple(zip(grads_no_embedding, tvars_no_embedding))
trian_op_no_embedding = optimizer.apply_gradients(grads_and_vars_no_embedding, global_step=global_step)
# Keep track of gradient values and sparsity (optional)
grad_summaries = []
for g, v in grads_and_vars:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
grad_summaries.append(grad_hist_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)
loss_summary = tf.summary.scalar('loss', loss)
acc_summary = tf.summary.scalar('accuracy', acc)
train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
# save model
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
#saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=2)
#saver.save(sess, checkpoint_prefix, global_step=FLAGS.num_checkpoints)
# 初始化多有的變量
sess.run(tf.global_variables_initializer())
def train_step(x_batch, y_batch):
feed_dict = {
textcnn._inputs_x: x_batch,
textcnn._inputs_y: y_batch,
textcnn._keep_dropout_prob: 0.5
}
_, step, summaries, cost, accuracy = sess.run([train_op, global_step, train_summary_op, loss, acc], feed_dict)
#print tf.shape(y_batch)
#print textcnn.predictions.get_shape()
#time_str = str(int(time.time()))
time_str = datetime.now().strftime( '%Y-%m-%d %H:%M:%S')
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, cost, accuracy))
train_summary_writer.add_summary(summaries, step)
return step
def train_step_no_embedding(x_batch, y_batch):
feed_dict = {
textcnn._inputs_x: x_batch,
textcnn._inputs_y: y_batch,
textcnn._keep_dropout_prob: 0.5
}
_, step, summaries, cost, accuracy = sess.run([train_op_no_embedding, global_step, train_summary_op, loss, acc], feed_dict)
time_str = datetime.now().strftime( '%Y-%m-%d %H:%M:%S')
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, cost, accuracy))
train_summary_writer.add_summary(summaries, step)
return step
def dev_step(x_batch, y_batch, writer=None):
feed_dict = {
textcnn._inputs_x: x_batch,
textcnn._inputs_y: y_batch,
textcnn._keep_dropout_prob: 1.0
}
step, summaries, cost, accuracy = sess.run([global_step, dev_summary_op, loss, acc], feed_dict)
#time_str = str(int(time.time()))
time_str = datetime.now().strftime( '%Y-%m-%d %H:%M:%S')
print("++++++++++++++++++dev++++++++++++++{}: step {}, loss {:g}, acc {:g}".format(time_str, step, cost, accuracy))
if writer:
writer.add_summary(summaries, step)
for epoch in range(FLAGS.num_epochs):
print('current epoch %s' % (epoch + 1))
for i in range(0, train_sample_n, FLAGS.batch_size):
x = train_x[i:i + FLAGS.batch_size]
y = train_y[i:i + FLAGS.batch_size]
step = train_step(x, y)
if step % FLAGS.evaluate_every == 0:
dev_step(dev_x, dev_y, dev_summary_writer)
if step % FLAGS.checkpoint_every == 0:
path = saver.save(sess, checkpoint_prefix, global_step=FLAGS.num_checkpoints)
print "Saved model checkpoint to {}\n".format(path)
寫tensorflow代碼的關(guān)鍵在于定義網(wǎng)絡(luò)結(jié)構(gòu),多看好代碼思恐,仔細(xì)揣摩其中定義網(wǎng)絡(luò)結(jié)構(gòu)的代碼模式很重要荒吏。另外敛惊,對tensorflow中每一個API輸入、輸出tensor也要了解司倚,特別是tensor的shape豆混,這個在實際中最容易出錯。
經(jīng)驗分享
在工作用到TextCNN做query推薦动知,并結(jié)合先關(guān)的文獻(xiàn)皿伺,談幾點經(jīng)驗:
1、TextCNN是一個n-gram特征提取器盒粮,對于訓(xùn)練集中沒有的n-gram不能很好的提取鸵鸥。對于有些n-gram,可能過于強烈丹皱,反而會干擾模型妒穴,造成誤分類。
2摊崭、TextCNN對詞語的順序不敏感讼油,在query推薦中,我把正樣本分詞后得到的term做隨機排序呢簸,正確率并沒有降低太多矮台,當(dāng)然,其中一方面的原因短query本身對term的順序要求不敏感根时。隔壁組有用textcnn做博彩網(wǎng)頁識別瘦赫,正確率接近95%,在對網(wǎng)頁內(nèi)容(長文本)做隨機排序后蛤迎,正確率大概是85%确虱。
3、TextCNN擅長長本文分類替裆,在這一方面可以做到很高正確率校辩。
4、TextCNN在模型結(jié)構(gòu)方面有很多參數(shù)可調(diào)辆童,具體參看文末的文獻(xiàn)召川。
參考文獻(xiàn)
《Convolutional Neural Networks for Sentence Classification》
《A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification》
————————————————
版權(quán)聲明:本文為CSDN博主「gg-123」的原創(chuàng)文章,遵循CC 4.0 BY-SA版權(quán)協(xié)議胸遇,轉(zhuǎn)載請附上原文出處鏈接及本聲明荧呐。
原文鏈接:https://blog.csdn.net/u012762419/article/details/79561441