序列分類蚤吹,預(yù)測整個輸入序列的類別標簽。情緒分析随抠,預(yù)測用戶撰寫文字話題態(tài)度裁着。預(yù)測選舉結(jié)果或產(chǎn)品、電影評分拱她。
國際電影數(shù)據(jù)庫(International Movie Database)影評數(shù)據(jù)集二驰。目標值二元,正面或負面椭懊。語言大量否定诸蚕、反語步势、模糊,不能只看單詞是否出現(xiàn)背犯。構(gòu)建詞向量循環(huán)網(wǎng)絡(luò)坏瘩,逐個單詞查看每條評論,最后單詞話性值訓(xùn)練預(yù)測整條評論情緒分類器漠魏。
斯擔福大學人工智能實驗室的IMDB影評數(shù)據(jù)集: http://ai.stanford.edu/~amaas/data/sentiment/ 倔矾。壓縮tar文檔,正面負面評論從兩個文件夾文本文件獲取柱锹。利用正則表達式提取純文本哪自,字母全部轉(zhuǎn)小寫。
詞向量嵌入表示禁熏,比獨熱編碼詞語語義更豐富壤巷。詞匯表確定單詞索引,找到正確詞向量瞧毙。序列填充相同長度胧华,多個影評數(shù)據(jù)批量送入網(wǎng)絡(luò)。
序列標注模型宙彪,傳入兩個占位符矩动,一輸入數(shù)據(jù)data或序列,二目標值target或情緒释漆。傳入配置參數(shù)params對象悲没,優(yōu)化器。
動態(tài)計算當前批數(shù)據(jù)序列長度男图。數(shù)據(jù)單個張量形式示姿,各序列以最長影評長度補0。絕對值最大值縮減詞向量享言。零向量峻凫,標量0。實型詞向量览露,標量大于0實數(shù)荧琼。tf.sign()離散為0或1。結(jié)果沿時間步相加差牛,得到序列長度命锄。張量長度與批數(shù)據(jù)容量相同,標量表示序列長度偏化。
使用params對象定義單元類型和單元數(shù)量脐恩。length屬性指定向RNN提供批數(shù)據(jù)最多行數(shù)。獲取每個序列最后活性值侦讨,送入softmax層驶冒。因每條影評長度不同苟翻,批數(shù)據(jù)每個序列RNN最后相關(guān)輸出活性值有不同索引。在時間步維度(批數(shù)據(jù)形狀sequencestime_stepsword_vectors)建立索引骗污。tf.gather()沿第1維建立索引崇猫。輸出活性值形狀sequencestime_stepsword_vectors前兩維扁平化(flatten),添加序列長度需忿。添加length-1,選擇最后有效時間步诅炉。
梯度裁剪,梯度值限制在合理范圍內(nèi)屋厘√樯眨可用任何中分類有意義代價函數(shù),模型輸出可用所有類別概率分布汗洒。增加梯度裁剪(gradient clipping)改善學習結(jié)果议纯,限制最大權(quán)值更新。RNN訓(xùn)練難度大仲翎,不同超參數(shù)搭配不當痹扇,權(quán)值極易發(fā)散。
TensorFlow支持優(yōu)化器實例compute_gradients函數(shù)推演溯香,修改梯度,apply_gradients函數(shù)應(yīng)用權(quán)值變化浓恶。梯度分量小于-limit玫坛,設(shè)置-limit;梯度分量在于limit包晰,設(shè)置limit湿镀。TensorFlow導(dǎo)數(shù)可取None,表示某個變量與代價函數(shù)沒有關(guān)系伐憾,數(shù)學上應(yīng)為零向量但None利于內(nèi)部性能優(yōu)化勉痴,只需傳回None值。
影評逐個單詞送入循環(huán)神經(jīng)網(wǎng)絡(luò)树肃,每個時間步由詞向量構(gòu)成批數(shù)據(jù)蒸矛。batched函數(shù)查找詞向量,所有序列長度補齊胸嘴。訓(xùn)練模型雏掠,定義超參數(shù)、加載數(shù)據(jù)集和詞向量劣像、經(jīng)過預(yù)處理訓(xùn)練批數(shù)據(jù)運行模型乡话。模型成功訓(xùn)練,取決網(wǎng)絡(luò)結(jié)構(gòu)耳奕、超參數(shù)绑青、詞向量質(zhì)量诬像。可從skip-gram模型word2vec項目(https://code.google.com/archive/p/word2vec/ )闸婴、斯坦福NLP研究組Glove模型(https://nlp.stanford.edu/projects/glove )坏挠,加載預(yù)訓(xùn)練詞向量。
Kaggle 開放學習競賽(https://kaggle.com/c/word2vec-nlp-tutorial )掠拳,IMDB影評數(shù)據(jù)癞揉,與他人比較預(yù)測結(jié)果。
import tarfile
import re
from helpers import download
class ImdbMovieReviews:
DEFAULT_URL = \
'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
def __init__(self, cache_dir, url=None):
self._cache_dir = cache_dir
self._url = url or type(self).DEFAULT_URL
def __iter__(self):
filepath = download(self._url, self._cache_dir)
with tarfile.open(filepath) as archive:
for filename in archive.getnames():
if filename.startswith('aclImdb/train/pos/'):
yield self._read(archive, filename), True
elif filename.startswith('aclImdb/train/neg/'):
yield self._read(archive, filename), False
def _read(self, archive, filename):
with archive.extractfile(filename) as file_:
data = file_.read().decode('utf-8')
data = type(self).TOKEN_REGEX.findall(data)
data = [x.lower() for x in data]
return data
import bz2
import numpy as np
class Embedding:
def __init__(self, vocabulary_path, embedding_path, length):
self._embedding = np.load(embedding_path)
with bz2.open(vocabulary_path, 'rt') as file_:
self._vocabulary = {k.strip(): i for i, k in enumerate(file_)}
self._length = length
def __call__(self, sequence):
data = np.zeros((self._length, self._embedding.shape[1]))
indices = [self._vocabulary.get(x, 0) for x in sequence]
embedded = self._embedding[indices]
data[:len(sequence)] = embedded
return data
@property
def dimensions(self):
return self._embedding.shape[1]
import tensorflow as tf
from helpers import lazy_property
class SequenceClassificationModel:
def __init__(self, data, target, params):
self.data = data
self.target = target
self.params = params
self.prediction
self.cost
self.error
self.optimize
@lazy_property
def length(self):
used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
@lazy_property
def prediction(self):
# Recurrent network.
output, _ = tf.nn.dynamic_rnn(
self.params.rnn_cell(self.params.rnn_hidden),
self.data,
dtype=tf.float32,
sequence_length=self.length,
)
last = self._last_relevant(output, self.length)
# Softmax layer.
num_classes = int(self.target.get_shape()[1])
weight = tf.Variable(tf.truncated_normal(
[self.params.rnn_hidden, num_classes], stddev=0.01))
bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
return prediction
@lazy_property
def cost(self):
cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))
return cross_entropy
@lazy_property
def error(self):
mistakes = tf.not_equal(
tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
return tf.reduce_mean(tf.cast(mistakes, tf.float32))
@lazy_property
def optimize(self):
gradient = self.params.optimizer.compute_gradients(self.cost)
try:
limit = self.params.gradient_clipping
gradient = [
(tf.clip_by_value(g, -limit, limit), v)
if g is not None else (None, v)
for g, v in gradient]
except AttributeError:
print('No gradient clipping parameter specified.')
optimize = self.params.optimizer.apply_gradients(gradient)
return optimize
@staticmethod
def _last_relevant(output, length):
batch_size = tf.shape(output)[0]
max_length = int(output.get_shape()[1])
output_size = int(output.get_shape()[2])
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(output, [-1, output_size])
relevant = tf.gather(flat, index)
return relevant
import tensorflow as tf
from helpers import AttrDict
from Embedding import Embedding
from ImdbMovieReviews import ImdbMovieReviews
from preprocess_batched import preprocess_batched
from SequenceClassificationModel import SequenceClassificationModel
IMDB_DOWNLOAD_DIR = './imdb'
WIKI_VOCAB_DIR = '../01_wikipedia/wikipedia'
WIKI_EMBED_DIR = '../01_wikipedia/wikipedia'
params = AttrDict(
rnn_cell=tf.contrib.rnn.GRUCell,
rnn_hidden=300,
optimizer=tf.train.RMSPropOptimizer(0.002),
batch_size=20,
)
reviews = ImdbMovieReviews(IMDB_DOWNLOAD_DIR)
length = max(len(x[0]) for x in reviews)
embedding = Embedding(
WIKI_VOCAB_DIR + '/vocabulary.bz2',
WIKI_EMBED_DIR + '/embeddings.npy', length)
batches = preprocess_batched(reviews, length, embedding, params.batch_size)
data = tf.placeholder(tf.float32, [None, length, embedding.dimensions])
target = tf.placeholder(tf.float32, [None, 2])
model = SequenceClassificationModel(data, target, params)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for index, batch in enumerate(batches):
feed = {data: batch[0], target: batch[1]}
error, _ = sess.run([model.error, model.optimize], feed)
print('{}: {:3.1f}%'.format(index + 1, 100 * error))
參考資料:
《面向機器智能的TensorFlow實踐》
歡迎加我微信交流:qingxingfengzi
我的微信公眾號:qingxingfengzigz
我老婆張幸清的微信公眾號:qingqingfeifangz