前言
本文參考了tensorflow github里面的實現(xiàn)的lstm的教程代碼6_lstm.ipynb勿侯。因為這代碼即實現(xiàn)了lstm,也實操了tf的內(nèi)容逢享,可以說是一箭雙雕罐监。
源碼地址:https://github.com/Salon-sai/learning-tensorflow/tree/master/lesson4
小情緒
鄙人原本想試試NLP的,由于最近一直忙于做項目(急需換電腦瞒爬,也準(zhǔn)備做鴨弓柱,做男優(yōu)來謀點財),而且最近心事重重侧但,心虛不寧矢空,滿腹心事,茶飯不思禀横,蹙額顰眉屁药,寢不安席,輾轉(zhuǎn)反側(cè)柏锄,雙眉緊皺酿箭,耿耿于懷复亏。因此遲遲未能寫完整本文。
lstm理論知識
在簡書中有一篇很好的文章缭嫡,大家可以參考一下當(dāng)中圖和公式:
[譯] 理解 LSTM 網(wǎng)絡(luò)缔御。
LSTM的論文:https://arxiv.org/pdf/1402.1128v1.pdf
其實LSTM就是忘記以前的文字內(nèi)容并記憶當(dāng)前輸入的內(nèi)容。而LSTM并不是完整的RNN妇蛀,他僅僅對RNN的隱含層進(jìn)行改進(jìn)耕突。而LSTM對隱含層進(jìn)行精密的設(shè)計,設(shè)計出forget, input ,output, state這些閥門评架。
在這個間隔不斷增大時眷茁,RNN 會喪失學(xué)習(xí)到連接如此遠(yuǎn)的信息的能力(個人認(rèn)為跟vanishing gradient有關(guān),因為在很深的神經(jīng)網(wǎng)絡(luò)里面纵诞,梯度會逐級遞減上祈,所以考前的cell就不能學(xué)到后面內(nèi)容,就只能根據(jù)附近的信息學(xué)習(xí))挣磨。而LSTM沒有這個問題雇逞。
(鄙人沒有對公式進(jìn)行證明,所以在此猜測一下茁裙。)LSTM是分為state和h_t兩個作為下一個元組的輸入內(nèi)容塘砸。
實戰(zhàn)代碼
-
config.py
# config.py
# -*-coding:utf-8-*-#
import string
class ModelConfig(object):
def __init__(self):
self.num_unrollings = 10 # 每條數(shù)據(jù)的字符串長度
self.batch_size = 64 # 每一批數(shù)據(jù)的個數(shù)
self.vocabulary_size = len(string.ascii_lowercase) + 1 # 定義出現(xiàn)字符串的個數(shù)(一共有26個英文字母和一個空格)
self.summary_frequency = 100 # 生成樣本的頻率
self.num_steps = 7001 # 訓(xùn)練步數(shù)
self.num_nodes = 64 # 隱含層個數(shù)
config = ModelConfig()
如上,config.py用來保存一些變量晤锥。
-
handle_data.py
# -*-coding:utf-8-*-#
import tensorflow as tf
import string
import zipfile
import numpy as np
first_letter = ord(string.ascii_lowercase[0])
class LoadData(object):
def __init__(self, valid_size=1000):
self.text = self._read_data()
self.valid_text = self.text[:valid_size]
self.train_text = self.text[valid_size:]
def _read_data(self, filename='text8.zip'):
with zipfile.ZipFile(filename) as f:
# 獲取當(dāng)中的一個文件
name = f.namelist()[0]
print('file name : %s ' % name)
data = tf.compat.as_str(f.read(name))
return data
def char2id(char):
# 將字母轉(zhuǎn)換成id
if char in string.ascii_lowercase:
return ord(char) - first_letter + 1
elif char == ' ':
return 0
else:
print("Unexpencted character: %s " % char)
return 0
def id2char(dictid):
# 將id轉(zhuǎn)換成字母
if dictid > 0:
return chr(dictid + first_letter - 1)
else:
return ' '
def characters(probabilities):
# 根據(jù)傳入的概率向量得到相應(yīng)的詞
return [id2char(c) for c in np.argmax(probabilities, 1)]
def batches2string(batches):
# 用于測試得到的batches是否符合原來的字符組合
s = [''] * batches[0].shape[0]
for b in batches:
s = [''.join(x) for x in zip(s, characters(b))]
return s
這里要提醒一下我拿的數(shù)據(jù)是text8.zip大家可以去下載來用掉蔬。LoadData就是將壓縮包里面的文本拿出來.然后再劃分成train_text和valid_text兩個。這里還有一些char2id
和id2char
方法矾瘾,這些都為了后面使用的女轿。
-
BatchGenerator.py
# -*-coding:utf-8-*-#
import numpy as np
from handleData import char2id
from config import config
class BatchGenerator(object):
def __init__(self, text, batch_size, num_unrollings):
self._text = text
self._text_size = len(text)
self._batch_size = batch_size
self._num_unrollings = num_unrollings
# 每個串之間的間距
segment = self._text_size // self._batch_size
# 記錄每個串當(dāng)前的位置
self._cursor =[ offset * segment for offset in range(self._batch_size)]
self._last_batch = self._next_batch()
def _next_batch(self):
"""
從當(dāng)前數(shù)據(jù)的游標(biāo)位置生成單一批數(shù)據(jù),一個batch的大小為(batch, 27)
"""
batch = np.zeros(shape=(self._batch_size, config.vocabulary_size), dtype=np.float)
for b in range(self._batch_size):
# 生成one-hot向量
batch[b, char2id(self._text[self._cursor[b]])] = 1.0
self._cursor[b] = (self._cursor[b] + 1) % self._text_size
return batch
def next(self):
# 因為這里加入了上一批數(shù)據(jù)的最后一個字符壕翩,所以當(dāng)前這批
# 數(shù)據(jù)每串長度為num_unrollings + 1
batches = [self._last_batch]
for step in range(self._num_unrollings):
batches.append(self._next_batch())
self._last_batch = batches[-1]
return batches
這里是一個batch生成器,根據(jù)batch_size和num_unrollings生成batch_size個num_unrollings長度的字符串.可能這個類看起來比較繞,大家可以運行剛剛在handleData里面的batches2string
函數(shù)來把理解好這個類.
train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)
print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
你會發(fā)現(xiàn)它打印的內(nèi)容是這樣的:
['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
[' a']
['an']
你發(fā)現(xiàn)這個是一個數(shù)組大小是batch_size蛉迹,每個字符串都是num_unrollings。細(xì)心的你會更會注意到每個字符串在文中的間隔是segment
也就是text_size // batch_size
放妈。而這個_next_batch
函數(shù)其實就是生成一個只有一個字符長度為batch_size的數(shù)組北救,而且每個字符之間的間隔為segment
。那next
函數(shù)就是按照順序依次生成num_unrollings
個只有一個字符長度為batch_size的數(shù)組芜抒。最后把他們join在一起就出現(xiàn)剛剛打印的內(nèi)容啦珍策。這樣以來我們就等于有個迭代生成數(shù)據(jù)集合的對象啦。這個類的代碼還是挺值得我們分析一下的宅倒。(大家可以debug看看吧)
-
sample.py
# -*-coding:utf-8-*-#
import random
import numpy as np
from config import config
def sample_distribution(distribution):
# 隨機概率分布采樣
r = random.uniform(0, 1)
s = 0
for i in range(len(distribution)):
s += distribution[i]
if s >= r:
return i
return len(distribution) - 1
def sample(prediction):
# 隨機采樣生成one-hot向量
p = np.zeros(shape=[1, config.vocabulary_size], dtype=np.float)
p[0, sample_distribution(prediction[0])] = 1.0
return p
def random_distribution():
# 生成隨機概率向量,向量大小為1*27
b = np.random.uniform(0.0, 1.0, size=[1, config.vocabulary_size])
return b / np.sum(b, 1)[:, None]
-
lstm_model.py
# -*-coding:utf-8-*-#
import tensorflow as tf
from config import config
class LSTM_Cell(object):
def __init__(self, train_data, train_label, num_nodes=64):
with tf.variable_scope("input", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as input_layer:
self.ix, self.im, self.ib = self._generate_w_b(
x_weights_size=[config.vocabulary_size, num_nodes],
m_weights_size=[num_nodes, num_nodes],
biases_size=[1, num_nodes])
with tf.variable_scope("memory", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as update_layer:
self.cx, self.cm, self.cb = self._generate_w_b(
x_weights_size=[config.vocabulary_size, num_nodes],
m_weights_size=[num_nodes, num_nodes],
biases_size=[1, num_nodes])
with tf.variable_scope("forget", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as forget_layer:
self.fx, self.fm, self.fb = self._generate_w_b(
x_weights_size=[config.vocabulary_size, num_nodes],
m_weights_size=[num_nodes, num_nodes],
biases_size=[1, num_nodes])
with tf.variable_scope("output", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as output_layer:
self.ox, self.om, self.ob = self._generate_w_b(
x_weights_size=[config.vocabulary_size, num_nodes],
m_weights_size=[num_nodes, num_nodes],
biases_size=[1, num_nodes])
self.w = tf.Variable(tf.truncated_normal([num_nodes, config.vocabulary_size], -0.1, 0.1))
self.b = tf.Variable(tf.zeros([config.vocabulary_size]))
self.saved_output = tf.Variable(tf.zeros([config.batch_size, num_nodes]), trainable=False)
self.saved_state = tf.Variable(tf.zeros([config.batch_size, num_nodes]), trainable=False)
self.train_data = train_data
self.train_label = train_label
def _generate_w_b(self, x_weights_size, m_weights_size, biases_size):
x_w = tf.get_variable("x_weights", x_weights_size)
m_w = tf.get_variable("m_weigths", m_weights_size)
b = tf.get_variable("biases", config.batch_size, initializer=tf.constant_initializer(0.0))
return x_w, m_w, b
def _run(self, input, output, state):
forget_gate = tf.sigmoid(tf.matmul(input, self.fx) + tf.matmul(output, self.fm) + self.fb)
input_gate = tf.sigmoid(tf.matmul(input, self.ix) + tf.matmul(output, self.im) + self.ib)
update = tf.matmul(input, self.cx) + tf.matmul(output, self.cm) + self.cb
state = state * forget_gate + tf.tanh(update) * input_gate
output_gate = tf.sigmoid(tf.matmul(input, self.ox) + tf.matmul(output, self.om) + self.ob)
return output_gate * tf.tanh(state), state
def loss_func(self):
outputs = list()
output = self.saved_output
state = self.saved_state
for i in self.train_data:
output, state = self._run(i, output, state)
outputs.append(output)
# finnaly, the length of outputs is num_unrollings
with tf.control_dependencies([
self.saved_output.assign(output),
self.saved_state.assign(state)
]):
# concat(0, outputs) to concat the list of output on the dim 0
# the length of outputs is batch_size
logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), self.w, self.b)
# the label should fix the size of ouputs
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=tf.concat(self.train_label, 0),
logits=logits))
train_prediction = tf.nn.softmax(logits)
return logits, loss, train_prediction
這是一個本篇最核心的內(nèi)容攘宙。我們在__init__
里面定義了很多參數(shù),這里我就不多加說明。直接上圖上公式更加清晰明了蹭劈。
這些變量也說明了
__init__
里面各個參數(shù)的含義疗绣。我在這里翻譯一下中文意思
- x_t: 該LSTM cell的輸入向量
- h_t: 該LSTM cell的輸出向量
- c_t: 該LSTM cell的狀態(tài)向量
- W, U 和 b:參數(shù)矩陣和向量
- f_t, i_t和 o_t都是閥門向量
其中:- f_t為忘記閥門向量。它表示過去舊的信息的記憶權(quán)重(0就是應(yīng)該要忘記铺韧,1就是要保留的)
- i_t為輸入閥門持痰。它表示接受新內(nèi)容的權(quán)重是多少(0就是應(yīng)該要忘記,1就是要保留的)
- o_t為輸入閥門祟蚀,它表示輸出的變量應(yīng)該是多少
這里的公式就是
_run
里面的運行的內(nèi)容。結(jié)合上面的變量一看就明白當(dāng)中奧秘了割卖。這個sigmod
函數(shù)就是使得權(quán)重在0-1之間的重要函數(shù)前酿。值得注意的是:計算當(dāng)前LSTM cell的state時候,向量與向量之間是逐點向乘哦鹏溯“瘴可不要搞成矩陣乘法哦。(鄙人在這里沒看清楚公式就寫錯代碼了)另外當(dāng)中的內(nèi)容需要大家留意最后輸出h_t
的計算不一定要對狀態(tài)加入激活函數(shù)的計算丙挽,直接與o_t
做點乘就好了肺孵。
這里的loss_func
就是通過計算softmax和cross_entropy計算預(yù)測與目標(biāo)之間的損失值。我們就可以得到最后損失函數(shù)啦哈哈颜阐。
-
在main.py的輔助函數(shù)
def get_optimizer(loss):
global_step = tf.Variable(0)
learning_rate = tf.train.exponential_decay(
10.0, global_step, 5000, 0.1, staircase=True)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
gradients, v = zip(*optimizer.compute_gradients(loss))
# 為了避免梯度爆炸的問題平窘,我們求出梯度的二范數(shù)。
# 然后判斷該二范數(shù)是否大于1.25凳怨,若大于瑰艘,則變成
# gradients * (1.25 / global_norm)作為當(dāng)前的gradients
gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
# 將剛剛求得的梯度組裝成相應(yīng)的梯度下降法
optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)
return optimizer, learning_rate
def logprob(predictions, labels):
# 計算交叉熵
predictions[predictions < 1e-10] = 1e-10
return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
顯然這兩個分別是獲取學(xué)習(xí)算法另一個是計算交叉商也就是損失值。這里只得注意的是學(xué)習(xí)算法肤舞∽闲拢可以看到它與之前的學(xué)習(xí)算法不同,因為他多個tf.clip_by_global_norm(gradients, 1.25)
李剖。LSTM對于RNN的隱含層的改進(jìn)就是這個將梯度消失(vanishing gradient)變?yōu)樘荻缺ǎ╡xploding gradient)芒率。梯度消失比較麻煩,因為消失了我們就很難讓靠前的LSTM單元學(xué)習(xí)到內(nèi)容篙顺,但梯度爆炸可以通過正則化壓制梯度過大的問題偶芍。所以我們這里就用了clip的處理方式來處理這個問題。
大家看這個圖就明白當(dāng)中的含義啦慰安。不止如此作者還是用指數(shù)遞減來降低學(xué)習(xí)率的問題腋寨。
-
訓(xùn)練
定義好數(shù)據(jù)流和模型
loadData = LoadData()
train_text = loadData.train_text
valid_text = loadData.valid_text
train_batcher = BatchGenerator(text=train_text, batch_size=config.batch_size, num_unrollings=config.num_unrollings)
vaild_batcher = BatchGenerator(text=valid_text, batch_size=1, num_unrollings=1)
# 定義訓(xùn)練數(shù)據(jù)由num_unrollings個占位符組成
train_data = list()
for _ in range(config.num_unrollings + 1):
train_data.append(
tf.placeholder(tf.float32, shape=[config.batch_size, config.vocabulary_size]))
train_input = train_data[:config.num_unrollings]
train_label= train_data[1:]
# define the lstm train model
model = LSTM_Cell(
train_data=train_input,
train_label=train_label)
# get the loss and the prediction
logits, loss, train_prediction = model.loss_func()
optimizer, learning_rate = get_optimizer(loss)
我們的train_data是有num_unrollings個batch,每個batch之間的字符是相鄰的化焕。因為我們用LSTM的時候是預(yù)測哪個字符出現(xiàn)在下一個位置的可能最大萄窜,所以我們的label和data之間是錯開相差一個字符。
定義樣本
# 定義樣本(通過訓(xùn)練后的rnn網(wǎng)絡(luò)自動生成文字)的輸入,輸出,重置
sample_input = tf.placeholder(tf.float32, shape=[1, config.vocabulary_size])
save_sample_output = tf.Variable(tf.zeros([1, config.num_nodes]))
save_sample_state = tf.Variable(tf.zeros([1, config.num_nodes]))
reset_sample_state = tf.group(
save_sample_output.assign(tf.zeros([1, config.num_nodes])),
save_sample_state.assign(tf.zeros([1, config.num_nodes])))
sample_output, sample_state = model._run(
sample_input, save_sample_output, save_sample_state)
with tf.control_dependencies([save_sample_output.assign(sample_output),
save_sample_state.assign(sample_state)]):
# 生成樣本
sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, model.w, model.b))
這里的樣本是指每訓(xùn)練一定次數(shù)就根據(jù)現(xiàn)在有的訓(xùn)練結(jié)果隨機生成一段文字樣本〔榭蹋可以讓大家看看訓(xùn)練的學(xué)習(xí)效果如何(個人覺得聽差勁的键兜,哈哈哈)。
這里有些要注意的地方control_dependencies
這個函數(shù)穗泵。因為不是順序執(zhí)行語言普气,一般模型如果不是相關(guān)的語句,其執(zhí)行是沒有先后順序的佃延。這里我們必須先保存了output和state现诀,因為在下次計算損失函數(shù)的時候需要重用上次的output和state。
開始訓(xùn)練
# training
with tf.Session() as session:
tf.global_variables_initializer().run()
print("Initialized....")
mean_loss = 0
for step in range(config.num_steps):
batches = train_batcher.next()
feed_dict = dict()
for i in range(config.num_unrollings + 1):
feed_dict[train_data[i]] = batches[i]
_, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
# 計算每一批數(shù)據(jù)的平均損失
mean_loss += l
if step % config.summary_frequency == 0:
if step > 0:
mean_loss = mean_loss / config.summary_frequency
print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
mean_loss = 0
labels = np.concatenate(list(batches)[1:])
print('Minibatch perplexity: %.2f' % float(
np.exp(logprob(predictions, labels))))
if step % (config.summary_frequency * 10) == 0:
# Generate some samples.
print('=' * 80)
for _ in range(5):
feed = sample(random_distribution())
sentence = characters(feed)[0]
reset_sample_state.run()
for _ in range(79):
prediction = sample_prediction.eval({sample_input: feed})
feed = sample(prediction)
sentence += characters(feed)[0]
print(sentence)
print('=' * 80)
reset_sample_state.run()
跟以往如出一轍履肃,把之前的準(zhǔn)備好的數(shù)據(jù)倒到損失函數(shù)上仔沿,然后迭代累積損失函數(shù),最后加上梯度下降算法對模型進(jìn)行優(yōu)化尺棋。
總結(jié)
- 這僅僅是一個lstm深入理解當(dāng)中的公式和原理(但沒有證明它的收斂性和長期依賴性)封锉,并且熟悉tf的一些操作。
- 這里用one-hot作為詞向量的方法是不行的膘螟,假如要提高準(zhǔn)確率的話成福,就需要使用word2vec這些東西來表示每個字符(單詞)的向量。