自然語言處理,語音處理檬贰、文本處理姑廉。語音識別(speech recognition),讓計算機(jī)能夠“聽懂”人類語音翁涤,語音的文字信息“提取”桥言。
日本富國生命保險公司花170萬美元安裝人工智能系統(tǒng),客戶語言轉(zhuǎn)換文本葵礼,分析詞正面或負(fù)面号阿。智能客服是人工能智能公司研究重點(diǎn)。循環(huán)神經(jīng)網(wǎng)絡(luò)(recurrent neural network,RNN)模型鸳粉。
模型選擇扔涧。每一個矩形是一個向量,箭頭表示函數(shù)。最下面一行輸入向量枯夜,最上面一行輸出向量弯汰,中間一行RNN狀態(tài)。一對一湖雹,沒用RNN蝙泼,如Vanilla模型,固定大小輸入到固定大小輸出(圖像分類)劝枣。一對多汤踏,序列輸出,圖片描述舔腾,輸入一張圖片輸出一段文字序列溪胶,CNN、RNN結(jié)合稳诚,圖像哗脖、語言結(jié)合。多對一扳还,序列輸入才避,情感分析,輸入一段文字氨距,分類積極桑逝、消極情感,如淘寶商品評論分類俏让,用LSTM楞遏。多對多,異步序列輸入首昔、序列輸出寡喝,機(jī)器翻譯,如RNN讀取英文語句勒奇,以法語形式輸出预鬓。多對多,同步序列輸入赊颠、序列輸出格二,視頻分類,視頻每幀打標(biāo)記巨税。中間RNN狀態(tài)部分固定蟋定,可多次使用,不需對序列長度預(yù)先約束草添。Andrej Karpathy《The Unreasonable Effectiveness of Recurrent Neural Networks》驶兜。http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 。自然語言處理,語音合成(文字生成語音)抄淑、語單識別屠凶、聲紋識別(聲紋鑒權(quán))、文本處理(分詞肆资、情感分析矗愧、文本挖掘)。
英文數(shù)字語音識別郑原。https://github.com/pannous/tensorflow-speech-recognition/blob/master/speech2text-tflearn.py 唉韭。20行Python代碼創(chuàng)建超簡單語音識別器。LSTM循環(huán)神經(jīng)網(wǎng)絡(luò)犯犁,TFLearn訓(xùn)練英文數(shù)字口語數(shù)據(jù)集属愤。spoken numbers pcm數(shù)據(jù)集 http://pannous.net/spoken_numbers.tar 。多人閱讀0~9數(shù)字英文音頻酸役,分男女聲住诸,一段音頻(wav文件)只有一個數(shù)字對應(yīng)英文聲音。標(biāo)識方法{數(shù)字}_人名_xxx涣澡。
定義輸入數(shù)據(jù)贱呐,預(yù)處理數(shù)據(jù)。語音處理成矩陣形式入桂。梅爾頻率倒譜系數(shù)(Mel frequency cepstral coefficents, MFCC)特征向量奄薇。語音分幀、取對數(shù)事格、逆矩陣惕艳,生成MFCC代表語音特征。
定義網(wǎng)絡(luò)模型驹愚。LSTM模型。
訓(xùn)練模型劣纲,并存儲模型逢捺。
預(yù)測模型。任意輸入一個語音文件癞季,預(yù)測劫瞳。
語音識別,可用在智能輸入法绷柒、會議快速錄入志于、語音控制系統(tǒng)、智能家居領(lǐng)域废睦。
#!/usr/bin/env python
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
from __future__ import division, print_function, absolute_import
import tflearn
import speech_data
learning_rate = 0.0001
training_iters = 300000 # steps 迭代次數(shù)
batch_size = 64
width = 20 # mfcc features MFCC特征
height = 80 # (max) length of utterance 最大發(fā)音長度
classes = 10 # digits 數(shù)字類別
batch = word_batch = speech_data.mfcc_batch_generator(batch_size) # 生成每一批MFCC語音
X, Y = next(batch)
# train, test, _ = ,X
trainX, trainY = X, Y
testX, testY = X, Y #overfit for now
# Data preprocessing
# Sequence padding
# trainX = pad_sequences(trainX, maxlen=100, value=0.)
# testX = pad_sequences(testX, maxlen=100, value=0.)
# # Converting labels to binary vectors
# trainY = to_categorical(trainY, nb_classes=2)
# testY = to_categorical(testY, nb_classes=2)
# Network building
# LSTM模型
net = tflearn.input_data([None, width, height])
# net = tflearn.embedding(net, input_dim=10000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, classes, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.load("tflearn.lstm.model")
while 1: #training_iters
model.fit(trainX, trainY, n_epoch=100, validation_set=(testX, testY), show_metric=True,
batch_size=batch_size)
_y=model.predict(X)
model.save("tflearn.lstm.model")
print (_y)
print (y)
智能聊天機(jī)器人伺绽。未來方向“自然語言人機(jī)交互”。蘋果Siri、微軟Cortana和小冰奈应、Google Now澜掩、百度度秘、亞馬遜藍(lán)牙音箱Amazon Echo內(nèi)置語音助手Alexa杖挣、Facebook 語音助手M肩榕。通過和用戶“語音機(jī)器人”對話,引導(dǎo)用戶到對應(yīng)服務(wù)惩妇。今后智能硬件株汉、智能家居嵌入式應(yīng)用。
智能聊天機(jī)器人3代技術(shù)歌殃。第一代特征工程郎逃,大量邏輯判斷。第二代檢索庫挺份,給定問題褒翰、聊天,從檢索庫找到與已有答案最匹配答案匀泊。第三代深度學(xué)習(xí)优训,seq2seq+Attention模型,大量訓(xùn)練各聘,根據(jù)輸入生成輸出揣非。
seq2seq+Attention模型原理、構(gòu)建方法躲因。翻譯模型早敬,把一個序列翻譯成另一個序列。兩個RNNLM大脉,一個作編碼器搞监,一個解碼器,組成RNN編碼器-解碼器镰矿。文本處理領(lǐng)域琐驴,常用編碼器-解碼器(encoder-decoder)框架。輸入->編碼器->語義編碼C->解碼器->輸出秤标。適合處理上下文(context)生成一個目標(biāo)(target)通用處理模型绝淡。一個句子對<X,Y>,輸入給定句子X苍姜,通過編碼器-解碼器框架生成目標(biāo)句子Y牢酵。X、Y可以不同語言衙猪,機(jī)器翻譯馍乙。X布近、Y是對話問句答句,聊天機(jī)器人潘拨。X吊输、Y可以是圖片和對應(yīng)描述,看圖說話铁追。
X由x1?x2等單詞序列組成季蚂,Y由y1?y2等單詞序列組成。編碼器編碼輸入X琅束,生成中間語義編碼C扭屁,解碼器解碼中間語義編碼C,每個i時刻結(jié)合已生成y1?y2……yi-1歷史信息生成Yi涩禀。生成句子每個詞采用中間語義編碼相同 C料滥。短句子貼切,長句子不合語義艾船。
實(shí)際實(shí)現(xiàn)聊天系統(tǒng)葵腹,編碼器和解碼器采用RNN模型、LSTM模型屿岂。句子長度超過30践宴,LSTM模型效果急劇下降,引入Attention模型爷怀,長句子提升系統(tǒng)效果阻肩。Attention機(jī)制,人在做一件事情运授,專注做這件事烤惊,忽略周圍其他事。源句子中對生成句子重要關(guān)鍵詞權(quán)重提高吁朦,產(chǎn)生更準(zhǔn)確應(yīng)答柒室。增加Attention模型編碼器-解碼器模型框架:輸入->編碼器->語義編碼C1?C2?C3->解碼器->輸出Y1、Y2喇完、Y3伦泥。中間語義編碼Ci不斷變化,產(chǎn)生更準(zhǔn)確Yi锦溪。
最佳實(shí)踐。https://github.com/suriyadeepan/easy_seq2seq 府怯,依賴TensorFlow 0.12.1環(huán)境刻诊。康奈爾大學(xué) Corpus數(shù)據(jù)集(Cornell Movie Dialogs Corpus) http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html 牺丙。600 部電影對白则涯。
處理聊天數(shù)據(jù)复局。
先把數(shù)據(jù)集整理成“問”、“答”文件粟判,生成.enc(問句)亿昏、.dec(答句)文件。test.dec #測試集答句档礁,test.enc #測試集問句角钩,train.dec #訓(xùn)練集答句,train.enc #訓(xùn)練集問句呻澜。
創(chuàng)建詞匯表递礼,問句、答句轉(zhuǎn)換成對應(yīng)id形式羹幸。詞匯表文件2萬個詞匯脊髓。vocab20000.dec #答句詞匯表,vocab20000.enc #問句詞匯表栅受。_GO将硝、_EOS、_UNK屏镊、_PAD seq2seq模型特殊標(biāo)記依疼,填充標(biāo)記對話。_GO標(biāo)記對話開始闸衫。_EOS標(biāo)記對話結(jié)束涛贯。_UNK標(biāo)記未出現(xiàn)詞匯表字符,替換稀有詞匯蔚出。_PAD填充序列弟翘,保證批次序列長度相同。轉(zhuǎn)換成ids文件骄酗,test.enc.ids20000?train.dec.ids20000?train.enc.ids20000稀余。問句、答句轉(zhuǎn)換ids文件趋翻,每行是一個問句或答句睛琳,每行每個id代表問句或答句對應(yīng)位置詞。
采用編碼器-解碼器框架訓(xùn)練踏烙。
定義訓(xùn)練參數(shù)师骗。seq2seq.ini。
[strings]
# Mode : train, test, serve 模式
mode = train
train_enc = data/train.enc
train_dec = data/train.dec
test_enc = data/test.enc
test_dec = data/test.dec
# folder where checkpoints, vocabulary, temporary data will be stored
# 模型文件和詞匯表存儲路徑
working_directory = working_dir/
[ints]
# vocabulary size
# 詞匯表大小
# 20,000 is a reasonable size
enc_vocab_size = 20000
dec_vocab_size = 20000
# number of LSTM layers : 1/2/3
# LSTM層數(shù)
num_layers = 3
# typical options : 128, 256, 512, 1024 每層大小讨惩,可取值
layer_size = 256
# dataset size limit; typically none : no limit
max_train_data_size = 0
batch_size = 64
# steps per checkpoint
# 每多少次迭代存儲一次模型
# Note : At a checkpoint, models parameters are saved, model is evaluated
# and results are printed
steps_per_checkpoint = 300
[floats]
learning_rate = 0.5 # 學(xué)習(xí)速率
learning_rate_decay_factor = 0.99 # 學(xué)習(xí)速率下降系數(shù)
max_gradient_norm = 5.0
定義網(wǎng)絡(luò)模型 seq2seq辟癌。seq2seq_model.py。TensorFlow 0.12荐捻。定義seq2seq+Attention模型類黍少,3個函數(shù)寡夹。《Grammar as a Foreign Language》 http://arxiv.org/abs/1412.7499 厂置。初始化模型函數(shù)(init)菩掏、訓(xùn)練模型函數(shù)(step)、獲取下一批次訓(xùn)練數(shù)據(jù)函數(shù)(get_batch)昵济。
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import random
import numpy as np
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.models.rnn.translate import data_utils
class Seq2SeqModel(object):
def __init__(self, source_vocab_size, target_vocab_size, buckets, size,
num_layers, max_gradient_norm, batch_size, learning_rate,
learning_rate_decay_factor, use_lstm=False,
num_samples=512, forward_only=False):
""" 構(gòu)建模型
Args: 參數(shù)
source_vocab_size: size of the source vocabulary. 問句詞匯表大小
target_vocab_size: size of the target vocabulary.答句詞匯表大小
buckets: a list of pairs (I, O), where I specifies maximum input length
that will be processed in that bucket, and O specifies maximum output
length. Training instances that have inputs longer than I or outputs
longer than O will be pushed to the next bucket and padded accordingly.
We assume that the list is sorted, e.g., [(2, 4), (8, 16)].
其中I指定最大輸入長度智绸,O指定最大輸出長度
size: number of units in each layer of the model.每層神經(jīng)元數(shù)量
num_layers: number of layers in the model.模型層數(shù)
max_gradient_norm: gradients will be clipped to maximally this norm.梯度被削減到最大規(guī)范
batch_size: the size of the batches used during training;
the model construction is independent of batch_size, so it can be
changed after initialization if this is convenient, e.g., for decoding.批次大小。訓(xùn)練砸紊、預(yù)測批次大小传于,可不同
learning_rate: learning rate to start with.學(xué)習(xí)速率
learning_rate_decay_factor: decay learning rate by this much when needed.調(diào)整學(xué)習(xí)速率
use_lstm: if true, we use LSTM cells instead of GRU cells.使用LSTM 單元代替GRU單元
num_samples: number of samples for sampled softmax.使用softmax樣本數(shù)
forward_only: if set, we do not construct the backward pass in the model.是否僅構(gòu)建前向傳播
"""
self.source_vocab_size = source_vocab_size
self.target_vocab_size = target_vocab_size
self.buckets = buckets
self.batch_size = batch_size
self.learning_rate = tf.Variable(float(learning_rate), trainable=False)
self.learning_rate_decay_op = self.learning_rate.assign(
self.learning_rate * learning_rate_decay_factor)
self.global_step = tf.Variable(0, trainable=False)
# If we use sampled softmax, we need an output projection.
output_projection = None
softmax_loss_function = None
# Sampled softmax only makes sense if we sample less than vocabulary size.
# 如果樣本量比詞匯表量小,用抽樣softmax
if num_samples > 0 and num_samples < self.target_vocab_size:
w = tf.get_variable("proj_w", [size, self.target_vocab_size])
w_t = tf.transpose(w)
b = tf.get_variable("proj_b", [self.target_vocab_size])
output_projection = (w, b)
def sampled_loss(inputs, labels):
labels = tf.reshape(labels, [-1, 1])
return tf.nn.sampled_softmax_loss(w_t, b, inputs, labels, num_samples,
self.target_vocab_size)
softmax_loss_function = sampled_loss
# Create the internal multi-layer cell for our RNN.
# 構(gòu)建RNN
single_cell = tf.nn.rnn_cell.GRUCell(size)
if use_lstm:
single_cell = tf.nn.rnn_cell.BasicLSTMCell(size)
cell = single_cell
cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=0.5)
if num_layers > 1:
cell = tf.nn.rnn_cell.MultiRNNCell([single_cell] * num_layers)
# The seq2seq function: we use embedding for the input and attention.
# Attention模型
def seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
return tf.nn.seq2seq.embedding_attention_seq2seq(
encoder_inputs, decoder_inputs, cell,
num_encoder_symbols=source_vocab_size,
num_decoder_symbols=target_vocab_size,
embedding_size=size,
output_projection=output_projection,
feed_previous=do_decode)
# Feeds for inputs.
# 給模型填充數(shù)據(jù)
self.encoder_inputs = []
self.decoder_inputs = []
self.target_weights = []
for i in xrange(buckets[-1][0]): # Last bucket is the biggest one.
self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None],
name="encoder{0}".format(i)))
for i in xrange(buckets[-1][1] + 1):
self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None],
name="decoder{0}".format(i)))
self.target_weights.append(tf.placeholder(tf.float32, shape=[None],
name="weight{0}".format(i)))
# Our targets are decoder inputs shifted by one.
# targets值是解碼器偏移1位
targets = [self.decoder_inputs[i + 1]
for i in xrange(len(self.decoder_inputs) - 1)]
# Training outputs and losses.
# 訓(xùn)練模型輸出
if forward_only:
self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs, targets,
self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, True),
softmax_loss_function=softmax_loss_function)
# If we use output projection, we need to project outputs for decoding.
if output_projection is not None:
for b in xrange(len(buckets)):
self.outputs[b] = [
tf.matmul(output, output_projection[0]) + output_projection[1]
for output in self.outputs[b]
]
else:
self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs, targets,
self.target_weights, buckets,
lambda x, y: seq2seq_f(x, y, False),
softmax_loss_function=softmax_loss_function)
# Gradients and SGD update operation for training the model.
# 訓(xùn)練模型醉顽,更新梯度
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdamOptimizer()
for b in xrange(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.saver = tf.train.Saver(tf.global_variables())
def step(self, session, encoder_inputs, decoder_inputs, target_weights,
bucket_id, forward_only):
"""Run a step of the model feeding the given inputs.
定義運(yùn)行模型的每一步
Args:
session: tensorflow session to use.
encoder_inputs: list of numpy int vectors to feed as encoder inputs.問句向量序列
decoder_inputs: list of numpy int vectors to feed as decoder inputs.答句向量序列
target_weights: list of numpy float vectors to feed as target weights.
bucket_id: which bucket of the model to use.輸入bucket_id
forward_only: whether to do the backward step or only forward.是否只做前向傳播
Returns:
A triple consisting of gradient norm (or None if we did not do backward),
average perplexity, and the outputs.
Raises:
ValueError: if length of encoder_inputs, decoder_inputs, or
target_weights disagrees with bucket size for the specified bucket_id.
"""
# Check if the sizes match.
encoder_size, decoder_size = self.buckets[bucket_id]
if len(encoder_inputs) != encoder_size:
raise ValueError("Encoder length must be equal to the one in bucket,"
" %d != %d." % (len(encoder_inputs), encoder_size))
if len(decoder_inputs) != decoder_size:
raise ValueError("Decoder length must be equal to the one in bucket,"
" %d != %d." % (len(decoder_inputs), decoder_size))
if len(target_weights) != decoder_size:
raise ValueError("Weights length must be equal to the one in bucket,"
" %d != %d." % (len(target_weights), decoder_size))
# Input feed: encoder inputs, decoder inputs, target_weights, as provided.
# 輸入填充
input_feed = {}
for l in xrange(encoder_size):
input_feed[self.encoder_inputs[l].name] = encoder_inputs[l]
for l in xrange(decoder_size):
input_feed[self.decoder_inputs[l].name] = decoder_inputs[l]
input_feed[self.target_weights[l].name] = target_weights[l]
# Since our targets are decoder inputs shifted by one, we need one more.
last_target = self.decoder_inputs[decoder_size].name
input_feed[last_target] = np.zeros([self.batch_size], dtype=np.int32)
# Output feed: depends on whether we do a backward step or not.
# 輸出填充:與是否有后向傳播有關(guān)
if not forward_only:
output_feed = [self.updates[bucket_id], # Update Op that does SGD.
self.gradient_norms[bucket_id], # Gradient norm.
self.losses[bucket_id]] # Loss for this batch.
else:
output_feed = [self.losses[bucket_id]] # Loss for this batch.
for l in xrange(decoder_size): # Output logits.
output_feed.append(self.outputs[bucket_id][l])
outputs = session.run(output_feed, input_feed)
if not forward_only:
return outputs[1], outputs[2], None # Gradient norm, loss, no outputs.有后向傳播輸出沼溜,梯度、損失值游添、None
else:
return None, outputs[0], outputs[1:] # No gradient norm, loss, outputs.僅有前向傳播輸出系草,None,損失值唆涝,None
def get_batch(self, data, bucket_id):
"""
從指定桶獲取一個批次隨機(jī)數(shù)據(jù)找都,在訓(xùn)練每步(step)使用
Args:參數(shù)
data: a tuple of size len(self.buckets) in which each element contains
lists of pairs of input and output data that we use to create a batch.長度為(self.buckets)元組,每個元素包含創(chuàng)建批次輸入廊酣、輸出數(shù)據(jù)對列表
bucket_id: integer, which bucket to get the batch for.整數(shù)能耻,從哪個bucket獲取批次
Returns:返回
The triple (encoder_inputs, decoder_inputs, target_weights) for
the constructed batch that has the proper format to call step(...) later.一個包含三項(xiàng)元組(encoder_inputs, decoder_inputs, target_weights)
"""
encoder_size, decoder_size = self.buckets[bucket_id]
encoder_inputs, decoder_inputs = [], []
# Get a random batch of encoder and decoder inputs from data,
# pad them if needed, reverse encoder inputs and add GO to decoder.
for _ in xrange(self.batch_size):
encoder_input, decoder_input = random.choice(data[bucket_id])
# Encoder inputs are padded and then reversed.
encoder_pad = [data_utils.PAD_ID] * (encoder_size - len(encoder_input))
encoder_inputs.append(list(reversed(encoder_input + encoder_pad)))
# Decoder inputs get an extra "GO" symbol, and are padded then.
decoder_pad_size = decoder_size - len(decoder_input) - 1
decoder_inputs.append([data_utils.GO_ID] + decoder_input +
[data_utils.PAD_ID] * decoder_pad_size)
# Now we create batch-major vectors from the data selected above.
batch_encoder_inputs, batch_decoder_inputs, batch_weights = [], [], []
# Batch encoder inputs are just re-indexed encoder_inputs.
for length_idx in xrange(encoder_size):
batch_encoder_inputs.append(
np.array([encoder_inputs[batch_idx][length_idx]
for batch_idx in xrange(self.batch_size)], dtype=np.int32))
# Batch decoder inputs are re-indexed decoder_inputs, we create weights.
for length_idx in xrange(decoder_size):
batch_decoder_inputs.append(
np.array([decoder_inputs[batch_idx][length_idx]
for batch_idx in xrange(self.batch_size)], dtype=np.int32))
# Create target_weights to be 0 for targets that are padding.
batch_weight = np.ones(self.batch_size, dtype=np.float32)
for batch_idx in xrange(self.batch_size):
# We set weight to 0 if the corresponding target is a PAD symbol.
# The corresponding target is decoder_input shifted by 1 forward.
if length_idx < decoder_size - 1:
target = decoder_inputs[batch_idx][length_idx + 1]
if length_idx == decoder_size - 1 or target == data_utils.PAD_ID:
batch_weight[batch_idx] = 0.0
batch_weights.append(batch_weight)
return batch_encoder_inputs, batch_decoder_inputs, batch_weights
訓(xùn)練模型。修改seq2seq.ini文件mode值“train”亡驰,execute.py訓(xùn)練晓猛。
驗(yàn)證模型。修改seq2seq.ini文件mode值“test”凡辱,execute.py測試戒职。
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import os
import random
import sys
import time
import numpy as np
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
import data_utils
import seq2seq_model
try:
from ConfigParser import SafeConfigParser
except:
from configparser import SafeConfigParser # In Python 3, ConfigParser has been renamed to configparser for PEP 8 compliance.
gConfig = {}
def get_config(config_file='seq2seq.ini'):
parser = SafeConfigParser()
parser.read(config_file)
# get the ints, floats and strings
_conf_ints = [ (key, int(value)) for key,value in parser.items('ints') ]
_conf_floats = [ (key, float(value)) for key,value in parser.items('floats') ]
_conf_strings = [ (key, str(value)) for key,value in parser.items('strings') ]
return dict(_conf_ints + _conf_floats + _conf_strings)
# We use a number of buckets and pad to the closest one for efficiency.
# See seq2seq_model.Seq2SeqModel for details of how they work.
_buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
def read_data(source_path, target_path, max_size=None):
"""Read data from source and target files and put into buckets.
Args:
source_path: path to the files with token-ids for the source language.
target_path: path to the file with token-ids for the target language;
it must be aligned with the source file: n-th line contains the desired
output for n-th line from the source_path.
max_size: maximum number of lines to read, all other will be ignored;
if 0 or None, data files will be read completely (no limit).
Returns:
data_set: a list of length len(_buckets); data_set[n] contains a list of
(source, target) pairs read from the provided data files that fit
into the n-th bucket, i.e., such that len(source) < _buckets[n][0] and
len(target) < _buckets[n][1]; source and target are lists of token-ids.
"""
data_set = [[] for _ in _buckets]
with tf.gfile.GFile(source_path, mode="r") as source_file:
with tf.gfile.GFile(target_path, mode="r") as target_file:
source, target = source_file.readline(), target_file.readline()
counter = 0
while source and target and (not max_size or counter < max_size):
counter += 1
if counter % 100000 == 0:
print(" reading data line %d" % counter)
sys.stdout.flush()
source_ids = [int(x) for x in source.split()]
target_ids = [int(x) for x in target.split()]
target_ids.append(data_utils.EOS_ID)
for bucket_id, (source_size, target_size) in enumerate(_buckets):
if len(source_ids) < source_size and len(target_ids) < target_size:
data_set[bucket_id].append([source_ids, target_ids])
break
source, target = source_file.readline(), target_file.readline()
return data_set
def create_model(session, forward_only):
"""Create model and initialize or load parameters"""
model = seq2seq_model.Seq2SeqModel( gConfig['enc_vocab_size'], gConfig['dec_vocab_size'], _buckets, gConfig['layer_size'], gConfig['num_layers'], gConfig['max_gradient_norm'], gConfig['batch_size'], gConfig['learning_rate'], gConfig['learning_rate_decay_factor'], forward_only=forward_only)
if 'pretrained_model' in gConfig:
model.saver.restore(session,gConfig['pretrained_model'])
return model
ckpt = tf.train.get_checkpoint_state(gConfig['working_directory'])
if ckpt and ckpt.model_checkpoint_path:
print("Reading model parameters from %s" % ckpt.model_checkpoint_path)
model.saver.restore(session, ckpt.model_checkpoint_path)
else:
print("Created model with fresh parameters.")
session.run(tf.global_variables_initializer())
return model
def train():
# prepare dataset
# 準(zhǔn)備數(shù)據(jù)集
print("Preparing data in %s" % gConfig['working_directory'])
enc_train, dec_train, enc_dev, dec_dev, _, _ = data_utils.prepare_custom_data(gConfig['working_directory'],gConfig['train_enc'],gConfig['train_dec'],gConfig['test_enc'],gConfig['test_dec'],gConfig['enc_vocab_size'],gConfig['dec_vocab_size'])
# setup config to use BFC allocator
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config=config) as sess:
# Create model.
# 構(gòu)建模型
print("Creating %d layers of %d units." % (gConfig['num_layers'], gConfig['layer_size']))
model = create_model(sess, False)
# Read data into buckets and compute their sizes.
# 把數(shù)據(jù)讀入桶(bucket)中,計算桶大小
print ("Reading development and training data (limit: %d)."
% gConfig['max_train_data_size'])
dev_set = read_data(enc_dev, dec_dev)
train_set = read_data(enc_train, dec_train, gConfig['max_train_data_size'])
train_bucket_sizes = [len(train_set[b]) for b in xrange(len(_buckets))]
train_total_size = float(sum(train_bucket_sizes))
# A bucket scale is a list of increasing numbers from 0 to 1 that we'll use
# to select a bucket. Length of [scale[i], scale[i+1]] is proportional to
# the size if i-th training bucket, as used later.
train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
for i in xrange(len(train_bucket_sizes))]
# This is the training loop.
# 開始訓(xùn)練循環(huán)
step_time, loss = 0.0, 0.0
current_step = 0
previous_losses = []
while True:
# Choose a bucket according to data distribution. We pick a random number
# in [0, 1] and use the corresponding interval in train_buckets_scale.
# 隨機(jī)生成一個0-1數(shù)透乾,在生成bucket_id中使用
random_number_01 = np.random.random_sample()
bucket_id = min([i for i in xrange(len(train_buckets_scale))
if train_buckets_scale[i] > random_number_01])
# Get a batch and make a step.
# 獲取一個批次數(shù)據(jù)洪燥,進(jìn)行一步訓(xùn)練
start_time = time.time()
encoder_inputs, decoder_inputs, target_weights = model.get_batch(
train_set, bucket_id)
_, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
target_weights, bucket_id, False)
step_time += (time.time() - start_time) / gConfig['steps_per_checkpoint']
loss += step_loss / gConfig['steps_per_checkpoint']
current_step += 1
# Once in a while, we save checkpoint, print statistics, and run evals.
# 保存檢查點(diǎn)文件,打印統(tǒng)計數(shù)據(jù)
if current_step % gConfig['steps_per_checkpoint'] == 0:
# Print statistics for the previous epoch.
perplexity = math.exp(loss) if loss < 300 else float('inf')
print ("global step %d learning rate %.4f step-time %.2f perplexity "
"%.2f" % (model.global_step.eval(), model.learning_rate.eval(),
step_time, perplexity))
# Decrease learning rate if no improvement was seen over last 3 times.
# 如果損失值在最近3次內(nèi)沒有再降低乳乌,減小學(xué)習(xí)率
if len(previous_losses) > 2 and loss > max(previous_losses[-3:]):
sess.run(model.learning_rate_decay_op)
previous_losses.append(loss)
# Save checkpoint and zero timer and loss.
# 保存檢查點(diǎn)文件捧韵,計數(shù)器、損失值歸零
checkpoint_path = os.path.join(gConfig['working_directory'], "seq2seq.ckpt")
model.saver.save(sess, checkpoint_path, global_step=model.global_step)
step_time, loss = 0.0, 0.0
# Run evals on development set and print their perplexity.
for bucket_id in xrange(len(_buckets)):
if len(dev_set[bucket_id]) == 0:
print(" eval: empty bucket %d" % (bucket_id))
continue
encoder_inputs, decoder_inputs, target_weights = model.get_batch(
dev_set, bucket_id)
_, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
target_weights, bucket_id, True)
eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')
print(" eval: bucket %d perplexity %.2f" % (bucket_id, eval_ppx))
sys.stdout.flush()
def decode():
with tf.Session() as sess:
# Create model and load parameters.
# 建立模型汉操,定義超參數(shù)batch_size
model = create_model(sess, True)
model.batch_size = 1 # We decode one sentence at a time.一次只解碼一個句子
# Load vocabularies.
# 加載詞匯表文件
enc_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.enc" % gConfig['enc_vocab_size'])
dec_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.dec" % gConfig['dec_vocab_size'])
enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path)
_, rev_dec_vocab = data_utils.initialize_vocabulary(dec_vocab_path)
# Decode from standard input.
# 對標(biāo)準(zhǔn)輸入句子解碼
sys.stdout.write("> ")
sys.stdout.flush()
sentence = sys.stdin.readline()
while sentence:
# Get token-ids for the input sentence.
# 得到輸入句子的token-ids
token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), enc_vocab)
# Which bucket does it belong to?
# 計算token_ids屬于哪個桶(bucket)
bucket_id = min([b for b in xrange(len(_buckets))
if _buckets[b][0] > len(token_ids)])
# Get a 1-element batch to feed the sentence to the model.
# 句子送入模型
encoder_inputs, decoder_inputs, target_weights = model.get_batch(
{bucket_id: [(token_ids, [])]}, bucket_id)
# Get output logits for the sentence.
_, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs,
target_weights, bucket_id, True)
# This is a greedy decoder - outputs are just argmaxes of output_logits.
# 貪心解碼器纫版,輸出output_logits argmaxes
outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
# Print out French sentence corresponding to outputs.
# 打印與輸出句子對應(yīng)法語句子
print(" ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs]))
print("> ", end="")
sys.stdout.flush()
sentence = sys.stdin.readline()
def self_test():
"""Test the translation model."""
with tf.Session() as sess:
print("Self-test for neural translation model.")
# Create model with vocabularies of 10, 2 small buckets, 2 layers of 32.
model = seq2seq_model.Seq2SeqModel(10, 10, [(3, 3), (6, 6)], 32, 2,
5.0, 32, 0.3, 0.99, num_samples=8)
sess.run(tf.initialize_all_variables())
# Fake data set for both the (3, 3) and (6, 6) bucket.
data_set = ([([1, 1], [2, 2]), ([3, 3], [4]), ([5], [6])],
[([1, 1, 1, 1, 1], [2, 2, 2, 2, 2]), ([3, 3, 3], [5, 6])])
for _ in xrange(5): # Train the fake model for 5 steps.
bucket_id = random.choice([0, 1])
encoder_inputs, decoder_inputs, target_weights = model.get_batch(
data_set, bucket_id)
model.step(sess, encoder_inputs, decoder_inputs, target_weights,
bucket_id, False)
def init_session(sess, conf='seq2seq.ini'):
global gConfig
gConfig = get_config(conf)
# Create model and load parameters.
model = create_model(sess, True)
model.batch_size = 1 # We decode one sentence at a time.
# Load vocabularies.
enc_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.enc" % gConfig['enc_vocab_size'])
dec_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.dec" % gConfig['dec_vocab_size'])
enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path)
_, rev_dec_vocab = data_utils.initialize_vocabulary(dec_vocab_path)
return sess, model, enc_vocab, rev_dec_vocab
def decode_line(sess, model, enc_vocab, rev_dec_vocab, sentence):
# Get token-ids for the input sentence.
token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), enc_vocab)
# Which bucket does it belong to?
bucket_id = min([b for b in xrange(len(_buckets)) if _buckets[b][0] > len(token_ids)])
# Get a 1-element batch to feed the sentence to the model.
encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(token_ids, [])]}, bucket_id)
# Get output logits for the sentence.
_, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)
# This is a greedy decoder - outputs are just argmaxes of output_logits.
outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
return " ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs])
if __name__ == '__main__':
if len(sys.argv) - 1:
gConfig = get_config(sys.argv[1])
else:
# get configuration from seq2seq.ini
gConfig = get_config()
print('\n>> Mode : %s\n' %(gConfig['mode']))
if gConfig['mode'] == 'train':
# start training
train()
elif gConfig['mode'] == 'test':
# interactive decode
decode()
else:
# wrong way to execute "serve"
# Use : >> python ui/app.py
# uses seq2seq_serve.ini as conf file
print('Serve Usage : >> python ui/app.py')
print('# uses seq2seq_serve.ini as conf file')
基于文字智能機(jī)器人,結(jié)合語音識別客情,產(chǎn)生直接對話機(jī)器人其弊。系統(tǒng)架構(gòu):
人->語音識別(ASR)->自然語言理解(NLU)->對話管理->自然語言生成(NLG)->語音合成(TTS)->人“蛘《中國人工智能學(xué)會通訊》2016年第6卷第1期梭伐。
圖靈機(jī)器人公司,提高對話和語義準(zhǔn)確度仰担,提升中文語境智能程度糊识。竹間智能科技,研究記憶摔蓝、自學(xué)習(xí)情感機(jī)器人赂苗,機(jī)器人真正理解多模式多渠道信息,高度擬人化回應(yīng)贮尉,最理想自然語言交流模式交流拌滋。騰訊公司,社交對話數(shù)據(jù)猜谚。微信败砂,最龐大自然語言交流語料庫,利用龐大真實(shí)數(shù)據(jù)魏铅,結(jié)合小程序成為所有服務(wù)入口昌犹。
參考資料:
《TensorFlow技術(shù)解析與實(shí)戰(zhàn)》
歡迎推薦上海機(jī)器學(xué)習(xí)工作機(jī)會,我的微信:qingxingfengzi
人工智能工作機(jī)會分割線-----------------------------------------
杭州阿里 新零售淘寶基礎(chǔ)架構(gòu)平臺:移動AI高級專家