話不多說先上圖
NMT中的seq2seq
加入attention
基本的seq2seq模型:
outputs, states = basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell)
其中encoder_inputs是編碼器的輸入,decoder_inputs是解碼器的輸入
在模型中采用feed_previous參數來判斷是否采用解碼器輸入還是前一刻的輸出作為解碼器輸入痹兜。通常在訓練中采用解碼器輸入穆咐,預測中采用前一刻的輸出。
output_projection:
在decoder中會用到的參數。如果不指定对湃,模型的輸出是[batch_size, num_decoder_symbols]崖叫。但當num_decoder_symbols很大時,模型會采用一個較小的tensor[batch_size, num_samples]計算loss function(此時采用sampled softmax loss function)拍柒,隨后通過output_projection映射到原先的tensor心傀。
bucket:
變長seq2seq中所采用的機制,通過將inputs和targets放入不同長度的桶中避免了對每一種長度組合都要新建一個graph拆讯。
在每次采用SGD更新模型參數時脂男,會根據概率隨機地從所有buckets中選擇一個,并從中隨機選取batch_size個訓練樣例往果,并對當前sub-graph中的參數進行優(yōu)化疆液,每個sub-graph之間權值共享一铅。
train_bucket_sizes = [len(train_data.inputs[b])
for b in xrange(len(BUCKETS))]
train_total_size = float(sum(train_bucket_sizes))
train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
for i in xrange(len(train_bucket_sizes))]
foreach batch:
random_number_01 = np.random.random_sample()
#根據數據概率分布函數隨機選取bucket
bucket_id = min([i for i in xrange(len(train_buckets_scale))
if train_buckets_scale[i] > random_number_01])
def seq2seq_f(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection,
do_decode):
return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(
encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection=output_projection,
feed_previous=do_decode)
tf.contrib.legacy_seq2seq.model_with_buckets(
encoder_inputs,
decoder_inputs,
targets,
target_weights,
buckets,
lambda x, y: seq2seq_f(x, y,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection,
False),
softmax_loss_function=softmax_loss_function)