15Seq2Seq實(shí)戰(zhàn)語言翻譯(2)

1.加載數(shù)據(jù)

# English source data
with open("data/small_vocab_en", "r", encoding="utf-8") as f:
    source_text = f.read()

# French target data
with open("data/small_vocab_fr", "r", encoding="utf-8") as f:
    target_text = f.read()

2.查看數(shù)據(jù)

# 統(tǒng)計(jì)英文語料數(shù)據(jù)
sentences = source_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]
# 統(tǒng)計(jì)法語語料數(shù)據(jù)
sentences = target_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]

3.數(shù)據(jù)預(yù)處理

3.1 構(gòu)造字典

# 構(gòu)造英文詞典
source_vocab = list(set(source_text.lower().split()))
# 構(gòu)造法語詞典
target_vocab = list(set(target_text.lower().split()))

3.2 增加特殊字符

# 增加特殊編碼
SOURCE_CODES = ['<PAD>', '<UNK>']
TARGET_CODES = ['<PAD>', '<EOS>', '<UNK>', '<GO>']

3.3 word和id之間的映射表

# 構(gòu)造英文語料的映射表
source_vocab_to_int = {word: idx for idx, word in enumerate(SOURCE_CODES + source_vocab)}
source_int_to_vocab = {idx: word for idx, word in enumerate(SOURCE_CODES + source_vocab)}

# 構(gòu)造法語語料的映射表
target_vocab_to_int = {word: idx for idx, word in enumerate(TARGET_CODES + target_vocab)}
target_int_to_vocab = {idx: word for idx, word in enumerate(TARGET_CODES + target_vocab)}

3.4 text 轉(zhuǎn)換成 int

 # 用<PAD>填充整個(gè)序列
    text_to_idx = []
    # unk index
    unk_idx = map_dict.get("<UNK>")
    pad_idx = map_dict.get("<PAD>")
    eos_idx = map_dict.get("<EOS>")
    
    # 如果是輸入源文本
    if not is_target:
        for word in sentence.lower().split():
            text_to_idx.append(map_dict.get(word, unk_idx))
    
    # 否則，對于輸出目標(biāo)文本需要做<EOS>的填充最后
    else:
        for word in sentence.lower().split():
            text_to_idx.append(map_dict.get(word, unk_idx))
        text_to_idx.append(eos_idx)
    
    # 如果超長需要截?cái)?    if len(text_to_idx) > max_length:
        return text_to_idx[:max_length]
    # 如果不夠則增加<PAD>
    else:
        text_to_idx = text_to_idx + [pad_idx] * (max_length - len(text_to_idx))
        return text_to_idx
# 對源句子進(jìn)行轉(zhuǎn)換 Tx = 20
source_text_to_int = []
for sentence in tqdm.tqdm(source_text.split("\n")):
    source_text_to_int.append(text_to_int(sentence, source_vocab_to_int, 20, 
                                          is_target=False))

# 對目標(biāo)句子進(jìn)行轉(zhuǎn)換  Ty = 25
target_text_to_int = []
for sentence in tqdm.tqdm(target_text.split("\n")):
    target_text_to_int.append(text_to_int(sentence, target_vocab_to_int, 25, 
                                          is_target=True))

X = np.array(source_text_to_int)
Y = np.array(target_text_to_int)

# 對X和Y做One Hot Encoding
Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(source_vocab_to_int)), X)))
Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(target_vocab_to_int)), Y)))

4. 構(gòu)建模型

和上一篇介紹的一樣，encoder將輸入信息embedding轉(zhuǎn)換成稠密向量娃善，再輸入給LSTM學(xué)習(xí)成一個(gè)固定長度向量S吟税，S輸入到Decoder端生成新的序列泼各。所以模型模塊主要分為四部分：

模型輸入： model_inputs
Encoder端： encoder_layer
Decoder端：輸入端decoder_layer_inputs/ 訓(xùn)練deocder_layer_train / 預(yù)測 decoder_layer_inference
Seq2seq模型
具體代碼可以套用上一篇

5.模型預(yù)測與調(diào)參

epochs = 10
batch_size =128
rnn_size = 128
rnn_num_layers = 1
encoder_embedding_size = 100
decoder_embedding_size = 100
learning_rate = 0.001
#每50輪打印一次結(jié)果
display_step = 50

設(shè)置了10輪迭代谱俭，1層LSTM笙纤，encoder與decoder的嵌入詞向量維度均為100維查描，并指定每訓(xùn)練50輪打印一次結(jié)果.由于語料庫比較少突委，僅有13W條，對于語言翻譯模型這種嚴(yán)重依賴數(shù)據(jù)的模型確實(shí)有點(diǎn)少冬三。而且因?yàn)閿?shù)據(jù)集有限匀油，并沒有劃分訓(xùn)練集和測試集。

image.png

最終的LOSS在0.01左右勾笆。

如果用BiLSTM可以得到更多上下文的信息敌蚜，另外如果還加入attention，在翻譯每個(gè)單詞時(shí)會(huì)使用不同的S,這樣decoder時(shí)候匠襟，準(zhǔn)確率更高钝侠。

最后編輯于：2019.10.22 17:56:57

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者