Javascript類型推斷(3) - 算法模型解析
構建訓練模型
上一節(jié)我們介紹了生成訓練集舔哪,測試集,驗證集的方法槽棍,以及生成詞表的方法捉蚤。
這5個文件構成了訓練的基本素材:
files = {
'train': { 'file': 'data/train.ctf', 'location': 0 },
'valid': { 'file': 'data/valid.ctf', 'location': 0 },
'test': { 'file': 'data/test.ctf', 'location': 0 },
'source': { 'file': 'data/source_wl', 'location': 1 },
'target': { 'file': 'data/target_wl', 'location': 1 }
}
詞表我們需要轉(zhuǎn)換一下格式抬驴,放到哈希表里:
# load dictionaries
source_wl = [line.rstrip('\n') for line in open(files['source']['file'])]
target_wl = [line.rstrip('\n') for line in open(files['target']['file'])]
source_dict = {source_wl[i]:i for i in range(len(source_wl))}
target_dict = {target_wl[i]:i for i in range(len(target_wl))}
下面是一些全局參數(shù):
# number of words in vocab, slot labels, and intent labels
vocab_size = len(source_dict)
num_labels = len(target_dict)
epoch_size = 17.955*1000*1000
minibatch_size = 5000
emb_dim = 300
hidden_dim = 650
num_epochs = 10
下面我們定義x,y,t三個值,分別與輸入詞表缆巧、輸出標簽數(shù)和隱藏層有關
# Create the containers for input feature (x) and the label (y)
x = C.sequence.input_variable(vocab_size, name="x")
y = C.sequence.input_variable(num_labels, name="y")
t = C.sequence.input_variable(hidden_dim, name="t")
好布持,我們開始看下訓練的流程:
model = create_model()
enc, dec = model(x, t)
trainer = create_trainer()
train()
訓練模型
首先是一個詞嵌入層:
def create_model():
embed = C.layers.Embedding(emb_dim, name='embed')
然后是兩個雙向的循環(huán)神經(jīng)網(wǎng)絡(使用GRU),一個全連接網(wǎng)絡陕悬,和一個dropout:
encoder = BiRecurrence(C.layers.GRU(hidden_dim//2), C.layers.GRU(hidden_dim//2))
recoder = BiRecurrence(C.layers.GRU(hidden_dim//2), C.layers.GRU(hidden_()dim//2))
project = C.layers.Dense(num_labels, name='classify')
do = C.layers.Dropout(0.5)
然后把上面的四項組合起來:
def recode(x, t):
inp = embed(x)
inp = C.layers.LayerNormalization()(inp)
enc = encoder(inp)
rec = recoder(enc + t)
proj = project(do(rec))
dec = C.ops.softmax(proj)
return enc, dec
return recode
其中雙向循環(huán)神經(jīng)網(wǎng)絡定義如下:
def BiRecurrence(fwd, bwd):
F = C.layers.Recurrence(fwd)
G = C.layers.Recurrence(bwd, go_backwards=True)
x = C.placeholder()
apply_x = C.splice(F(x), G(x))
return apply_x
構建訓練過程
首先定義下?lián)p失函數(shù)题暖,由兩部分組成,一部分是loss捉超,另一部分是分類錯誤:
def criterion(model, labels):
ce = -C.reduce_sum(labels*C.ops.log(model))
errs = C.classification_error(model, labels)
return ce, errs
有了損失函數(shù)之后胧卤,我們使用帶動量的Adam算法進行梯度下降訓練:
def create_trainer():
masked_dec = dec*C.ops.clip(C.ops.argmax(y), 0, 1)
loss, label_error = criterion(masked_dec, y)
loss *= C.ops.clip(C.ops.argmax(y), 0, 1)
lr_schedule = C.learning_parameter_schedule_per_sample([1e-3]*2 + [5e-4]*2 + [1e-4], epoch_size=int(epoch_size))
momentum_as_time_constant = C.momentum_as_time_constant_schedule(1000)
learner = C.adam(parameters=dec.parameters,
lr=lr_schedule,
momentum=momentum_as_time_constant,
gradient_clipping_threshold_per_sample=15,
gradient_clipping_with_truncation=True)
progress_printer = C.logging.ProgressPrinter(tag='Training', num_epochs=num_epochs)
trainer = C.Trainer(dec, (loss, label_error), learner, progress_printer)
C.logging.log_number_of_parameters(dec)
return trainer
訓練
定義好模型之后,我們就可以訓練了拼岳。
首先我們可以利用CNTK.io包的功能定義一個數(shù)據(jù)的讀取器:
def create_reader(path, is_training):
return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(
source = C.io.StreamDef(field='S0', shape=vocab_size, is_sparse=True),
slot_labels = C.io.StreamDef(field='S1', shape=num_labels, is_sparse=True)
)), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)
然后我們就可以利用這個讀取器讀取數(shù)據(jù)開始訓練了:
def train():
train_reader = create_reader(files['train']['file'], is_training=True)
step = 0
pp = C.logging.ProgressPrinter(freq=10, tag='Training')
for epoch in range(num_epochs):
epoch_end = (epoch+1) * epoch_size
while step < epoch_end:
data = train_reader.next_minibatch(minibatch_size, input_map={
x: train_reader.streams.source,
y: train_reader.streams.slot_labels
})
# Enhance data
enhance_data(data, enc)
# Train model
trainer.train_minibatch(data)
pp.update_with_trainer(trainer, with_metric=True)
step += data[y].num_samples
pp.epoch_summary(with_metric=True)
trainer.save_checkpoint("models/model-" + str(epoch + 1) + ".cntk")
validate()
evaluate()
上面的代碼中枝誊,enhance_data需要解釋一下。
我們的數(shù)據(jù)并非是完全線性的數(shù)據(jù)惜纸,還需要進行一個數(shù)據(jù)增強的處理過程:
def enhance_data(data, enc):
guesses = enc.eval({x: data[x]})
inputs = C.ops.argmax(x).eval({x: data[x]})
tables = []
for i in range(len(inputs)):
ts = []
table = {}
counts = {}
for j in range(len(inputs[i])):
inp = int(inputs[i][j])
if inp not in table:
table[inp] = guesses[i][j]
counts[inp] = 1
else:
table[inp] += guesses[i][j]
counts[inp] += 1
for inp in table:
table[inp] /= counts[inp]
for j in range(len(inputs[i])):
inp = int(inputs[i][j])
ts.append(table[inp])
tables.append(np.array(np.float32(ts)))
s = C.io.MinibatchSourceFromData(dict(t=(tables, C.layers.typing.Sequence[C.layers.typing.tensor])))
mems = s.next_minibatch(minibatch_size)
data[t] = mems[s.streams['t']]
測試和驗證
測試和驗證的過程中叶撒,也需要我們上面介紹的數(shù)據(jù)增強的過程:
def validate():
valid_reader = create_reader(files['valid']['file'], is_training=False)
while True:
data = valid_reader.next_minibatch(minibatch_size, input_map={
x: valid_reader.streams.source,
y: valid_reader.streams.slot_labels
})
if not data:
break
enhance_data(data, enc)
trainer.test_minibatch(data)
trainer.summarize_test_progress()
evaluate與validate邏輯完全一樣,只是讀取的文件不同:
def evaluate():
test_reader = create_reader(files['test']['file'], is_training=False)
while True:
data = test_reader.next_minibatch(minibatch_size, input_map={
x: test_reader.streams.source,
y: test_reader.streams.slot_labels
})
if not data:
break
# Enhance data
enhance_data(data, enc)
# Test model
trainer.test_minibatch(data)
trainer.summarize_test_progress()