[tf]nlp任務(wù)中使用 tf.data

讀取數(shù)據(jù)

比較好的方法是從tf.data.Dataset.from_generator中讀取數(shù)據(jù)歹篓，因?yàn)檫@樣允許從任意一個(gè)迭代器中讀取數(shù)據(jù)龄捡，可以更靈活的對(duì)數(shù)據(jù)進(jìn)行預(yù)處理等等。

def generator_fn():
    for digit in range(2):
        line = 'I am digit {}'.format(digit)
        words = line.split()
        yield [w.encode() for w in words], len(words)

雖然有很多讀取數(shù)據(jù)的方法坞笙，比如tf.data.TextLineDataset是從text文本中讀取數(shù)據(jù)，比如tf.data.Dataset.from_tensor_slices是從np array中讀取數(shù)據(jù)的，tf.data.TFRecordDataset是從TF records中讀取數(shù)據(jù)的拳芙，但是作為一個(gè)NLP的研究人員，除非要使用上面三個(gè)讀取方式中的一個(gè)特定函數(shù)來獲得模型性能上的提升皮璧，否在為了靈活性起見還是使用tf.data.Dataset.from_generator最好舟扎。

shapes = ([None], ())
types = (tf.string, tf.int32)

dataset = tf.data.Dataset.from_generator(generator_fn,
    output_shapes=shapes, output_types=types)

測(cè)試是否正常

The tf.enable_eager_execution() must be called at program startup, just after your import tensorflow as tf

 import tensorflow as tf
 tf.enable_eager_execution()

 for tf_words, tf_size in dataset:
     print(tf_words, tf_size)
 >>> tf.Tensor([b'I' b'am' b'digit' b'0'], shape=(4,), dtype=string) tf.Tensor(4, shape=(), dtype=int32)
 >>> tf.Tensor([b'I' b'am' b'digit' b'1'], shape=(4,), dtype=string) tf.Tensor(4, shape=(), dtype=int32)

使用一種old school的方式tf.Session()，但是這種方式需要先創(chuàng)建一個(gè)iterator悴务。
然后創(chuàng)建一個(gè)取下一個(gè)節(jié)目的op睹限，這樣取出一個(gè)元素以后迭代器再向后移動(dòng)一次。

 iterator = dataset.make_one_shot_iterator()
 node = iterator.get_next()
 with tf.Session() as sess:
     print(sess.run(node))
     print(sess.run(node))  # Each call moves the iterator to its next position
 >>> (array([b'I', b'am', b'digit', b'0'], dtype=object), 4)
 >>> (array([b'I', b'am', b'digit', b'1'], dtype=object), 4)

讀取文件和進(jìn)行分詞

使用tf.data.Dataset.from_generaor()最大的好處就是可以使用你python方式進(jìn)行文本的預(yù)處理讯檐，而不用想方設(shè)法找tf中的對(duì)應(yīng)函數(shù)羡疗。

 def parse_fn(line_words, line_tags):
     # Encode in Bytes for TF
     words = [w.encode() for w in line_words.strip().split()]
     tags = [t.encode() for t in line_tags.strip().split()]
     assert len(words) == len(tags), "Words and tags lengths don't match"
     return (words, len(words)), tags

 def generator_fn(words, tags):
     with Path(words).open('r') as f_words, Path(tags).open('r') as f_tags:
         for line_words, line_tags in zip(f_words, f_tags):
             yield parse_fn(line_words, line_tags)

然后使用input_fn構(gòu)建dataset，并接下來將和tf.estimator配合進(jìn)行使用别洪。其中的函數(shù)在我的另外一篇博客中都有叨恨。

prefetch which ensures that a batch of data is pre-loaded on the computing device so that it does not suffer from data starvation

 def input_fn(words, tags, params=None, shuffle_and_repeat=False):
     params = params if params is not None else {}
     shapes = (([None], ()), [None])
     types = ((tf.string, tf.int32), tf.string)
     defaults = (('<pad>', 0), 'O')

     dataset = tf.data.Dataset.from_generator(
         functools.partial(generator_fn, words, tags),
         output_shapes=shapes, output_types=types)

     if shuffle_and_repeat:
         dataset = dataset.shuffle(params['buffer']).repeat(params['epochs'])

     dataset = (dataset
                .padded_batch(params.get('batch_size', 20), shapes, defaults)
                .prefetch(1))
     return dataset

運(yùn)行結(jié)果，可以看到Pad起到了應(yīng)有的結(jié)果挖垛。

運(yùn)行結(jié)果

tf.estimator

提供一個(gè)高級(jí)的用于訓(xùn)練測(cè)試和預(yù)測(cè)的方法痒钝，在使用之前需要定義兩個(gè)組件。
一個(gè)模型文件model_fn(features, labels, mode, params) ->tf.estimator.EstimatorSpec
- 前面兩個(gè)都是訓(xùn)練中需要的tensor痢毒。
- mode：是一個(gè)string送矩，用于指定model_fn是用于預(yù)測(cè)，測(cè)試還是訓(xùn)練哪替。
- param：是一個(gè)字典用于存放超參栋荸。
input_fn：就是之前我們所定義的返回tf.data.Dataset的函數(shù)，返回訓(xùn)練的tensorfeatures和labels被model_fn用于訓(xùn)練凭舶。

def model_fn(features, labels, mode, params):
    # Define the inference graph
    graph_outputs = some_tensorflow_applied_to(features)

    if mode == tf.estimator.ModeKeys.PREDICT:
        # Extract the predictions
        predictions = some_dict_from(graph_outputs)
        return tf.estimator.EstimatorSpec(mode, predictions=predictions)
    else:
        # Compute loss, metrics, tensorboard summaries
        loss = compute_loss_from(graph_outputs, labels)
        metrics = compute_metrics_from(graph_outputs, labels)

        if mode == tf.estimator.ModeKeys.EVAL:
            return tf.estimator.EstimatorSpec(
                mode, loss=loss, eval_metric_ops=metrics)

        elif mode == tf.estimator.ModeKeys.TRAIN:
            # Get train operator
            train_op = compute_train_op_from(graph_outputs, labels)
            return tf.estimator.EstimatorSpec(
                mode, loss=loss, train_op=train_op)

        else:
            raise NotImplementedError('Unknown mode {}'.format(mode))

一個(gè)具體的例子說明

tf.contrib.lookup.index_table_from_file將strings to ids in the tensorflow graph晌块。

Here, params['words'] is the path to a file containing one lexeme (= an element of my vocabulary) per line. I use Tensorflow built-int lookup tables to map token strings to lexemes ids. We also use the same convention to store the vocabulary of tags.

dropout = params['dropout']
words, nwords = features
training = (mode == tf.estimator.ModeKeys.TRAIN)
vocab_words = tf.contrib.lookup.index_table_from_file(
    params['words'], num_oov_buckets=1)
with Path(params['tags']).open() as f:
    indices = [idx for idx, tag in enumerate(f) if tag.strip() != 'O']
    num_tags = len(indices) + 1

創(chuàng)建word embedding。
可以加載預(yù)訓(xùn)練的詞向量库快。

word_ids = vocab_words.lookup(words)
glove = np.load(params['glove'])['embeddings']  # np.array
variable = np.vstack([glove, [[0.]*params['dim']]])  # For unknown words
variable = tf.Variable(variable, dtype=tf.float32, trainable=False)
embeddings = tf.nn.embedding_lookup(variable, word_ids)
embeddings = tf.layers.dropout(embeddings, rate=dropout, training=training)

我們使用最為有效的lstm cell方式摸袁，它將所有的LSTM操作都放在一個(gè)CUDA kernel里面進(jìn)行

t = tf.transpose(embeddings, perm=[1, 0, 2])  # Make time-major
lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw)
output_fw, _ = lstm_cell_fw(t, dtype=tf.float32, sequence_length=nwords)
output_bw, _ = lstm_cell_bw(t, dtype=tf.float32, sequence_length=nwords)
output = tf.concat([output_fw, output_bw], axis=-1)
output = tf.transpose(output, perm=[1, 0, 2])  # Make batch-major
output = tf.layers.dropout(output, rate=dropout, training=training)

LSTMBlockCell需要time在前所以要使用tf.transpose進(jìn)行翻轉(zhuǎn)。

This is an extremely efficient LSTM implementation, that uses a single TF op for the entire LSTM. It should be both faster and more memory-efficient than LSTMBlockCell defined above.

加入CRF

logits = tf.layers.dense(output, num_tags)
crf_params = tf.get_variable("crf", [num_tags, num_tags], dtype=tf.float32)
pred_ids, _ = tf.contrib.crf.crf_decode(logits, crf_params, nwords)

測(cè)度和使用tensorboard

import tf_metrics

# Metrics
weights = tf.sequence_mask(nwords)
metrics = {
    'acc': tf.metrics.accuracy(tags, pred_ids, weights),
    'precision': tf_metrics.precision(tags, pred_ids, num_tags, indices, weights),
    'recall': tf_metrics.recall(tags, pred_ids, num_tags, indices, weights),
    'f1': tf_metrics.f1(tags, pred_ids, num_tags, indices, weights),
}
# Tensoboard summaries
for metric_name, op in metrics.items():
    tf.summary.scalar(metric_name, op[1])

評(píng)估模型

if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(
        mode, loss=loss, eval_metric_ops=metrics)

elif mode == tf.estimator.ModeKeys.TRAIN:
    train_op = tf.train.AdamOptimizer().minimize(
        loss, global_step=tf.train.get_or_create_global_step())
    return tf.estimator.EstimatorSpec(
        mode, loss=loss, train_op=train_op)

實(shí)例化Estimator

params = {
    'dim': 300,
    'dropout': 0.5,
    'num_oov_buckets': 1,
    'epochs': 25,
    'batch_size': 20,
    'buffer': 15000,
    'lstm_size': 100,
    'words': str(Path(DATADIR, 'vocab.words.txt')),
    'chars': str(Path(DATADIR, 'vocab.chars.txt')),
    'tags': str(Path(DATADIR, 'vocab.tags.txt')),
    'glove': str(Path(DATADIR, 'glove.npz'))
}
cfg = tf.estimator.RunConfig(save_checkpoints_secs=120)
estimator = tf.estimator.Estimator(model_fn, 'results/model', cfg, params)

Train an Estimator with early stopping

因?yàn)槲覀兊暮瘮?shù)中只有后面幾個(gè)參數(shù)不同沒有必要再寫一個(gè)函數(shù)义屏，因此我們使用functools.partial對(duì)函數(shù)在不同數(shù)據(jù)集合上進(jìn)行包裝靠汁。
早停法訓(xùn)練蜂大，獲得F1最高值的模型，使用tf.contrib.estimator.stop_if_no_increase_hook

# 1. Define our input_fn
train_inpf = functools.partial(input_fn, 'words.train.txt', 'tags.train.txt',
                               params, shuffle_and_repeat=True)
eval_inpf = functools.partial(input_fn,'words.testa.txt', 'tags.testa.txt'
                              params)

# 2. Create a hook
Path(estimator.eval_dir()).mkdir(parents=True, exist_ok=True)
hook = tf.contrib.estimator.stop_if_no_increase_hook(
    estimator, 'f1', 500, min_steps=8000, run_every_secs=120)
train_spec = tf.estimator.TrainSpec(input_fn=input_fn, hooks=[hook])
eval_spec = tf.estimator.EvalSpec(input_fn=eval_inpf, throttle_secs=120)

# 3. Train with early stopping
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

最后編輯于：2019.01.16 15:20:33

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末蝶怔，一起剝皮案震驚了整個(gè)濱河市奶浦，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌踢星，老刑警劉巖澳叉，帶你破解...
沈念sama閱讀 206,126評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異沐悦，居然都是意外死亡成洗，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,254評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門藏否，熙熙樓的掌柜王于貴愁眉苦臉地迎上來瓶殃，“玉大人，你說我怎么就攤上這事副签∫４唬” “怎么了？”我有些...
開封第一講書人閱讀 152,445評(píng)論 0贊 341
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵淆储，是天一觀的道長(zhǎng)冠场。經(jīng)常有香客問我，道長(zhǎng)本砰，這世上最難降的妖魔是什么碴裙？我笑而不...
開封第一講書人閱讀 55,185評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮灌具，結(jié)果婚禮上青团，老公的妹妹穿的比我還像新娘。我一直安慰自己咖楣，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 64,178評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布芦昔。她就那樣靜靜地躺著诱贿，像睡著了一般。火紅的嫁衣襯著肌膚如雪咕缎。梳的紋絲不亂的頭發(fā)上珠十，一...
開封第一講書人閱讀 48,970評(píng)論 1贊 284
城市分裂傳說
那天，我揣著相機(jī)與錄音凭豪，去河邊找鬼焙蹭。笑死，一個(gè)胖子當(dāng)著我的面吹牛嫂伞，可吹牛的內(nèi)容都是我干的孔厉。我是一名探鬼主播拯钻，決...
沈念sama閱讀 38,276評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼撰豺！你這毒婦竟也來了粪般？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 36,927評(píng)論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤污桦，失蹤者是張志新（化名）和其女友劉穎亩歹，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體凡橱，經(jīng)...
沈念sama閱讀 43,400評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡小作，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,883評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了稼钩。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片躲惰。...
茶點(diǎn)故事閱讀 37,997評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖变抽，靈堂內(nèi)的尸體忽然破棺而出础拨，到底是詐尸還是另有隱情，我是刑警寧澤绍载，帶...
沈念sama閱讀 33,646評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布诡宗，位于F島的核電站，受9級(jí)特大地震影響击儡，放射性物質(zhì)發(fā)生泄漏塔沃。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,213評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一阳谍、第九天我趴在偏房一處隱蔽的房頂上張望蛀柴。院中可真熱鬧，春花似錦矫夯、人聲如沸鸽疾。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,204評(píng)論 0贊 19
一樁弒父案训貌，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽制肮。三九已至，卻和暖如春递沪，著一層夾襖步出監(jiān)牢的瞬間豺鼻，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,423評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工款慨，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留儒飒，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,423評(píng)論 2贊 352
代替公主和親
正文我出身青樓檩奠，卻偏偏與公主長(zhǎng)得像桩了，于是被迫代替她去往敵國和親附帽。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,722評(píng)論 2贊 345