- 先看reader.py,主要功能是讀取數(shù)據(jù)岂膳,以及序列數(shù)據(jù)的生成
def _read_words(filename):
with tf.gfile.GFile(filename, "r") as f:
if sys.version_info[0] >= 3:
return f.read().replace("\n", "<eos>").split()
else:
return f.read().decode("utf-8").replace("\n", "<eos>").split()
tf.gfile.GFile 主要是用于HDFS等文件系統(tǒng)中的文件操作。詳見
def _build_vocab(filename):
data = _read_words(filename)
counter = collections.Counter(data)
count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))
words, _ = list(zip(*count_pairs))
word_to_id = dict(zip(words, range(len(words))))
return word_to_id
一個Counter是dict子類,用于計數(shù)可哈希的對象煎殷。這是一個無序的容器,元素被作為字典的key存儲腿箩,它們的計數(shù)作為字典的value存儲豪直。詳見。其中counter.items()是返回一個元素計數(shù)列表
c.items() # convert to a list of (elem, cnt) pairs
zip()函數(shù)使用*list/tuple的方式表示時珠移,是將list/tuple分開弓乙,作為位置參數(shù)傳遞給對應(yīng)函數(shù)(前提是對應(yīng)函數(shù)支持不定個數(shù)的位置參數(shù)),函數(shù)效果可見钧惧。
而且此處sorted()函數(shù)就是按照計數(shù)降序排序暇韧,實現(xiàn)的時候直接將counter.items()的元素tuple前后換了個位置。
def ptb_producer(raw_data, batch_size, num_steps, name=None):
with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
data_len = tf.size(raw_data)
batch_len = data_len // batch_size
data = tf.reshape(raw_data[0 : batch_size * batch_len],
[batch_size, batch_len])
epoch_size = (batch_len - 1) // num_steps
assertion = tf.assert_positive(
epoch_size,
message="epoch_size == 0, decrease batch_size or num_steps")
with tf.control_dependencies([assertion]):
epoch_size = tf.identity(epoch_size, name="epoch_size")
i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
x = tf.strided_slice(data, [0, i * num_steps],
[batch_size, (i + 1) * num_steps])
x.set_shape([batch_size, num_steps])
y = tf.strided_slice(data, [0, i * num_steps + 1],
[batch_size, (i + 1) * num_steps + 1])
y.set_shape([batch_size, num_steps])
return x, y
根據(jù)batch_size將原數(shù)據(jù)reshape()浓瞪,劃分了成了一個矩陣batch_size * batch_len維懈玻。(batch_len就是number of batches)后面代碼解析見
- 有問題再接著寫吧:)