本篇文章帶來的是TensorFlow框架下常見的數(shù)據(jù)讀取方式女蜈。
1持舆、Preloaded data: 預(yù)加載數(shù)據(jù)
就是我們常見的寫在程序里面的數(shù)據(jù)格式。
#coding=utf8
import tensorflow as tf
a = tf.constant([[1,2],[3,4]])
b = tf.constant([[1,2],[3,4]])
c = tf.matmul(a,b)
with tf.Session() as sess:
print(sess.run(c))
2伪窖、Feeding: Python產(chǎn)生數(shù)據(jù)逸寓,再把數(shù)據(jù)喂給后端。
這種方法經(jīng)常用到覆山。
#coding=utf8
import tensorflow as tf
a = tf.placeholder(tf.int32,shape=[2,2])
b = tf.placeholder(tf.int32,shape=[2,2])
##a,b為占位符竹伸,只有在程序運行的時候才載入數(shù)據(jù)
c = tf.matmul(a,b)
a1 = [[1,2],[3,4]]
b1 = [[1,2],[3,4]]
##這里的a1,b1是我們要傳入a簇宽,b的數(shù)據(jù)勋篓,也可以從文件讀取
with tf.Session() as sess:
print(sess.run(c,feed_dict={a:a1,b:b1}))
3、Reading from file: 從文件中直接讀取
(1)魏割、read from CSV or txt
有時我們遇到的數(shù)據(jù)文件是CSV或者txt格式譬嚣。
單reader,單樣本(batch_size=1)
#coding=utf8
import tensorflow as tf
#創(chuàng)建文件隊列
filenames = ['datas/A.csv','datas/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=True)
#shuffle=True 文件隊列隨機讀取钞它,默認(rèn)
TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)
example, label = tf.decode_csv(value, record_defaults=[[], []])
##record_defaults=[[], []]文件讀取后的數(shù)據(jù)默認(rèn)格式拜银,文件有幾列返回值就有幾個殊鞭,
##默認(rèn)是英文逗號分隔,可以指定
##關(guān)于tf.decode_csv()的具體用法可以查看https://www.tensorflow.org/versions/master/api_docs/python/tf/decode_csv
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)
for i in range(100):
##循環(huán)讀取盐股,即使所有文件沒有那么多行
print(example.eval(),label.eval())
coord.request_stop()
coord.join(threads)
單reader钱豁,多樣本(batch_size)
#coding=utf8
import tensorflow as tf
#創(chuàng)建文件隊列
filenames = ['datas/A.csv','datas/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
#shuffle=True 文件隊列隨機讀取,默認(rèn)
TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)
example, label = tf.decode_csv(value, record_defaults=[[], []])
##record_defaults=[[], []]文件讀取后的數(shù)據(jù)默認(rèn)格式疯汁,文件有幾列返回值就有幾個牲尺,
##默認(rèn)是英文逗號分隔,可以指定
example_batch,label_batch = tf.train.batch([example,label],
batch_size=5,
capacity=100,
num_threads=2)
# ###隨機讀取
# example_batch,label_batch = tf.train.shuffle_batch([example,label],
# batch_size=5,
# capacity=100,
# min_after_dequeue=50,
# num_threads=2)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)
for i in range(10):
##循環(huán)讀取幌蚊,即使所有文件沒有那么多行
print(example_batch.eval())
coord.request_stop()
coord.join(threads)
多reader谤碳,多樣本
#coding=utf8
import tensorflow as tf
#創(chuàng)建文件隊列
filenames = ['datas/A.csv','datas/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
#shuffle=True 文件隊列隨機讀取,默認(rèn)
TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)
example_list = [tf.decode_csv(value, record_defaults=[[], []]) for _ in range(2)]
##2表示創(chuàng)建兩個reader
example_batch,label_batch = tf.train.batch_join(example_list,batch_size=5)
# 使用tf.train.batch_join()溢豆,可以使用多個reader蜒简,并行讀取數(shù)據(jù)。每個Reader使用一個線程漩仙。
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)
for i in range(10):
##循環(huán)讀取搓茬,即使所有文件沒有那么多行
print(example_batch.eval(),label_batch.eval())
coord.request_stop()
coord.join(threads)
tf.train.batch與tf.train.shuffle_batch函數(shù)是單個Reader讀取,但是可以多線程队他。tf.train.batch_join與tf.train.shuffle_batch_join可設(shè)置多Reader讀取卷仑,每個Reader使用一個線程。至于兩種方法的效率麸折,單Reader時锡凝,2個線程就達(dá)到了速度的極限。多Reader時垢啼,2個Reader就達(dá)到了極限窜锯。所以并不是線程越多越快,甚至更多的線程反而會使效率下降芭析。
(2)锚扎、read from tfrecords
這種讀取方式常被用來讀取圖片數(shù)據(jù),先是將圖片數(shù)據(jù)寫入到tfrecords文件中馁启,當(dāng)要使用時再從中讀取工秩,速度很快,但是將圖片格式的文件寫入tfrecords文件后所占用的磁盤內(nèi)存更大进统?有弊有利,在圖像處理時要先將其處理成相同大小的圖片保存(很重要浪听,我在測試過程中沒有找到這么儲存不同大小的圖片)螟碎。
#coding=utf8
import tensorflow as tf
import os
import numpy as np
from PIL import Image
## 寫入tfrecords文件
def write_tfrecods():
np.random.seed(100)
path = 'E:/cats_dogs/'
savepath = 'datas/test.tfrecords'
files = [path+item for item in os.listdir(path) if item.endswith('.jpg')]
np.random.shuffle(files)
train_files = files[:23000]
test_files = files[23000:]
TFWriter = tf.python_io.TFRecordWriter(savepath)
for i,file in enumerate(test_files):
if i%1000==0:
print(i)
lab = file.split('/')[-1].split('.')[0].strip()
if lab=='cat':
label = 1
else:
label = 0
image = Image.open(file)
image = image.resize((208,208))
imagerow = image.tobytes()
sample = tf.train.Example(features=tf.train.Features(feature={
'label':tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
'image':tf.train.Feature(bytes_list=tf.train.BytesList(value=[imagerow]))
}))
TFWriter.write(sample.SerializeToString())
TFWriter.close()
# write_tfrecods()
## 從tfrecords文件讀取
def read_tfrecords():
filepath = 'datas/test.tfrecords'
filename_queue = tf.train.string_input_producer([filepath])
TFReader = tf.TFRecordReader()
_,serialize_sample = TFReader.read(filename_queue)
features = tf.parse_single_example(serialize_sample,features={
'label':tf.FixedLenFeature([],tf.int64),
'image':tf.FixedLenFeature([],tf.string)
})
image = tf.decode_raw(features['image'],tf.uint8)
image = tf.reshape(image,shape=[208,208,3])
label = tf.cast(features['label'],tf.int32)
return image,label
## 讀取批次
def next_batch(batch_size):
# import matplotlib.pyplot as plt
image,label = read_tfrecords()
image_batch,label_batch = tf.train.shuffle_batch([image,label],
batch_size=batch_size,
capacity=200,
min_after_dequeue=100,
num_threads=32)
# with tf.Session() as sess:
# sess.run(tf.global_variables_initializer())
# coord = tf.train.Coordinator()
# threads = tf.train.start_queue_runners(coord=coord)
# image,label = sess.run([image_batch,label_batch])
# for i in range(2):
# print(label[i])
# plt.imshow(image[i])
# plt.show()
# coord.request_stop()
# coord.join(threads)
image_batch = tf.cast(image_batch,tf.float32)
label_batch = tf.cast(label_batch,tf.int32)
return image_batch,label_batch
# next_batch(2)
上面這種方法非常快迹栓,而且不占用內(nèi)存掉分,但是缺點是數(shù)據(jù)已定,比如在處理圖像數(shù)據(jù)時往往為了增加數(shù)據(jù)量,會對圖像做一些噪聲酥郭,如模糊华坦、亮度、刪除不从,如果先處理后再寫入tfrecords文件惜姐,那么是及其浪費磁盤空間的,因此我有時候就喜歡直接讀取文件椿息,然后對圖片加噪聲處理歹袁,然后再送入訓(xùn)練,這樣做的缺點是高速讀寫圖片文件寝优,會非常占用CPU条舔。
import tensorflow as tf
import numpy as np
import os
import math
# you need to change this to your data directory
# train_dir = '/home/acrobat/DataSets/cats_vs_dogs/train/'
def get_files(file_dir, ratio):
"""
Args:
file_dir: file directory
ratio:ratio of validation datasets
Returns:
list of images and labels
"""
cats = []
label_cats = []
dogs = []
label_dogs = []
for file in os.listdir(file_dir):
name = file.split(sep='.')
if name[0]=='cat':
cats.append(file_dir + file)
label_cats.append(0)
else:
dogs.append(file_dir + file)
label_dogs.append(1)
print('There are %d cats\nThere are %d dogs' %(len(cats), len(dogs)))
image_list = np.hstack((cats, dogs))
label_list = np.hstack((label_cats, label_dogs))
temp = np.array([image_list, label_list])
temp = temp.transpose()
np.random.shuffle(temp)
all_image_list = temp[:, 0]
all_label_list = temp[:, 1]
n_sample = len(all_label_list)
n_val = math.ceil(n_sample*ratio) # number of validation samples
n_train = n_sample - n_val # number of trainning samples
tra_images = all_image_list[0:n_train]
tra_labels = all_label_list[0:n_train]
tra_labels = [int(float(i)) for i in tra_labels]
val_images = all_image_list[n_train:-1]
val_labels = all_label_list[n_train:-1]
val_labels = [int(float(i)) for i in val_labels]
return tra_images,tra_labels,val_images,val_labels
def get_batch(image, label, image_W, image_H, batch_size, capacity):
"""
Args:
image: list type
label: list type
image_W: image width
image_H: image height
batch_size: batch size
capacity: the maximum elements in queue
Returns:
image_batch: 4D tensor [batch_size, width, height, 3], dtype=tf.float32
label_batch: 1D tensor [batch_size], dtype=tf.int32
"""
image = tf.cast(image, tf.string)
label = tf.cast(label, tf.int32)
# make an input queue
input_queue = tf.train.slice_input_producer([image, label])
label = input_queue[1]
image_contents = tf.read_file(input_queue[0])
image = tf.image.decode_jpeg(image_contents, channels=3)
image = tf.image.resize_image_with_crop_or_pad(image, image_W, image_H)
# if you want to test the generated batches of images, you might want to comment the following line.
image = tf.image.per_image_standardization(image)
image_batch, label_batch = tf.train.batch([image, label],
batch_size= batch_size,
num_threads= 64,
capacity = capacity)
#you can also use shuffle_batch
# image_batch, label_batch = tf.train.shuffle_batch([image,label],
# batch_size=BATCH_SIZE,
# num_threads=64,
# capacity=CAPACITY,
# min_after_dequeue=CAPACITY-1)
label_batch = tf.reshape(label_batch, [batch_size])
image_batch = tf.cast(image_batch, tf.float32)
return image_batch, label_batch
(3)、read from bin
有的時候我們的數(shù)據(jù)是二進(jìn)制格式(bin)乏矾,因此需要將二進(jìn)制文件讀取出來孟抗。
在官網(wǎng)的cifar的例子中就是從bin文件中讀取的。bin文件需要以一定的size格式存儲钻心,比如每個樣本的值占多少字節(jié)凄硼,label占多少字節(jié),且這對于每個樣本都是固定的扔役,然后一個挨著一個存儲帆喇。這樣就可以使用tf.FixedLengthRecordReader 類來每次讀取固定長度的字節(jié),正好對應(yīng)一個樣本存儲的字節(jié)(包括label)亿胸。并且用tf.decode_raw進(jìn)行解析坯钦。
import tensorflow as tf
import numpy as np
# 預(yù)定義圖像數(shù)據(jù)信息
labelBytes = 1
witdthBytes = 32
heightBytes = 32
depthBytes = 3
imageBytes = witdthBytes*heightBytes*depthBytes
recordBytes = imageBytes+labelBytes
filename_queue = tf.train.string_input_producer(["./data/train.bin"])
reader = tf.FixedLengthRecordReader(record_bytes=recordBytes) # 按固定長度讀取二進(jìn)制文件
key,value = reader.read(filename_queue)
bytes = tf.decode_raw(value,out_type=tf.uint8) # 解碼為uint8,0-255 8位3通道圖像
label = tf.cast(tf.strided_slice(bytes,[0],[labelBytes]),tf.int32) # 分割label并轉(zhuǎn)化為int32
##tf.strided_slice() 將讀取的一個bytes切分
originalImg = tf.reshape(tf.strided_slice(bytes,[labelBytes],[labelBytes+imageBytes]),[depthBytes,heightBytes,witdthBytes])
# 分割圖像,此時按照數(shù)據(jù)組織形式深度在前
img = tf.transpose(originalImg,[1,2,0]) # 調(diào)整軸的順序侈玄,深度在后
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(100):
imgArr = sess.run(img)
print (imgArr.shape)
coord.request_stop()
coord.join(threads)
tf.strided_slice具體用法見:https://www.tensorflow.org/versions/master/api_docs/python/tf/strided_slice
上面就是我現(xiàn)在遇到過的TensorFlow讀寫數(shù)據(jù)的一些常見方式婉刀,后面遇到了會陸續(xù)添加。
補充:這里有一篇比較詳細(xì)的tfrecords文件的讀寫教程序仙,受益頗多M患铡!
http://blog.csdn.net/u010223750/article/details/70482498
參考文章:
http://honggang.io/2016/08/19/tensorflow-data-reading/
http://blog.csdn.net/freedom098/article/details/56008784