TensorFlow的讀寫方式

本篇文章帶來的是TensorFlow框架下常見的數(shù)據(jù)讀取方式女蜈。

1持舆、Preloaded data: 預(yù)加載數(shù)據(jù)
就是我們常見的寫在程序里面的數(shù)據(jù)格式。

#coding=utf8
import tensorflow as tf

a = tf.constant([[1,2],[3,4]])
b = tf.constant([[1,2],[3,4]])

c = tf.matmul(a,b)
with tf.Session() as sess:
    print(sess.run(c))

2伪窖、Feeding: Python產(chǎn)生數(shù)據(jù)逸寓，再把數(shù)據(jù)喂給后端。
這種方法經(jīng)常用到覆山。

#coding=utf8
import tensorflow as tf

a = tf.placeholder(tf.int32,shape=[2,2])
b = tf.placeholder(tf.int32,shape=[2,2])
##a,b為占位符竹伸，只有在程序運行的時候才載入數(shù)據(jù)

c = tf.matmul(a,b)

a1 = [[1,2],[3,4]]
b1 = [[1,2],[3,4]]
##這里的a1，b1是我們要傳入a簇宽，b的數(shù)據(jù)勋篓，也可以從文件讀取
with tf.Session() as sess:
    print(sess.run(c,feed_dict={a:a1,b:b1}))

3、Reading from file: 從文件中直接讀取
（1）魏割、read from CSV or txt
有時我們遇到的數(shù)據(jù)文件是CSV或者txt格式譬嚣。
單reader，單樣本（batch_size=1）

#coding=utf8
import tensorflow as tf

#創(chuàng)建文件隊列
filenames = ['datas/A.csv','datas/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=True)
#shuffle=True 文件隊列隨機讀取钞它，默認(rèn)

TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)

example, label = tf.decode_csv(value, record_defaults=[[], []])
##record_defaults=[[], []]文件讀取后的數(shù)據(jù)默認(rèn)格式拜银，文件有幾列返回值就有幾個殊鞭，
##默認(rèn)是英文逗號分隔，可以指定
##關(guān)于tf.decode_csv()的具體用法可以查看https://www.tensorflow.org/versions/master/api_docs/python/tf/decode_csv

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess,coord=coord)
    for i in range(100):
        ##循環(huán)讀取盐股，即使所有文件沒有那么多行
        print(example.eval(),label.eval())
    coord.request_stop()
    coord.join(threads)

單reader钱豁，多樣本（batch_size）

#coding=utf8
import tensorflow as tf

#創(chuàng)建文件隊列
filenames = ['datas/A.csv','datas/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
#shuffle=True 文件隊列隨機讀取，默認(rèn)

TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)

example, label = tf.decode_csv(value, record_defaults=[[], []])
##record_defaults=[[], []]文件讀取后的數(shù)據(jù)默認(rèn)格式疯汁，文件有幾列返回值就有幾個牲尺，
##默認(rèn)是英文逗號分隔，可以指定

example_batch,label_batch = tf.train.batch([example,label],
                                           batch_size=5,
                                           capacity=100,
                                           num_threads=2)
# ###隨機讀取
# example_batch,label_batch = tf.train.shuffle_batch([example,label],
#                                                     batch_size=5,
#                                                     capacity=100,
#                                                    min_after_dequeue=50,
#                                                     num_threads=2)

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess,coord=coord)
    for i in range(10):
        ##循環(huán)讀取幌蚊，即使所有文件沒有那么多行
        print(example_batch.eval())
    coord.request_stop()
    coord.join(threads)

多reader谤碳，多樣本

#coding=utf8
import tensorflow as tf

#創(chuàng)建文件隊列
filenames = ['datas/A.csv','datas/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
#shuffle=True 文件隊列隨機讀取，默認(rèn)

TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)

example_list = [tf.decode_csv(value, record_defaults=[[], []]) for _ in range(2)]
##2表示創(chuàng)建兩個reader

example_batch,label_batch = tf.train.batch_join(example_list,batch_size=5)
# 使用tf.train.batch_join()溢豆，可以使用多個reader蜒简，并行讀取數(shù)據(jù)。每個Reader使用一個線程漩仙。

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess,coord=coord)
    for i in range(10):
        ##循環(huán)讀取搓茬，即使所有文件沒有那么多行
        print(example_batch.eval(),label_batch.eval())
    coord.request_stop()
    coord.join(threads)

tf.train.batch與tf.train.shuffle_batch函數(shù)是單個Reader讀取，但是可以多線程队他。tf.train.batch_join與tf.train.shuffle_batch_join可設(shè)置多Reader讀取卷仑，每個Reader使用一個線程。至于兩種方法的效率麸折，單Reader時锡凝，2個線程就達(dá)到了速度的極限。多Reader時垢啼，2個Reader就達(dá)到了極限窜锯。所以并不是線程越多越快，甚至更多的線程反而會使效率下降芭析。
（2）锚扎、read from tfrecords
這種讀取方式常被用來讀取圖片數(shù)據(jù)，先是將圖片數(shù)據(jù)寫入到tfrecords文件中馁启，當(dāng)要使用時再從中讀取工秩，速度很快，但是將圖片格式的文件寫入tfrecords文件后所占用的磁盤內(nèi)存更大进统？有弊有利，在圖像處理時要先將其處理成相同大小的圖片保存（很重要浪听，我在測試過程中沒有找到這么儲存不同大小的圖片）螟碎。

#coding=utf8
import tensorflow as tf
import os
import numpy as np
from PIL import Image

## 寫入tfrecords文件
def write_tfrecods():
    np.random.seed(100)
    path = 'E:/cats_dogs/'
    savepath = 'datas/test.tfrecords'
    files = [path+item for item in os.listdir(path) if item.endswith('.jpg')]
    np.random.shuffle(files)
    train_files = files[:23000]
    test_files = files[23000:]
    TFWriter = tf.python_io.TFRecordWriter(savepath)
    for i,file in enumerate(test_files):
        if i%1000==0:
            print(i)
        lab = file.split('/')[-1].split('.')[0].strip()
        if lab=='cat':
            label = 1
        else:
            label = 0
        image = Image.open(file)
        image = image.resize((208,208))
        imagerow = image.tobytes()
        sample = tf.train.Example(features=tf.train.Features(feature={
            'label':tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
            'image':tf.train.Feature(bytes_list=tf.train.BytesList(value=[imagerow]))
        }))
        TFWriter.write(sample.SerializeToString())
    TFWriter.close()


# write_tfrecods()


## 從tfrecords文件讀取
def read_tfrecords():
    filepath = 'datas/test.tfrecords'
    filename_queue = tf.train.string_input_producer([filepath])
    TFReader = tf.TFRecordReader()
    _,serialize_sample = TFReader.read(filename_queue)
    features = tf.parse_single_example(serialize_sample,features={
        'label':tf.FixedLenFeature([],tf.int64),
        'image':tf.FixedLenFeature([],tf.string)
    })
    image = tf.decode_raw(features['image'],tf.uint8)
    image = tf.reshape(image,shape=[208,208,3])
    label = tf.cast(features['label'],tf.int32)

    return image,label

## 讀取批次
def next_batch(batch_size):
    # import matplotlib.pyplot as plt
    image,label = read_tfrecords()
    image_batch,label_batch = tf.train.shuffle_batch([image,label],
                                                     batch_size=batch_size,
                                                     capacity=200,
                                                     min_after_dequeue=100,
                                                     num_threads=32)

    # with tf.Session() as sess:
    #     sess.run(tf.global_variables_initializer())
    #     coord = tf.train.Coordinator()
    #     threads = tf.train.start_queue_runners(coord=coord)
    #     image,label = sess.run([image_batch,label_batch])
    #     for i in range(2):
    #         print(label[i])
    #         plt.imshow(image[i])
    #         plt.show()
    #     coord.request_stop()
    #     coord.join(threads)
    image_batch = tf.cast(image_batch,tf.float32)
    label_batch = tf.cast(label_batch,tf.int32)
    return image_batch,label_batch

# next_batch(2)

上面這種方法非常快迹栓，而且不占用內(nèi)存掉分，但是缺點是數(shù)據(jù)已定，比如在處理圖像數(shù)據(jù)時往往為了增加數(shù)據(jù)量，會對圖像做一些噪聲酥郭，如模糊华坦、亮度、刪除不从，如果先處理后再寫入tfrecords文件惜姐，那么是及其浪費磁盤空間的，因此我有時候就喜歡直接讀取文件椿息，然后對圖片加噪聲處理歹袁，然后再送入訓(xùn)練，這樣做的缺點是高速讀寫圖片文件寝优，會非常占用CPU条舔。

import tensorflow as tf
import numpy as np
import os
import math

# you need to change this to your data directory
# train_dir = '/home/acrobat/DataSets/cats_vs_dogs/train/'

def get_files(file_dir, ratio):
    """
    Args:
        file_dir: file directory
        ratio:ratio of validation datasets
    Returns:
        list of images and labels
    """
    cats = []
    label_cats = []
    dogs = []
    label_dogs = []
    for file in os.listdir(file_dir):
        name = file.split(sep='.')
        if name[0]=='cat':
            cats.append(file_dir + file)
            label_cats.append(0)
        else:
            dogs.append(file_dir + file)
            label_dogs.append(1)
    print('There are %d cats\nThere are %d dogs' %(len(cats), len(dogs)))

    image_list = np.hstack((cats, dogs))
    label_list = np.hstack((label_cats, label_dogs))

    temp = np.array([image_list, label_list])
    temp = temp.transpose()
    np.random.shuffle(temp)

    all_image_list = temp[:, 0]
    all_label_list = temp[:, 1]

    n_sample = len(all_label_list)
    n_val = math.ceil(n_sample*ratio) # number of validation samples
    n_train = n_sample - n_val # number of trainning samples

    tra_images = all_image_list[0:n_train]
    tra_labels = all_label_list[0:n_train]
    tra_labels = [int(float(i)) for i in tra_labels]
    val_images = all_image_list[n_train:-1]
    val_labels = all_label_list[n_train:-1]
    val_labels = [int(float(i)) for i in val_labels]

    return tra_images,tra_labels,val_images,val_labels


def get_batch(image, label, image_W, image_H, batch_size, capacity):
    """
    Args:
        image: list type
        label: list type
        image_W: image width
        image_H: image height
        batch_size: batch size
        capacity: the maximum elements in queue
    Returns:
        image_batch: 4D tensor [batch_size, width, height, 3], dtype=tf.float32
        label_batch: 1D tensor [batch_size], dtype=tf.int32
    """

    image = tf.cast(image, tf.string)
    label = tf.cast(label, tf.int32)

    # make an input queue
    input_queue = tf.train.slice_input_producer([image, label])

    label = input_queue[1]
    image_contents = tf.read_file(input_queue[0])
    image = tf.image.decode_jpeg(image_contents, channels=3)

    image = tf.image.resize_image_with_crop_or_pad(image, image_W, image_H)

    # if you want to test the generated batches of images, you might want to comment the following line.
    image = tf.image.per_image_standardization(image)

    image_batch, label_batch = tf.train.batch([image, label],
                                                batch_size= batch_size,
                                                num_threads= 64,
                                                capacity = capacity)
    #you can also use shuffle_batch
#    image_batch, label_batch = tf.train.shuffle_batch([image,label],
#                                                      batch_size=BATCH_SIZE,
#                                                      num_threads=64,
#                                                      capacity=CAPACITY,
#                                                      min_after_dequeue=CAPACITY-1)

    label_batch = tf.reshape(label_batch, [batch_size])
    image_batch = tf.cast(image_batch, tf.float32)

    return image_batch, label_batch

（3）、read from bin
有的時候我們的數(shù)據(jù)是二進(jìn)制格式（bin）乏矾，因此需要將二進(jìn)制文件讀取出來孟抗。
在官網(wǎng)的cifar的例子中就是從bin文件中讀取的。bin文件需要以一定的size格式存儲钻心，比如每個樣本的值占多少字節(jié)凄硼，label占多少字節(jié)，且這對于每個樣本都是固定的扔役，然后一個挨著一個存儲帆喇。這樣就可以使用tf.FixedLengthRecordReader 類來每次讀取固定長度的字節(jié)，正好對應(yīng)一個樣本存儲的字節(jié)（包括label）亿胸。并且用tf.decode_raw進(jìn)行解析坯钦。

import tensorflow as tf
import numpy as np

# 預(yù)定義圖像數(shù)據(jù)信息
labelBytes = 1
witdthBytes = 32
heightBytes = 32
depthBytes = 3
imageBytes = witdthBytes*heightBytes*depthBytes
recordBytes = imageBytes+labelBytes

filename_queue = tf.train.string_input_producer(["./data/train.bin"])
reader = tf.FixedLengthRecordReader(record_bytes=recordBytes) # 按固定長度讀取二進(jìn)制文件
key,value = reader.read(filename_queue)

bytes = tf.decode_raw(value,out_type=tf.uint8) # 解碼為uint8,0-255 8位3通道圖像
label = tf.cast(tf.strided_slice(bytes,[0],[labelBytes]),tf.int32) # 分割label并轉(zhuǎn)化為int32
##tf.strided_slice() 將讀取的一個bytes切分
originalImg  = tf.reshape(tf.strided_slice(bytes,[labelBytes],[labelBytes+imageBytes]),[depthBytes,heightBytes,witdthBytes])
# 分割圖像，此時按照數(shù)據(jù)組織形式深度在前
img = tf.transpose(originalImg,[1,2,0]) # 調(diào)整軸的順序侈玄，深度在后

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for i in range(100):
        imgArr = sess.run(img)
        print (imgArr.shape)

    coord.request_stop()
    coord.join(threads)

tf.strided_slice具體用法見：https://www.tensorflow.org/versions/master/api_docs/python/tf/strided_slice

上面就是我現(xiàn)在遇到過的TensorFlow讀寫數(shù)據(jù)的一些常見方式婉刀，后面遇到了會陸續(xù)添加。

補充：這里有一篇比較詳細(xì)的tfrecords文件的讀寫教程序仙，受益頗多Ｍ患铡！
http://blog.csdn.net/u010223750/article/details/70482498

參考文章：
http://honggang.io/2016/08/19/tensorflow-data-reading/
http://blog.csdn.net/freedom098/article/details/56008784

最后編輯于：2018.01.21 15:26:17

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末潘悼，一起剝皮案震驚了整個濱河市律秃，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌治唤，老刑警劉巖棒动，帶你破解...
沈念sama閱讀 217,907評論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異宾添，居然都是意外死亡船惨，警方通過查閱死者的電腦和手機柜裸，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,987評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來粱锐，“玉大人疙挺，你說我怎么就攤上這事×常” “怎么了铐然？”我有些...
開封第一講書人閱讀 164,298評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長海雪。經(jīng)常有香客問我锦爵，道長，這世上最難降的妖魔是什么奥裸？我笑而不...
開封第一講書人閱讀 58,586評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任险掀，我火速辦了婚禮，結(jié)果婚禮上湾宙，老公的妹妹穿的比我還像新娘樟氢。我一直安慰自己，他們只是感情好侠鳄，可當(dāng)我...
茶點故事閱讀 67,633評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布埠啃。她就那樣靜靜地躺著，像睡著了一般伟恶。火紅的嫁衣襯著肌膚如雪碴开。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,488評論 1贊 302
城市分裂傳說
那天博秫，我揣著相機與錄音潦牛，去河邊找鬼。笑死挡育，一個胖子當(dāng)著我的面吹牛巴碗，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播即寒，決...
沈念sama閱讀 40,275評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼橡淆，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了母赵？” 一聲冷哼從身側(cè)響起逸爵，我...
開封第一講書人閱讀 39,176評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤寺晌，失蹤者是張志新（化名）和其女友劉穎往堡，沒想到半個月后办桨，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體事扭，經(jīng)...
沈念sama閱讀 45,619評論 1贊 314
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,819評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年妒穴，在試婚紗的時候發(fā)現(xiàn)自己被綠了节榜。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片膳殷。...
茶點故事閱讀 39,932評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡谷醉，死狀恐怖致稀，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情俱尼，我是刑警寧澤抖单，帶...
沈念sama閱讀 35,655評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站遇八，受9級特大地震影響矛绘，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜刃永，卻給世界環(huán)境...
茶點故事閱讀 41,265評論 3贊 329
男人毒藥：我在死后第九天來索命
文/蒙蒙一货矮、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧斯够，春花似錦囚玫、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,871評論 0贊 22
一樁弒父案抓督，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至束亏，卻和暖如春铃在，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背碍遍。一陣腳步聲響...
開封第一講書人閱讀 32,994評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工定铜，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人雀久。一個月前我還...
沈念sama閱讀 48,095評論 3贊 370
代替公主和親
正文我出身青樓宿稀，卻偏偏與公主長得像，于是被迫代替她去往敵國和親赖捌。傳聞我的和親對象是個殘疾皇子祝沸，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 44,884評論 2贊 354

TensorFlow的讀寫方式

推薦閱讀更多精彩內(nèi)容