前言
最近在學(xué)習(xí)Keras晨炕,要使用到LeCun大神的MNIST手寫數(shù)字?jǐn)?shù)據(jù)集膀值,直接從官網(wǎng)上下載了4個(gè)壓縮包:
解壓后發(fā)現(xiàn)里面每個(gè)壓縮包里有一個(gè)idx-ubyte文件亚情,沒有圖片文件在里面命浴∽氨回去仔細(xì)看了一下官網(wǎng)后發(fā)現(xiàn)原來這是IDX文件格式,是一種用來存儲(chǔ)向量與多維度矩陣的文件格式寡壮。
IDX文件格式
官網(wǎng)上的介紹如下:
THE IDX FILE FORMAT
the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.
The basic format is
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
The magic number is an integer (MSB first). The first 2 bytes are always 0.
The third byte codes the type of the data:
0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)
The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....
The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.
解析腳本
根據(jù)以上解析規(guī)則贩疙,我使用了Python里的struct模塊對(duì)文件進(jìn)行讀寫(如果不熟悉struct模塊的可以看我的另一篇博客文章《Python中對(duì)字節(jié)流/二進(jìn)制流的操作:struct模塊簡(jiǎn)易使用教程》)。IDX文件的解析通用接口如下:
# 解析idx1格式
def decode_idx1_ubyte(idx1_ubyte_file):
"""
解析idx1文件的通用函數(shù)
:param idx1_ubyte_file: idx1文件路徑
:return: np.array類型對(duì)象
"""
return data
def decode_idx3_ubyte(idx3_ubyte_file):
"""
解析idx3文件的通用函數(shù)
:param idx3_ubyte_file: idx3文件路徑
:return: np.array類型對(duì)象
"""
return data
針對(duì)MNIST數(shù)據(jù)集的解析腳本如下
# encoding: utf-8
"""
@author: monitor1379
@contact: yy4f5da2@hotmail.com
@site: www.monitor1379.com
@version: 1.0
@license: Apache Licence
@file: mnist_decoder.py
@time: 2016/8/16 20:03
對(duì)MNIST手寫數(shù)字?jǐn)?shù)據(jù)文件轉(zhuǎn)換為bmp圖片文件格式况既。
數(shù)據(jù)集下載地址為http://yann.lecun.com/exdb/mnist这溅。
相關(guān)格式轉(zhuǎn)換見官網(wǎng)以及代碼注釋。
========================
關(guān)于IDX文件格式的解析規(guī)則:
========================
THE IDX FILE FORMAT
the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.
The basic format is
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
The magic number is an integer (MSB first). The first 2 bytes are always 0.
The third byte codes the type of the data:
0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)
The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....
The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.
"""
import numpy as np
import struct
import matplotlib.pyplot as plt
# 訓(xùn)練集文件
train_images_idx3_ubyte_file = '../../data/mnist/bin/train-images.idx3-ubyte'
# 訓(xùn)練集標(biāo)簽文件
train_labels_idx1_ubyte_file = '../../data/mnist/bin/train-labels.idx1-ubyte'
# 測(cè)試集文件
test_images_idx3_ubyte_file = '../../data/mnist/bin/t10k-images.idx3-ubyte'
# 測(cè)試集標(biāo)簽文件
test_labels_idx1_ubyte_file = '../../data/mnist/bin/t10k-labels.idx1-ubyte'
def decode_idx3_ubyte(idx3_ubyte_file):
"""
解析idx3文件的通用函數(shù)
:param idx3_ubyte_file: idx3文件路徑
:return: 數(shù)據(jù)集
"""
# 讀取二進(jìn)制數(shù)據(jù)
bin_data = open(idx3_ubyte_file, 'rb').read()
# 解析文件頭信息棒仍,依次為魔數(shù)悲靴、圖片數(shù)量、每張圖片高莫其、每張圖片寬
offset = 0
fmt_header = '>iiii'
magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)
print '魔數(shù):%d, 圖片數(shù)量: %d張, 圖片大小: %d*%d' % (magic_number, num_images, num_rows, num_cols)
# 解析數(shù)據(jù)集
image_size = num_rows * num_cols
offset += struct.calcsize(fmt_header)
fmt_image = '>' + str(image_size) + 'B'
images = np.empty((num_images, num_rows, num_cols))
for i in range(num_images):
if (i + 1) % 10000 == 0:
print '已解析 %d' % (i + 1) + '張'
images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))
offset += struct.calcsize(fmt_image)
return images
def decode_idx1_ubyte(idx1_ubyte_file):
"""
解析idx1文件的通用函數(shù)
:param idx1_ubyte_file: idx1文件路徑
:return: 數(shù)據(jù)集
"""
# 讀取二進(jìn)制數(shù)據(jù)
bin_data = open(idx1_ubyte_file, 'rb').read()
# 解析文件頭信息癞尚,依次為魔數(shù)和標(biāo)簽數(shù)
offset = 0
fmt_header = '>ii'
magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)
print '魔數(shù):%d, 圖片數(shù)量: %d張' % (magic_number, num_images)
# 解析數(shù)據(jù)集
offset += struct.calcsize(fmt_header)
fmt_image = '>B'
labels = np.empty(num_images)
for i in range(num_images):
if (i + 1) % 10000 == 0:
print '已解析 %d' % (i + 1) + '張'
labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0]
offset += struct.calcsize(fmt_image)
return labels
def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):
"""
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
:param idx_ubyte_file: idx文件路徑
:return: n*row*col維np.array對(duì)象,n為圖片數(shù)量
"""
return decode_idx3_ubyte(idx_ubyte_file)
def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):
"""
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
:param idx_ubyte_file: idx文件路徑
:return: n*1維np.array對(duì)象乱陡,n為圖片數(shù)量
"""
return decode_idx1_ubyte(idx_ubyte_file)
def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):
"""
TEST SET IMAGE FILE (t10k-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 10000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
:param idx_ubyte_file: idx文件路徑
:return: n*row*col維np.array對(duì)象浇揩,n為圖片數(shù)量
"""
return decode_idx3_ubyte(idx_ubyte_file)
def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):
"""
TEST SET LABEL FILE (t10k-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 10000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
:param idx_ubyte_file: idx文件路徑
:return: n*1維np.array對(duì)象,n為圖片數(shù)量
"""
return decode_idx1_ubyte(idx_ubyte_file)
def run():
train_images = load_train_images()
train_labels = load_train_labels()
# test_images = load_test_images()
# test_labels = load_test_labels()
# 查看前十個(gè)數(shù)據(jù)及其標(biāo)簽以讀取是否正確
for i in range(10):
print train_labels[i]
plt.imshow(train_images[i], cmap='gray')
plt.show()
print 'done'
if __name__ == '__main__':
run()