numpy讀取大型二進(jìn)制數(shù)據(jù)
2021.01.15 LHQ
讀取大型二進(jìn)制數(shù)據(jù)的時(shí)候(類似GrADS格式旗国,內(nèi)含規(guī)則數(shù)據(jù))褐望,考慮需要部分、直接讀取挣轨,查詢網(wǎng)絡(luò)資源發(fā)現(xiàn)军熏,numpy里面的兩種方式
- numpy.fromfile (file, dtype=float, count=-1, sep='', offset=0)
- numpy.memmap
numpy.fromfile
numpy.fromfile (file, dtype=float, count=-1, sep='', offset=0)
- file: 文件名
- dtype:數(shù)據(jù)格式 (GrADS格式的一般為float32,更多設(shè)置還在摸索中)
- count: 一次讀取的數(shù)據(jù)量(-1表示讀取整個(gè)文件卷扮,注意數(shù)據(jù)量不是字節(jié)數(shù)大小荡澎,而是數(shù)據(jù)的數(shù)量,例如有多少個(gè)浮點(diǎn)數(shù)晤锹,比如GrADS數(shù)據(jù)要素單個(gè)水平層有nlat*nlon個(gè)格點(diǎn)摩幔,數(shù)據(jù)量也就是nlat*nlon)
- sep:分隔符,空分割符(“”)表示二進(jìn)制數(shù)據(jù)鞭铆,分隔符中的空格(“ ”)匹配零個(gè)或多個(gè)空格热鞍,單純的空格分隔符(“ ”)匹配一個(gè)或多個(gè)空格。
- offset: 距離文件當(dāng)前位置的偏移量(以字節(jié)為單位)衔彻。默認(rèn)為0薇宠,僅用于二進(jìn)制文件。使用時(shí)艰额,請(qǐng)先open文件澄港,否者文件位置不會(huì)設(shè)置在開(kāi)頭。
測(cè)試代碼:
fdir="./database/postvar202006020004400"
nlat=501
nlon=751
with open(fdir,'rb') as f:
ts1=time.time()
data=np.fromfile(f,dtype='float32',count=nlat*nlon,offset=nlat*nlon*4)
te1=time.time()
t1=te1-ts1
print('耗時(shí):',t1)
with open(fdir,'rb') as f:
ts1=time.time()
data=np.fromfile(f,dtype='float32')
te1=time.time()
t1=te1-ts1
print('耗時(shí):',t1)
耗時(shí): 0.03650185585021973
耗時(shí): 1.6820800304412842
而使用xgrads讀取相應(yīng)的變量
ts=time.time()
ds = gv.Grads_data(ctlname)
ds = ds.getsinglelevel("u",0,24)
te =time.time()
print("耗時(shí): ", te-ts)
耗時(shí): 0.08249235153198242
考慮到xgrads獲得是pandas.dataset數(shù)據(jù)柄沮,而numpy.fromfile僅僅獲得要素值回梧,在現(xiàn)有情況下,這兩種方式的效率其實(shí)相差不大祖搓。
numpy.memap
貌似有點(diǎn)復(fù)雜狱意,官方解釋如下:
Create a memory-map to an array stored in a binary file on disk.
創(chuàng)建一份存儲(chǔ)在磁盤上的二進(jìn)制文件中的數(shù)組的內(nèi)存映射。
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
內(nèi)存映射文件用于訪問(wèn)磁盤上大文件的小片段拯欧,而無(wú)需將整個(gè)文件讀入內(nèi)存详囤。NumPy的memmap是類似數(shù)組的對(duì)象。這與Python的mmap模塊不同,后者使用的是類似文件的對(duì)象藏姐。
This subclass of ndarray has some unpleasant interactions with some operations, because it doesn’t quite fit properly as a subclass. An alternative to using this subclass is to create the mmap object yourself, then create an ndarray with ndarray.__new__ directly, passing the object created in its ‘buffer=’ parameter.
ndarray的這個(gè)子類與某些操作有一些不愉快的交互隆箩,因?yàn)樗惶m合作為子類。使用此子類的另一種方法是自己創(chuàng)建mmap對(duì)象羔杨,然后直接使用ndarray .__ new__創(chuàng)建一個(gè)ndarray捌臊,并傳遞在其'buffer ='參數(shù)中創(chuàng)建的對(duì)象。
This class may at some point be turned into a factory function which returns a view into an mmap buffer.
此類可能在某些時(shí)候變成了工廠函數(shù)兜材,該函數(shù)將視圖返回到mmap緩沖區(qū)理澎。
Delete the memmap instance to close the memmap file.
刪除memmap實(shí)例以關(guān)閉memmap文件
腳本測(cè)試:
#coding=utf-8
import numpy as np
import time
fdir="./database/postvar202006020004400"
nlat=501
nlon=751
with open(fdir,'rb') as f:
ts1=time.time()
data=np.fromfile(f,dtype='float32',count=nlat*nlon,offset=nlat*nlon*4)
te1=time.time()
t1=te1-ts1
print('耗時(shí):',t1)
ts1=time.time()
data=np.memmap(fdir,dtype=np.float32,offset=nlat*nlon*4,shape=(nlat,nlon) )
te1=time.time()
t1=te1-ts1
print('耗時(shí):',t1)
del data
耗時(shí): 0.0011439323425292969
耗時(shí): 0.0006759166717529297
mmap 不用讀取到緩存里面,確實(shí)比較快曙寡。