慚愧耻蛇!沒有堅持每天寫日記踪蹬,違背了自己許下的承諾。國慶中秋假期玩瘋了臣咖,一個獎學金申請表寫了7天跃捣,自己都對自己無語了。不管怎樣夺蛇,爬了泰山疚漆,去了泰山上的孔廟,也算是一次朝圣(刁赦?)之旅吧娶聘!什么時候我要認認真真寫一篇游記,今天先寫一點python金融大數(shù)據(jù)分析的學習筆記甚脉。
Python的I/O丸升,幾種操作的對比
去年我剛開始做天池競賽的時候,就注意到python的I/O是多么緩慢(當然還是比R要快)牺氨,這一方面是因為自己對python及pandas庫不了解发钝,另一方面是因為硬件本身的限制。后來也接觸到一些龐大的數(shù)據(jù)波闹,比如騰訊的商品推薦賽題,不可避免地需要更大的內存和更頻繁的I/O涛碑,才去想到怎樣優(yōu)化性能精堕。然而我的解決方法真是比較野——使用SAS代替python進行數(shù)據(jù)的初步處理,然后仍然使用最初級的I/O導入python蒲障,誰叫SAS速度快不占內存呢歹篓!現(xiàn)在想想當時的自己,真是有點哭笑不得揉阎。
事實上庄撮,當時的那些工作,python完全能夠處理毙籽,只是自己技術太差洞斯,知識太淺,思而不學則殆坑赡。下面就來舉例說明一下各種各樣的python I/O以及性能對比烙如。
我們使用這樣的數(shù)據(jù)進行導出和讀入:
import numpy as np
import pandas as pd
n = 1000000
c1= pd.date_range(start='2000-01-01',periods=n,freq='2min')
ci= pd.DataFrame(np.random.randint(0,100,(n,2)),c1).add_prefix('Int')
cf= pd.DataFrame(np.random.randn(n,2),c1).add_prefix('Float')
c = pd.concat([ci,cf],axis=1).reset_index().rename(columns={'index':'Date'})
為以后的方便,我們首先生成數(shù)據(jù)框c的兩個ndarray版本螟加。ndarray格式可以使用
pd.DataFrame()
函數(shù)轉換成數(shù)據(jù)框徘溢,但這要求格外的操作和資源吞琐,所以可能雞肋一些,但是不管怎樣然爆,這里還是放出ndarray的操作站粟。
dtimes = np.arange('2000-01-01 00:00:00','2003-10-20 21:20:00',dtype='datetime64[2m]')
dty = np.dtype([('Date','datetime64[m]'),('Int0','int64'),('Int1','int64'),('Float0','float64'),('Float1','float64')])
d = np.zeros(len(dtimes),dtype=dty)
for i in ['Int0','Int1','Float0','Float1']: d[i] = c[i]
d['Date'] = dtimes
dtimes = np.arange('2000-01-01 00:00:00','2003-10-20 21:20:00',dtype='datetime64[2m]')
dty = np.dtype([('Date','S19'),('Int0','int64'),('Int1','int64'),('Float0','float64'),('Float1','float64')])
dl = np.zeros(len(dtimes),dtype=dty)
for i in ['Int0','Int1','Float0','Float1']: dl[i] = c[i]
dl['Date'] = dtimes
pickle模塊
pickle是python的標準庫。使用pickle模塊的好處是可以存儲大部分python的對象到磁盤上翻默,這里提供一個簡單的例子缸沃,直接存儲數(shù)據(jù)框c⌒扌担考慮到讀入的對象就是寫出的對象趾牧,因此操作簡便沒有中間步驟無需前后處理,可以說非常好用肯污。
import pickle
pkl_file = open('data.pkl','wb')
%time pickle.dump(c,pkl_file)
pkl_file.close()
pkl_file = open('data.pkl','rb')
%time b = pickle.load(pkl_file)
pkl_file.close()
讀寫文本文件
使用.write
和.readlines
讀寫翘单,固定了寫出的浮點型格式為17位小數(shù),這樣做可以保證精度不損失蹦渣。這種處理方法復雜且緩慢哄芜,但由于寫出的是.csv
格式,可以使用其他文本編輯器瀏覽查看柬唯,有很強的通用性认臊,因此也還算好。
csv_file = open('data.txt','wb')
csv_file.write(b'Date,Int0,Int1,Float0,Float1\n')
ca = c.values
%time for x in ca: csv_file.write(('%s,%g,%g,%.17f,%.17f\n'%tuple(x)).encode())
csv_file.close()
csv_file = open('data.txt','rb')
%time ba = csv_file.readlines()
csv_file.close()
SQL數(shù)據(jù)庫
用python自帶的數(shù)據(jù)庫進行查詢锄奢。讀入也可以使用pandas.io.sql
模塊失晴。緩慢。讀入時也十分緩慢拘央。
import sqlite3 as sq3
query = 'create table data (Data date, Int0 int, Int1 int, Float0 float, Float1 float)'
!rm -rf data.db
con = sq3.connect('data.db')
con.execute(query)
con.commit()
%time for x in ca: con.execute('insert into data values(?,?,?,?,?)',(str(x[0]),x[1],x[2],x[3],x[4]))
con.commit()
pointer = con.execute('select * from data')
%time da = pointer.fetchall()
da = np.array(da)
con.close()
import pandas.io.sql as pds
con = sq3.connect('data.db')
%time data = pds.read_sql('select * from data',con)
con.close()
使用pandas的HDF5格式
相較于pickle模塊更快涂屁,更簡便,缺點是只能存儲數(shù)據(jù)框數(shù)據(jù)灰伟。
h5s = pd.HDFStore('data.h5s','w')
%time h5s['data'] = c
h5s.close()
h5s = pd.HDFStore('data.h5s','r')
%time f = h5s['data']
h5s.close()
使用pandas的to_csv和to_excel
以前只知道.to_csv
拆又,受制于文本格式限制,其速度和.write
差不多袱箱,很慢遏乔。.to_excel
只測試導出了1/10的數(shù)據(jù),實在太慢了发笔。
%time c.to_csv('data.csv',index=False)
%time b = pd.read_csv('data.csv')
%time c.iloc[:int(n/10),:].to_excel('data.xlsx',index=False)
%time g = pd.read_excel('data.xlsx')
使用numpy的save
numpy的高性能有目共睹盟萨。可惜寫出讀入的都是ndarray了讨,需要轉化捻激,有些麻煩制轰。優(yōu)于pickle但不如pandas的HDF5格式存儲。
%time np.save('data',d)
%time ea = np.load('data.npy')
使用PyTables
這就是Python和HDF5的結合了胞谭,很快垃杖。在complevel=0
下,與pandas的HDF5存儲不分伯仲丈屹〉鞣可惜寫出讀入的都是ndarray,需要轉化旺垒,有些麻煩彩库。還有一些基于PyTables的數(shù)據(jù)分析操作,這里就不介紹了先蒋。
import tables as tb
h5 = tb.open_file('data.h5','w')
filters = tb.Filters(complevel=0)
%time tab = h5.create_table('/','data',dl,title='data',expectedrows=n,filters=filters)
h5.close()
h5 = tb.open_file('data.h5','r')
%time h = h5.get_node('/','data').read()
h = pd.DataFrame(h)
h['Date'] = h['Date'].astype('datetime64[m]')
h5.close()
幾種I/O操作的對比
這里使用time模塊計時骇钦,重復若干次操作并取中位數(shù)。給出了byte(占用磁盤大芯貉)眯搭、easy(代碼復雜度)、read(讀入用時)业岁、write(寫出用時)4個指標鳞仙,其中easy只有1,2,3三個取值,打分標準為:1.無需轉換格式笔时,不使用循環(huán)繁扎;2.需轉換格式,不使用循環(huán)糊闽;3.需轉換格式,使用循環(huán)爹梁。
我們排除太慢的.to_excel
右犹,于是剩下7種I/O操作方法∫考慮到實際工作中對代碼簡潔和讀入性能的要求念链,對這7種方法進行排序,排序結果如下表:
可以看到积糯,pickle模塊優(yōu)勢明顯掂墓。如果方便使用ndarray格式的話,np.save
也是一個不錯的選擇看成。如果考慮文本格式的讀寫君编,則應當使用.write
和.readlines
而不是pandas的.to_csv
。
附上比較的代碼
import matplotlib.pyplot as plt
import time
tl = {}
!mkdir tryio
cd tryio
# pickle
tl[('write','pickle')] = []
tl[('read','pickle')] = []
import pickle
for i in range(50):
pkl_file = open('data.pkl','wb')
tt = time.time()
pickle.dump(c,pkl_file)
tl[('write','pickle')].append(time.time()-tt)
pkl_file.close()
pkl_file = open('data.pkl','rb')
tt = time.time()
b = pickle.load(pkl_file)
tl[('read','pickle')].append(time.time()-tt)
pkl_file.close()
plt.plot(tl[('write','pickle')])
plt.plot(tl[('read','pickle')])
tmp = !powershell dir
tl[('byte','pickle')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','pickle')] = 1
!rm -rf *.*
# txt
tl[('write','txt')] = []
tl[('read','txt')] = []
for i in range(3):
csv_file = open('data.txt','wb')
csv_file.write(b'Date,Int0,Int1,Float0,Float1\n')
ca = c.values
tt = time.time()
for x in ca: csv_file.write(('%s,%g,%g,%.17f,%.17f\n'%tuple(x)).encode())
tl[('write','txt')].append(time.time()-tt)
csv_file.close()
csv_file = open('data.txt','rb')
tt = time.time()
b = csv_file.readlines()
tl[('read','txt')].append(time.time()-tt)
csv_file.close()
plt.plot(tl[('write','txt')])
plt.plot(tl[('read','txt')])
tmp = !powershell dir
tl[('byte','txt')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','txt')] = 3
!rm -rf *.*
# sql
tl[('write','sql')] = []
tl[('read','sql')] = []
import sqlite3 as sq3
for i in range(3):
query = 'create table data (Data date, Int0 int, Int1 int, Float0 float, Float1 float)'
!rm -rf data.db
con = sq3.connect('data.db')
con.execute(query)
con.commit()
tt = time.time()
for x in ca: con.execute('insert into data values(?,?,?,?,?)',(str(x[0]),x[1],x[2],x[3],x[4]))
tl[('write','sql')].append(time.time()-tt)
con.commit()
pointer = con.execute('select * from data')
tt = time.time()
da = pointer.fetchall()
tl[('read','sql')].append(time.time()-tt)
da = np.array(da)
con.close()
# import pandas.io.sql as pds
# con = sq3.connect('data.db')
# %time data = pds.read_sql('select * from data',con)
# con.close()
plt.plot(tl[('write','sql')])
plt.plot(tl[('read','sql')])
tmp = !powershell dir
tl[('byte','sql')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','sql')] = 3
!rm -rf *.*
# pdHDF5
tl[('write','pdHDF5')] = []
tl[('read','pdHDF5')] = []
for i in range(50):
h5s = pd.HDFStore('data.h5s','w')
tt = time.time()
h5s['data'] = c
tl[('write','pdHDF5')].append(time.time()-tt)
h5s.close()
h5s = pd.HDFStore('data.h5s','r')
tt = time.time()
f = h5s['data']
tl[('read','pdHDF5')].append(time.time()-tt)
h5s.close()
plt.plot(tl[('write','pdHDF5')])
plt.plot(tl[('read','pdHDF5')])
tmp = !powershell dir
tl[('byte','pdHDF5')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','pdHDF5')] = 1
!rm -rf *.*
# pdcsv
tl[('write','pdcsv')] = []
tl[('read','pdcsv')] = []
for i in range(3):
tt = time.time()
c.to_csv('data.csv',index=False)
tl[('write','pdcsv')].append(time.time()-tt)
tt = time.time()
b = pd.read_csv('data.csv')
tl[('read','pdcsv')].append(time.time()-tt)
plt.plot(tl[('write','pdcsv')])
plt.plot(tl[('read','pdcsv')])
tmp = !powershell dir
tl[('byte','pdcsv')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','pdcsv')] = 1
!rm -rf *.*
# pdexcel
tl[('write','pdexcel')] = []
tl[('read','pdexcel')] = []
for i in range(3):
tt = time.time()
c.iloc[:int(n/10),:].to_excel('data.xlsx',index=False)
tl[('write','pdexcel')].append(time.time()-tt)
tt = time.time()
g = pd.read_excel('data.xlsx')
tl[('read','pdexcel')].append(time.time()-tt)
plt.plot(tl[('write','pdexcel')])
plt.plot(tl[('read','pdexcel')])
tmp = !powershell dir
tl[('byte','pdexcel')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','pdexcel')] = 1
!rm -rf *.*
# npsave
tl[('write','npsave')] = []
tl[('read','npsave')] = []
for i in range(50):
tt = time.time()
np.save('data',d)
tl[('write','npsave')].append(time.time()-tt)
tt = time.time()
ea = np.load('data.npy')
tl[('read','npsave')].append(time.time()-tt)
plt.plot(tl[('write','npsave')])
plt.plot(tl[('read','npsave')])
tmp = !powershell dir
tl[('byte','npsave')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','npsave')] = 2
!rm -rf *.*
# pytable
tl[('write','pytable')] = []
tl[('read','pytable')] = []
import tables as tb
for i in range(50):
h5 = tb.open_file('data.h5','w')
filters = tb.Filters(complevel=0)
tt = time.time()
tab = h5.create_table('/','data',dl,title='data',expectedrows=n,filters=filters)
tl[('write','pytable')].append(time.time()-tt)
h5.close()
h5 = tb.open_file('data.h5','r')
tt = time.time()
h = h5.get_node('/','data').read()
tl[('read','pytable')].append(time.time()-tt)
h = pd.DataFrame(h)
h['Date'] = h['Date'].astype('datetime64[m]')
h5.close()
plt.plot(tl[('write','pytable')])
plt.plot(tl[('read','pytable')])
tmp = !powershell dir
tl[('byte','pytable')] = int(re.findall(r' ([0-9]+) ',' '.join(tmp))[0])
tl[('easy','pytable')] = 2
!rm -rf *.*
# comparison
ptl = pd.Series(tl)
ptl.loc[['write','read']] = ptl[['write','read']].apply(np.median)
ptl.loc[['byte']] = ptl[['byte']]/1000**2
ptl = ptl.unstack(level=0)
ptl.drop('pdexcel',inplace=True)
ptl.sort_values(by=['easy','read'])
cd ..
!rm -rf tryio