本文將通過(guò)pandas to_pickle()方法壓縮文件界弧,并比較不同格式壓縮文件的大小、寫入速度、讀取速度呜呐,對(duì)比結(jié)果將說(shuō)明哪種壓縮文件最優(yōu)。
學(xué)過(guò)Python基礎(chǔ)的同學(xué)肯定知道有一個(gè)叫Pickle的模塊悍募,用來(lái)對(duì)數(shù)據(jù)進(jìn)行序列化及反序列化蘑辑。
對(duì)數(shù)據(jù)進(jìn)行反序列化有什么用呢?一個(gè)重要的作用就是便于存儲(chǔ)坠宴。
序列化過(guò)程將文本信息轉(zhuǎn)變?yōu)槎M(jìn)制數(shù)據(jù)流洋魂,同時(shí)保存數(shù)據(jù)類型。比如喜鼓,數(shù)據(jù)處理過(guò)程中副砍,突然有事要走,你可以直接將數(shù)據(jù)序列化到本地庄岖,這時(shí)候你的數(shù)據(jù)是什么類型豁翎,保存到本地也是同樣的數(shù)據(jù)類型,再次打開(kāi)的時(shí)候同樣也是該數(shù)據(jù)類型隅忿,而不是從頭開(kāi)始再處理心剥。
導(dǎo)入庫(kù)及數(shù)據(jù)
import pandas as pd
import numpy as np
import os
import re
?
data = pd.read_csv('qdaily_infos.csv',header=0)
查看文件本地文件大小
os.stat('qdaily_infos.csv').st_size/(1024 * 1024)
out:
19.07578182220459
使用to_pickle()方法進(jìn)行文件壓縮
read_pickle()邦尊,DataFrame.to_pickle()和Series.to_pickle()可以讀取和寫入壓縮的腌制文件。 支持讀寫gzip优烧,bz2蝉揍,xz壓縮類型。 zip文件格式僅支持讀取匙隔,并且只能包含一個(gè)要讀取的數(shù)據(jù)文件疑苫。 壓縮類型可以是顯式參數(shù),也可以從文件擴(kuò)展名推斷出來(lái)纷责。 如果為“infer”捍掺,則文件名分別以“ .gz”,“再膳。bz2”挺勿,“。zip”或“ .xz”結(jié)尾喂柒。
直接保存為Pickle文件
data.to_pickle('qdaily_infos.pkl')
os.stat('qdaily_infos.pkl').st_size/(1024 * 1024)
out:
11.89148235321045
可以看到比原文件小了7.2M.
通過(guò)compression參數(shù)指定壓縮類型
save_path_compress = ['qdaily_gzip.pkl.compress','qdaily_bz2.pkl.compress','qdaily_xz.pkl.compress']
compression = ['gzip','bz2','xz']
?
for path,compress in zip(save_path_compress,compression):
data.to_pickle(path,compression=compress)
print('the size of {0} is : {1}'.format(path,os.stat(path).st_size/(1024 * 1024)))
out:
the size of qdaily_gzip.pkl.compress is : 4.289911270141602
the size of qdaily_bz2.pkl.compress is : 3.002878189086914
the size of qdaily_xz.pkl.compress is : 2.890697479248047
通過(guò)文件后綴定義壓縮類型
save_path_infer = ['qdaily.pkl.gzip','qdaily.pkl.bz2','qdaily.pkl.xz']
?
for path in save_path_infer:
data.to_pickle(path,compression='infer')
print('The size of {0} is : {1}'.format(path,os.stat(path).st_size/(1024 * 1024)))
out:
The size of qdaily.pkl.gzip is : 11.89148235321045
The size of qdaily.pkl.bz2 is : 3.002878189086914
The size of qdaily.pkl.xz is : 2.890697479248047
綜合對(duì)比
path_all = save_path_compress + save_path_infer
path_all.insert(0,'qdaily_infos.pkl')
path_all.insert(0,'qdaily_infos.csv')
本地文件大小對(duì)比
for path in path_all:
print('The size of {0} is : {1}'.format(path,os.stat(path).st_size/(1024 * 1024)))
out:
The size of qdaily_infos.csv is : 19.07578182220459
The size of qdaily_infos.pkl is : 11.89148235321045
The size of qdaily_gzip.pkl.compress is : 4.289911270141602
The size of qdaily_bz2.pkl.compress is : 3.002878189086914
The size of qdaily_xz.pkl.compress is : 2.890697479248047
The size of qdaily.pkl.gzip is : 11.89148235321045
The size of qdaily.pkl.bz2 is : 3.002878189086914
The size of qdaily.pkl.xz is : 2.890697479248047
寫入時(shí)間對(duì)比
for path in path_all:
if path.endswith('.csv'):
%time data.to_csv(path)
elif path.endswith('.pkl'):
%time data.to_pickle(path)
elif path.find('_') > 0:
compress = re.findall('_(.*?)\.',path)[0]
%time data.to_pickle(path,compression=compress)
else:
%time data.to_pickle(path,compression='infer')
out:
Wall time: 502 ms
Wall time: 62.8 ms
Wall time: 3.31 s
Wall time: 1.04 s
Wall time: 10.9 s
Wall time: 56.8 ms
Wall time: 1.02 s
Wall time: 10.7 s
讀取時(shí)間對(duì)比
for path in path_all:
if path.endswith('.csv'):
%time pd.read_csv(path)
elif path.endswith('.pkl'):
%time pd.read_pickle(path)
elif path.find('_') > 0:
compress = re.findall('_(.*?)\.',path)[0]
%time pd.read_pickle(path,compression=compress)
else:
%time pd.read_pickle(path,compression='infer')
out:
Wall time: 369 ms
Wall time: 66.9 ms
Wall time: 122 ms
Wall time: 431 ms
Wall time: 401 ms
Wall time: 60.8 ms
Wall time: 450 ms
Wall time: 383 ms
結(jié)論
壓縮效果最好的是xz格式不瓶,只有原來(lái)的15%,但是寫入速度最慢灾杰;
寫入速度最快的是gzip格式蚊丐,只有56.8ms,同時(shí)也是讀取速度最快的艳吠,60.8ms麦备,但是壓縮效果差一些,是原來(lái)的62%;
綜合壓縮效果、寫入時(shí)間及讀取時(shí)間嘱朽,比較合適的壓縮格式為.bz2格式,壓縮效果為15.7%呛梆,寫入速度會(huì)差一些,但是可以接受磕诊。
size | writing_time | read_time | |
---|---|---|---|
qdaily_infos.csv | 19.08 | 502 ms | 369 ms |
qdaily_infos.pkl | 11.89 | 62.8 ms | 66.9 ms |
qdaily_gzip.pkl.compress | 4.28 | 3.31 s | 122 ms |
qdaily_bz2.pkl.compress | 3 | 1.04 s | 431 ms |
qdaily_xz.pkl.compress | 2.89 | 10.9 s | 401 ms |
qdaily.pkl.gzip | 11.89 | 56.8 ms | 60.8 ms |
qdaily.pkl.bz2 | 3 | 1.02 s | 450 ms |
qdaily.pkl.xz | 2.89 | 10.7 s | 383 ms |