讀取csv文件
import pandas as pd
csv_path = 'gun_deaths_in_america.csv'
data_csv = pd.read_csv(csv_path,header=0)
data_csv.head()
data_csv.shape
(100798, 10)
%timeit pd.read_csv(csv_path,header=0)
114 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
查看文件大小
查看本地文件大小
import os
os.stat('gun_deaths_in_america.csv').st_size # 單位是byte
4824404
查看占用內(nèi)存大小
data_csv.memory_usage(deep=True).sum()
30368107
查看每一列占用內(nèi)存大小
- object 類型占用內(nèi)存空間很大
- int/float類型占用內(nèi)存小
data_csv.memory_usage(deep=True)
Index 80
year 806384
month 806384
intent 6495168
police 806384
sex 6249476
age 806384
race 6322009
hispanic 806384
place 6463070
education 806384
dtype: int64
data_csv.dtypes
year int64
month int64
intent object
police int64
sex object
age float64
race object
hispanic int64
place object
education float64
dtype: object
保存為Pickle文件
直接保存為Pickle文件
保存為本地文件后屎即,文件大小比原文件大。
data_csv.to_pickle('gun_deaths_in_america_before_transform.pkl')
pkl_path_before = 'gun_deaths_in_america_before_transform.pkl'
os.stat(pkl_path_before).st_size
5656925
對比文件讀取速度
pickle文件的讀取速度比csv文件讀取速度快2倍 酪耳!
%timeit pd.read_csv(csv_path,header=0)
102 ms ± 7.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_before)
32.4 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
類型轉(zhuǎn)換后保存為Pickle文件
剛才看到object類型很占內(nèi)存寒跳,可以將其轉(zhuǎn)換為category類型禽翼。
data_csv.intent.astype('category').head()
0 Suicide
1 Suicide
2 Suicide
3 Suicide
4 Suicide
Name: intent, dtype: category
Categories (4, object): [Accidental, Homicide, Suicide, Undetermined]
先準(zhǔn)換intent列,對比object的6495168,category的大小為object的1/65.
data_csv.intent.astype('category').memory_usage(deep=True)
101303
將所有數(shù)據(jù)轉(zhuǎn)換成category類型
for col in data_csv.columns:
data_csv[col] = data_csv[col].astype('category')
查看轉(zhuǎn)換后占用內(nèi)存大小辖试,相比轉(zhuǎn)換前的303688107喊暖,轉(zhuǎn)換后的內(nèi)存大小減小57倍惫企。
data_csv.memory_usage(deep=True).sum()
1018587
將轉(zhuǎn)換后的數(shù)據(jù)保存為pickle文件,并查看pickle本地文件大小哄啄。相比轉(zhuǎn)換前的4824404雅任,轉(zhuǎn)換后的文件的大小減小4倍。
data_csv.to_pickle('gun_deaths_in_america_after_transform.pkl')
pkl_path_after = 'gun_deaths_in_america_after_transform.pkl'
os.stat(pkl_path_after).st_size
1012643
對比文件讀取速度咨跌,比轉(zhuǎn)換前快42倍沪么。
%timeit pd.read_pickle(pkl_path_after)
2.57 ms ± 262 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_csv(csv_path,header=0)
106 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
綜合對比
files = [csv_path,pkl_path_before,pkl_path_after]
對比本地文件大小
轉(zhuǎn)換后的文件占用磁盤空間最小,比原文件小4倍锌半,對于保存大量數(shù)據(jù)非常有用禽车。
for file in files:
print('File size of the {0} is {1}: '.format(file,os.stat(file).st_size))
File size of the gun_deaths_in_america.csv is 4824404:
File size of the gun_deaths_in_america_before_transform.pkl is 5656925:
File size of the gun_deaths_in_america_after_transform.pkl is 1012643:
對比文件讀取速度
轉(zhuǎn)換后的讀取速度比普通csv文件的讀取速度快42倍寇漫。
%timeit pd.read_csv(csv_path,header=0)
97.5 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_before)
28.5 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_after)
2.18 ms ± 141 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
對比占用內(nèi)存大小
轉(zhuǎn)換后占用內(nèi)存比轉(zhuǎn)換前小30倍。
for file in files:
if os.path.splitext(file)[1]=='.csv':
print('memory_usage of the {0} is : {1}'. \
format(file,pd.read_csv(file,header=0).memory_usage(deep=True).sum()))
else:
print('memory_usage of the {0} is : {1}'. \
format(file,pd.read_pickle(file).memory_usage(deep=True).sum()))
memory_usage of the gun_deaths_in_america.csv is : 30368107
memory_usage of the gun_deaths_in_america_before_transform.pkl is : 30368107
memory_usage of the gun_deaths_in_america_after_transform.pkl is : 1010827
讀取的數(shù)據(jù)都是一樣的殉摔,就是數(shù)據(jù)類型不一樣州胳。