參考:
安裝
支持的python版本: 2.7, 3.5, 3.6
$ pip install pandas
檢查本地的pandas運(yùn)行環(huán)境是否完整,可以運(yùn)行pandas的單元測試用例
$ pip install pytest
>>> import pandas as pd
>>> pd.test()
獲取當(dāng)前使用pandas的版本信息
>>> import pandas as pd
>>> pd.__version__
'0.21.1'
概覽
pandas的基本數(shù)據(jù)結(jié)構(gòu):
-
Series
: 一維數(shù)據(jù) -
DataFrame
: 二維數(shù)據(jù) -
Panel
: 三維數(shù)據(jù)(從0.20.0版本開始秕重,已經(jīng)不再推薦使用) -
Panel4D
,PanelND
(不再推薦使用)
DataFrame
是由Series
構(gòu)成的
創(chuàng)建Series
創(chuàng)建Series
最簡單的方法
>>> s = pd.Series(data, index=index)
data
可以是不同的類型:
- python字典
- ndarray
- 標(biāo)量(比如: 5)
使用ndarray
創(chuàng)建(From ndarray)
如果data
是ndarray
,那么index
的長度必須和data
的長度相同不同,當(dāng)沒有明確index
參數(shù)時(shí),默認(rèn)使用[0, ... len(data) - 1]
作為index
溶耘。
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> s
a 0.654385
b 0.055691
c 0.856054
d 0.621810
e 1.802872
dtype: float64
>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> pd.Series(np.random.randn(5))
0 -0.467183
1 -1.333323
2 -0.493813
3 -0.067705
4 -1.310332
dtype: float64
需要注意的是: pandas
里的索引并不要求唯一性二拐,如果一個(gè)操作不支持重復(fù)的索引,會(huì)自動(dòng)拋出異常凳兵。這么做的原因是很多操作不會(huì)用到索引百新,比如GroupBy
。
>>> s = pd.Series(np.random.randn(5), index=['a', 'a', 'a', 'a', 'a'])
>>> s
a 0.847331
a -2.138021
a -0.364763
a -0.603172
a 0.363691
dtype: float64
使用dict
創(chuàng)建(From dict)
當(dāng)data
是dict
類型時(shí)庐扫,如果指定了index
參數(shù)饭望,那么就使用index
參數(shù)作為索引。否者形庭,就使用排序后的data
的key
作為index
铅辞。
>>> d = {'b': 0., 'a': 1., 'c': 2.}
# 索引的值是排序后的
>>> pd.Series(d)
a 1.0
b 0.0
c 2.0
dtype: float64
# 字典中不存在的key, 直接賦值為NaN(Not a number)
>>> pd.Series(d, index=['b', 'c', 'd', 'a'])
b 0.0
c 2.0
d NaN
a 1.0
dtype: float64
使用標(biāo)量創(chuàng)建(From scalar value)
當(dāng)data
是標(biāo)量時(shí),必須提供index
, 值會(huì)被重復(fù)到index
的長度
>>> pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
創(chuàng)建DataFrame
DataFrame是一個(gè)二維的數(shù)據(jù)結(jié)構(gòu)萨醒,可以看做是一個(gè)excel表格或一張SQL表巷挥,或者值為Series的字典。 跟Series一樣验靡,DataFrame也可以通過多種類型的數(shù)據(jù)結(jié)構(gòu)來創(chuàng)建
- 字典(包含一維ndarray數(shù)組倍宾,列表,字典或Series)
- 二維的ndarray數(shù)組
- 結(jié)構(gòu)化的ndarray
- Series
- 另一個(gè)DataFrame
除了data
之外胜嗓,還接受index
和columns
參數(shù)來分布指定行和列的標(biāo)簽
從Series字典或嵌套的字典創(chuàng)建(From dict of Series or dicts)
結(jié)果的索引是多個(gè)Series索引的合集高职,如果沒有指定columns
,就用排序后的字典的key
作為列標(biāo)簽辞州。
>>> d = {'one': pd.Series([1,2,3], index=['a', 'b', 'c']),
... 'two': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}
...
>>> df = pd.DataFrame(d)
>>> df
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
>>> pd.DataFrame(d, index=['d', 'b', 'a'])
one two
d NaN 4
b 2.0 2
a 1.0 1
>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
two three
d 4 NaN
b 2 NaN
a 1 NaN
>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')
從ndarray類型/列表類型的字典(From dict of ndarrays / lists)
>>> d = {'one': [1,2,3,4], 'two': [4,3,2,1]}
>>> pd.DataFrame(d)
one two
0 1 4
1 2 3
2 3 2
3 4 1
>>> pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
one two
a 1 4
b 2 3
c 3 2
d 4 1
從結(jié)構(gòu)化ndarray創(chuàng)建(From structured or record array)
>>> data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
>>> data
array([(0, 0., b''), (0, 0., b'')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
>>> data[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
>>> pd.DataFrame(data)
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
>>> pd.DataFrame(data, index=['first', 'second'])
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'
>>> pd.DataFrame(data, index=['first', 'second'], columns=['C', 'A', 'B'])
C A B
first b'Hello' 1 2.0
second b'World' 2 3.0
從字典列表里創(chuàng)建(a list of dicts)
>>> data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
>>> pd.DataFrame(data2)
a b c
0 1 2 NaN
1 5 10 20.0
>>> pd.DataFrame(data2, index=["first", "second"])
a b c
first 1 2 NaN
second 5 10 20.0
>>> pd.DataFrame(data2, columns=["a", "b"])
a b
0 1 2
1 5 10
從元祖字典創(chuàng)建(From a dict of tuples)
通過元祖字典怔锌,可以創(chuàng)建多索引的DataFrame
>>> pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
... ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
... ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
... ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
... ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
...
a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
通過Series創(chuàng)建(From a Series)
>>> pd.DataFrame(pd.Series([1,2,3]))
0
0 1
1 2
2 3
查看數(shù)據(jù)
>>> dates = pd.date_range('20130101', periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
>>> df
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 -1.831020 0.813526 0.403101 -1.251946
# 獲取前幾行(默認(rèn)前5行)
>>> df.head()
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
# 獲取后3行
>>> df.tail(3)
A B C D
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 -1.831020 0.813526 0.403101 -1.251946
# 獲取索引
>>> df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
# 獲取列信息
>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
# 獲取數(shù)據(jù)信息
>>> df.values
array([[ 1.23189704, -0.16983942, 1.3332949 , 0.36714191],
[-0.12744988, -1.71667129, 0.91034961, 0.15118638],
[-0.24165226, -0.98464711, 0.78865554, -0.20363944],
[ 0.04498958, -0.25515787, -1.21384804, 1.07671506],
[ 0.41821265, 0.10740007, 0.61944799, 1.49408712],
[-1.8310196 , 0.81352564, 0.40310115, -1.25194611]])
# 獲取簡單的統(tǒng)計(jì)信息
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.084170 -0.367565 0.473500 0.272257
std 1.007895 0.880134 0.883494 0.970912
min -1.831020 -1.716671 -1.213848 -1.251946
25% -0.213102 -0.802275 0.457188 -0.114933
50% -0.041230 -0.212499 0.704052 0.259164
75% 0.324907 0.038090 0.879926 0.899322
max 1.231897 0.813526 1.333295 1.494087
# 轉(zhuǎn)置矩陣
>>> df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 1.231897 -0.127450 -0.241652 0.044990 0.418213 -1.831020
B -0.169839 -1.716671 -0.984647 -0.255158 0.107400 0.813526
C 1.333295 0.910350 0.788656 -1.213848 0.619448 0.403101
D 0.367142 0.151186 -0.203639 1.076715 1.494087 -1.251946
# 按照列排序
>>> df.sort_values(by='B')
A B C D
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 -1.831020 0.813526 0.403101 -1.251946
選擇數(shù)據(jù)
獲取
選擇列催束, 返回的是Series
>>> df['A']
2013-01-01 1.231897
2013-01-02 -0.127450
2013-01-03 -0.241652
2013-01-04 0.044990
2013-01-05 0.418213
2013-01-06 -1.831020
Freq: D, Name: A, dtype: float64
>>> df.A
2013-01-01 1.231897
2013-01-02 -0.127450
2013-01-03 -0.241652
2013-01-04 0.044990
2013-01-05 0.418213
2013-01-06 -1.831020
Freq: D, Name: A, dtype: float64
選擇行
>>> df[0:3]
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
>>> df["20130102":"20130104"]
A B C D
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
通過Label選擇
# 返回的Series
>>> df.loc[dates[0]]
A 1.231897
B -0.169839
C 1.333295
D 0.367142
Name: 2013-01-01 00:00:00, dtype: float64
# 返回的DateFrame
>>> df.loc[:, ['A', 'B']]
A B
2013-01-01 1.231897 -0.169839
2013-01-02 -0.127450 -1.716671
2013-01-03 -0.241652 -0.984647
2013-01-04 0.044990 -0.255158
2013-01-05 0.418213 0.107400
2013-01-06 -1.831020 0.813526
>>> df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 -0.127450 -1.716671
2013-01-03 -0.241652 -0.984647
2013-01-04 0.044990 -0.255158
# 降維返回
>>> df.loc['20130102',['A','B']]
A -0.127450
B -1.716671
Name: 2013-01-02 00:00:00, dtype: float64
通過Position選擇
# 返回第4行
>>> df.iloc[3]
A 0.044990
B -0.255158
C -1.213848
D 1.076715
Name: 2013-01-04 00:00:00, dtype: float64
>>> df.iloc[3:5,0:2]
A B
2013-01-04 0.044990 -0.255158
2013-01-05 0.418213 0.107400
>>> df.iloc[1:3, :]
A B C D
2013-01-02 -0.127450 -1.716671 0.910350 0.151186
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639
# 獲得指定位置的元素
>>> df.iloc[1,1]
-1.7166712884342545
>>> df.iat[1,1]
-1.7166712884342545
布爾索引
>>> df[df.A > 0]
A B C D
2013-01-01 1.231897 -0.169839 1.333295 0.367142
2013-01-04 0.044990 -0.255158 -1.213848 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
>>> df[df > 0]
A B C D
2013-01-01 1.231897 NaN 1.333295 0.367142
2013-01-02 NaN NaN 0.910350 0.151186
2013-01-03 NaN NaN 0.788656 NaN
2013-01-04 0.044990 NaN NaN 1.076715
2013-01-05 0.418213 0.107400 0.619448 1.494087
2013-01-06 NaN 0.813526 0.403101 NaN
>>> df2=df.copy()
>>> df2['E'] = ['one','one','two','three','four','three']
>>> df2
A B C D E
2013-01-01 1.231897 -0.169839 1.333295 0.367142 one
2013-01-02 -0.127450 -1.716671 0.910350 0.151186 one
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 two
2013-01-04 0.044990 -0.255158 -1.213848 1.076715 three
2013-01-05 0.418213 0.107400 0.619448 1.494087 four
2013-01-06 -1.831020 0.813526 0.403101 -1.251946 three
# 使用isin()來過濾
>>> df2[df2['E'].isin(['two', 'four'])]
A B C D E
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 two
2013-01-05 0.418213 0.107400 0.619448 1.494087 four
賦值
根據(jù)日期新增加一列
>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
>>> df['F'] = s1
>>> df
A B C D F
2013-01-01 1.231897 -0.169839 1.333295 0.367142 NaN
2013-01-02 -0.127450 -1.716671 0.910350 0.151186 1.0
2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 2.0
2013-01-04 0.044990 -0.255158 -1.213848 1.076715 3.0
2013-01-05 0.418213 0.107400 0.619448 1.494087 4.0
2013-01-06 -1.831020 0.813526 0.403101 -1.251946 5.0
# 通過label賦值
>>> df.at[dates[0], 'A'] = 0
# 通過position賦值
>>> df.iat[0,1] = 0
# 通過ndarray賦值
>>> df.loc[:, 'D'] = np.array([5] * len(df))
>>> df
A B C D F
2013-01-01 0.000000 0.000000 1.333295 5 NaN
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0
2013-01-05 0.418213 0.107400 0.619448 5 4.0
2013-01-06 -1.831020 0.813526 0.403101 5 5.0
# 通過where操作
>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
>>> df
A B C D
2013-01-01 -1.231777 -0.068987 -0.105402 1.512076
2013-01-02 -1.120426 -0.240417 0.223964 -0.559793
2013-01-03 0.697097 0.758780 -1.191408 -0.793882
2013-01-04 0.332519 0.784564 0.805932 -1.169186
2013-01-05 0.010235 0.156115 0.419567 -2.279214
2013-01-06 0.294819 -0.691370 0.294119 -0.208475
>>> df2 = df.copy()
>>> df2[df > 0] = -df2
>>> df2
A B C D
2013-01-01 -1.231777 -0.068987 -0.105402 -1.512076
2013-01-02 -1.120426 -0.240417 -0.223964 -0.559793
2013-01-03 -0.697097 -0.758780 -1.191408 -0.793882
2013-01-04 -0.332519 -0.784564 -0.805932 -1.169186
2013-01-05 -0.010235 -0.156115 -0.419567 -2.279214
2013-01-06 -0.294819 -0.691370 -0.294119 -0.208475
數(shù)據(jù)缺失
pandas使用np.nan
來表示缺失的數(shù)據(jù)冈敛,它默認(rèn)不參與任何運(yùn)算
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 1.333295 5 NaN NaN
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 NaN
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 NaN
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 NaN
>>> df1.loc[dates[0]:dates[1], 'E'] = 1
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 1.333295 5 NaN 1.0
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 1.0
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 NaN
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 NaN
# 丟棄所有包含NaN的行
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 -0.12745 -1.716671 0.91035 5 1.0 1.0
# 填充所有包含NaN的元素
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 1.333295 5 5.0 1.0
2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 1.0
2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 5.0
2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 5.0
# 獲取元素值為nan的布爾掩碼
>>> pd.isna(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
運(yùn)算操作
Stats統(tǒng)計(jì)
運(yùn)算操作都會(huì)排除NaN
元素
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
# 計(jì)算列的平均值
>>> df.mean()
A 10.0
B 11.0
C 12.0
D 13.0
dtype: float64
計(jì)算行的平均值
>>> df.mean(1)
2013-01-01 1.5
2013-01-02 5.5
2013-01-03 9.5
2013-01-04 13.5
2013-01-05 17.5
2013-01-06 21.5
Freq: D, dtype: float64
# shift(n),按照列的方向容贝,從上往下移動(dòng)n個(gè)位置
>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
>>> s
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
# sub函數(shù),DataFrame相減操作, 等于 df-s
>>> df.sub(s, axis='index')
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 7.0 8.0 9.0 10.0
2013-01-04 9.0 10.0 11.0 12.0
2013-01-05 11.0 12.0 13.0 14.0
2013-01-06 NaN NaN NaN NaN
Apply
>>> df
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
# 在列方向累加
>>> df.apply(np.cumsum)
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 6 8 10
2013-01-03 12 15 18 21
2013-01-04 24 28 32 36
2013-01-05 40 45 50 55
2013-01-06 60 66 72 78
# 列方向的最大值-最小值叉抡, 得到的是一個(gè)Series
>>> df.apply(lambda x: x.max() - x.min())
A 20
B 20
C 20
D 20
dtype: int64
直方圖 Histogramming
>>> s = pd.Series(np.random.randint(0, 7, size=10))
>>> s
0 6
1 5
2 0
3 2
4 5
5 1
6 3
7 3
8 3
9 1
dtype: int64
# 索引是出現(xiàn)的數(shù)字,值是次數(shù)
>>> s.value_counts()
3 3
5 2
1 2
6 1
2 1
0 1
dtype: int64
字符串方法
>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
>>> s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
合并
Concat
>>> df = pd.DataFrame(np.random.randn(10, 4))
>>> df
0 1 2 3
0 -1.710767 -2.107488 1.441790 0.959924
1 0.509422 0.099733 0.845039 0.232462
2 -0.609247 0.533162 -0.387640 0.668803
3 0.946219 -0.326805 1.245303 1.336090
4 -1.069114 0.755313 -1.003991 -0.327009
5 1.169418 -1.225637 -2.137500 1.766341
6 -1.751095 0.279439 0.018053 1.800435
7 -0.328828 -1.513893 1.879333 0.945217
8 2.440123 -0.260918 -0.232951 -1.337775
9 -0.876878 -1.153583 -1.487573 -1.509871
# 分成小塊
>>> pieces = [df[:3], df[3:7], df[7:]]
# 合并
>>> pd.concat(pieces)
0 1 2 3
0 -1.710767 -2.107488 1.441790 0.959924
1 0.509422 0.099733 0.845039 0.232462
2 -0.609247 0.533162 -0.387640 0.668803
3 0.946219 -0.326805 1.245303 1.336090
4 -1.069114 0.755313 -1.003991 -0.327009
5 1.169418 -1.225637 -2.137500 1.766341
6 -1.751095 0.279439 0.018053 1.800435
7 -0.328828 -1.513893 1.879333 0.945217
8 2.440123 -0.260918 -0.232951 -1.337775
9 -0.876878 -1.153583 -1.487573 -1.509871
Join
跟數(shù)據(jù)庫的Join操作一樣
>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
>>> left
key lval
0 foo 1
1 foo 2
>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
>>> right
key rval
0 foo 4
1 foo 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
另一個(gè)例子
>>> left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
>>> left
key lval
0 foo 1
1 bar 2
>>> right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
>>> right
key rval
0 foo 4
1 bar 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 bar 2 5
Append
>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
>>> df
A B C D
0 -1.521762 -0.850721 1.322354 -0.226562
1 -2.773304 -0.663303 0.895075 -0.171524
2 0.322975 -0.796484 0.379920 0.028333
3 -0.350795 1.839747 -0.359241 -0.027921
4 -0.945340 1.062598 -2.208670 0.769027
5 -0.329458 -0.145658 1.580258 -1.414820
6 -0.261757 -1.435025 -0.512306 -0.222287
7 -0.994207 -1.219057 0.781283 -1.795741
>>> s = df.iloc[3]
>>> df.append(s, ignore_index=True)
A B C D
0 -1.521762 -0.850721 1.322354 -0.226562
1 -2.773304 -0.663303 0.895075 -0.171524
2 0.322975 -0.796484 0.379920 0.028333
3 -0.350795 1.839747 -0.359241 -0.027921
4 -0.945340 1.062598 -2.208670 0.769027
5 -0.329458 -0.145658 1.580258 -1.414820
6 -0.261757 -1.435025 -0.512306 -0.222287
7 -0.994207 -1.219057 0.781283 -1.795741
8 -0.350795 1.839747 -0.359241 -0.027921
Grouping
group by
的操作需要經(jīng)過以下1個(gè)或多個(gè)步驟
根據(jù)條件分組數(shù)據(jù)(Spliting)
在各個(gè)分組上執(zhí)行函數(shù)(Applying)
-
合并結(jié)果(Combining)
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ... 'foo', 'bar', 'foo', 'foo'], ... 'B' : ['one', 'one', 'two', 'three', ... 'two', 'two', 'one', 'three'], ... 'C' : np.arange(1, 9), ... 'D' : np.arange(2, 10)}) ... ... >>> df A B C D 0 foo one 1 2 1 bar one 2 3 2 foo two 3 4 3 bar three 4 5 4 foo two 5 6 5 bar two 6 7 6 foo one 7 8 7 foo three 8 9 # 分組求和 >>> df.groupby('A').sum() C D A bar 12 15 foo 24 29 # 多列分組 >>> df.groupby(['A','B']).sum() C D A B bar one 2 3 three 4 5 two 6 7 foo one 8 10 three 8 9 two 8 10 >>> b = df.groupby(['A','B']).sum() # 多索引 >>> b.index MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']], labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], names=['A', 'B']) >>> b.columns
Index(['C', 'D'], dtype='object')
Reshaping
Stack
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
... 'foo', 'foo', 'qux', 'qux'],
... ['one', 'two', 'one', 'two',
... 'one', 'two', 'one', 'two']]))
...
>>> tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
>>> df
A B
first second
bar one 0.096893 0.479194
two -0.771606 0.331693
baz one -0.022540 0.531284
two -0.039843 1.876942
foo one 0.250473 1.163931
two -1.127163 1.447566
qux one -0.410361 -0.734333
two -0.461247 0.018531
>>> df2 = df[:4]
>>> df2
A B
first second
bar one 0.096893 0.479194
two -0.771606 0.331693
baz one -0.022540 0.531284
two -0.039843 1.876942
>>> stacked = df2.stack()
>>> stacked
first second
bar one A 0.096893
B 0.479194
two A -0.771606
B 0.331693
baz one A -0.022540
B 0.531284
two A -0.039843
B 1.876942
dtype: float64
>>> type(stacked)
pandas.core.series.Series
>>> stacked.index
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two'], ['A', 'B']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second', None])
>>> stacked.values
array([ 0.09689327, 0.47919417, -0.77160574, 0.3316934 , -0.02253955,
0.53128436, -0.03984337, 1.8769416 ])
>>> stacked.unstack()
A B
first second
bar one 0.096893 0.479194
two -0.771606 0.331693
baz one -0.022540 0.531284
two -0.039843 1.876942
>>> stacked.unstack(1)
second one two
first
bar A 0.096893 -0.771606
B 0.479194 0.331693
baz A -0.022540 -0.039843
B 0.531284 1.876942
>>> stacked.unstack(0)
first bar baz
second
one A 0.096893 -0.022540
B 0.479194 0.531284
two A -0.771606 -0.039843
B 0.331693 1.876942
數(shù)據(jù)透視表(Pivot Tables)
時(shí)間序列
pandas在時(shí)間序列上型檀,提供了很方便的按照頻率重新采樣的功能躯嫉,在財(cái)務(wù)分析上非常有用
# 把每秒的數(shù)據(jù)按5分鐘聚合
>>> rng = pd.date_range('1/1/2012', periods=100, freq='S')
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts.resample('5Min').sum()
2012-01-01 22073
Freq: 5T, dtype: int64
加上時(shí)區(qū)信息
>>> rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
>>> ts = pd.Series(np.random.randn(len(rng)), rng)
>>> ts
2012-03-06 -0.386974
2012-03-07 0.657785
2012-03-08 1.390234
2012-03-09 0.412904
2012-03-10 -1.189340
Freq: D, dtype: float64
>>> ts_utc = ts.tz_localize('UTC')
>>> ts_utc
2012-03-06 00:00:00+00:00 -0.386974
2012-03-07 00:00:00+00:00 0.657785
2012-03-08 00:00:00+00:00 1.390234
2012-03-09 00:00:00+00:00 0.412904
2012-03-10 00:00:00+00:00 -1.189340
Freq: D, dtype: float64
轉(zhuǎn)換成另一個(gè)時(shí)區(qū)
>>> ts_utc.tz_convert('Asia/Shanghai')
2012-03-06 08:00:00+08:00 -0.386974
2012-03-07 08:00:00+08:00 0.657785
2012-03-08 08:00:00+08:00 1.390234
2012-03-09 08:00:00+08:00 0.412904
2012-03-10 08:00:00+08:00 -1.189340
Freq: D, dtype: float64
時(shí)間跨度轉(zhuǎn)換
>>> rng = pd.date_range('1/1/2012', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2012-01-31 0.825174
2012-02-29 -2.190258
2012-03-31 -0.073171
2012-04-30 -0.404208
2012-05-31 0.245025
Freq: M, dtype: float64
>>> ps = ts.to_period()
>>> ps
2012-01 0.825174
2012-02 -2.190258
2012-03 -0.073171
2012-04 -0.404208
2012-05 0.245025
Freq: M, dtype: float64
>>> ps.to_timestamp()
2012-01-01 0.825174
2012-02-01 -2.190258
2012-03-01 -0.073171
2012-04-01 -0.404208
2012-05-01 0.245025
Freq: MS, dtype: float64
轉(zhuǎn)換季度時(shí)間
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts.head()
1990Q1 -0.590040
1990Q2 -0.750392
1990Q3 -0.385517
1990Q4 -0.380806
1991Q1 -1.252727
Freq: Q-NOV, dtype: float64
>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts.head()
1990-03-01 09:00 -0.590040
1990-06-01 09:00 -0.750392
1990-09-01 09:00 -0.385517
1990-12-01 09:00 -0.380806
1991-03-01 09:00 -1.252727
Freq: H, dtype: float64
Categoricals分類
>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
>>> df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
轉(zhuǎn)換原始類別為分類數(shù)據(jù)類型
>>> df["grade"] = df["raw_grade"].astype("category")
>>> df
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 e e
>>> df["grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
重命名分類為更有意義的名稱
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
>>> df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
重新安排順分類,同時(shí)添加缺少的分類(序列 .cat方法下返回新默認(rèn)序列)
>>> df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
>>> df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
>>> df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
按照分類排序
>>> df.sort_values(by="grade")
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
按照分類分組壶唤,同時(shí)也會(huì)顯示空的分類
>>> df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
Plotting
>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> ts = ts.cumsum()
>>> ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x108594668>
>>> plt.show()
畫圖帶圖例的圖
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A','B'
... ,'C', 'D'])
>>> df.cumsum()
>>> plt.figure();df.plot();plt.legend(loc='best')
<matplotlib.legend.Legend at 0x111793f98>
>>> plt.show()
數(shù)據(jù)In/Out
CSV
保存到csv文件
>>> df.to_csv('foo.csv')
從csv文件讀取數(shù)據(jù)
>>> pd.read_csv('foo.csv')
HDF5
保存到HDF5倉庫
>>> df.to_hdf('foo.h5','df')
從倉庫讀取
>>> pd.read_hdf('foo.h5','df')
Excel
保存到excel
>>> df.to_excel('foo.xlsx', sheet_name='Sheet1')
從excel文件讀取
>>> pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])