pandas入門
簡介
pandas包含的數(shù)據結構和操作工具能快速簡單地清洗和分析數(shù)據谱俭。
pandas經常與NumPy和SciPy這樣的數(shù)據計算工具,statsmodels和scikit-learn之類的分析庫及數(shù)據可視化庫(如matplotlib)等一起用使用。pandas基于NumPy的數(shù)組,經潮焊瘢可以不使用循環(huán)就能處理好大量數(shù)據涌乳。
pandas適合處理表格數(shù)據或巨量數(shù)據。NumPy則適合處理巨量的數(shù)值數(shù)組數(shù)據姨夹。
這里約定導入方式:
技術支持qq群:521070358 630011153
#!python
import pandas as pd
參考資料
- 本文最新版本地址
- 本文涉及的python測試開發(fā)庫 謝謝點贊纤垂!
- 本文相關海量書籍下載
主要數(shù)據結構:Series和DataFrame。
Series
Series類似于一維數(shù)組的對象,它由一組數(shù)據(NumPy類似數(shù)據類型)以及相關的數(shù)據標簽(即索引)組成磷账。僅由一組數(shù)據即可產生最簡單的Series:
#!python
In [2]: import pandas as pd
In [3]: obj = pd.Series([4, 7, -5, 3])
In [4]: obj
Out[4]:
0 4
1 7
2 -5
3 3
dtype: int64
In [5]: obj.values
Out[5]: array([ 4, 7, -5, 3])
In [6]: obj.index
Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')
指定索引:
#!python
In [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [3]: obj2
Out[3]:
d 4
b 7
a -5
c 3
dtype: int64
In [4]: obj2.index
Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object')
In [10]: obj2['a']
Out[10]: -5
In [11]: obj2['d'] = 6
In [12]: obj2[['c', 'a', 'd']]
Out[12]:
c 3
a -5
d 6
dtype: int64
可見與普通NumPy數(shù)組相比,你還可以通過索引的方式選取Series中的值峭沦。
NumPy函數(shù)或類似操作,如根據布爾型數(shù)組進行過濾逃糟、標量乘法吼鱼、應用數(shù)學函數(shù)等)都會保留索引和值之間的鏈接:
#!python
In [13]: obj2[obj2 > 0]
Out[13]:
d 6
b 7
c 3
dtype: int64
In [14]: obj2 * 2
Out[14]:
d 12
b 14
a -10
c 6
dtype: int64
In [15]: obj2
Out[15]:
d 6
b 7
a -5
c 3
dtype: int64
In [17]: import numpy as np
In [18]: np.exp(obj2)
Out[18]:
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
In [19]: 'b' in obj2
Out[19]: True
In [20]: 'e' in obj2
Out[20]: False
可見可以吧Series看成是定長的有序字典蓬豁。也可由字典創(chuàng)建Series:
#!python
In [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [22]: obj3 = pd.Series(sdata)
In [23]: obj3
Out[23]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [25]: obj4 = pd.Series(sdata, index=states)
In [26]: obj4
Out[26]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [27]: pd.isnull(obj4)
Out[27]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [28]: pd.notnull(obj4)
Out[28]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
In [29]: obj4.isnull()
Out[29]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [32]: obj4.notnull()
Out[32]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
相加
#!python
In [33]: obj3
Out[33]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [34]: obj4
Out[34]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [35]: obj3 + obj4
Out[35]:
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
In [36]: obj4.name = 'population'
In [37]: obj4.index.name = 'state'
In [38]: obj4
Out[38]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
In [40]: obj = pd.Series([4, 7, -5, 3])
In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [42]: obj
Out[42]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
本文代碼地址:https://github.com/china-testing/python-api-tesing/
本文最新版本地址:http://t.cn/R8tJ9JH
交流QQ群:python 測試開發(fā) 144081101
wechat: pythontesting
淘寶天貓可以把鏈接發(fā)給qq850766020,為你生成優(yōu)惠券菇肃,降低你的購物成本地粪!
DataFrame
DataFrame是矩狀表格型的數(shù)據結構,包含有序的列琐谤,每列可以是不同的類型(數(shù)值蟆技、字符串、布爾值等)斗忌。DataFrame既有行索引也有列索引质礼,它可以被看做由相同索引的Series組成的字典。DataFrame中的數(shù)據是以一個或多個二維塊存放的织阳。
構建DataFrame的辦法有很多眶蕉,最常用的是直接傳入等長列表或NumPy數(shù)組組成的字典。DataFrame會自動加上索引(跟Series一樣)唧躲,有序排列造挽。
#!python
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]:
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
In [4]:
In [4]: frame = pd.DataFrame(data)
In [5]: frame
Out[5]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003
In [6]: frame.head()
Out[6]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In [7]:
In [7]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[7]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
In [8]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
...: index=['one', 'two', 'three', 'four', 'five', 'six'])
In [9]: frame2
Out[9]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [10]: frame2['state']
Out[10]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
可見還可以通過columns指定DataFrame的列序, index指定索引名。跟Series一樣弄痹,如果傳入的列在數(shù)據中找不到饭入,就會產生NaN值。通過類似字典的方式或屬性的方式界酒,可以將DataFrame的列獲取為Series圣拄,返回的Series擁有DataFrame相同的索引,且其name屬性也已經被相應地設置好毁欣。
行也可以用loc屬性通過位置或名稱的方式進行獲取庇谆。列可以通過賦值的方式進行修改。
將列表或數(shù)組賦值給某個列時凭疮,其長度必須跟DataFrame的長度相匹配饭耳。如果賦值的是Series,就會精確匹配
#!python
In [11]: frame2.loc['three']
Out[11]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
In [12]: frame2['debt'] = 16.5
In [13]: frame2
Out[13]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
In [14]: frame2['debt'] = np.arange(6.)
In [15]: frame2
Out[15]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
In [16]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [17]: frame2['debt'] = val
In [18]: frame2
Out[18]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
為不存在的列賦值會創(chuàng)建出一個新列执解。關鍵字del用于刪除列:
#!python
In [19]: frame2['eastern'] = frame2['state'] == 'Ohio'
In [20]: frame2
Out[20]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
In [21]: del frame2['eastern']
In [22]: frame2.columns
Out[22]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
通過索引方式返回的列只是相應數(shù)據的視圖而不是副本寞肖。因此,對返回的Series所做的任何就地修改
全都會反映到源DataFrame上衰腌。通過Series的copy方法即可顯式地復制列新蟆。
另一種常見的數(shù)據形式是嵌套字典,外層字典的鍵作為列右蕊,內層鍵則作為行索引:
#!python
In [23]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [24]: frame3 = pd.DataFrame(pop)
In [25]: frame3
Out[25]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [26]: frame3.T
Out[26]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
In [27]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[27]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
In [28]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}
In [29]: pdata
Out[29]:
{'Ohio': 2000 1.5
2001 1.7
Name: Ohio, dtype: float64, 'Nevada': 2000 NaN
2001 2.4
Name: Nevada, dtype: float64}
In [30]: pd.DataFrame(pdata)
Out[30]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
In [31]: frame3.index.name = 'year'; frame3.columns.name = 'state'
In [32]: frame3
Out[32]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [33]: frame3.values
Out[33]:
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
In [34]: frame2.values
Out[34]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
可見可以轉置琼稻,由Series組成的字典和字典類似。如果設置了DataFrame的index和columns的name屬性,則這些信息也會被顯示出來饶囚。跟Series一樣,values屬性也會以二維ndarray的形式返回DataFrame中的數(shù)據帕翻。如果DataFrame各列的數(shù)據類型不同,則值數(shù)組的數(shù)據類型就會選用能兼容所有列的數(shù)據類型鸠补。
DataFrame的constructor接受的類型為:2D ndarray、dict of arrays, lists, or tuples嘀掸、NumPy structured/record紫岩、array、dict of Series睬塌、dict of dicts泉蝌、List of dicts or Series、List of lists or tuples衫仑、Another DataFrame梨与、NumPy MaskedArray堕花。
更多參考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
索引對象
pandas的索引對象負責管理軸標簽和其他元數(shù)據(比如軸名稱等)文狱。構建Series或DataFrame時,所用到的任何數(shù)組或其他序列的標簽都會被轉換成Index。
#!python
In [35]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [36]: index = obj.index
In [37]: index
Out[37]: Index(['a', 'b', 'c'], dtype='object')
In [38]: index[1:]
Out[38]: Index(['b', 'c'], dtype='object')
In [39]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'
/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
1722
1723 def __setitem__(self, key, value):
-> 1724 raise TypeError("Index does not support mutable operations")
1725
1726 def __getitem__(self, key):
TypeError: Index does not support mutable operations
In [40]: labels = pd.Index(np.arange(3))
In [41]: labels
Out[41]: Int64Index([0, 1, 2], dtype='int64')
In [42]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [43]: obj2
Out[43]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [44]: obj2.index is labels
Out[44]: True
In [45]: frame3
Out[45]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [46]: frame3.columns
Out[46]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [47]: 'Ohio' in frame3.columns
Out[47]: True
In [48]: 2003 in frame3.index
Out[48]: False
In [49]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [50]: dup_labels
Out[50]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
Index對象是不可變的,因此用戶不能對其進行修改瞄崇,這樣Index對象在多個數(shù)據結構之間可安全共享。除了像數(shù)組,Index類似固定大小的集合壕曼。
Index的方法和屬性有:append,difference,intersection,union,isin,delete,drop,insert苏研,is_monotonic,unique腮郊。
更多參考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
基本功能
本節(jié)中,我將介紹操作Series和DataFrame中的數(shù)據的基本手段摹蘑。
重新索引
#!python
In [51]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [52]: obj
Out[52]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
# 調用reindex將會根據新索引進行重排。如果某個索引值當前不存在,就為NaN
In [53]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [54]: obj2
Out[54]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
In [55]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [56]: obj3
Out[56]:
0 blue
2 purple
4 yellow
dtype: object
# 對于時間序列這樣的有序數(shù)據,重新索引時可能需要做插值處理轧飞。method選項即可達到此目的,例如,使用ffill以實現(xiàn)前向值填充:
In [57]: obj3.reindex(range(6), method='ffill')
Out[57]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
# DataFrame中reindex可以調整行列
In [58]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
....: index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
In [59]: frame
Out[59]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [60]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [61]: frame2
Out[61]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
In [62]: states = ['Texas', 'Utah', 'California']
In [63]: frame.reindex(columns=states)
Out[63]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [69]: frame2 = frame.reindex(['a', 'b', 'c', 'd'],columns=states)
In [70]: frame2
Out[70]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
reindex函數(shù)的參數(shù)有index衅鹿,method,fill_value过咬,limit大渤,tolerance,level掸绞,copy等泵三。
更多參考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
丟棄指定軸上的項
丟棄某條軸上的一項很簡單,只要有索引數(shù)組或列表即可。由于需要執(zhí)行一些數(shù)據整理和集合邏輯,所以drop方法返回的是在指定軸上刪除了指定值的新對象:
#!python
In [71]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [72]: obj
Out[72]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [73]: new_obj = obj.drop('c')
In [74]: new_obj
Out[74]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [75]: obj
Out[75]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [76]: obj.drop(['d', 'c'])
Out[76]:
a 0.0
b 1.0
e 4.0
dtype: float64
In [77]: obj
Out[77]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [78]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
....: columns=['one', 'two', 'three', 'four'])
In [79]: data
Out[79]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [80]: data.drop(['Colorado', 'Ohio'])
Out[80]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In []: data.drop('two',1)
Out[57]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In []: data.drop('two', axis=1)
Out[58]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In []: data.drop(['two', 'four'], axis='columns')
Out[59]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
In []: obj.drop('c', inplace=True)
In []: obj
Out[61]:
d 4.5
b 7.2
a -5.3
dtype: float64
索引衔掸、選取和過濾
Series索引(obj[...])的工作方式類似于NumPy數(shù)組的索引,只不過Series的索引值不只是整數(shù)烫幕。下面是幾個例子:
#!python
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
Out[63]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
obj['b']
Out[64]: 1.0
obj[1]
Out[65]: 1.0
obj[2:4]
Out[66]:
c 2.0
d 3.0
dtype: float64
obj[['b', 'a', 'd']]
Out[67]:
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1, 3]]
Out[68]:
b 1.0
d 3.0
dtype: float64
obj[obj < 2]
Out[69]:
a 0.0
b 1.0
dtype: float64
obj['b':'c']
Out[70]:
b 1.0
c 2.0
dtype: float64
obj['b':'c'] = 5
obj
Out[72]:
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
注意標簽的方式和python的列表不同,后面的index也是包含在里面的敞映。
#!python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
Out[74]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']
Out[75]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three', 'one']]
Out[76]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
data[:2]
Out[77]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three'] > 5]
Out[78]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data < 5
Out[79]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data < 5] = 0
data
Out[81]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
- loc和iloc
對于行上的DataFrame標簽索引有特殊的索引操作符loc(標簽)和iloc(整數(shù)索引)能夠從DataFrame中選擇子集较曼。
#!python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
Out[74]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']
Out[75]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three', 'one']]
Out[76]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
data[:2]
Out[77]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three'] > 5]
Out[78]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data < 5
Out[79]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data < 5] = 0
data
Out[81]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.loc['Colorado', ['two', 'three']]
Out[82]:
two 5
three 6
Name: Colorado, dtype: int32
data.iloc[2, [3, 0, 1]]
Out[83]:
four 11
one 8
two 9
Name: Utah, dtype: int32
data.iloc[2]
Out[84]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
data.iloc[[1, 2], [3, 0, 1]]
Out[85]:
four one two
Colorado 7 0 5
Utah 11 8 9
data.loc[:'Utah', 'two']
Out[86]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
data.iloc[:, :3][data.three > 5]
Out[87]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
注意ix現(xiàn)在已經不推薦使用。
整數(shù)索引(Integer Indexes)
pandas對象的整數(shù)索引與內置Python數(shù)據的索引語義存在一些差異驱显,以下代碼會生成錯誤:
#!python
ser = pd.Series(np.arange(3.))
ser[-1]
Traceback (most recent call last):
File "<ipython-input-20-3cbe0b873a9e>", line 1, in <module>
ser[-1]
File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: -1
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]
Out[22]: 2.0
ser[:1]
Out[23]:
0 0.0
dtype: float64
ser.loc[:1]
Out[24]:
0 0.0
1 1.0
dtype: float64
ser.iloc[:1]
Out[25]:
0 0.0
dtype: float64
算術和數(shù)據對齊
pandas可在不同索引的對象建進行算術運算诗芜,類似數(shù)據庫的連接:
#!python
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
Out[28]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
s2
Out[29]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
s1 + s2
Out[30]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
Out[33]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
df2
Out[34]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
df1 + df2
Out[35]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
Out[38]:
A
0 1
1 2
df2
Out[39]:
B
0 3
1 4
df1 - df2
Out[40]:
A B
0 NaN NaN
1 NaN NaN
還可以進行值的填充
#!python
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
df1
Out[43]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
Out[44]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df2.loc[1, 'b'] = np.nan
df2
Out[46]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 + df2
Out[47]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0)
Out[48]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1 / df1
Out[49]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
df1.rdiv(1)
Out[50]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
df1.reindex(columns=df2.columns, fill_value=0)
Out[53]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
Method | Description |
---|---|
add, radd | for addition (+) |
sub, rsub | for subtraction (-) |
div, rdiv | for division (/) |
floordiv, rfloordiv | for floor division (//) |
mul, rmul | for multiplication (*) |
pow, rpow | for exponentiation (**) |
- DataFrame和Series間的操作
默認基于行進行廣播瞳抓,用( axis='index' or axis=0 )可以基于列進行廣播。
#!python
arr = np.arange(12.).reshape((3, 4))
arr
Out[55]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
Out[56]: array([ 0., 1., 2., 3.])
arr - arr[0]
Out[57]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
arr
Out[58]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
Out[61]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series
Out[62]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
Out[63]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
Out[65]:
b 0
e 1
f 2
dtype: int32
frame + series2
Out[66]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
series3 = frame['d']
frame
Out[69]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series3
Out[70]:
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
frame.sub(series3, axis='index')
Out[71]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
函數(shù)應用和映射
NumPy的ufuncs(元素級數(shù)組方法)也可用于操作pandas對象:
另一個常見的操作是將函數(shù)應用到由各列或行所形成的一維數(shù)組上伏恐。DataFrame的apply方法即可實現(xiàn)此功能:
許多最為常見的數(shù)組統(tǒng)計功能都被實現(xiàn)成DataFrame的方法(如sum和mean)孩哑,因此無需使用apply方法。除標量值外翠桦,傳遞給apply的函數(shù)還可以返回由多個值組成的Series:
此外横蜒,元素級的Python函數(shù)也是可以用的。假如你想得到frame中各個浮點值的格式化字符串销凑,使用applymap即可:
之所以叫做applymap丛晌,是因為Series有一個用于應用元素級函數(shù)的map方法:
#!python
arr = np.arange(12.).reshape((3, 4))
arr
Out[73]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
Out[74]: array([ 0., 1., 2., 3.])
arr - arr[0]
Out[75]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[77]:
b d e
Utah 0.255395 1.983985 0.936326
Ohio 0.319394 2.231544 -0.051256
Texas -0.041388 -0.026032 -0.446722
Oregon 1.099475 -1.432638 -0.919189
np.abs(frame)
Out[78]:
b d e
Utah 0.255395 1.983985 0.936326
Ohio 0.319394 2.231544 0.051256
Texas 0.041388 0.026032 0.446722
Oregon 1.099475 1.432638 0.919189
f = lambda x: x.max() - x.min()
frame.apply(f)
Out[80]:
b 1.140863
d 3.664181
e 1.855515
dtype: float64
frame.apply(f, axis='columns')
Out[81]:
Utah 1.728590
Ohio 2.282800
Texas 0.420690
Oregon 2.532113
dtype: float64
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
Out[83]:
b d e
min -0.041388 -1.432638 -0.919189
max 1.099475 2.231544 0.936326
format = lambda x: '%.2f' % x
frame.applymap(format)
Out[85]:
b d e
Utah 0.26 1.98 0.94
Ohio 0.32 2.23 -0.05
Texas -0.04 -0.03 -0.45
Oregon 1.10 -1.43 -0.92
frame['e'].map(format)
Out[86]:
Utah 0.94
Ohio -0.05
Texas -0.45
Oregon -0.92
Name: e, dtype: object
排序和排名
根據條件對數(shù)據集排序(sorting)也是重要的內置運算。要對行或列索引進行排序(按字典順序)斗幼,可使用sort_index方法澎蛛,它將返回一個已排序的新對象。
而對于DataFrame蜕窿,則可以根據任意軸上的索引進行排序:
#!python
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
Out[88]:
a 1
b 2
c 3
d 0
dtype: int32
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
frame
Out[90]:
d a b c
three 0 1 2 3
one 4 5 6 7
frame.sort_index()
Out[91]:
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis='columns')
Out[94]:
a b c d
three 1 2 3 0
one 5 6 7 4
數(shù)據默認是按升序排序的谋逻,但也可以降序排序,若要按值對Series進行排序桐经,可使用其order方法毁兆。在排序時,任何缺失值默認都會被放到Series的末尾阴挣。
#!python
frame.sort_index(axis='columns', ascending=False)
Out[95]:
d c b a
three 0 3 2 1
one 4 7 6 5
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()
Out[97]:
2 -3
3 2
0 4
1 7
dtype: int64
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
Out[99]:
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
obj.sort_values(ascending=False)
Out[100]:
2 7.0
0 4.0
5 2.0
4 -3.0
1 NaN
3 NaN
dtype: float64
在DataFrame上气堕,你可能希望根據一個或多個列中的值進行排序。將一個或多個列的名字傳遞給by選項即可畔咧。要根據多個列進行排序茎芭,傳入名稱的列表即可:
#!python
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
Out[102]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2
frame.sort_values(by='b')
Out[103]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
frame.sort_values(by=['a', 'b'])
Out[104]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
排名(ranking)跟排序關系密切,且它會增設排名值(從1開始盒卸,一直到數(shù)組中有效數(shù)據的數(shù)量)骗爆。它跟numpy.argsort產生的間接排序索引差不多,只不過它可以根據某種規(guī)則破壞平級關系蔽介。接下來介紹Series和DataFrame的rank方法摘投。默認情況下,rank是通過“為各組分配一個平均排名”的方式破壞平級關系的:
也可以根據值在原數(shù)據中出現(xiàn)的順序給出排名
當然虹蓄,你也可以按降序進行排名:
#!python
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()
Out[106]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
Out[107]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(ascending=False, method='max')
Out[108]:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame
Out[110]:
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
frame.rank(axis='columns')
Out[111]:
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0
Method | Description |
---|---|
'average' | Default: assign the average rank to each entry in the equal group |
'min' | Use the minimum rank for the whole group |
'max' | Use the maximum rank for the whole group |
'first' | Assign ranks in the order the values appear in the data |
'dense' | Like method='min' , but ranks always increase by 1 in between groups rather than the number of equal |
elements in a group |
帶有重復值的軸索引
直到目前為止犀呼,我所介紹的所有范例都有著唯一的軸標簽(索引值)。雖然許多pandas函數(shù)(如reindex)都要求標簽唯一薇组,但這并不是強制性的外臂。
#!python
import pandas as pd
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
Out[3]:
a 0
a 1
b 2
b 3
c 4
dtype: int32
obj.index.is_unique
Out[4]: False
obj['a']
Out[5]:
a 0
a 1
dtype: int32
obj['c']
Out[6]: 4
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Out[10]:
0 1 2
a 0.835470 0.465657 -0.068212
a -1.067020 1.148283 1.722324
b 0.057184 -0.441111 -0.388286
b -0.363911 -0.599963 0.126594
df.loc['b']
Out[11]:
0 1 2
b 0.057184 -0.441111 -0.388286
b -0.363911 -0.599963 0.126594
匯總和計算描述統(tǒng)計
pandas對象擁有一組常用的數(shù)學和統(tǒng)計方法。它們大部分都屬于reduction和summary統(tǒng)計律胀,用于從Series中提取單個值(如sum或mean)或從DataFrame的行或列中提取Series宋光。跟對應的NumPy數(shù)組方法相比貌矿,它們都是基于沒有缺失數(shù)據的假設而構建的。接下來看一個簡單DataFrame:
#!python
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
df
Out[14]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.sum()
Out[15]:
one 9.25
two -5.80
dtype: float64
df.sum(axis='columns')
Out[16]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.mean(axis='columns', skipna=False)
Out[17]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
df.mean(axis='columns')
Out[18]:
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
Method | Description |
---|---|
axis | Axis to reduce over; 0 for DataFrame’s rows and 1 for columns |
skipna | Exclude missing values; True by default |
level | Reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |
有些方法(如idxmin和idxmax)返回的是間接統(tǒng)計(比如達到最小值或最大值的索引),cumsum則為累計求和罪佳,describe則為匯總統(tǒng)計逛漫。
#!python
df
Out[19]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.idxmax()
Out[20]:
one b
two d
dtype: object
df.cumsum()
Out[21]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
df.describe()
Out[22]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
Out[24]:
count 16
unique 3
top a
freq 8
dtype: object
Method | Description |
---|---|
count | Number of non-NA values |
describe | Compute set of summary statistics for Series or each DataFrame column |
min, max | Compute minimum and maximum values |
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
idxmin, idxmax | Compute index labels at which minimum or maximum value obtained, respectively |
quantile | Compute sample quantile ranging from 0 to 1 |
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median (50% quantile) of values |
mad | Mean absolute deviation from mean value |
prod | Product of all values |
var | Sample variance of values |
std | Sample standard deviation of values |
skew | Sample skewness (third moment) of values |
kurt | Sample kurtosis (fourth moment) of values |
cumsum | Cumulative sum of values |
cummin, cummax | Cumulative minimum or maximum of values, respectively |
cumprod | Cumulative product of values |
diff | Compute first arithmetic difference (useful for time series) |
pct_change | Compute percent changes |
相關性和方差
一些匯總統(tǒng)計,如相關和方差赘艳,是從成對的參數(shù)程程酌毡。 讓我們考慮一些來自Yahoo的股票價格和數(shù)量DataFrame! 使用附加的pandas-datareader包,
暫略
唯一值蕾管、值計數(shù)以及成員資格
還有一類方法可以從一維Series的值中抽取信息枷踏。以下面這個Series為例:
#!python
import pandas as pd
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
Out[9]: array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
Out[10]:
a 3
c 3
b 2
d 1
dtype: int64
pd.value_counts(obj.values, sort=False)
Out[11]:
c 3
d 1
b 2
a 3
dtype: int64
obj
Out[12]:
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
mask = obj.isin(['b', 'c'])
mask
Out[14]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
obj[mask]
Out[15]:
0 c
5 b
6 b
7 c
8 c
dtype: object
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)
Out[18]: array([0, 2, 1, 1, 0, 2], dtype=int64)
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})
data
Out[20]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
result = data.apply(pd.value_counts).fillna(0)
result
Out[22]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
Method | Description |
---|---|
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence ofvalues |
match | Compute integer indices for each value in an array into another array of distinct values; helpful for data |
alignment and join-type operations | |
unique | Compute array of unique values in a Series, returned in the order observed |
value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in |
descending order |
數(shù)據清洗和準備
在進行數(shù)據分析和建模的過程中,需要花費大量的時間(80%或更多)在數(shù)據準備上:加載掰曾,清理旭蠕,轉換和重新排列。有時候數(shù)據存儲在文件或數(shù)據庫中的方式不適合特定任務的格式婴梧。
在本章中下梢,我將討論缺失數(shù)據客蹋,重復數(shù)據塞蹭,字符串操作,和其他一些分析數(shù)據轉換讶坯。在下一章中番电,我將重點放在組合上,并以各種方式重新排列數(shù)據集辆琅。
處理缺失數(shù)據
數(shù)值用浮點數(shù)NaN (Not a Number)表示缺失漱办。
#!python
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [4]: string_data
Out[4]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [5]: string_data.isnull()
Out[5]:
0 False
1 False
2 True
3 False
dtype: bool
In [6]: string_data[0] = None
In [7]: string_data.isnull()
Out[7]:
0 True
1 False
2 True
3 False
dtype: bool
NA相關的處理方法
數(shù)據缺失用NA(not available)表示, python內置的None也為NA婉烟。
Argument | Description |
---|---|
dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' . |
isnull | Return boolean values indicating which values are missing/NA. |
notnull | Negation of isnull . |
#!python
In [8]: from numpy import nan as NA
In [9]: data = pd.Series([1, NA, 3.5, NA, 7])
In [10]: data.dropna()
Out[10]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [11]: data[data.notnull()]
Out[11]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
....: [NA, NA, NA], [NA, 6.5, 3.]])
In [13]: cleaned = data.dropna()
In [14]: data
Out[14]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [15]: cleaned
Out[15]:
0 1 2
0 1.0 6.5 3.0
In [16]: data.dropna(how='all')
Out[16]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
In [17]: data[4] = NA
In [18]: data
Out[18]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
In [19]: data.dropna((axis='columns', how='all')
Out[19]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
how='all'要所有行都為NaN時才會刪除彤避。thresh參數(shù)可以指定NA的個數(shù)瓤的。
#!python
In [21]: df = pd.DataFrame(np.random.randn(7, 3))
In [22]: df.iloc[:4, 1] = NA
In [23]: df.iloc[:2, 2] = NA
In [24]: df
Out[24]:
0 1 2
0 -0.843340 NaN NaN
1 -1.305941 NaN NaN
2 1.026378 NaN 2.176567
3 0.048885 NaN 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [25]: df.dropna()
Out[25]:
0 1 2
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [26]: df.dropna(thresh=2)
Out[26]:
0 1 2
2 1.026378 NaN 2.176567
3 0.048885 NaN 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
fillna用來對缺失值進行填充。可以針對列進行填充也祠,用上一行的值填充,用平均值填充等纵东。
#!python
In [27]: df.fillna(0)
Out[27]:
0 1 2
0 -0.843340 0.000000 0.000000
1 -1.305941 0.000000 0.000000
2 1.026378 0.000000 2.176567
3 0.048885 0.000000 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [28]: df.fillna({1: 0.5, 2: 0})
Out[28]:
0 1 2
0 -0.843340 0.500000 0.000000
1 -1.305941 0.500000 0.000000
2 1.026378 0.500000 2.176567
3 0.048885 0.500000 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [29]: _ = df.fillna(0, inplace=True)
In [30]: df
Out[30]:
0 1 2
0 -0.843340 0.000000 0.000000
1 -1.305941 0.000000 0.000000
2 1.026378 0.000000 2.176567
3 0.048885 0.000000 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [31]: df = pd.DataFrame(np.random.randn(6, 3))
In [32]: df.iloc[2:, 1] = NA
In [33]: df.iloc[4:, 2] = NA
In [34]: df
Out[34]:
0 1 2
0 -0.081265 -0.820770 -0.746845
1 1.150648 0.977842 0.861825
2 1.823679 NaN 1.272047
3 0.293133 NaN 0.273399
4 0.235116 NaN NaN
5 1.365186 NaN NaN
In [35]: df.fillna(method='ffill')
Out[35]:
0 1 2
0 -0.081265 -0.820770 -0.746845
1 1.150648 0.977842 0.861825
2 1.823679 0.977842 1.272047
3 0.293133 0.977842 0.273399
4 0.235116 0.977842 0.273399
5 1.365186 0.977842 0.273399
In [36]: df.fillna(method='ffill', limit=2)
Out[36]:
0 1 2
0 -0.081265 -0.820770 -0.746845
1 1.150648 0.977842 0.861825
2 1.823679 0.977842 1.272047
3 0.293133 0.977842 0.273399
4 0.235116 NaN 0.273399
5 1.365186 NaN 0.273399
In [37]: data = pd.Series([1., NA, 3.5, NA, 7])
In [38]: data.fillna(data.mean())
Out[38]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
Argument | Description |
---|---|
value | Scalar value or dict-like object to use to fill missing values |
method | Interpolation; by default 'ffill' if function called with no other arguments |
axis | Axis to fill on; default axis=0 |
inplace | Modify the calling object without producing a copy |
limit | For forward and backward filling, maximum number of consecutive periods to fill |
數(shù)據轉換
去重
#!python
In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
....: 'k2': [1, 1, 2, 3, 3, 4, 4]})
In [40]: data
Out[40]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [41]: data.duplicated()
Out[41]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [42]: data.drop_duplicates()
Out[42]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
In [43]: data['v1'] = range(7)
In [44]: data.drop_duplicates(['k1'])
Out[44]:
k1 k2 v1
0 one 1 0
1 two 1 1
In [45]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[45]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
使用函數(shù)或者映射(map)轉換數(shù)據
#!python
import pandas as np
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[5]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased = data['food'].str.lower()
lowercased
Out[8]:
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
Out[10]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[11]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
替換
#!python
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [5]: data
Out[5]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
In [6]: data.replace(-999, np.nan)
Out[6]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [7]: data.replace([-999, -1000], np.nan)
Out[7]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [8]: data.replace([-999, -1000], [np.nan, 0])
Out[8]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
In [9]: data.replace({-999: np.nan, -1000: 0})
Out[9]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
索引和列名修改
#!python
In [2]: import pandas as pd
In [3]: import numpy as np
In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
....: index=['Ohio', 'Colorado', 'New York'],
....: columns=['one', 'two', 'three', 'four'])
In [11]: data
Out[11]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [5]: data.replace(4, 40)
Out[5]:
one two three four
Ohio 0 1 2 3
Colorado 40 5 6 7
New York 8 9 10 11
In [12]: transform = lambda x: x[:4].upper()
In [13]: data.index.map(transform)
Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
In [14]: data
Out[14]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [15]: data.index = data.index.map(transform)
In [16]: data
Out[16]:
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
In [17]: data.rename(index=str.title, columns=str.upper)
Out[17]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})
Out[18]:
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
In [20]: data
Out[20]:
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
離散化和面元劃分
以下暫略
字符串處理
#!python
In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}
In [8]: data = pd.Series(data)
In [9]: data
Out[9]:
Dave dave@google.com
Rob rob@gmail.com
Steve steve@gmail.com
Wes NaN
dtype: object
In [10]: data.isnull()
Out[10]:
Dave False
Rob False
Steve False
Wes True
dtype: bool
In [11]: data.str.contains('gmail')
Out[11]:
Dave False
Rob True
Steve True
Wes NaN
dtype: object
In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
In [13]: data.str.findall(pattern, flags=re.IGNORECASE)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-085c16e4dbfe> in <module>()
----> 1 data.str.findall(pattern, flags=re.IGNORECASE)
NameError: name 're' is not defined
In [14]: import re
In [15]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[15]:
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
Steve [(steve, gmail, com)]
Wes NaN
dtype: object
In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)
In [17]: matches
Out[17]:
Dave True
Rob True
Steve True
Wes NaN
dtype: object
In [18]: matches.str.get(1)
Out[18]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64
In [19]: matches.str[0]
Out[19]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64
In [20]: data.str[:5]
Out[20]:
Dave dave@
Rob rob@g
Steve steve
Wes NaN
dtype: object
Method | Description |
---|---|
cat | Concatenate strings element-wise with optional delimiter |
contains | Return boolean array if each string contains pattern/regex |
count | Count occurrences of pattern |
extract | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group |
endswith | Equivalent to x.endswith(pattern) for each element |
startswith | Equivalent to x.startswith(pattern) for each element |
findall | Compute list of all occurrences of pattern/regex for each string |
get | Index into each element (retrieve i-th element) |
isalnum | Equivalent to built-in str.alnum |
isalpha | Equivalent to built-in str.isalpha |
isdecimal | Equivalent to built-in str.isdecimal |
isdigit | Equivalent to built-in str.isdigit |
islower | Equivalent to built-in str.islower |
isnumeric | Equivalent to built-in str.isnumeric |
isupper | Equivalent to built-in str.isupper |
join | Join strings in each element of the Series with passed separator |
len | Compute length of each string |
lower, upper | Convert cases; equivalent to x.lower() or x.upper() for each element |
match | Use re.match with the passed regular expression on each element, returning matched groups as list |
pad | Add whitespace to left, right, or both sides of strings |
center | Equivalent to pad(side='both') |
repeat | Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string) |
replace | Replace occurrences of pattern/regex with some other string |
slice | Slice each string in the Series |
split | Split strings on delimiter or regular expression |
strip | Trim whitespace from both sides, including newlines |
rstrip | Trim whitespace on right side |
lstrip | Trim whitespace on left side |
數(shù)據爭奪:連接歼捏,合并,和重塑
在許多應用程序中而涉,數(shù)據可能分布在多個文件或數(shù)據庫中著瓶,或者是以不易分析的形式。 本章重點介紹連接啼县,合并材原,和重塑沸久。
分層索引
#!python
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
Out[5]:
a 1 -1.111004
2 -0.451764
3 -0.501180
b 1 1.007739
3 0.407470
c 1 -0.307985
2 0.608742
d 2 1.432663
3 -1.660043
dtype: float64
data['b']
Out[6]:
1 1.007739
3 0.407470
dtype: float64
data['b':'c']
Out[7]:
b 1 1.007739
3 0.407470
c 1 -0.307985
2 0.608742
dtype: float64
data.loc[['b', 'd']]
Out[8]:
b 1 1.007739
3 0.407470
d 2 1.432663
3 -1.660043
dtype: float64
data.loc[:, 2]
Out[9]:
a -0.451764
c 0.608742
d 1.432663
dtype: float64
data.unstack()
Out[10]:
1 2 3
a -1.111004 -0.451764 -0.501180
b 1.007739 NaN 0.407470
c -0.307985 0.608742 NaN
d NaN 1.432663 -1.660043
data.unstack().stack()
Out[11]:
a 1 -1.111004
2 -0.451764
3 -0.501180
b 1 1.007739
3 0.407470
c 1 -0.307985
2 0.608742
d 2 1.432663
3 -1.660043
dtype: float64
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame
Out[13]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[16]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame['Ohio']
Out[17]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
- 重新排序和排序級別
以下暫略
聯(lián)結和合并數(shù)據集
pandas.merge可根據鍵將不同DataFrame中的行連接起來。SQL或其他關系型數(shù)據庫的用戶對此應該會比較
熟悉余蟹,因為它實現(xiàn)的就是數(shù)據庫的連接操作麦向。pandas.concat可以沿軸將多個對象堆疊到一起。
實例方法combine_first可以將重復數(shù)據編接在一起客叉,用一個對象中的值填充另一個對象中的缺失值诵竭。
數(shù)據庫風格的DataFrame合并
#!python
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1
Out[20]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
df2
Out[22]:
data2 key
0 0 a
1 1 b
2 2 d
pd.merge(df1, df2)
Out[23]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
pd.merge(df1, df2, on='key')
Out[24]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[27]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
pd.merge(df1, df2, how='outer')
Out[28]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0
Option | Behavior |
---|---|
'inner' | Use only the key combinations observed in both tables |
'left' | Use all key combinations found in the left table |
'right' | Use all key combinations found in the right table |
'output' | Use all key combinations observed in both tables together |
上面是多對一合并,下面看下多對多
數(shù)據聚合與分組運算
對數(shù)據集進行分組并對各組應用函數(shù)(無論是聚合還是轉換)是數(shù)據分析工作中的重要環(huán)節(jié)兼搏。在將數(shù)據集準備好之后卵慰,通常的任務就是計算分組統(tǒng)計或生成透視表。
pandas提供了靈活高效的gruopby功能佛呻,能以自然的方式對數(shù)據集進行切片裳朋、切塊、摘要等操作吓著。關系型數(shù)據庫和SQL(Structured Query Language鲤嫡,結構化查詢語言)能夠如此流行的原因之一就是其能夠方便地對數(shù)據進行連接、過濾绑莺、轉換和聚合暖眼。但是,像SQL這樣的查詢語
言所能執(zhí)行的分組運算的種類很有限纺裁。
在本章中你將會看到诫肠,由于Python和pandas強大的表達能力,我們可以執(zhí)行復雜得多的分組運算(利用任何可以接受pandas對象或NumPy數(shù)組的函數(shù))欺缘。在本章中栋豫,你將會學到:
- 根據一個或多個鍵(可以是函數(shù)、數(shù)組或DataFrame列名)拆分pandas對象谚殊。
- 計算分組摘要統(tǒng)計丧鸯,如計數(shù)、平均值嫩絮、標準差丛肢,或用戶自定義函數(shù)。
- 對DataFrame的列應用各種各樣的函數(shù)絮记。
- 應用組內轉換或其他運算摔踱,如規(guī)格化、線性回歸怨愤、排名或選取子集等派敷。
- 計算透視表或交叉表。
- 執(zhí)行分位數(shù)分析以及其他分組分析。
分組技術
Hadley Wickham(許多熱門R語言包的作者)創(chuàng)造了用于表示分組運算的術語"split-apply-combine"(拆分-應用-合并)篮愉,我覺得這個詞很好地描述了整個過程腐芍。分組運算的第一個階段,pandas對象(無論是Series试躏、DataFrame還是其他的)中的數(shù)據會根據你所提供的一個或多個鍵被拆分(split)為多組猪勇。拆分操作是在對象的特定軸上執(zhí)行的。例如颠蕴,DataFrame可以在其行(axis=0)或列(axis=1)上進行分組泣刹。然后,將一個函數(shù)應用(apply)到各個分組并產生一個新值犀被。最后椅您,所有這些函數(shù)的執(zhí)行結果會被合并(combine)到最終的結果對象中。結果對象的形式一般取決于數(shù)據上所執(zhí)行的操作寡键。圖9-1大致說明了一個簡單的分組聚合過程掀泳。
分組鍵可以有多種形式,且類型不必相同:
- 列表或數(shù)組西轩,其長度與待分組的軸一樣员舵。
- 表示DataFrame某個列名的值。
- 字典或Series藕畔,給出待分組軸上的值與分組名之間的對應關系马僻。
- 函數(shù),用于處理軸索引或索引中的各個標簽劫流。
注意巫玻,后三種都只是快捷方式而已,其最終目的仍然是產生一組用于拆分對象的值祠汇。如果覺得這些東西看起來很抽象,不用擔心熄诡,我將在本章中給出大量有關于此的示例可很。首先來看看下面這個非常簡單的表格型數(shù)據集(以DataFrame的形式)。
#!python
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),'data2' : np.random.randn(5)})
df
Out[32]:
data1 data2 key1 key2
0 -0.592555 0.537886 a one
1 0.286764 1.498792 a two
2 -0.149658 0.847675 b one
3 0.961803 -1.218945 b two
4 0.896790 1.461441 a one
假設你想要按key1進行分組凰浮,并計算data1列的平均值我抠。實現(xiàn)該功能的方式有很多,而我們這里要用的是:訪問data1袜茧,并根據key1調用groupby菜拓。
#!python
grouped = df['data1'].groupby(df['key1'])
grouped
Out[34]: <pandas.core.groupby.SeriesGroupBy object at 0x000001937BF46E48>
變量grouped是一個GroupBy對象。它實際上還沒有進行任何計算笛厦,只是含有一些有關分組鍵df['key1']的中間數(shù)據而已纳鼎。換句話說,該對象已經有了接下來對各分組執(zhí)行運算所需的一切信息。例如贱鄙,我們可以調用GroupBy的mean方法來計算分組平均值:
#!python
grouped.mean()
Out[35]:
key1
a 0.197000
b 0.406073
Name: data1, dtype: float64
數(shù)據(Series)根據分組鍵進行了聚合劝贸,產生了新的Series,其索引為key1列中的唯一值逗宁。
如果我們一次傳入多個數(shù)組映九,就會得到不同的結果:
#!python
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
Out[38]:
key1 key2
a one 0.152117
two 0.286764
b one -0.149658
two 0.961803
Name: data1, dtype: float64
means.unstack()
Out[39]:
key2 one two
key1
a 0.152117 0.286764
b -0.149658 0.961803
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()
Out[42]:
California 2005 0.286764
2006 -0.149658
Ohio 2005 0.184624
2006 0.896790
Name: data1, dtype: float64
更常用的是列名(可以是字符串、數(shù)字或其他Python對象)用作分組鍵:
#!python
df.groupby('key1').mean()
Out[44]:
data1 data2
key1
a 0.197000 1.166040
b 0.406073 -0.185635
df.groupby(['key1', 'key2']).mean()
Out[45]:
data1 data2
key1 key2
a one 0.152117 0.999663
two 0.286764 1.498792
b one -0.149658 0.847675
two 0.961803 -1.218945
df.groupby(['key1', 'key2']).size()
Out[46]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
你可能已經注意到在執(zhí)行df.groupby('key1').mean()時瞎颗,結果中沒有key2列件甥。這是因為df['key2']不是數(shù)值數(shù)據(俗稱“麻煩列”),所以被從結果中排除了哼拔。默認情況下嚼蚀,所有數(shù)值列都會被聚合,雖然有時可能會被過濾為一個子集(稍后就會講到)管挟。分組鍵中的任何缺失值都會被排除在結果之外轿曙。
- 分組迭代
GroupBy對象支持迭代,可以產生一組二元元組(由分組名和數(shù)據塊組成)僻孝〉嫉郏看看下面這個簡單的數(shù)據集:
#!python
for name, group in df.groupby('key1'):
print(name)
print(group)
a
data1 data2 key1 key2
0 -0.592555 0.537886 a one
1 0.286764 1.498792 a two
4 0.896790 1.461441 a one
b
data1 data2 key1 key2
2 -0.149658 0.847675 b one
3 0.961803 -1.218945 b two
對于多重鍵的情況,元組的第一個元素將會是由鍵值組成的元組: