pandas入門

pandas入門

簡介

pandas包含的數(shù)據結構和操作工具能快速簡單地清洗和分析數(shù)據谱俭。

pandas經常與NumPy和SciPy這樣的數(shù)據計算工具,statsmodels和scikit-learn之類的分析庫及數(shù)據可視化庫(如matplotlib)等一起用使用。pandas基于NumPy的數(shù)組,經潮焊瘢可以不使用循環(huán)就能處理好大量數(shù)據涌乳。

pandas適合處理表格數(shù)據或巨量數(shù)據。NumPy則適合處理巨量的數(shù)值數(shù)組數(shù)據姨夹。

這里約定導入方式:

技術支持qq群:521070358 630011153

#!python

import pandas as pd

參考資料

主要數(shù)據結構:Series和DataFrame。

Series

Series類似于一維數(shù)組的對象,它由一組數(shù)據(NumPy類似數(shù)據類型)以及相關的數(shù)據標簽(即索引)組成磷账。僅由一組數(shù)據即可產生最簡單的Series:

#!python

In [2]: import pandas as pd

In [3]: obj = pd.Series([4, 7, -5, 3])

In [4]: obj
Out[4]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [5]: obj.values
Out[5]: array([ 4,  7, -5,  3])

In [6]: obj.index
Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')

指定索引:

#!python

In [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [3]: obj2
Out[3]: 
d    4
b    7
a   -5
c    3
dtype: int64

In [4]: obj2.index
Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object')

In [10]: obj2['a']
Out[10]: -5

In [11]: obj2['d'] = 6

In [12]: obj2[['c', 'a', 'd']]
Out[12]: 
c    3
a   -5
d    6
dtype: int64

可見與普通NumPy數(shù)組相比,你還可以通過索引的方式選取Series中的值峭沦。

NumPy函數(shù)或類似操作,如根據布爾型數(shù)組進行過濾逃糟、標量乘法吼鱼、應用數(shù)學函數(shù)等)都會保留索引和值之間的鏈接:

#!python

In [13]: obj2[obj2 > 0]
Out[13]: 
d    6
b    7
c    3
dtype: int64

In [14]: obj2 * 2
Out[14]: 
d    12
b    14
a   -10
c     6
dtype: int64

In [15]: obj2
Out[15]: 
d    6
b    7
a   -5
c    3
dtype: int64

In [17]: import numpy as np

In [18]: np.exp(obj2)
Out[18]: 
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [19]: 'b' in obj2
Out[19]: True

In [20]: 'e' in obj2
Out[20]: False

可見可以吧Series看成是定長的有序字典蓬豁。也可由字典創(chuàng)建Series:

#!python

In [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [22]: obj3 = pd.Series(sdata)

In [23]: obj3
Out[23]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [25]: obj4 = pd.Series(sdata, index=states)

In [26]: obj4
Out[26]: 
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [27]: pd.isnull(obj4)
Out[27]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]: pd.notnull(obj4)
Out[28]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [29]: obj4.isnull()
Out[29]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]: obj4.notnull()
Out[32]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

相加

#!python

In [33]: obj3
Out[33]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [34]: obj4
Out[34]: 
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [35]: obj3 + obj4
Out[35]: 
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [36]: obj4.name = 'population'

In [37]: obj4.index.name = 'state'

In [38]: obj4
Out[38]: 
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [40]: obj = pd.Series([4, 7, -5, 3])

In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [42]: obj
Out[42]: 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

本文代碼地址:https://github.com/china-testing/python-api-tesing/

本文最新版本地址:http://t.cn/R8tJ9JH

交流QQ群:python 測試開發(fā) 144081101

wechat: pythontesting

淘寶天貓可以把鏈接發(fā)給qq850766020,為你生成優(yōu)惠券菇肃,降低你的購物成本地粪!

DataFrame

DataFrame是矩狀表格型的數(shù)據結構,包含有序的列琐谤,每列可以是不同的類型(數(shù)值蟆技、字符串、布爾值等)斗忌。DataFrame既有行索引也有列索引质礼,它可以被看做由相同索引的Series組成的字典。DataFrame中的數(shù)據是以一個或多個二維塊存放的织阳。

構建DataFrame的辦法有很多眶蕉,最常用的是直接傳入等長列表或NumPy數(shù)組組成的字典。DataFrame會自動加上索引(跟Series一樣)唧躲,有序排列造挽。

#!python

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: 

In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
   ...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
   ...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [4]: 

In [4]: frame = pd.DataFrame(data)

In [5]: frame
Out[5]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
5  3.2  Nevada  2003

In [6]: frame.head()
Out[6]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

In [7]: 

In [7]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[7]: 
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

In [8]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
   ...: index=['one', 'two', 'three', 'four', 'five', 'six'])

In [9]: frame2
Out[9]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

In [10]: frame2['state']
Out[10]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

可見還可以通過columns指定DataFrame的列序, index指定索引名。跟Series一樣弄痹,如果傳入的列在數(shù)據中找不到饭入,就會產生NaN值。通過類似字典的方式或屬性的方式界酒,可以將DataFrame的列獲取為Series圣拄,返回的Series擁有DataFrame相同的索引,且其name屬性也已經被相應地設置好毁欣。

行也可以用loc屬性通過位置或名稱的方式進行獲取庇谆。列可以通過賦值的方式進行修改。
將列表或數(shù)組賦值給某個列時凭疮,其長度必須跟DataFrame的長度相匹配饭耳。如果賦值的是Series,就會精確匹配

#!python

In [11]: frame2.loc['three']
Out[11]: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [12]: frame2['debt'] = 16.5

In [13]: frame2
Out[13]: 
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5

In [14]: frame2['debt'] = np.arange(6.)

In [15]: frame2
Out[15]: 
       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0

In [16]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [17]: frame2['debt'] = val

In [18]: frame2
Out[18]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN

為不存在的列賦值會創(chuàng)建出一個新列执解。關鍵字del用于刪除列:

#!python

In [19]: frame2['eastern'] = frame2['state'] == 'Ohio'

In [20]: frame2
Out[20]: 
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False

In [21]: del frame2['eastern']

In [22]: frame2.columns
Out[22]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

通過索引方式返回的列只是相應數(shù)據的視圖而不是副本寞肖。因此,對返回的Series所做的任何就地修改
全都會反映到源DataFrame上衰腌。通過Series的copy方法即可顯式地復制列新蟆。

另一種常見的數(shù)據形式是嵌套字典,外層字典的鍵作為列右蕊,內層鍵則作為行索引:

#!python

In [23]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
   ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [24]: frame3 = pd.DataFrame(pop)

In [25]: frame3
Out[25]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

In [26]: frame3.T
Out[26]: 
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

In [27]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[27]: 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

In [28]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [29]: pdata
Out[29]: 
{'Ohio': 2000    1.5
 2001    1.7
 Name: Ohio, dtype: float64, 'Nevada': 2000    NaN
 2001    2.4
 Name: Nevada, dtype: float64}

In [30]: pd.DataFrame(pdata)
Out[30]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7

In [31]: frame3.index.name = 'year'; frame3.columns.name = 'state'

In [32]: frame3
Out[32]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

In [33]: frame3.values
Out[33]: 
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [34]: frame2.values
Out[34]: 
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

可見可以轉置琼稻,由Series組成的字典和字典類似。如果設置了DataFrame的index和columns的name屬性,則這些信息也會被顯示出來饶囚。跟Series一樣,values屬性也會以二維ndarray的形式返回DataFrame中的數(shù)據帕翻。如果DataFrame各列的數(shù)據類型不同,則值數(shù)組的數(shù)據類型就會選用能兼容所有列的數(shù)據類型鸠补。

DataFrame的constructor接受的類型為:2D ndarray、dict of arrays, lists, or tuples嘀掸、NumPy structured/record紫岩、array、dict of Series睬塌、dict of dicts泉蝌、List of dicts or Series、List of lists or tuples衫仑、Another DataFrame梨与、NumPy MaskedArray堕花。

更多參考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

索引對象

pandas的索引對象負責管理軸標簽和其他元數(shù)據(比如軸名稱等)文狱。構建Series或DataFrame時,所用到的任何數(shù)組或其他序列的標簽都會被轉換成Index。

#!python

In [35]: obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [36]: index = obj.index

In [37]: index
Out[37]: Index(['a', 'b', 'c'], dtype='object')

In [38]: index[1:]
Out[38]: Index(['b', 'c'], dtype='object')

In [39]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'

/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1722 
   1723     def __setitem__(self, key, value):
-> 1724         raise TypeError("Index does not support mutable operations")
   1725 
   1726     def __getitem__(self, key):

TypeError: Index does not support mutable operations

In [40]: labels = pd.Index(np.arange(3))

In [41]: labels
Out[41]: Int64Index([0, 1, 2], dtype='int64')

In [42]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)

In [43]: obj2
Out[43]: 
0    1.5
1   -2.5
2    0.0
dtype: float64

In [44]: obj2.index is labels
Out[44]: True

In [45]: frame3
Out[45]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

In [46]: frame3.columns
Out[46]: Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [47]: 'Ohio' in frame3.columns
Out[47]: True

In [48]: 2003 in frame3.index
Out[48]: False

In [49]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [50]: dup_labels
Out[50]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index對象是不可變的,因此用戶不能對其進行修改瞄崇,這樣Index對象在多個數(shù)據結構之間可安全共享。除了像數(shù)組,Index類似固定大小的集合壕曼。

Index的方法和屬性有:append,difference,intersection,union,isin,delete,drop,insert苏研,is_monotonic,unique腮郊。

更多參考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html

基本功能

本節(jié)中,我將介紹操作Series和DataFrame中的數(shù)據的基本手段摹蘑。

重新索引

#!python

In [51]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [52]: obj
Out[52]: 
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

# 調用reindex將會根據新索引進行重排。如果某個索引值當前不存在,就為NaN

In [53]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [54]: obj2
Out[54]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [55]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [56]: obj3
Out[56]: 
0      blue
2    purple
4    yellow
dtype: object

# 對于時間序列這樣的有序數(shù)據,重新索引時可能需要做插值處理轧飞。method選項即可達到此目的,例如,使用ffill以實現(xiàn)前向值填充:

In [57]: obj3.reindex(range(6), method='ffill')
Out[57]: 
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

# DataFrame中reindex可以調整行列
In [58]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
   ....: index=['a', 'c', 'd'],
   ....: columns=['Ohio', 'Texas', 'California'])

In [59]: frame
Out[59]: 
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

In [60]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [61]: frame2
Out[61]: 
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

In [62]: states = ['Texas', 'Utah', 'California']

In [63]: frame.reindex(columns=states)
Out[63]: 
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

In [69]:  frame2 = frame.reindex(['a', 'b', 'c', 'd'],columns=states)

In [70]: frame2
Out[70]: 
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

reindex函數(shù)的參數(shù)有index衅鹿,method,fill_value过咬,limit大渤,tolerance,level掸绞,copy等泵三。

更多參考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

丟棄指定軸上的項

丟棄某條軸上的一項很簡單,只要有索引數(shù)組或列表即可。由于需要執(zhí)行一些數(shù)據整理和集合邏輯,所以drop方法返回的是在指定軸上刪除了指定值的新對象:

#!python

In [71]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [72]: obj
Out[72]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [73]: new_obj = obj.drop('c')

In [74]: new_obj
Out[74]: 
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [75]: obj
Out[75]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [76]: obj.drop(['d', 'c'])
Out[76]: 
a    0.0
b    1.0
e    4.0
dtype: float64

In [77]: obj
Out[77]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [78]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
   ....: columns=['one', 'two', 'three', 'four'])

In [79]: data
Out[79]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [80]: data.drop(['Colorado', 'Ohio'])
Out[80]: 
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In []: data.drop('two',1)
Out[57]: 
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In []: data.drop('two', axis=1)
Out[58]: 
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In []: data.drop(['two', 'four'], axis='columns')
Out[59]: 
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

In []: obj.drop('c', inplace=True)

In []: obj
Out[61]: 
d    4.5
b    7.2
a   -5.3
dtype: float64

索引衔掸、選取和過濾

Series索引(obj[...])的工作方式類似于NumPy數(shù)組的索引,只不過Series的索引值不只是整數(shù)烫幕。下面是幾個例子:

#!python
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

obj
Out[63]: 
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

obj['b']
Out[64]: 1.0

obj[1]
Out[65]: 1.0

obj[2:4]
Out[66]: 
c    2.0
d    3.0
dtype: float64

obj[['b', 'a', 'd']]
Out[67]: 
b    1.0
a    0.0
d    3.0
dtype: float64

obj[[1, 3]]
Out[68]: 
b    1.0
d    3.0
dtype: float64

obj[obj < 2]
Out[69]: 
a    0.0
b    1.0
dtype: float64

obj['b':'c']
Out[70]: 
b    1.0
c    2.0
dtype: float64

obj['b':'c'] = 5

obj
Out[72]: 
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

注意標簽的方式和python的列表不同,后面的index也是包含在里面的敞映。

#!python

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

data
Out[74]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data['two']
Out[75]: 
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three', 'one']]
Out[76]: 
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

data[:2]
Out[77]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

data[data['three'] > 5]
Out[78]: 
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data < 5
Out[79]: 
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

data[data < 5] = 0

data
Out[81]: 
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
  • loc和iloc

對于行上的DataFrame標簽索引有特殊的索引操作符loc(標簽)和iloc(整數(shù)索引)能夠從DataFrame中選擇子集较曼。

#!python

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

data
Out[74]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data['two']
Out[75]: 
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three', 'one']]
Out[76]: 
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

data[:2]
Out[77]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

data[data['three'] > 5]
Out[78]: 
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data < 5
Out[79]: 
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

data[data < 5] = 0

data
Out[81]: 
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

data.loc['Colorado', ['two', 'three']]
Out[82]: 
two      5
three    6
Name: Colorado, dtype: int32

data.iloc[2, [3, 0, 1]]
Out[83]: 
four    11
one      8
two      9
Name: Utah, dtype: int32

data.iloc[2]
Out[84]: 
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

data.iloc[[1, 2], [3, 0, 1]]
Out[85]: 
          four  one  two
Colorado     7    0    5
Utah        11    8    9

data.loc[:'Utah', 'two']
Out[86]: 
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

data.iloc[:, :3][data.three > 5]
Out[87]: 
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14

注意ix現(xiàn)在已經不推薦使用。

整數(shù)索引(Integer Indexes)

pandas對象的整數(shù)索引與內置Python數(shù)據的索引語義存在一些差異驱显,以下代碼會生成錯誤:

#!python

ser = pd.Series(np.arange(3.))

ser[-1]
Traceback (most recent call last):

  File "<ipython-input-20-3cbe0b873a9e>", line 1, in <module>
    ser[-1]

  File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)

  File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
    tz=getattr(series.dtype, 'tz', None))

  File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value

  File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value

  File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item

KeyError: -1

ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

ser2[-1]
Out[22]: 2.0

ser[:1]
Out[23]: 
0    0.0
dtype: float64

ser.loc[:1]
Out[24]: 
0    0.0
1    1.0
dtype: float64

ser.iloc[:1]
Out[25]: 
0    0.0
dtype: float64

算術和數(shù)據對齊

pandas可在不同索引的對象建進行算術運算诗芜,類似數(shù)據庫的連接:

#!python

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1
Out[28]: 
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2
Out[29]: 
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

s1 + s2
Out[30]: 
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1
Out[33]: 
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

df2
Out[34]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

df1 + df2
Out[35]: 
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

df1 = pd.DataFrame({'A': [1, 2]})

df2 = pd.DataFrame({'B': [3, 4]})

df1
Out[38]: 
   A
0  1
1  2

df2
Out[39]: 
   B
0  3
1  4

df1 - df2
Out[40]: 
    A   B
0 NaN NaN
1 NaN NaN

還可以進行值的填充

#!python

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))

df1
Out[43]: 
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

df2
Out[44]: 
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

df2.loc[1, 'b'] = np.nan

df2
Out[46]: 
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

df1 + df2
Out[47]: 
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

df1.add(df2, fill_value=0)
Out[48]: 
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

1 / df1
Out[49]: 
          a         b         c         d
0       inf  1.000000  0.500000  0.333333
1  0.250000  0.200000  0.166667  0.142857
2  0.125000  0.111111  0.100000  0.090909

df1.rdiv(1)
Out[50]: 
          a         b         c         d
0       inf  1.000000  0.500000  0.333333
1  0.250000  0.200000  0.166667  0.142857
2  0.125000  0.111111  0.100000  0.090909


df1.reindex(columns=df2.columns, fill_value=0)
Out[53]: 
     a    b     c     d  e
0  0.0  1.0   2.0   3.0  0
1  4.0  5.0   6.0   7.0  0
2  8.0  9.0  10.0  11.0  0

Method Description
add, radd for addition (+)
sub, rsub for subtraction (-)
div, rdiv for division (/)
floordiv, rfloordiv for floor division (//)
mul, rmul for multiplication (*)
pow, rpow for exponentiation (**)
  • DataFrame和Series間的操作

默認基于行進行廣播瞳抓,用( axis='index' or axis=0 )可以基于列進行廣播。

#!python

arr = np.arange(12.).reshape((3, 4))

arr
Out[55]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]
Out[56]: array([ 0.,  1.,  2.,  3.])

arr - arr[0]
Out[57]: 
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

arr
Out[58]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])

series = frame.iloc[0]

frame
Out[61]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

series
Out[62]: 
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

frame - series
Out[63]: 
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

series2 = pd.Series(range(3), index=['b', 'e', 'f'])

series2
Out[65]: 
b    0
e    1
f    2
dtype: int32

frame + series2
Out[66]: 
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN


series3 = frame['d']

frame
Out[69]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

series3
Out[70]: 
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

frame.sub(series3, axis='index')
Out[71]: 
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

函數(shù)應用和映射

NumPy的ufuncs(元素級數(shù)組方法)也可用于操作pandas對象:

另一個常見的操作是將函數(shù)應用到由各列或行所形成的一維數(shù)組上伏恐。DataFrame的apply方法即可實現(xiàn)此功能:

許多最為常見的數(shù)組統(tǒng)計功能都被實現(xiàn)成DataFrame的方法(如sum和mean)孩哑,因此無需使用apply方法。除標量值外翠桦,傳遞給apply的函數(shù)還可以返回由多個值組成的Series:

此外横蜒,元素級的Python函數(shù)也是可以用的。假如你想得到frame中各個浮點值的格式化字符串销凑,使用applymap即可:

之所以叫做applymap丛晌,是因為Series有一個用于應用元素級函數(shù)的map方法:

#!python

arr = np.arange(12.).reshape((3, 4))

arr
Out[73]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]
Out[74]: array([ 0.,  1.,  2.,  3.])

arr - arr[0]
Out[75]: 
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])





frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame
Out[77]: 
               b         d         e
Utah    0.255395  1.983985  0.936326
Ohio    0.319394  2.231544 -0.051256
Texas  -0.041388 -0.026032 -0.446722
Oregon  1.099475 -1.432638 -0.919189

np.abs(frame)
Out[78]: 
               b         d         e
Utah    0.255395  1.983985  0.936326
Ohio    0.319394  2.231544  0.051256
Texas   0.041388  0.026032  0.446722
Oregon  1.099475  1.432638  0.919189

f = lambda x: x.max() - x.min()

frame.apply(f)
Out[80]: 
b    1.140863
d    3.664181
e    1.855515
dtype: float64

frame.apply(f, axis='columns')
Out[81]: 
Utah      1.728590
Ohio      2.282800
Texas     0.420690
Oregon    2.532113
dtype: float64

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])


frame.apply(f)
Out[83]: 
            b         d         e
min -0.041388 -1.432638 -0.919189
max  1.099475  2.231544  0.936326

format = lambda x: '%.2f' % x

frame.applymap(format)
Out[85]: 
            b      d      e
Utah     0.26   1.98   0.94
Ohio     0.32   2.23  -0.05
Texas   -0.04  -0.03  -0.45
Oregon   1.10  -1.43  -0.92

frame['e'].map(format)
Out[86]: 
Utah       0.94
Ohio      -0.05
Texas     -0.45
Oregon    -0.92
Name: e, dtype: object

排序和排名

根據條件對數(shù)據集排序(sorting)也是重要的內置運算。要對行或列索引進行排序(按字典順序)斗幼,可使用sort_index方法澎蛛,它將返回一個已排序的新對象。

而對于DataFrame蜕窿,則可以根據任意軸上的索引進行排序:

#!python

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

obj.sort_index()
Out[88]: 
a    1
b    2
c    3
d    0
dtype: int32

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],columns=['d', 'a', 'b', 'c'])

frame
Out[90]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

frame.sort_index()
Out[91]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

frame.sort_index(axis='columns')
Out[94]: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

數(shù)據默認是按升序排序的谋逻,但也可以降序排序,若要按值對Series進行排序桐经,可使用其order方法毁兆。在排序時,任何缺失值默認都會被放到Series的末尾阴挣。

#!python

frame.sort_index(axis='columns', ascending=False)
Out[95]: 
       d  c  b  a
three  0  3  2  1
one    4  7  6  5
obj = pd.Series([4, 7, -3, 2])

obj.sort_values()
Out[97]: 
2   -3
3    2
0    4
1    7
dtype: int64

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

obj.sort_values()
Out[99]: 
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

obj.sort_values(ascending=False)
Out[100]: 
2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64

在DataFrame上气堕,你可能希望根據一個或多個列中的值進行排序。將一個或多個列的名字傳遞給by選項即可畔咧。要根據多個列進行排序茎芭,傳入名稱的列表即可:

#!python

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

frame
Out[102]: 
   a  b
0  0  4
1  1  7
2  0 -3
3  1  2

frame.sort_values(by='b')
Out[103]: 
   a  b
2  0 -3
3  1  2
0  0  4
1  1  7

frame.sort_values(by=['a', 'b'])
Out[104]: 
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7

排名(ranking)跟排序關系密切,且它會增設排名值(從1開始盒卸,一直到數(shù)組中有效數(shù)據的數(shù)量)骗爆。它跟numpy.argsort產生的間接排序索引差不多,只不過它可以根據某種規(guī)則破壞平級關系蔽介。接下來介紹Series和DataFrame的rank方法摘投。默認情況下,rank是通過“為各組分配一個平均排名”的方式破壞平級關系的:

也可以根據值在原數(shù)據中出現(xiàn)的順序給出排名

當然虹蓄,你也可以按降序進行排名:

#!python

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

obj.rank()
Out[106]: 
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first')
Out[107]: 
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(ascending=False, method='max')
Out[108]: 
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})

frame
Out[110]: 
   a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5

frame.rank(axis='columns')
Out[111]: 
     a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0

Method Description
'average' Default: assign the average rank to each entry in the equal group
'min' Use the minimum rank for the whole group
'max' Use the maximum rank for the whole group
'first' Assign ranks in the order the values appear in the data
'dense' Like method='min' , but ranks always increase by 1 in between groups rather than the number of equal
elements in a group

帶有重復值的軸索引

直到目前為止犀呼,我所介紹的所有范例都有著唯一的軸標簽(索引值)。雖然許多pandas函數(shù)(如reindex)都要求標簽唯一薇组,但這并不是強制性的外臂。

#!python

import pandas as pd

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

obj
Out[3]: 
a    0
a    1
b    2
b    3
c    4
dtype: int32

obj.index.is_unique
Out[4]: False

obj['a']
Out[5]: 
a    0
a    1
dtype: int32

obj['c']
Out[6]: 4

import numpy as np

df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

df
Out[10]: 
          0         1         2
a  0.835470  0.465657 -0.068212
a -1.067020  1.148283  1.722324
b  0.057184 -0.441111 -0.388286
b -0.363911 -0.599963  0.126594

df.loc['b']
Out[11]: 
          0         1         2
b  0.057184 -0.441111 -0.388286
b -0.363911 -0.599963  0.126594

匯總和計算描述統(tǒng)計

pandas對象擁有一組常用的數(shù)學和統(tǒng)計方法。它們大部分都屬于reduction和summary統(tǒng)計律胀,用于從Series中提取單個值(如sum或mean)或從DataFrame的行或列中提取Series宋光。跟對應的NumPy數(shù)組方法相比貌矿,它們都是基于沒有缺失數(shù)據的假設而構建的。接下來看一個簡單DataFrame:

#!python

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], 
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])

df
Out[14]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

df.sum()
Out[15]: 
one    9.25
two   -5.80
dtype: float64

df.sum(axis='columns')
Out[16]: 
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis='columns', skipna=False)
Out[17]: 
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

df.mean(axis='columns')
Out[18]: 
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

Method Description
axis Axis to reduce over; 0 for DataFrame’s rows and 1 for columns
skipna Exclude missing values; True by default
level Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

有些方法(如idxmin和idxmax)返回的是間接統(tǒng)計(比如達到最小值或最大值的索引),cumsum則為累計求和罪佳,describe則為匯總統(tǒng)計逛漫。

#!python

df
Out[19]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

df.idxmax()
Out[20]: 
one    b
two    d
dtype: object

df.cumsum()
Out[21]: 
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8

df.describe()
Out[22]: 
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

obj.describe()
Out[24]: 
count     16
unique     3
top        a
freq       8
dtype: object

Method Description
count Number of non-NA values
describe Compute set of summary statistics for Series or each DataFrame column
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax Compute index labels at which minimum or maximum value obtained, respectively
quantile Compute sample quantile ranging from 0 to 1
sum Sum of values
mean Mean of values
median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
prod Product of all values
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (third moment) of values
kurt Sample kurtosis (fourth moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute first arithmetic difference (useful for time series)
pct_change Compute percent changes

相關性和方差

一些匯總統(tǒng)計,如相關和方差赘艳,是從成對的參數(shù)程程酌毡。 讓我們考慮一些來自Yahoo的股票價格和數(shù)量DataFrame! 使用附加的pandas-datareader包,

暫略

唯一值蕾管、值計數(shù)以及成員資格

還有一類方法可以從一維Series的值中抽取信息枷踏。以下面這個Series為例:

#!python

import pandas as pd

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

uniques = obj.unique()

uniques
Out[9]: array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()
Out[10]: 
a    3
c    3
b    2
d    1
dtype: int64

pd.value_counts(obj.values, sort=False)
Out[11]: 
c    3
d    1
b    2
a    3
dtype: int64

obj
Out[12]: 
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

mask = obj.isin(['b', 'c'])

mask
Out[14]: 
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

obj[mask]
Out[15]: 
0    c
5    b
6    b
7    c
8    c
dtype: object

to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

unique_vals = pd.Series(['c', 'b', 'a'])

pd.Index(unique_vals).get_indexer(to_match)
Out[18]: array([0, 2, 1, 1, 0, 2], dtype=int64)

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})

data
Out[20]: 
   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4

result = data.apply(pd.value_counts).fillna(0)

result
Out[22]: 
   Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0
Method Description
isin Compute boolean array indicating whether each Series value is contained in the passed sequence ofvalues
match Compute integer indices for each value in an array into another array of distinct values; helpful for data
alignment and join-type operations
unique Compute array of unique values in a Series, returned in the order observed
value_counts Return a Series containing unique values as its index and frequencies as its values, ordered count in
descending order

數(shù)據清洗和準備

在進行數(shù)據分析和建模的過程中,需要花費大量的時間(80%或更多)在數(shù)據準備上:加載掰曾,清理旭蠕,轉換和重新排列。有時候數(shù)據存儲在文件或數(shù)據庫中的方式不適合特定任務的格式婴梧。

在本章中下梢,我將討論缺失數(shù)據客蹋,重復數(shù)據塞蹭,字符串操作,和其他一些分析數(shù)據轉換讶坯。在下一章中番电,我將重點放在組合上,并以各種方式重新排列數(shù)據集辆琅。

處理缺失數(shù)據

數(shù)值用浮點數(shù)NaN (Not a Number)表示缺失漱办。

#!python

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [4]: string_data
Out[4]: 
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [5]: string_data.isnull()
Out[5]: 
0    False
1    False
2     True
3    False
dtype: bool

In [6]: string_data[0] = None

In [7]: string_data.isnull()
Out[7]: 
0     True
1    False
2     True
3    False
dtype: bool

NA相關的處理方法



數(shù)據缺失用NA(not available)表示, python內置的None也為NA婉烟。

Argument Description
dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' .
isnull Return boolean values indicating which values are missing/NA.
notnull Negation of isnull .
#!python

In [8]: from numpy import nan as NA

In [9]: data = pd.Series([1, NA, 3.5, NA, 7])

In [10]: data.dropna()
Out[10]: 
0    1.0
2    3.5
4    7.0
dtype: float64

In [11]: data[data.notnull()]
Out[11]: 
0    1.0
2    3.5
4    7.0
dtype: float64

In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
   ....: [NA, NA, NA], [NA, 6.5, 3.]])

In [13]: cleaned = data.dropna()

In [14]: data
Out[14]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

In [15]: cleaned
Out[15]: 
     0    1    2
0  1.0  6.5  3.0

In [16]: data.dropna(how='all')
Out[16]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

In [17]: data[4] = NA

In [18]: data
Out[18]: 
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN

In [19]: data.dropna((axis='columns', how='all')
Out[19]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

how='all'要所有行都為NaN時才會刪除彤避。thresh參數(shù)可以指定NA的個數(shù)瓤的。

#!python

In [21]: df = pd.DataFrame(np.random.randn(7, 3))

In [22]: df.iloc[:4, 1] = NA

In [23]: df.iloc[:2, 2] = NA

In [24]: df
Out[24]: 
          0         1         2
0 -0.843340       NaN       NaN
1 -1.305941       NaN       NaN
2  1.026378       NaN  2.176567
3  0.048885       NaN  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [25]: df.dropna()
Out[25]: 
          0         1         2
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [26]: df.dropna(thresh=2)
Out[26]: 
          0         1         2
2  1.026378       NaN  2.176567
3  0.048885       NaN  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

fillna用來對缺失值進行填充。可以針對列進行填充也祠,用上一行的值填充,用平均值填充等纵东。

#!python

In [27]: df.fillna(0)
Out[27]: 
          0         1         2
0 -0.843340  0.000000  0.000000
1 -1.305941  0.000000  0.000000
2  1.026378  0.000000  2.176567
3  0.048885  0.000000  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [28]: df.fillna({1: 0.5, 2: 0})
Out[28]: 
          0         1         2
0 -0.843340  0.500000  0.000000
1 -1.305941  0.500000  0.000000
2  1.026378  0.500000  2.176567
3  0.048885  0.500000  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [29]: _ = df.fillna(0, inplace=True)

In [30]: df
Out[30]: 
          0         1         2
0 -0.843340  0.000000  0.000000
1 -1.305941  0.000000  0.000000
2  1.026378  0.000000  2.176567
3  0.048885  0.000000  0.012649
4  0.591212 -0.739625  1.017533
5  0.633873 -0.124162 -0.823495
6 -1.537827  0.802565  0.359058

In [31]: df = pd.DataFrame(np.random.randn(6, 3))

In [32]: df.iloc[2:, 1] = NA

In [33]: df.iloc[4:, 2] = NA

In [34]: df
Out[34]: 
          0         1         2
0 -0.081265 -0.820770 -0.746845
1  1.150648  0.977842  0.861825
2  1.823679       NaN  1.272047
3  0.293133       NaN  0.273399
4  0.235116       NaN       NaN
5  1.365186       NaN       NaN

In [35]: df.fillna(method='ffill')
Out[35]: 
          0         1         2
0 -0.081265 -0.820770 -0.746845
1  1.150648  0.977842  0.861825
2  1.823679  0.977842  1.272047
3  0.293133  0.977842  0.273399
4  0.235116  0.977842  0.273399
5  1.365186  0.977842  0.273399

In [36]: df.fillna(method='ffill', limit=2)
Out[36]: 
          0         1         2
0 -0.081265 -0.820770 -0.746845
1  1.150648  0.977842  0.861825
2  1.823679  0.977842  1.272047
3  0.293133  0.977842  0.273399
4  0.235116       NaN  0.273399
5  1.365186       NaN  0.273399

In [37]: data = pd.Series([1., NA, 3.5, NA, 7])

In [38]: data.fillna(data.mean())
Out[38]: 
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Argument Description
value Scalar value or dict-like object to use to fill missing values
method Interpolation; by default 'ffill' if function called with no other arguments
axis Axis to fill on; default axis=0
inplace Modify the calling object without producing a copy
limit For forward and backward filling, maximum number of consecutive periods to fill

數(shù)據轉換

去重

#!python

In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
   ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})

In [40]: data
Out[40]: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

In [41]: data.duplicated()
Out[41]: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [42]: data.drop_duplicates()
Out[42]: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

In [43]: data['v1'] = range(7)

In [44]: data.drop_duplicates(['k1'])
Out[44]: 
    k1  k2  v1
0  one   1   0
1  two   1   1

In [45]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[45]: 
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
6  two   4   6

使用函數(shù)或者映射(map)轉換數(shù)據

#!python

import pandas as np

import pandas as pd

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data
Out[5]: 
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

lowercased = data['food'].str.lower()

lowercased
Out[8]: 
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

data['animal'] = lowercased.map(meat_to_animal)

data
Out[10]: 
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[11]: 
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替換

#!python

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [5]: data
Out[5]: 
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [6]: data.replace(-999, np.nan)
Out[6]: 
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [7]: data.replace([-999, -1000], np.nan)
Out[7]: 
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [8]: data.replace([-999, -1000], [np.nan, 0])
Out[8]: 
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [9]: data.replace({-999: np.nan, -1000: 0})
Out[9]: 
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

索引和列名修改

#!python

In [2]: import pandas as pd

In [3]: import numpy as np



In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
   ....: index=['Ohio', 'Colorado', 'New York'],
   ....: columns=['one', 'two', 'three', 'four'])

In [11]: data
Out[11]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

In [5]: data.replace(4, 40)
Out[5]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado   40    5      6     7
New York    8    9     10    11


In [12]: transform = lambda x: x[:4].upper()

In [13]: data.index.map(transform)
Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [14]: data
Out[14]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

In [15]: data.index = data.index.map(transform)

In [16]: data
Out[16]: 
      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11

In [17]: data.rename(index=str.title, columns=str.upper)
Out[17]: 
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})
Out[18]: 
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11

In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [20]: data
Out[20]: 
         one  two  three  four
INDIANA    0    1      2     3
COLO       4    5      6     7
NEW        8    9     10    11

離散化和面元劃分

以下暫略

字符串處理

#!python

In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
   ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [8]: data = pd.Series(data)

In [9]: data
Out[9]: 
Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

In [10]: data.isnull()
Out[10]: 
Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

In [11]: data.str.contains('gmail')
Out[11]: 
Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [13]: data.str.findall(pattern, flags=re.IGNORECASE)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-085c16e4dbfe> in <module>()
----> 1 data.str.findall(pattern, flags=re.IGNORECASE)

NameError: name 're' is not defined

In [14]: import re

In [15]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[15]: 
Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)

In [17]: matches
Out[17]: 
Dave     True
Rob      True
Steve    True
Wes       NaN
dtype: object

In [18]: matches.str.get(1)
Out[18]: 
Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

In [19]: matches.str[0]
Out[19]: 
Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

In [20]: data.str[:5]
Out[20]: 
Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object

Method Description
cat Concatenate strings element-wise with optional delimiter
contains Return boolean array if each string contains pattern/regex
count Count occurrences of pattern
extract Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith Equivalent to x.endswith(pattern) for each element
startswith Equivalent to x.startswith(pattern) for each element
findall Compute list of all occurrences of pattern/regex for each string
get Index into each element (retrieve i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join Join strings in each element of the Series with passed separator
len Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
match Use re.match with the passed regular expression on each element, returning matched groups as list
pad Add whitespace to left, right, or both sides of strings
center Equivalent to pad(side='both')
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace Replace occurrences of pattern/regex with some other string
slice Slice each string in the Series
split Split strings on delimiter or regular expression
strip Trim whitespace from both sides, including newlines
rstrip Trim whitespace on right side
lstrip Trim whitespace on left side

數(shù)據爭奪:連接歼捏,合并,和重塑

在許多應用程序中而涉,數(shù)據可能分布在多個文件或數(shù)據庫中著瓶,或者是以不易分析的形式。 本章重點介紹連接啼县,合并材原,和重塑沸久。

分層索引

#!python

import pandas as pd

import numpy as np

data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])

data
Out[5]: 
a  1   -1.111004
   2   -0.451764
   3   -0.501180
b  1    1.007739
   3    0.407470
c  1   -0.307985
   2    0.608742
d  2    1.432663
   3   -1.660043
dtype: float64

data['b']
Out[6]: 
1    1.007739
3    0.407470
dtype: float64

data['b':'c']
Out[7]: 
b  1    1.007739
   3    0.407470
c  1   -0.307985
   2    0.608742
dtype: float64

data.loc[['b', 'd']]
Out[8]: 
b  1    1.007739
   3    0.407470
d  2    1.432663
   3   -1.660043
dtype: float64

data.loc[:, 2]
Out[9]: 
a   -0.451764
c    0.608742
d    1.432663
dtype: float64

data.unstack()
Out[10]: 
          1         2         3
a -1.111004 -0.451764 -0.501180
b  1.007739       NaN  0.407470
c -0.307985  0.608742       NaN
d       NaN  1.432663 -1.660043

data.unstack().stack()
Out[11]: 
a  1   -1.111004
   2   -0.451764
   3   -0.501180
b  1    1.007739
   3    0.407470
c  1   -0.307985
   2    0.608742
d  2    1.432663
   3   -1.660043
dtype: float64

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])

frame
Out[13]: 
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state', 'color']

frame
Out[16]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

frame['Ohio']
Out[17]: 
color      Green  Red
key1 key2            
a    1         0    1
     2         3    4
b    1         6    7
     2         9   10

  • 重新排序和排序級別

以下暫略

聯(lián)結和合并數(shù)據集

  • pandas.merge可根據鍵將不同DataFrame中的行連接起來。SQL或其他關系型數(shù)據庫的用戶對此應該會比較
    熟悉余蟹,因為它實現(xiàn)的就是數(shù)據庫的連接操作麦向。

  • pandas.concat可以沿軸將多個對象堆疊到一起。

  • 實例方法combine_first可以將重復數(shù)據編接在一起客叉,用一個對象中的值填充另一個對象中的缺失值诵竭。

數(shù)據庫風格的DataFrame合并

#!python

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})

df1
Out[20]: 
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b

df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})

df2
Out[22]: 
   data2 key
0      0   a
1      1   b
2      2   d

pd.merge(df1, df2)
Out[23]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

pd.merge(df1, df2, on='key')
Out[24]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[27]: 
   data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a

pd.merge(df1, df2, how='outer')
Out[28]: 
   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0

Option Behavior
'inner' Use only the key combinations observed in both tables
'left' Use all key combinations found in the left table
'right' Use all key combinations found in the right table
'output' Use all key combinations observed in both tables together

上面是多對一合并,下面看下多對多

數(shù)據聚合與分組運算

對數(shù)據集進行分組并對各組應用函數(shù)(無論是聚合還是轉換)是數(shù)據分析工作中的重要環(huán)節(jié)兼搏。在將數(shù)據集準備好之后卵慰,通常的任務就是計算分組統(tǒng)計或生成透視表。

pandas提供了靈活高效的gruopby功能佛呻,能以自然的方式對數(shù)據集進行切片裳朋、切塊、摘要等操作吓著。關系型數(shù)據庫和SQL(Structured Query Language鲤嫡,結構化查詢語言)能夠如此流行的原因之一就是其能夠方便地對數(shù)據進行連接、過濾绑莺、轉換和聚合暖眼。但是,像SQL這樣的查詢語
言所能執(zhí)行的分組運算的種類很有限纺裁。

在本章中你將會看到诫肠,由于Python和pandas強大的表達能力,我們可以執(zhí)行復雜得多的分組運算(利用任何可以接受pandas對象或NumPy數(shù)組的函數(shù))欺缘。在本章中栋豫,你將會學到:

  • 根據一個或多個鍵(可以是函數(shù)、數(shù)組或DataFrame列名)拆分pandas對象谚殊。
  • 計算分組摘要統(tǒng)計丧鸯,如計數(shù)、平均值嫩絮、標準差丛肢,或用戶自定義函數(shù)。
  • 對DataFrame的列應用各種各樣的函數(shù)絮记。
  • 應用組內轉換或其他運算摔踱,如規(guī)格化、線性回歸怨愤、排名或選取子集等派敷。
  • 計算透視表或交叉表。
  • 執(zhí)行分位數(shù)分析以及其他分組分析。

分組技術

Hadley Wickham(許多熱門R語言包的作者)創(chuàng)造了用于表示分組運算的術語"split-apply-combine"(拆分-應用-合并)篮愉,我覺得這個詞很好地描述了整個過程腐芍。分組運算的第一個階段,pandas對象(無論是Series试躏、DataFrame還是其他的)中的數(shù)據會根據你所提供的一個或多個鍵被拆分(split)為多組猪勇。拆分操作是在對象的特定軸上執(zhí)行的。例如颠蕴,DataFrame可以在其行(axis=0)或列(axis=1)上進行分組泣刹。然后,將一個函數(shù)應用(apply)到各個分組并產生一個新值犀被。最后椅您,所有這些函數(shù)的執(zhí)行結果會被合并(combine)到最終的結果對象中。結果對象的形式一般取決于數(shù)據上所執(zhí)行的操作寡键。圖9-1大致說明了一個簡單的分組聚合過程掀泳。

分組鍵可以有多種形式,且類型不必相同:

  • 列表或數(shù)組西轩,其長度與待分組的軸一樣员舵。
  • 表示DataFrame某個列名的值。
  • 字典或Series藕畔,給出待分組軸上的值與分組名之間的對應關系马僻。
  • 函數(shù),用于處理軸索引或索引中的各個標簽劫流。

注意巫玻,后三種都只是快捷方式而已,其最終目的仍然是產生一組用于拆分對象的值祠汇。如果覺得這些東西看起來很抽象,不用擔心熄诡,我將在本章中給出大量有關于此的示例可很。首先來看看下面這個非常簡單的表格型數(shù)據集(以DataFrame的形式)。

#!python

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),'data2' : np.random.randn(5)})

df
Out[32]: 
      data1     data2 key1 key2
0 -0.592555  0.537886    a  one
1  0.286764  1.498792    a  two
2 -0.149658  0.847675    b  one
3  0.961803 -1.218945    b  two
4  0.896790  1.461441    a  one

假設你想要按key1進行分組凰浮,并計算data1列的平均值我抠。實現(xiàn)該功能的方式有很多,而我們這里要用的是:訪問data1袜茧,并根據key1調用groupby菜拓。

#!python

grouped = df['data1'].groupby(df['key1'])

grouped
Out[34]: <pandas.core.groupby.SeriesGroupBy object at 0x000001937BF46E48>

變量grouped是一個GroupBy對象。它實際上還沒有進行任何計算笛厦,只是含有一些有關分組鍵df['key1']的中間數(shù)據而已纳鼎。換句話說,該對象已經有了接下來對各分組執(zhí)行運算所需的一切信息。例如贱鄙,我們可以調用GroupBy的mean方法來計算分組平均值:

#!python

grouped.mean()
Out[35]: 
key1
a    0.197000
b    0.406073
Name: data1, dtype: float64

數(shù)據(Series)根據分組鍵進行了聚合劝贸,產生了新的Series,其索引為key1列中的唯一值逗宁。

如果我們一次傳入多個數(shù)組映九,就會得到不同的結果:

#!python

means = df['data1'].groupby([df['key1'], df['key2']]).mean()

means
Out[38]: 
key1  key2
a     one     0.152117
      two     0.286764
b     one    -0.149658
      two     0.961803
Name: data1, dtype: float64

means.unstack()
Out[39]: 
key2       one       two
key1                    
a     0.152117  0.286764
b    -0.149658  0.961803

states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

years = np.array([2005, 2005, 2006, 2005, 2006])

df['data1'].groupby([states, years]).mean()
Out[42]: 
California  2005    0.286764
            2006   -0.149658
Ohio        2005    0.184624
            2006    0.896790
Name: data1, dtype: float64

更常用的是列名(可以是字符串、數(shù)字或其他Python對象)用作分組鍵:

#!python

df.groupby('key1').mean()
Out[44]: 
         data1     data2
key1                    
a     0.197000  1.166040
b     0.406073 -0.185635

df.groupby(['key1', 'key2']).mean()
Out[45]: 
              data1     data2
key1 key2                    
a    one   0.152117  0.999663
     two   0.286764  1.498792
b    one  -0.149658  0.847675
     two   0.961803 -1.218945

df.groupby(['key1', 'key2']).size()
Out[46]: 
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

你可能已經注意到在執(zhí)行df.groupby('key1').mean()時瞎颗,結果中沒有key2列件甥。這是因為df['key2']不是數(shù)值數(shù)據(俗稱“麻煩列”),所以被從結果中排除了哼拔。默認情況下嚼蚀,所有數(shù)值列都會被聚合,雖然有時可能會被過濾為一個子集(稍后就會講到)管挟。分組鍵中的任何缺失值都會被排除在結果之外轿曙。

  • 分組迭代

GroupBy對象支持迭代,可以產生一組二元元組(由分組名和數(shù)據塊組成)僻孝〉嫉郏看看下面這個簡單的數(shù)據集:

#!python

for name, group in df.groupby('key1'):
    print(name)
    print(group)
    
a
      data1     data2 key1 key2
0 -0.592555  0.537886    a  one
1  0.286764  1.498792    a  two
4  0.896790  1.461441    a  one
b
      data1     data2 key1 key2
2 -0.149658  0.847675    b  one
3  0.961803 -1.218945    b  two

對于多重鍵的情況,元組的第一個元素將會是由鍵值組成的元組:

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末穿铆,一起剝皮案震驚了整個濱河市您单,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌荞雏,老刑警劉巖虐秦,帶你破解...
    沈念sama閱讀 216,470評論 6 501
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異凤优,居然都是意外死亡悦陋,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,393評論 3 392
  • 文/潘曉璐 我一進店門筑辨,熙熙樓的掌柜王于貴愁眉苦臉地迎上來俺驶,“玉大人,你說我怎么就攤上這事棍辕∧合郑” “怎么了?”我有些...
    開封第一講書人閱讀 162,577評論 0 353
  • 文/不壞的土叔 我叫張陵楚昭,是天一觀的道長栖袋。 經常有香客問我,道長抚太,這世上最難降的妖魔是什么塘幅? 我笑而不...
    開封第一講書人閱讀 58,176評論 1 292
  • 正文 為了忘掉前任昔案,我火速辦了婚禮,結果婚禮上晌块,老公的妹妹穿的比我還像新娘爱沟。我一直安慰自己,他們只是感情好匆背,可當我...
    茶點故事閱讀 67,189評論 6 388
  • 文/花漫 我一把揭開白布呼伸。 她就那樣靜靜地躺著,像睡著了一般钝尸。 火紅的嫁衣襯著肌膚如雪括享。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,155評論 1 299
  • 那天珍促,我揣著相機與錄音铃辖,去河邊找鬼。 笑死猪叙,一個胖子當著我的面吹牛娇斩,可吹牛的內容都是我干的。 我是一名探鬼主播穴翩,決...
    沈念sama閱讀 40,041評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼犬第,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了芒帕?” 一聲冷哼從身側響起歉嗓,我...
    開封第一講書人閱讀 38,903評論 0 274
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎背蟆,沒想到半個月后鉴分,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經...
    沈念sama閱讀 45,319評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡带膀,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,539評論 2 332
  • 正文 我和宋清朗相戀三年志珍,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片本砰。...
    茶點故事閱讀 39,703評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡碴裙,死狀恐怖,靈堂內的尸體忽然破棺而出点额,到底是詐尸還是另有隱情,我是刑警寧澤莺琳,帶...
    沈念sama閱讀 35,417評論 5 343
  • 正文 年R本政府宣布还棱,位于F島的核電站,受9級特大地震影響惭等,放射性物質發(fā)生泄漏珍手。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,013評論 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望琳要。 院中可真熱鬧寡具,春花似錦、人聲如沸稚补。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,664評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽课幕。三九已至厦坛,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間乍惊,已是汗流浹背杜秸。 一陣腳步聲響...
    開封第一講書人閱讀 32,818評論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留润绎,地道東北人撬碟。 一個月前我還...
    沈念sama閱讀 47,711評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像莉撇,于是被迫代替她去往敵國和親呢蛤。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 44,601評論 2 353

推薦閱讀更多精彩內容