第05章 pandas入門

讓數(shù)據(jù)清洗和數(shù)據(jù)分析變得簡答的數(shù)據(jù)結(jié)構(gòu)和操作工具
pandas是專門為處理表格和混雜數(shù)據(jù)設(shè)計的。而NumPy更適合處理統(tǒng)一的數(shù)值數(shù)組數(shù)據(jù)。

Series

In [38]: from pandas import Series,DataFrame

In [39]: obj=pd.Series([4,7,-5,3])

Series由索引和值構(gòu)成

In [40]: obj
Out[40]:
0    4
1    7
2   -5
3    3
dtype: int64

自定義索引

In [43]: obj2=pd.Series([4,7,-5,3],index=['d','b','a','c'])

用索引操作元素：

In [48]: obj2[['a','b']]

使用NumPy函數(shù)或類似NumPy的運算（如根據(jù)布爾型數(shù)組進行過濾、標量乘法、應用數(shù)學函數(shù)等）都會保留索引值的鏈接：
目前看來祖凫，和numpy相比，Series就是多了索引

可以將Series看作是一個有序的字典

In [52]: 'a' in obj2
Out[52]: True

In [49]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}#用字典構(gòu)建Series
In [50]: obj3=pd.Series(sdata)

用字典構(gòu)造Series的時候酬凳，可以改變鍵的順序

In [58]: keys
Out[58]: ['Ohio', 'Oregon', 'Texas', 'Utah']

In [59]: obj4=pd.Series(sdata,index=keys)

In [60]: obj4
Out[60]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

如果添加的index中的元素不在原字典中惠况，則會顯示為

In [61]: obj5=pd.Series(sdata,['a','b','c','d'])

In [62]: obj5
Out[62]:
a   NaN
b   NaN
c   NaN
d   NaN
dtype: float64

表示缺失數(shù)據(jù)。
pandas的isnull和notnull函數(shù)可用于檢測缺失數(shù)據(jù)：

In [66]: pd.isnull(obj5)
Out[66]:
a    True
b    True
c    True
d    True
dtype: bool

Series對象自帶屬性name宁仔，可以直接賦值

In [70]: obj4.name='population'

In [71]: obj4
Out[71]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
Name: population, dtype: int64

Series對象的index值也可以直接修改

In [72]: obj
Out[72]:
0    4
1    7
2   -5
3    3
dtype: int64

In [73]: obj.index=['anchor','ama','avec','god']

In [74]: obj
Out[74]:
anchor    4
ama       7
avec     -5
god       3
dtype: int64

DataFrame

DataFrame是一個表格型數(shù)據(jù)結(jié)構(gòu)稠屠，含有一組有序的列（每列可以是不同的元素）；
DataFrame既有行索引翎苫，也有列索引权埠。可以視作由多個Series對象構(gòu)成的(每個列是一個Series拉队，這些Series對象共用相同的索引）

關(guān)于DataFrame的結(jié)構(gòu)

DataFrame實際就是由多個共用索引的Series對象構(gòu)成弊知。
對于一個DataFrame對象來說阻逮，它的key就是列索引粱快，而對于DataFrame對象中的每一個Series對象（每一列）來說，key就是行索引

In [77]: data
Out[77]:
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [80]: frame

jupyter notebook界面

顯示前5行數(shù)據(jù)

In [6]: frame.head()
Out[6]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

如果指定了列序列叔扼，則DataFrame的列就會按照指定順序進行排列：

In [8]: pd.DataFrame(data,columns=['year','state','pop'])
Out[8]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

指定列序列和索引（此時索引不在是唯一鍵（唯一鍵成了列序列）事哭，所以可以在構(gòu)建時修改）。如果傳入的列在數(shù)據(jù)中找不到瓜富，則會在結(jié)果中產(chǎn)生缺省值

In [9]: pd.DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
Out[9]:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

前面說了可以將DataFrame對象看成多個Series鳍咱，那么自然在python這門神奇的語言，咳咳：

In [11]: frame
Out[11]:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

In [12]: frame['year']
Out[12]:
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [13]: frame.state
Out[13]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada

In [14]: type(frame['year'])
Out[14]: pandas.core.series.Series

為DataFrame對象創(chuàng)建一個新列

In [26]: frame['eastern']=frame['state']=='Ohio'

In [27]: frame
Out[27]:
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7   NaN     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4   NaN    False
five   2002  Nevada  2.9   NaN    False
six    2003  Nevada  3.2   NaN    False

使用del刪除該列

In [28]: del frame['eastern']

In [29]: frame
Out[29]:
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7   NaN
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4   NaN
five   2002  Nevada  2.9   NaN
six    2003  Nevada  3.2   NaN

在python中与柑，通過索引獲得的值（包括切片）都是獲得原對象的試圖谤辜，因此對DataFrame中的Series對象進行操作蓄坏，會直接影響到原DataFrame對象。要想獲得相應對象的副本丑念，調(diào)用該對象的copy方法即可涡戳。

**除了嵌套列表之外，還可以通過嵌套字典來構(gòu)造DataFrame脯倚，此時外層字典的鍵作為列索引渔彰，內(nèi)層字典的鍵作為行索引（嵌套Series同理）

In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

類似numpy數(shù)組，DateFrame也可以轉(zhuǎn)置

In [34]: frame2.T
Out[34]:
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

image.png

構(gòu)建DataFrame對象

構(gòu)建DataFrame對象的方法匯總：使用嵌套列表推正，字典恍涂，Series對象的字典構(gòu)建；使用二維ndarray數(shù)組構(gòu)建（此時行索引和列索引皆為自然數(shù)）;
構(gòu)造DateFrame對象時植榕，當傳入了多個字典組成的列表時再沧，DataFrame方法會將所有的鍵的集合作為列索引

索引對象

pandas的索引對象負責管理軸標簽和其他元數(shù)據(jù)（比如軸名稱等）。
構(gòu)建Series 和 DataFrame時尊残，所用到的任何數(shù)組或其他序列的標簽都會被轉(zhuǎn)換成一個Index(索引對象)：

In [37]: frame.index
Out[37]: Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

創(chuàng)建多級索引對象：pd.MultiIndex.from_arrays()

Index對象是不可變的

In [42]: index[1]='d'#試圖修改index的值

TypeError: Index does not support mutable operations

image.png

5.2 基本功能

reindex函數(shù):根據(jù)新索引返回一個新的Series對象

In [79]: obj
Out[79]:
0      blue
2    purple
4    yellow
dtype: object

In [80]: obj.reindex(range(6))
Out[80]:
0      blue
1       NaN
2    purple
3       NaN
4    yellow
5       NaN`
dtype: object

為了避免缺省值产园，在使用reindex函數(shù)的時候，添加參數(shù)method=‘ffill’來實現(xiàn)值向前填充

In [81]: obj.reindex(range(6),method='ffill')
Out[81]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

image.png

刪除Series對象中的元素

In [111]: se
Out[111]:
Texas         1
Utah          2
California    3
dtype: int64

In [112]: se.drop('Texas')
Out[112]:
Utah          2
California    3
dtype: int64

刪除DataFrame對象中的元素

In [122]: frame.drop(['a','b'])
Out[122]:
   Texas  Utah  California
c      6     7           8

默認為刪除axis=0軸（行）的值,可以添加參數(shù)axis=1或axis=columns

In [124]: frame.drop('Texas',axis=1)
Out[124]:
   Utah  California
a     1           2
b     4           5
c     7           8

drop函數(shù)添加參數(shù)inplace=True可以就地修改原來的對象夜郁，而不會返回新的對象

In [128]: frame
Out[128]:
   Texas  Utah  California
a      0     1           2
b      3     4           5
c      6     7           8

In [129]: frame.drop('a',inplace=True)

In [130]: frame
Out[130]:
   Texas  Utah  California
b      3     4           5
c      6     7           8

索引什燕，選取和過濾

對Series的索引操作和numpy相似
只是Series的索引可以是字符串

In [151]: se
Out[151]:
Texas         1
Utah          2
California    3
dtype: int64

In [152]: se['Texas':'California']
Out[152]:
Texas         1
Utah          2
California    3
dtype: int64

用一個值或一個序列對DataFrame對象進行索引就是獲取一個或多個列:

In [149]: data[['Texas','California']]
Out[149]:
       Texas  California
one        0           2
two        3           5
three      6           8

同樣支持布爾型數(shù)組索引

In [154]: data[data['Texas']>3]
Out[154]:
       Texas  Utah  California
three      6     7           8

也支持布爾型DataFrame對象索引

In [164]: data[data>5]
Out[164]:
       Texas  Utah  California
one      NaN   NaN         NaN
two      NaN   NaN         NaN
three    6.0   7.0         8.0

以上都是針對DataFrame對象的列的索引，接下來用loc和iloc對行進行索引

多級索引的選取

當要進行選取的行或列有多級索引時竞端，索引方法要改變屎即。此時，使用df.reindex()進行索引事富，如

In [26]: arrays=[['key1','key1','key2','key2'],['a','a','b','b']]

In [27]: mul_index=pd.MultiIndex.from_arrays(arrays,names=['level1','level2'])

In [28]: mul_index
Out[28]:
MultiIndex(levels=[['key1', 'key2'], ['a', 'b']],
           labels=[[0, 0, 1, 1], [0, 0, 1, 1]],
           names=['level1', 'level2'])

In [29]: df=pd.DataFrame(np.random.randint(10,size=(4,4)),columns=mul_index)

In [30]: df
Out[30]:
level1 key1    key2
level2    a  a    b  b
0         1  0    7  5
1         2  1    9  1
2         7  4    5  9
3         3  7    9  9

此時可通過一層一層的來獲取某一列元素：

In [31]: df['key1']['a']
Out[31]:
level2  a  a
0       1  0
1       2  1
2       7  4
3       3  7

但通常我會選擇一種更加明了的方式:reindex


###用loc和iloc進行選取
    一般情況下建議用loc和iloc方法取選定的行列技俐。因為直接對df對象進行切片會產(chǎn)生歧義，從而報錯
loc用鍵去取行

In [176]: data.loc['one']
Out[176]:
Texas 0
Utah 1
California 2
Name: one, dtype: int32

iloc用數(shù)字索引取行

In [175]: data.iloc[0]
Out[175]:
Texas 0
Utah 1
California 2
Name: one, dtype: int32

In [181]: data.iloc[0,1]
Out[181]: 1


![image.png](https://upload-images.jianshu.io/upload_images/11910087-54c5eb3073fb488d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

###整數(shù)索引
下面這段代碼會報錯

ser = pd.Series(np.arange(3.))
ser
ser[-1]

原因是因為可能會發(fā)生歧義,當索引為非數(shù)字時則不會報錯

In [202]: ser2=pd.Series(np.arange(3.),index=['a','b','c'])

In [203]: ser2[-1]
Out[203]: 2.0


**為了避免歧義统台，盡量使用loc和iloc顯示地表明雕擂。**

###算數(shù)運算和數(shù)據(jù)對齊
    當不同索引的對象進行算術(shù)運算時，自動取索引的并集

In [154]: s1 + s2
Out[154]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64

**不重疊的索引出引入NA值**
**對于DataFrame贱勃，對齊操作會同時發(fā)生在行和列上：**

###在算術(shù)方法中填充值
**我們已經(jīng)知道了在進行算數(shù)運算時井赌，DataFrame對象和Series對象會自動數(shù)據(jù)對齊并在不重疊的地方自動填充na值」笕牛可以在調(diào)用以下方法時添加參數(shù)fill_value=（要設(shè)置的自動填充值）**
![image.png](https://upload-images.jianshu.io/upload_images/11910087-706b31e86f2a4d16.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
上圖中仇穗，r開頭的方法表示將參數(shù)裝置，如下面兩種計算一致：

In [213]: ser2.add(ser1)
Out[213]:
a NaN
b 8.0
c 10.0
d NaN
dtype: float64

In [214]: ser1.radd(ser2)
Out[214]:
a NaN
b 8.0
c 10.0
d NaN
dtype: float64

**與此類似戚绕，在可能產(chǎn)生na值的地方都可以設(shè)置fill_value屬性** 如對DataFrame對象重新索引時：

In [221]: frame.reindex(['a','b','c'],fill_value=1)
Out[221]:
Texas Utah California
a 1 1 1
b 3 4 5
c 6 7 8

**DataFrame和Seires之間的運算符合廣播**

In [232]: frame
Out[232]:
Texas Utah California
b 3 4 5
c 6 7 8

In [233]: se=frame.iloc[0]

In [234]: se
Out[234]:
Texas 3
Utah 4
California 5
Name: b, dtype: int32

這里取出frame的第0行纹坐，就是為了在行中傳播，因此要匹配列

In [235]: frame.sub(se,axis='columns')
Out[235]:
Texas Utah California
b 0 0 0
c 3 3 3

**小結(jié):**要在行中傳播舞丛，則匹配列耘子；在列中傳播果漾，則匹配行（此時需要在運算方法中設(shè)置axis=‘index’，默認為axis=‘columns’）


###函數(shù)應用和映射
**NumPy的ufuncs（元素級數(shù)組方法）也可用于操作pandas對象：**

In [246]: frame2
Out[246]:
Texas Utah California
0 1.973670 -0.337147 -0.267831
1 -1.340909 0.662307 0.180784
2 0.341601 0.071515 -0.099701

In [248]: np.abs(frame2)
Out[248]:
Texas Utah California
0 1.973670 0.337147 0.267831
1 1.340909 0.662307 0.180784
2 0.341601 0.071515 0.099701

**傳遞函數(shù)到DataFrame對象的每一行（列）:使用對象的apply方法**

In [238]: frame
Out[238]:
Texas Utah California
b 3 4 5
c 6 7 8
In [239]: f=lambda x:x.max()-x.min()

將函數(shù)應用到每一列（匹配行）

In [241]: frame.apply(f,axis='index')
Out[241]:
Texas 3
Utah 3
California 3
dtype: int64

**傳遞函數(shù)到DataFrame對象的每一個元素：使用對象的applymap方法**

In [249]: f=lambda x:x*2

In [250]: frame2
Out[250]:
Texas Utah California
0 1.973670 -0.337147 -0.267831
1 -1.340909 0.662307 0.180784
2 0.341601 0.071515 -0.099701

In [252]: frame2.applymap(f)
Out[252]:
Texas Utah California
0 3.947339 -0.674294 -0.535662
1 -2.681818 1.324614 0.361568
2 0.683203 0.143031 -0.199401

ps：當然谷誓，由于DataFrame對象的傳播性跨晴，所以使用apply方法也可以應用到元素級

###排序和排名
**（1）對索引進行排名**
**對象.sort_index()  (在對DataFrame對象進行操作時，需要加入?yún)?shù)axis來指定軸)**
**（2）對值進行排名**
**對象.sort_value()(在對DataFrame對象進行操作時片林，需要加入?yún)?shù)by來指定某一列端盆，或多個列的列表形式)**

####排名。
Series對象的rank方法费封。默認情況下焕妙，rank是通過“為各組分配一個平均排名”的方式破壞平級關(guān)系的：

In [269]: obj
Out[269]:
0 7
1 -5
2 7
3 4
4 2
5 0
6 4
dtype: int64

**默認情況下，值越小弓摘，優(yōu)先級越高焚鹊。**

obj.rank()的值為優(yōu)先級

In [270]: obj.rank()
Out[270]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

將原數(shù)據(jù)出現(xiàn)的順序加入排名考慮因素

In [271]: obj.rank(method='first')

降序排名

In [273]: obj.rank(method='first',ascending=False)

5.3 匯總和計算描述統(tǒng)計

對象.describe方法可以得到關(guān)于該對象的各個統(tǒng)計值

image.png

唯一值、值計數(shù)以及成員資格

unique韧献，它可以得到Series中的唯一值數(shù)組：
value_counts用于計算一個Series中各值出現(xiàn)的頻率：
isin用于判斷矢量化集合的成員資格末患，可用于過濾Series中或DataFrame列中數(shù)據(jù)的子集

image.png

pandas方法補充

pd.date_range('16/3/2019',periods=5)
返回自16/3/2019 后5天的索引對象

tips=pd.read_csv('examples/tips.csv')
tips.head(10)

image.png

pd.crosstab()

party_counts = pd.crosstab(tips['day'], tips['size'])
party_count

image.png

crosstab()的第一個參數(shù)是索引列，第二個參數(shù)即是要分析的列锤窑，該方法返回一個視圖璧针。

pd.pct_change/se.pct_change()
迭代每個元素，并返回每個元素與之前的元素的改變百分比渊啰。
通常用于觀察某組數(shù)據(jù)的變化趨勢

In [184]: se=pd.Series([1,1,2])

In [185]: se.pct_change()
Out[185]:
0    NaN
1    0.0
2    1.0
dtype: float64

pandas.DataFrame.corrwith

DataFrame.corrwith(other, axis=0, drop=False, method='pearson')

Compute pairwise correlation between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
計算Series或DataFrame對象的某些行或列（默認為列）與傳入的Series或DataFrame對象的某一行或某一列的相關(guān)系數(shù)

Series.asof(date)
返回指定日期前最近的一個不為Nan的值

pandas.Series.take
根據(jù)傳入的索引選取指定的行或列

pandas.Series.astype
Series.astype(dtype, copy=True, errors='raise', **kwargs)
Cast a pandas object to a specified dtype dtype.

pandas.qcut
根據(jù)值的數(shù)量進行切分探橱。返回一個
Categorical對象
參數(shù)labels：給各個區(qū)間取名字
pandas.cut
根據(jù)值的大小進行切分

pandas.get_dummies
Convert categorical variable into dummy/indicator variables
將分類變量轉(zhuǎn)變?yōu)閱∽兞俊?br> 啞變量，又成為虛擬變量绘证，通常由0隧膏，1表示。一維矩陣變?yōu)槎S矩陣

pandas.date_range
Return a fixed frequency DatetimeIndex
返回一個固定頻率的時間索引嚷那。
參數(shù)start：開始的時間點
參數(shù)end：結(jié)束的時間點
參數(shù)freq：可取值為頻率字符串：D,M,Y
參數(shù)periods：時間段胞枕。

In [118]: pd.date_range(start='1/1/2018',periods=8)
Out[118]:

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')

pandas.Series.reset_index
Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or when the index is meaningless and needs to be reset to the default before another operation.
用索引列生成一個Series或DataFrame對象。

pandas.DataFrame.join

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Join columns of another DataFrame.

在一個DataFrame對象中加入另一個DataFrame的列

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

當兩個DataFrame對象有公共鍵時魏宽，需要提供后綴腐泻。
當兩個DataFrame對象沒有公共鍵時，根據(jù)索引進行合并湖员。

In [44]: df1
Out[44]:
   k1   k2
0   1    4
1   2  272
2   3    1

In [45]: df2
Out[45]:
   k3  k4
0   5   6
1   6   4
2   7   2

In [46]: df1.join(df2)
Out[46]:
   k1   k2  k3  k4
0   1    4   5   6
1   2  272   6   4
2   3    1   7   2

可以加入多個DataFrame對象(以列表的形式)

In [48]: df1
Out[48]:
   k1   k2
0   1    4
1   2  272
2   3    1

In [49]: df2
Out[49]:
   k3  k4
0   5   6
1   6   4
2   7   2

In [50]: df3
Out[50]:
   k5  k6
0   1   4
1   2   5
2   3   6

In [51]: df1.join([df2,df3])
Out[51]:
   k1   k2  k3  k4  k5  k6
0   1    4   5   6   1   4
1   2  272   6   4   2   5
2   3    1   7   2   3   6

pandas.DataFrame.merge

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Merge DataFrame or named Series objects with a database-style join.
相比于pandas.DataFrame.join,merge能提供一種數(shù)據(jù)庫風格的連接

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

根據(jù)列或者索引將兩個DataFrame連接在一起贫悄。如果沒有重復元素的列瑞驱，并且沒有指定連接的鍵列娘摔，則不能連接。

In [80]: df2
Out[80]:
  key  data2
0   a      0
1   b      1
2   d      2

In [81]: df
Out[81]:
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

In [82]: pd.merge(df,df2,left_on='key',right_on='key')
Out[82]:
Empty DataFrame
Columns: [key, A, data2]
Index: []

In [83]: pd.merge(df,df2,left_on='key',right_on='key',how='outer')
Out[83]:
  key    A  data2
0  K0   A0    NaN
1  K1   A1    NaN
2  K2   A2    NaN
3  K3   A3    NaN
4  K4   A4    NaN
5  K5   A5    NaN
6   a  NaN    0.0
7   b  NaN    1.0
8   d  NaN    2.0

通常對于連接DataFrame對象的兩個列唤反，分為一對多和多對多的情況凳寺。
一對多時鸭津，一會賦值于多進行匹配
多對多時，產(chǎn)生交集（內(nèi)連接肠缨，默認方式）和笛卡兒積（外連接）

pandas.DataFrame.pivot_table

DataFrame.``pivot_table`(*values=None*, *index=None*, *columns=None*, *aggfunc='mean'*, *fill_value=None*, *margins=False*, *dropna=True*, *margins_name='All'*)[[source]](http://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/frame.py#L5750-L5759)

Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
參數(shù)：
index:Keys to group by on the pivot table index逆趋。傳入的序列長度必須于原數(shù)據(jù)一致。作為分組透視表索引的關(guān)鍵字
columns：同index晒奕。
values：進行分析的數(shù)值闻书。
通常，pivot_table的必要參數(shù)為index脑慧，values

In [98]: df
Out[98]:
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

指定透視表的索引列魄眉，對應的values列。columns通常為可選的闷袒，用作進一步分析坑律。

In [97]: table=pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=n
p.sum)

In [99]: table
Out[99]:
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

pandas.DataFrame.sort_values

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

sort by the values along either axis
可以接收一給列名，或者一組列名囊骤。沿這指定的軸進行計算晃择。（默認為0軸）

pandas.concat

pandas.concat`(*objs*, *axis=0*, *join='outer'*, *join_axes=None*, *ignore_index=False*, *keys=None*, *levels=None*, *names=None*, *verify_integrity=False*, *sort=None*, *copy=True*)[[source]](http://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/reshape/concat.py#L24-L229)

merge，join實現(xiàn)的是列之間的連接也物，若要實現(xiàn)行連接宫屠，則要使用concat。通常一組數(shù)據(jù)的一行為一條獨立的數(shù)據(jù)滑蚯，所以concat一般適用于連接格式相同的多組數(shù)據(jù)激况。
參數(shù)keys:給最外層索引取名字。

pandas.Series.isin

Series.isin(*values*)[[source]](http://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/series.py#L3947-L4004)

Check whether values are contained in Series.
判斷Series是否再values中

pandas.DataFrame.duplicated

DataFrame.duplicated(subset=None, keep='first')

Return boolean Series denoting duplicate rows, optionally only considering certain columns.
返回一個布爾值的Series對象膘魄，從上往下乌逐，依次比較，重復的行數(shù)為True创葡，不重復的為False

pandas.DataFrame.rename

DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)

給指定軸上的索引重新取名字浙踢。
一般傳入一個字典，作為各個索引名的映射灿渴。
pandas.DataFrame.quantile

`DataFrame.``quantile`(*q=0.5*, *axis=0*, *numeric_only=True*, *interpolation='linear'*)[[source]](http://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/frame.py#L7697-L7788) [](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html#pandas.DataFrame.quantile "Permalink to this definition")

Return values at the given quantile over requested axis.
傳入指定的分位值在指定的軸上計算分位數(shù)
如傳入0.5洛波，即是計算二分位數(shù)
計算公式：quan=1+(n-1)*q

In [28]: df
Out[28]:
   a    b
0  1    1
1  2   10
2  3  100
3  4  100

In [29]: df.quantile(0.5)
Out[29]:
a     2.5
b    55.0
Name: 0.5, dtype: float64

上圖中，a列：1+(4-1)*0.5=2.5
todo.
pandas.Series.idxmax

`Series.``idxmax`(*axis=0*, *skipna=True*, **args*, ***kwargs*)[[source]](http://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/series.py#L1886-L1954)[?](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.idxmax.html#pandas.Series.idxmax "Permalink to this definition")

Return the row label of the maximum value.
返回Series對象的最大值的行標簽

If multiple values equal the maximum, .the first row label with that value is returned
如果有多個最大值骚露，則返回匹配的第一個數(shù)據(jù)蹬挤。

pandas.Series.map

Series.map(arg, na_action=None)

Map values of Series according to input correspondence.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series
根據(jù)給定的序列映射Series對象中的每個值〖遥可以傳入函數(shù)焰扳，字典或者Sereis

pandas.DataFrame.plot

DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False, sharex=None, sharey=False, layout=None, figsize=None, use_index=True, title=None, grid=None, legend=True, style=None, logx=False, logy=False, loglog=False, xticks=None, yticks=None, xlim=None, ylim=None, rot=None, fontsize=None, colormap=None, table=False, yerr=None, xerr=None, secondary_y=False, sort_columns=False, **kwds)

Make plots of DataFrame using matplotlib / pylab.
New in version 0.17.0: Each plot kind has a corresponding method on the DataFrame.plot accessor: df.plot(kind='line') is equivalent to df.plot.line().
調(diào)用matplotlib來畫DataFrame對象
參數(shù)：
kind : str
'line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot#改變軸向
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘a(chǎn)rea’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot
‘hexbin’ : hexbin plot

subplots:是否要以子圖的形式顯示，默認為False
sharex ：當subplots為True時，所有子圖共享x軸
sharex ：當subplots為True時吨悍，所有子圖共享y軸

最后編輯于：2019.03.27 12:54:18

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末扫茅，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子育瓜，更是在濱河造成了極大的恐慌葫隙，老刑警劉巖，帶你破解...
沈念sama閱讀 211,639評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件躏仇，死亡現(xiàn)場離奇詭異恋脚，居然都是意外死亡，警方通過查閱死者的電腦和手機焰手，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,277評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門慧起，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人册倒，你說我怎么就攤上這事蚓挤。” “怎么了驻子？”我有些...
開封第一講書人閱讀 157,221評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵灿意，是天一觀的道長。經(jīng)常有香客問我崇呵，道長缤剧，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 56,474評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任域慷，我火速辦了婚禮荒辕，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘犹褒。我一直安慰自己抵窒，他們只是感情好，可當我...
茶點故事閱讀 65,570評論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布叠骑。她就那樣靜靜地躺著李皇，像睡著了一般。火紅的嫁衣襯著肌膚如雪宙枷。梳的紋絲不亂的頭發(fā)上掉房，一...
開封第一講書人閱讀 49,816評論 1贊 290
城市分裂傳說
那天，我揣著相機與錄音慰丛，去河邊找鬼卓囚。笑死，一個胖子當著我的面吹牛诅病，可吹牛的內(nèi)容都是我干的哪亿。我是一名探鬼主播粥烁，決...
沈念sama閱讀 38,957評論 3贊 408
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼锣夹！你這毒婦竟也來了页徐？” 一聲冷哼從身側(cè)響起苏潜，我...
開封第一講書人閱讀 37,718評論 0贊 266
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤银萍，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后恤左，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體贴唇，經(jīng)...
沈念sama閱讀 44,176評論 1贊 303
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,511評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年飞袋，在試婚紗的時候發(fā)現(xiàn)自己被綠了戳气。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 38,646評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡巧鸭，死狀恐怖瓶您，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情纲仍，我是刑警寧澤呀袱，帶...
沈念sama閱讀 34,322評論 4贊 330
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站郑叠，受9級特大地震影響夜赵，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜乡革，卻給世界環(huán)境...
茶點故事閱讀 39,934評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一寇僧、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧沸版，春花似錦嘁傀、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,755評論 0贊 21
一樁弒父案心包，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至馒铃，卻和暖如春蟹腾，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背区宇。一陣腳步聲響...
開封第一講書人閱讀 31,987評論 1贊 266
情欲美人皮
我被黑心中介騙來泰國打工娃殖，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人议谷。一個月前我還...
沈念sama閱讀 46,358評論 2贊 360
代替公主和親
正文我出身青樓炉爆，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子芬首，可洞房花燭夜當晚...
茶點故事閱讀 43,514評論 2贊 348

第05章 pandas入門

第05章 pandas入門

Series

DataFrame

關(guān)于DataFrame的結(jié)構(gòu)

構(gòu)建DataFrame對象

索引對象

5.2 基本功能

索引什燕，選取和過濾

多級索引的選取

將函數(shù)應用到每一列（匹配行）

obj.rank()的值為優(yōu)先級

5.3 匯總和計算描述統(tǒng)計

相關(guān)系數(shù)與協(xié)方差

唯一值、值計數(shù)以及成員資格

pandas方法補充

pandas.DataFrame.plot

推薦閱讀更多精彩內(nèi)容