1.DataFrame對象
按照一定順序排列多列數(shù)據(jù),各列數(shù)據(jù)類型可以有所不同
DataFrame對象有兩個索引數(shù)組黑毅,第一個數(shù)組與行相關(guān),它與Series的索引數(shù)組極為相似,每個標(biāo)簽與標(biāo)簽所在行的所有元素相關(guān)聯(lián)晒骇,第二個數(shù)組包含一系列標(biāo)簽,每個標(biāo)簽與一列數(shù)據(jù)相關(guān)聯(lián)
DataFrame可以理解為一個由Series組成的字典磺浙,其中每一列的名稱為字典的鍵洪囤,形成DataFrame的列的Series作為字典的值
2定義DateFrame對象
新建dataFrame最常用的方法是傳遞一個dict對象給DataFrame()構(gòu)造函數(shù)
dictd對象的每一列名稱作為鍵,每個鍵都有一個數(shù)組作為值
1)將字典的每個鍵值對都放入DataFrame中
>>> import pandas as pd? #引入pandas包
>>> dict={'colors':['red','blue','yellow','black'],'object':['pen','paper','ball','mug'],'price':[1.1,1.2,3.2,4]}? #定義一個字典撕氧,每個鍵是以后DataFrame對象的列名箍鼓,每個鍵對應(yīng)的值是以后DataFrame列的元素內(nèi)容
>>> dict
{'object': ['pen', 'paper', 'ball', 'mug'], 'price': [1.1, 1.2, 3.2, 4], 'colors': ['red', 'blue', 'yellow', 'black']}
>>> s=pd.DataFrame(dict) #利用DataFrame的構(gòu)造函數(shù),將dict的內(nèi)容放入DataFrame中
>>> s?
??colors object?price
0???red??pen??1.1
1??blue?paper??1.2
2?yellow??ball??3.2
3??black??mug??4.0
2)挑選字典中部分?jǐn)?shù)據(jù)對用來初始化DataFrame對象
>>> import pandas as pd #導(dǎo)入pandas包
>>> dic={'colos':['red','black','yellow','orange'],'object':['pen','ball','shirt','mug'],'price':[1.2,3.4,2.3,5]} #定義字典
>>> dic
{'object': ['pen', 'ball', 'shirt', 'mug'], 'price': [1.2, 3.4, 2.3, 5], 'colos': ['red', 'black', 'yellow', 'orange']}
>>> s=pd.DataFrame(dic,columns=['price','object']) #用字典來初始化DataFrame對象并且只選擇兩列數(shù)據(jù)呵曹,且順序按照我選擇的來ding
>>> s
??price object
0??1.2??pen
1??3.4??ball
2??2.3?shirt
3??5.0??mug
3)對DataFrame對象進(jìn)行自定義索引(上面的例子都是不定義款咖,系統(tǒng)默認(rèn)從0開始定義)
4)不使用字典,使用構(gòu)造函數(shù)三個參數(shù)來進(jìn)行定義DataFrame
指定三個參數(shù)奄喂,順序:數(shù)據(jù)矩陣铐殃、index選項(xiàng)、columns選項(xiàng)跨新、將存放標(biāo)簽的數(shù)組賦給index富腊,將存放列名的數(shù)組賦值給columns選項(xiàng)、可使用np.arange(16).reshape(4,4)快捷生成矩陣
>>> import numpy as np
>>> import pandas as pd
>>> arry=np.arange(16)
>>> arry
array([ 0,?1,?2,?3,?4,?5,?6,?7,?8,?9, 10, 11, 12, 13, 14, 15])
>>> arry=np.arange(16).reshape(4,4)
>>> arry
array([[ 0,?1,?2,?3],
????[ 4,?5,?6,?7],
????[ 8,?9, 10, 11],
????[12, 13, 14, 15]])
>>> s=pd.DataFrame(arry,index=['a','b','c','d'],columns=['A','B','C','D'])
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
3.選取元素
1)要想知道DataFrame的所有列的名稱域帐,對它調(diào)用columns屬性即可
2)要想獲取DataFrame的索引列表赘被,調(diào)用index熟悉即可
3)想要獲取數(shù)據(jù)結(jié)構(gòu)中的元素,使用values熟悉獲取即可
>>> import numpy as np
>>> import pandas as pd
>>> arry=np.arange(16)
>>> arry
array([ 0,?1,?2,?3,?4,?5,?6,?7,?8,?9, 10, 11, 12, 13, 14, 15])
>>> arry=np.arange(16).reshape(4,4)
>>> arry
array([[ 0,?1,?2,?3],
????[ 4,?5,?6,?7],
????[ 8,?9, 10, 11],
????[12, 13, 14, 15]])
>>> s=pd.DataFrame(arry,columns=['a','b','c','d'])
>>> s
??a??b??c??d
0??0??1??2??3
1??4??5??6??7
2??8??9?10?11
3?12?13?14?15
>>> s=pd.DataFrame(arry,index=['a','b','c','d'],columns=['A','B','C','D'])
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
>>> s.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> s.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
>>> s.values
array([[ 0,?1,?2,?3],
????[ 4,?5,?6,?7],
????[ 8,?9, 10, 11],
????[12, 13, 14, 15]])
4)如果想要獲取一列元素內(nèi)容肖揣,把這一列名稱作為所以即可民假,或者是調(diào)用這個列名的屬性方法
第一種方法
>>> s['B']
a???1
b???5
c???9
d??13
Name: B, dtype: int64
第二種方法
>>> s.B
a???1
b???5
c???9
d??13
Name: B, dtype: int64
5)獲取DataFrame某一行數(shù)據(jù),利用ix熟悉的索引值獲取
獲取單行
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
>>> s.ix[2]
A???8
B???9
C??10
D??11
Name: c, dtype: int64
>>> s.ix['c']
A???8
B???9
C??10
D??11
Name: c, dtype: int64
獲取多行(非連續(xù))
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
>>> s.ix[[1,3]]
??A??B??C??D
b??4??5??6??7
d?12?13?14?15
>>> s.ix[['b','d']]
??A??B??C??D
b??4??5??6??7
d?12?13?14?15
獲取多行(連續(xù))
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
>>> s.ix[0:3]
??A?B??C??D
a?0?1??2??3
b?4?5??6??7
c?8?9?10?11
>>> s.ix['a':'c']
??A?B??C??D
a?0?1??2??3
b?4?5??6??7
c?8?9?10?11
獲取某個元素
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
>>> s['A'][1] #注意一定要先寫列【A】在寫行【1】
4
4.賦值
1)給index和columns指定name
>>> import numpy as np
>>> import pandas as pd
>>> arry=np.arange(16)
>>> arry
array([ 0,?1,?2,?3,?4,?5,?6,?7,?8,?9, 10, 11, 12, 13, 14, 15])
>>> arry=np.arange(16).reshape(4,4)
>>> arry
array([[ 0,?1,?2,?3],
????[ 4,?5,?6,?7],
????[ 8,?9, 10, 11],
????[12, 13, 14, 15]])
>>> s=pd.DataFrame(arry,index=['a','b','c','d'],columns=['A','B','C','D'])
>>> s
??A??B??C??D
a??0??1??2??3
b??4??5??6??7
c??8??9?10?11
d?12?13?14?15
>>> s.index.name=id
>>> s.columns.name='item'
>>> s
item???????????A??B??C??D
a????????????0??1??2??3
b????????????4??5??6??7
c????????????8??9?10?11
d????????????12?13?14?15
>>> s.index.name='id'
>>> s
item??A??B??C??D
id
a???0??1??2??3
b???4??5??6??7
c???8??9?10?11
d???12?13?14?15
2)添加一列新元素
>>> s
item??A??B??C??D
id
a???0??1??2??3
b???4??5??6??7
c???8??9?10?11
d???12?13?14?15
>>> s['E']=12
>>> s
item??A??B??C??D??E
id
a???0??1??2??3?12
b???4??5??6??7?12
c???8??9?10?11?12
d???12?13?14?15?12
3)給已經(jīng)有的一列更新元素值
>>> s
item??A??B??C??D??E
id
a???0??1??2??3?12
b???4??5??6??7?12
c???8??9?10?11?12
d???12?13?14?15?12
>>> s['E']=[3,5,2,6]
>>> s
item??A??B??C??D?E
id
a???0??1??2??3?3
b???4??5??6??7?5
c???8??9?10?11?2
d???12?13?14?15?6
5.元素的所屬關(guān)系
>>> s
item??A??B??C??D??E??F
id
a???0??1??2??3 NaN NaN
b???4??5??6??7 NaN NaN
c???8??9?10?11 NaN NaN
d???12?13?14?15 NaN NaN
>>> s.isin([1,4])
item???A???B???C???D???E???F
id
a???False??True?False?False?False?False
b???True?False?False?False?False?False
c???False?False?False?False?False?False
d???False?False?False?False?False?False
>>> s[s.isin([1,4])]
item??A??B??C??D??E??F
id
a???NaN?1.0 NaN NaN NaN NaN
b???4.0?NaN NaN NaN NaN NaN
c???NaN?NaN NaN NaN NaN NaN
d???NaN?NaN NaN NaN NaN NaN
6.刪除一列
>>> s
item??A??B??C??D??E??F
id
a???0??1??2??3 NaN NaN
b???4??5??6??7 NaN NaN
c???8??9?10?11 NaN NaN
d???12?13?14?15 NaN NaN
>>> del s['E']
>>> s
item??A??B??C??D??F
id
a???0??1??2??3 NaN
b???4??5??6??7 NaN
c???8??9?10?11 NaN
d???12?13?14?15 NaN
7.篩選
>>> s
item??A??B??C??D??F
id
a???0??1??2??3 NaN
b???4??5??6??7 NaN
c???8??9?10?11 NaN
d???12?13?14?15 NaN
>>> s[s<3]
item??A??B??C??D??F
id
a???0.0?1.0?2.0 NaN NaN
b???NaN?NaN?NaN NaN NaN
c???NaN?NaN?NaN NaN NaN
d???NaN?NaN?NaN NaN NaN
8.用嵌套字典生成DataFrame對象
將嵌套字典作為參數(shù)傳遞給DataFrame的構(gòu)造函數(shù)龙优,pandas就會將內(nèi)部的鍵作為列名羊异,將外部的鍵作為索引名,并非所有位置都有相應(yīng)的元素存在彤断,pandas會用NaN填充
>>> import pandas as pd
>>> dic={'red':{2012:22,2013:33},'white':{2011:13,2012:22,2013:16},'blue':{2017:17,2012:23,2018:18}}
>>> dic
{'blue': {2017: 17, 2018: 18, 2012: 23}, 'white': {2011: 13, 2012: 22, 2013: 16}, 'red': {2012: 22, 2013: 33}}
>>> s=pd.DataFrame(dic)
>>> s
???blue??red?white
2011??NaN??NaN??13.0
2012?23.0?22.0??22.0
2013??NaN?33.0??16.0
2017?17.0??NaN??NaN
2018?18.0??NaN??NaN
9.DataFrame轉(zhuǎn)置
>>> s
???blue??red?white
2011??NaN??NaN??13.0
2012?23.0?22.0??22.0
2013??NaN?33.0??16.0
2017?17.0??NaN??NaN
2018?18.0??NaN??NaN
>>> s.T #調(diào)用T方法就行
????2011?2012?2013?2017?2018
blue??NaN?23.0??NaN?17.0?18.0
red???NaN?22.0?33.0??NaN??NaN
white?13.0?22.0?16.0??NaN??NaN
10.index對象
在Series和DataFrame中index聲明后不可改變
11.index對象的方法
idmin()和idmax()函數(shù)分別返回索引值最小和最大的元素
12.含有重復(fù)標(biāo)簽的index
>>> import pandas as pd
>>> s=pd.Series(range(6),index=['a','a','b','c','c','d'])
>>> s
a??0
a??1
b??2
c??3
c??4
d??5
dtype: int64
>>> s['a']
a??0
a??1
dtype: int64
>>> s.index.is_unique #用來判斷索引中是否有重復(fù)的索引
False
13.更換索引
pandas的reindex函數(shù)可更換Series對象的索引野舶,根據(jù)新標(biāo)簽序列,重新調(diào)整原來Series的元素宰衙,生成一個新的Series對象
更換索引時(shí)平道,可以調(diào)整所以序列中各標(biāo)簽的順序,刪除或增加新標(biāo)簽
>>> import pandas as pd
>>> s=pd.Series([1,2,3,4],index=['a','b','c','d'])
>>> s
a??1
b??2
c??3
d??4
dtype: int64
>>> s.reindex(['e','f','g','b'])
e??NaN
f??NaN
g??NaN
b??2.0
dtype: float64
然而通過上述reindex的方式重新定義索引對于龐大的DataFrame不太適應(yīng)供炼,可以采用自動填充或插值的方法
如下:
>>> import pandas as pd
>>> s=pd.Series([1,5,6,3],index=[0,3,5,6])
>>> s
0??1
3??5
5??6
6??3
dtype: int64
>>> s.reindex(range(6),method='ffill')#讓對s這個對象的索引從0-5開始重新定義索引一屋,ffill告訴系統(tǒng)新增索引對應(yīng)值取比他小的那個索引對應(yīng)的值
0??1
1??1
2??1
3??5
4??5
5??6
dtype: int64
>>>
>>> s=pd.Series([1,5,6,3],index=[0,3,5,6])
>>> s
0??1
3??5
5??6
6??3
dtype: int64
>>> s.reindex(range(6),method='bfill')#bfill告訴系統(tǒng)新增索引的值用它后一個索引的元素值填充
0??1
1??5
2??5
3??5
4??6
5??6
dtype: int64
>>> dic={'colors':['blue','green','yellow','red','white'],'price':[1.2,1.0,0.6,0.9,1.7],'object':['ballpand','pen','pencil','paper','mug']}#定義一個嵌套字典
>>> dic
{'object': ['ballpand', 'pen', 'pencil', 'paper', 'mug'], 'price': [1.2, 1.0, 0.6, 0.9, 1.7], 'colors': ['blue', 'green', 'yellow', 'red', 'white']}
>>> s=pd.DataFrame(dic)#用嵌套字典定義s這個對象
>>> s
??colors??object?price
0??blue?ballpand??1.2
1??green????pen??1.0
2?yellow??pencil??0.6
3???red???paper??0.9
4??white????mug??1.7
>>> s.reindex(range(5),method='ffill',columns=['colors','price','new','object'])#補(bǔ)充new這個列索引
??colors?price???new??object
0??blue??1.2??blue?ballpand
1??green??1.0??green????pen
2?yellow??0.6?yellow??pencil
3???red??0.9???red???paper
4??white??1.7??white????mug
>>> s=pd.DataFrame(dic,index=[1,2,3,5,7] )#自定義一個索引的DataFrame對象
>>> s
??colors??object?price
1??blue?ballpand??1.2
2??green????pen??1.0
3?yellow??pencil??0.6
5???red???paper??0.9
7??white????mug??1.7
>>> s.reindex(range(5),method='ffill')#重定義行索引
??colors??object?price
0???NaN????NaN??NaN
1??blue?ballpand??1.2
2??green????pen??1.0
3?yellow??pencil??0.6
4?yellow??pencil??0.6
14.刪除索引
1)刪除Series中一項(xiàng)
2)刪除Series中多項(xiàng)窘疮,需要將多項(xiàng)組合成數(shù)組放入drop函數(shù)中
3)刪除DataFrame中某幾行
4)刪除DataFrame中列:需要加入axis值=1代表列
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series(np.arange(4),index=['red','blue','yellow','white'])
>>> s
red????0
blue???1
yellow??2
white???3
dtype: int64
>>> s.drop('yellow')#刪除Series中某個索引極其對應(yīng)元素
red???0
blue???1
white??3
dtype: int64
>>> s.drop(['red','white'])#刪除Series中多個索引
blue???1
yellow??2
dtype: int64
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame
????ball?pen?pencil?paper
red????0??1????2???3
blue????4??5????6???7
yellow???8??9???10???11
white???12??13???14???15
>>> frame.drop(['blue','yellow'])#刪除DataFrame中多個行
????ball?pen?pencil?paper
red????0??1????2???3
white??12??13???14???15
>>> frame.drop(['pen','pencil'],axis=1)#刪除DataFrame中多個列,需要指定axis=1
????ball?paper
red????0???3
blue????4???7
yellow???8???11
white???12???15
15.算術(shù)和數(shù)據(jù)對齊
1)兩個Series對象相加
>>> import pandas as pd
>>> s1=pd.Series([3,2,5,1],['white','yellow','green','blue'])
>>> s2=pd.Series([1,4,7,2,1],index=['white','yellow','black','blue','brown'])
>>> s1
white???3
yellow??2
green???5
blue???1
dtype: int64
>>> s2
white???1
yellow??4
black???7
blue???2
brown???1
dtype: int64
>>> s1+s2
black???NaN
blue???3.0
brown???NaN
green???NaN
white???4.0
yellow??6.0
dtype: float64
2)兩個DataFrame對象相加
>>> import numpy as np
>>> frame1=pd.DataFrame(np.arange(16).reshape(4,4),index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame2=pd.DataFrame(np.arange(12).reshape(4,3),index=['blue','green','white','yellow'],columns=['mug','pen','ball'])
>>> frame1
????ball?pen?pencil?paper
red????0??1????2???3
blue????4??5????6???7
yellow???8??9???10???11
white???12??13???14???15
>>> frame2
????mug?pen?ball
blue???0??1???2
green???3??4???5
white???6??7???8
yellow??9??10??11
>>> frame1+frame2
????ball?mug?paper??pen?pencil
blue???6.0?NaN??NaN??6.0???NaN
green??NaN?NaN??NaN??NaN???NaN
red???NaN?NaN??NaN??NaN???NaN
white??20.0?NaN??NaN?20.0???NaN
yellow?19.0?NaN??NaN?19.0???NaN
上述也可以使用如下的函數(shù)方法:
1)Series之間相加
2)DataFrame之間相加
>>> s1.add(s2)
black???NaN
blue???3.0
brown???NaN
green???NaN
white???4.0
yellow??6.0
dtype: float64
>>> frame1.add(frame2)
????ball?mug?paper??pen?pencil
blue???6.0?NaN??NaN??6.0???NaN
green??NaN?NaN??NaN??NaN???NaN
red???NaN?NaN??NaN??NaN???NaN
white??20.0?NaN??NaN?20.0???NaN
yellow?19.0?NaN??NaN?19.0???NaN
16.DataFramehe Series之間的運(yùn)算
1)Series的索引=DataFrame的列名
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series([1,2,3,4],index=['a','b','c','d'])
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),columns=['a','b','c','d'])
>>> s
a??1
b??2
c??3
d??4
dtype: int64
>>> frame
??a??b??c??d
0??0??1??2??3
1??4??5??6??7
2??8??9?10?11
3?12?13?14?15
>>> s+frame #frame的每一列都加上s的對應(yīng)索引的對應(yīng)值
??a??b??c??d
0??1??3??5??7
1??5??7??9?11
2??9?11?13?15
3?13?15?17?19
>>> frame-s #frame的每一列都加上s的對應(yīng)索引的對應(yīng)值
??a??b??c??d
0?-1?-1?-1?-1
1??3??3??3??3
2??7??7??7??7
3?11?11?11?11
2)Series的索引陆淀!=DataFrame的列名
>>> frame2=pd.DataFrame(np.arange(16).reshape(4,4),columns=['b','d','e','c'])
>>> s
a??1
b??2
c??3
d??4
dtype: int64
>>> frame2
??b??d??e??c
0??0??1??2??3
1??4??5??6??7
2??8??9?10?11
3?12?13?14?15
>>> s+frame2
??a???b???c???d??e
0 NaN??2.0??6.0??5.0 NaN
1 NaN??6.0?10.0??9.0 NaN
2 NaN?10.0?14.0?13.0 NaN
3 NaN?14.0?18.0?17.0 NaN
>>> frame2-s
??a???b???c??d??e
0 NaN?-2.0??0.0 -3.0 NaN
1 NaN??2.0??4.0?1.0 NaN
2 NaN??6.0??8.0?5.0 NaN
3 NaN?10.0?12.0?9.0 NaN
17.對DataFrame的每個元素求平方根考余,利用numpy的sqrt函數(shù)
>>> frame
??a??b??c??d
0??0??1??2??3
1??4??5??6??7
2??8??9?10?11
3?12?13?14?15
>>> np.sqrt(frame)
?????a?????b?????c?????d
0?0.000000?1.000000?1.414214?1.732051
1?2.000000?2.236068?2.449490?2.645751
2?2.828427?3.000000?3.162278?3.316625
3?3.464102?3.605551?3.741657?3.872983
18.按行或列執(zhí)行操作的函數(shù)
1)按列對DataFrame每一列進(jìn)行套用自定義函數(shù)
2)按行對DataFrame每一行進(jìn)行套用自定義函數(shù)
>>> f=lambda x:x.max()-x.min()
>>> frame.apply(f) #函數(shù)參數(shù)是DataFrame中的每一列
a??12
b??12
c??12
d??12
dtype: int64
>>> frame.apply(f,axis=1)#axis=1代表f參數(shù)是DataFrame的每一行
0??3
1??3
2??3
3??3
dtype: int64
3)利用apply套用函數(shù)對某個DataFrame處理成另一個Dataframe,從而實(shí)現(xiàn)多維度計(jì)算
>>> f=lambda x:pd.Series([x.min(),x.max()],index=['min','max'])定義一個函數(shù)轧苫,函數(shù)的參數(shù)x是某DataFrame的一列楚堤,f然會一個Series對象,索引是min和max值是DaraFrame列的最大值和最小值
>>> frame.apply(f)#對frame這個Dataframe套用f函數(shù)含懊,對每一列計(jì)算后都會有一個Series對象身冬,所有的列的Series對象組合成為一個DataFrame對象產(chǎn)出
???a??b??c??d
min??0??1??2??3
max?12?13?14?15
19.統(tǒng)計(jì)函數(shù)
數(shù)組的大多數(shù)統(tǒng)計(jì)函數(shù)對DataFrame依舊有效
>>> frame
??a??b??c??d
0??0??1??2??3
1??4??5??6??7
2??8??9?10?11
3?12?13?14?15
>>> frame.sum()
a??24
b??28
c??32
d??36
dtype: int64
>>> frame.mean()
a??6.0
b??7.0
c??8.0
d??9.0
dtype: float64
>>> frame.describe()
????????a?????b?????c?????d
count??4.000000??4.000000??4.000000??4.000000
mean??6.000000??7.000000??8.000000??9.000000
std???5.163978??5.163978??5.163978??5.163978
min???0.000000??1.000000??2.000000??3.000000
25%???3.000000??4.000000??5.000000??6.000000
50%???6.000000??7.000000??8.000000??9.000000
75%???9.000000?10.000000?11.000000?12.000000
max??12.000000?13.000000?14.000000?15.000000
>>> frame.sum(axis=1)#要想對行進(jìn)行套用統(tǒng)計(jì)函數(shù),需要指定axis=1
0???6
1??22
2??38
3??54
dtype: int64
20.排序和排位次
1)Series對象的排序
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
>>> s
red????5
blue???0
yellow??3
white???8
green???4
dtype: int64
>>> s.sort_index()#按照索引的A-z排序
blue???0
green???4
red????5
white???8
yellow??3
dtype: int64
>>> s.sort_index(ascending=False)#ascending參數(shù)代表指定是否是降序
yellow??3
white???8
red????5
green???4
blue???0
dtype: int64
2)DataFrame對象的排序
>>> import numpy as np
>>> import pandas as pd
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame
????ball?pen?pencil?paper
red????0??1????2???3
blue????4??5????6???7
yellow???8??9???10???11
white???12??13???14???15
>>> frame.sort_index()#默認(rèn)按照行索引進(jìn)行排序岔乔,就是按照blue酥筝、red、white雏门、yellow排序
????ball?pen?pencil?paper
blue????4??5????6???7
red????0??1????2???3
white???12??13???14???15
yellow???8??9???10???11
>>> frame.sort_index(axis=1)#axis=1說明按照列索引排序嘿歌,按照ball、paper茁影、pen宙帝、pencil排序是整列整列的換位置
????ball?paper?pen?pencil
red????0???3??1????2
blue????4???7??5????6
yellow???8???11??9???10
white???12???15??13???14
21以上都是對索引進(jìn)行排序以下對對象中內(nèi)容進(jìn)行排序
1)對Series中元素內(nèi)容進(jìn)行排序
s.order()
2)對DataFrame中元素內(nèi)容進(jìn)行排序
>>> frame
????ball?pen?pencil?paper
red????0??1????2???3
blue????4??5????6???7
yellow???8??9???10???11
white???12??13???14???15
>>> frame.sort_index(by='pen')
__main__:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
????ball?pen?pencil?paper
red????0??1????2???3
blue????4??5????6???7
yellow???8??9???10???11
white???12??13???14???15
22.相關(guān)性和協(xié)方差
1)兩個Series對象之間的相關(guān)性和協(xié)方差
>>> import numpy as np
>>> import pandas as pd
>>> s1=pd.Series([3,4,3,4,5,4,3,2])
>>> s2=pd.Series([1,2,3,4,4,3,2,1])
>>> s1
0??3
1??4
2??3
3??4
4??5
5??4
6??3
7??2
dtype: int64
>>> s2
0??1
1??2
2??3
3??4
4??4
5??3
6??2
7??1
dtype: int64
>>> s1.corr(s2) #相關(guān)性
0.7745966692414834
>>> s1.cov(s2)#協(xié)方差
0.8571428571428571
2)單個DataFrame的相關(guān)性和協(xié)方差
>>> frame=pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame
????ball?pen?pencil?paper
red????1??4????3???6
blue????4??5????6???1
yellow???3??3????1???5
white???4??1????6???4
>>> frame.corr()
??????ball????pen??pencil???paper
ball??1.000000 -0.276026?0.577350 -0.763763
pen??-0.276026?1.000000 -0.079682 -0.361403
pencil?0.577350 -0.079682?1.000000 -0.692935
paper?-0.763763 -0.361403 -0.692935?1.000000
>>> frame.cov()
??????ball????pen??pencil???paper
ball??2.000000 -0.666667?2.000000 -2.333333
pen??-0.666667?2.916667 -0.333333 -1.333333
pencil?2.000000 -0.333333?6.000000 -3.666667
paper?-2.333333 -1.333333 -3.666667?4.666667
3)DataFrame對象的行或者列與Series對象或其他DataFrame對象元素兩兩之間的相關(guān)性
>>> s
red????5
blue???0
yellow??3
white???8
green???4
dtype: int64
>>> frame
????ball?pen?pencil?paper
red????1??4????3???6
blue????4??5????6???1
yellow???3??3????1???5
white???4??1????6???4
>>> frame.corrwith(s)
ball???-0.140028
pen???-0.869657
pencil??0.080845
paper???0.595854
dtype: float64
23.為元素賦NaN值
>>> s=pd.Series([1,2,np.NaN,3])
>>> s
0??1.0
1??2.0
2??NaN
3??3.0
dtype: float64
24.過濾NaN
>>> s
0??1.0
1??2.0
2??NaN
3??3.0
dtype: float64
>>> s.dropna()#利用dropna函數(shù)
0??1.0
1??2.0
3??3.0
dtype: float64
>>>
或者用以下方法:利用notnull方法
>>> s=pd.Series([1,2,np.NaN,3])
>>> s
0??1.0
1??2.0
2??NaN
3??3.0
dtype: float64
>>> s[s.notnull()]
0??1.0
1??2.0
3??3.0
dtype: float64:使用dropna()方法只要行或者列有一個NaN元素,該行或列的全部元素都會被刪除
>>> frame=pd.DataFrame([[6,np.NaN,6],[np.NaN,np.NaN,np.NaN],[2,np.NaN,5]],index=['blue','green','red'],columns=['ball','mug','pen'])
>>> frame
????ball?mug?pen
blue??6.0?NaN?6.0
green??NaN?NaN?NaN
red???2.0?NaN?5.0
>>> frame.dropna()
Empty DataFrame
Columns: [ball, mug, pen]
Index: []
因此為了防止避免刪除整行或整列募闲,需要使用how選項(xiàng)步脓,值位all,告知dropna函數(shù)只刪除所有元素都是NaN的行或者列
>>> frame=pd.DataFrame([[6,np.NaN,6],[np.NaN,np.NaN,np.NaN],[2,np.NaN,5]],index=['blue','green','red'],columns=['ball','mug','pen'])
>>> frame
????ball?mug?pen
blue??6.0?NaN?6.0
green??NaN?NaN?NaN
red???2.0?NaN?5.0
>>> frame.dropna(how='all')
???ball?mug?pen
blue??6.0?NaN?6.0
red??2.0?NaN?5.0
25.為NaN元素填充其他值
1)將所有的NAN替換成同一個元素浩螺,利用fillna函數(shù)
>>> frame=pd.DataFrame([[6,np.NaN,6],[np.NaN,np.NaN,np.NaN],[2,np.NaN,5]],index=['blue','green','red'],columns=['ball','mug','pen'])
>>> frame
????ball?mug?pen
blue??6.0?NaN?6.0
green??NaN?NaN?NaN
red???2.0?NaN?5.0
>>> frame.fillna(0)
????ball?mug?pen
blue??6.0?0.0?6.0
green??0.0?0.0?0.0
red???2.0?0.0?5.0
2)將不同列的NaN替換成不同的元素:需要依次指定列名及要替換成的元素即可
>>> frame.fillna('ball':1,'mug':2,'pen':8)
26.等級索引和分級
1)創(chuàng)建帶有等級索引的Series對象
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series(np.random.rand(8),index=[['a','a','a','b','b','c','c','c'],['up','down','right','up','down','up','down','left']])
>>> s
a?up????0.587733
??down???0.425383
??right??0.356205
b?up????0.251802
??down???0.105830
c?up????0.253041
??down???0.140155
??left???0.425004
dtype: float64
2)展示帶有等級索引Series對象的index屬性
>>> s.index
MultiIndex(levels=[['a', 'b', 'c'], ['down', 'left', 'right', 'up']],
??????labels=[[0, 0, 0, 1, 1, 2, 2, 2], [3, 0, 2, 3, 0, 3, 0, 1]])
3)選取帶有等級索引的Series對象的第一級索引對應(yīng)的元素
>>> s['a']
up????0.587733
down???0.425383
right??0.356205
dtype: float64
4)選取帶有等級索引的Series對象的第二級索引對應(yīng)的元素
>>> s[:,'up'] #一定記得有個逗號
a??0.587733
b??0.251802
c??0.253041
dtype: float64
5)選取帶有等級索引的Series對象的某個具體的元素
>>> s['a','up']
0.5877327517004284
6)將帶有等級索引的Series對象改變成一個DataFrame對象
>>> s.unstack()
????down???left???right????up
a?0.425383????NaN?0.356205?0.587733
b?0.105830????NaN????NaN?0.251802
c?0.140155?0.425004????NaN?0.253041
7)將一個DataFrame對象改變成一個帶有等級索引給的Series對象
>>> frame
????down???left???right????up
a?0.425383????NaN?0.356205?0.587733
b?0.105830????NaN????NaN?0.251802
c?0.140155?0.425004????NaN?0.253041
>>> frame.stack()
a?down???0.425383
??right??0.356205
??up????0.587733
b?down???0.105830
??up????0.251802
c?down???0.140155
??left???0.425004
??up????0.253041
dtype: float64
8)定義一個index和columns都是等級的DataFrame對象
>>> frame=pd.DataFrame(np.random.randn(16).reshape(4,4),index=[['white','white','red','red'],['up','down','up','down']],columns=[['pen','pen','paper','paper'],[1,2,1,2]])
>>> frame
?????????pen????????paper
??????????1?????2?????1?????2
white up??-0.487631?0.200648?0.344613?0.144835
???down?0.246683 -0.847063 -0.391592 -0.091928
red??up??-0.132962 -1.728167?1.787231?0.374895
???down -1.033622?0.354458?0.007813 -1.203889
27.重新調(diào)整順序和為層級排序
>>> frame
?????????pen????????paper
??????????1?????2?????1?????2
white up??-0.487631?0.200648?0.344613?0.144835
???down?0.246683 -0.847063 -0.391592 -0.091928
red??up??-0.132962 -1.728167?1.787231?0.374895
???down -1.033622?0.354458?0.007813 -1.203889
>>> frame.index.names=['colors','status']
>>> frame.columns.names=['objects','id']
>>> frame
objects???????pen????????paper
id??????????1?????2?????1?????2
colors status
white?up???-0.487631?0.200648?0.344613?0.144835
????down??0.246683 -0.847063 -0.391592 -0.091928
red??up???-0.132962 -1.728167?1.787231?0.374895
????down??-1.033622?0.354458?0.007813 -1.203889
>>> frame.swaplevel('colors','status')#交換colors和status兩列層級順序
objects???????pen????????paper
id??????????1?????2?????1?????2
status colors
up???white?-0.487631?0.200648?0.344613?0.144835
down??white??0.246683 -0.847063 -0.391592 -0.091928
up???red??-0.132962 -1.728167?1.787231?0.374895
down??red??-1.033622?0.354458?0.007813 -1.203889
>>> frame
objects???????pen????????paper
id??????????1?????2?????1?????2
colors status
white?up???-0.487631?0.200648?0.344613?0.144835
????down??0.246683 -0.847063 -0.391592 -0.091928
red??up???-0.132962 -1.728167?1.787231?0.374895
????down??-1.033622?0.354458?0.007813 -1.203889
>>> frame.sortlevel()#使用sortlevel對colots的所有進(jìn)行首字母的順序排列
__main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
objects???????pen????????paper
id??????????1?????2?????1?????2
colors status
red??down??-1.033622?0.354458?0.007813 -1.203889
????up???-0.132962 -1.728167?1.787231?0.374895
white?down??0.246683 -0.847063 -0.391592 -0.091928
????up???-0.487631?0.200648?0.344613?0.144835
28.按層級統(tǒng)計(jì)數(shù)據(jù)
1)按照某一行層級統(tǒng)計(jì)靴患,將層級名稱賦值給level,level作為統(tǒng)計(jì)函數(shù)的參數(shù)
>>> frame
objects???????pen????????paper
id??????????1?????2?????1?????2
colors status
white?up???-0.487631?0.200648?0.344613?0.144835
????down??0.246683 -0.847063 -0.391592 -0.091928
red??up???-0.132962 -1.728167?1.787231?0.374895
????down??-1.033622?0.354458?0.007813 -1.203889
>>> frame.sum(level='colors')#對colors這個行層級進(jìn)行sum處理
objects????pen????????paper
id???????1?????2?????1?????2
colors
white??-0.240947 -0.646416 -0.046978?0.052907
red???-1.166584 -1.373709?1.795044 -0.828994
2)想要對某一列層級
>>> frame
objects???????pen????????paper
id??????????1?????2?????1?????2
colors status
white?up???-0.487631?0.200648?0.344613?0.144835
????down??0.246683 -0.847063 -0.391592 -0.091928
red??up???-0.132962 -1.728167?1.787231?0.374895
????down??-1.033622?0.354458?0.007813 -1.203889
>>> frame.sum(level='id',axis=1) #對id這個列層級進(jìn)行sum處理要出,用axis=1標(biāo)識對列處理
id??????????1?????2
colors status
white?up???-0.143017?0.345483
????down??-0.144909 -0.938991
red??up???1.654270 -1.353272
????down??-1.025809 -0.849432