一、Pandas簡(jiǎn)介
pandas是python的一個(gè)數(shù)據(jù)分析包,最初是被作為金融數(shù)據(jù)分析工具而開(kāi)發(fā)出來(lái)的。pandas提供了大量快速便捷地處理數(shù)據(jù)的函數(shù)和方法,基本功能有以下幾點(diǎn):
1傲醉、具備按軸自動(dòng)或顯式數(shù)據(jù)對(duì)齊功能的數(shù)據(jù)機(jī)構(gòu)
2、集成時(shí)間序列功能
3呻率、既能處理時(shí)間序列數(shù)據(jù)也能處理非時(shí)間序列數(shù)據(jù)的數(shù)據(jù)結(jié)構(gòu)
4硬毕、數(shù)學(xué)運(yùn)算和約簡(jiǎn)(比如對(duì)某個(gè)軸求和)可以根據(jù)不同的元數(shù)據(jù)(軸編號(hào))執(zhí)行
5、靈活處理缺失數(shù)據(jù)
6筷凤、合并及其他出現(xiàn)在常見(jiàn)數(shù)據(jù)庫(kù)(例如基于SQL的)的關(guān)系型運(yùn)算
pandas有三種數(shù)據(jù)形式昭殉,分別是Series、DataFrame和索引對(duì)象藐守。
二挪丢、Series
Series是一種類似于一維數(shù)組的對(duì)象,它是由一組數(shù)據(jù)(各種Numpy數(shù)據(jù)類型)以及一組與之相關(guān)的數(shù)據(jù)標(biāo)簽(即索引)組成卢厂。
Series的字符串變現(xiàn)形式為:索引在左邊乾蓬,值在右邊。
from pandas import Series
print('用數(shù)組生成Series')
obj=Series([4,7,-5,3])#不指定的話從0開(kāi)始
print(obj)
print(obj.values)
print(obj.index)
#輸出結(jié)果
用數(shù)組生成Series
0 4
1 7
2 -5
3 3
dtype: int64
[ 4 7 -5 3]
RangeIndex(start=0, stop=4, step=1)
print('指定Series的index')
obj2=Series([4,7,-5,3],index=['d','b','a','c'])
print(obj2)
print(obj2.index)
print(obj2['a'])
#輸出結(jié)果
指定Series的index
d 4
b 7
a -5
c 3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')
-5
obj2['d']=6
print(obj2[['c','a','d']])
#輸出結(jié)果
c 3
a -5
d 6
dtype: int64
print(obj2>0)
#輸出結(jié)果
d True
b True
a False
c True
dtype: bool
print(obj2[obj2>0]) #找出大于0的元素
#輸出結(jié)果
d 6
b 7
c 3
dtype: int64
print('b' in obj2)
print('e' in obj2)
#輸出結(jié)果
True
False
print('使用字典生成Series')
sdata={'Ohio':45000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
print(obj3)
#輸出結(jié)果
使用字典生成Series
Ohio 45000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
print('使用字典生成Series慎恒,并額外指定index任内,不匹配部分為NaN')
states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
print(obj4)
#輸出結(jié)果
使用字典生成Series撵渡,并額外指定index,不匹配部分為NaN
California NaN
Ohio 45000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
print('Series相加死嗦,相同索引部分相加')
print(obj3+obj4)
#輸出結(jié)果
Series相加趋距,相同索引部分相加
California NaN
Ohio 90000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
print('指定Series及其索引的名字')
obj4.name='population'#數(shù)據(jù)命名為population
obj4.index.name='state'#索引命名為state
print(obj4)
#輸出結(jié)果
指定Series及其索引的名字
state
California NaN
Ohio 45000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
print('替換index')
obj.index=['Bob','Steve','Jeff','Ryan']
print(obj)
#輸出結(jié)果
替換index
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
三、DataFrame
1越除、DataFrame是一個(gè)表格型的數(shù)據(jù)結(jié)構(gòu)节腐,它含有一組有序的列,每列可以是不同的值類型(數(shù)值摘盆、字符串翼雀、布爾值等)
2、DataFrame既有行索引也有列索引孩擂,它可以被看做由Series組成的字典(共用同一個(gè)索引)
3狼渊、可以輸出給DataFrame構(gòu)造器的數(shù)據(jù)
類型 | 說(shuō)明 |
---|---|
二維ndarray | 數(shù)據(jù)矩陣,還可以傳入行標(biāo)和列標(biāo) |
由數(shù)組类垦、列表或元組組成的字典 | 每個(gè)序列會(huì)生成DataFrame的一列狈邑,所有序列的長(zhǎng)度必須相同 |
NumPy的結(jié)構(gòu)化/記錄數(shù)組 | 類似于“由數(shù)組組成的字典” |
由Series組成的字典 | 每個(gè)Series會(huì)組成一列,如果沒(méi)有顯示指定索引护锤,則各Series的索引會(huì)被合并成結(jié)果的行索引 |
由字典組成的字典 | 各內(nèi)層字典會(huì)成為一列官地,鍵會(huì)被合并成結(jié)果的行索引酿傍,跟“由Series組成的字典”的情況一樣 |
字典或Series的列表 | 各項(xiàng)將會(huì)成為DataFrame的一行烙懦,字典鍵或Series索引的并集將會(huì)成為DataFrame的列標(biāo) |
由列表或元組組成的列表 | 類似于“二維ndarray” |
另一個(gè)DataFrame | 該DataFrame的索引將會(huì)被沿用,除非顯示指定了其他索引 |
NumPy的MaskedArray | 類似于“二維ndarray”的情況赤炒,只是掩碼值在結(jié)果DataFrame會(huì)變成NA/缺失值氯析。 |
import numpy as np
from pandas import Series,DataFrame
print('用字典生成DataFrame,key為列的名字')
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
print(DataFrame(data))
print(DataFrame(data,columns=['year','state','pop']))#指定列的順序
#輸出結(jié)果
用字典生成DataFrame,key為列的名字
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
print('指定索引,在列中指定不存在的列莺褒,默認(rèn)數(shù)據(jù)為NaN')
frame2=DataFrame(data,columns=['year','state','pop','debt'],
index=['one','two','three','four','five'])
print(frame2)
#輸出結(jié)果
指定索引掩缓,在列中指定不存在的列,默認(rèn)數(shù)據(jù)為NaN
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
print(frame2['state'])
#輸出結(jié)果
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
print(frame2.year)
print(frame2['year'])
#輸出結(jié)果
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
print(frame2.loc['three']) #行索引
#輸出結(jié)果
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
frame2['debt']=16.5 #修改一整列
print(frame2)
#輸出結(jié)果
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
frame2.debt=np.arange(5)#用numpy數(shù)組修改元素
print(frame2)
#輸出結(jié)果
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
print('用Series指定要修改的索引及其對(duì)應(yīng)的值遵岩,沒(méi)有指定的默認(rèn)數(shù)據(jù)用NaN')
val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt']=val
print(frame2)
#輸出結(jié)果
用Series指定要修改的索引及其對(duì)應(yīng)的值你辣,沒(méi)有指定的默認(rèn)數(shù)據(jù)用NaN
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
print('賦值給新列')
frame2['eastern']=(frame2.state=='Ohio') #如果state等于Ohiow為T(mén)rue,不等于為False
print(frame2)
print(frame2.columns)
#輸出結(jié)果
賦值給新列
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
Index(['year', 'state', 'pop', 'debt', 'eastern'], dtype='object')
print('DataFrame轉(zhuǎn)置')
pop={'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame3=DataFrame(pop)
print(frame3)
print(frame3.T)
#輸出結(jié)果
DataFrame轉(zhuǎn)置
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5
print('指定索引順序,以及使用切片初始化數(shù)據(jù)')
print(DataFrame(pop,index=[2001,2002,2003]))
#輸出結(jié)果
指定索引順序尘执,以及使用切片初始化數(shù)據(jù)
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
pdata={'Ohio':frame3['Ohio'][:-1],'Nevada':frame3['Nevada'][:2]}
print(DataFrame(pdata))
#輸出結(jié)果
Ohio Nevada
2001 1.7 2.4
2002 3.6 2.9
print('指定索引和列的名稱')
frame3.index.name='year'
frame3.columns.name='state'
print(frame3)
#輸出結(jié)果
指定索引和列的名稱
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
print(frame3.values)
#輸出結(jié)果
[[2.4 1.7]
[2.9 3.6]
[nan 1.5]]
print(frame2.values)
#輸出結(jié)果
[[2000 'Ohio' 1.5 nan True]
[2001 'Ohio' 1.7 -1.2 True]
[2002 'Ohio' 3.6 nan True]
[2001 'Nevada' 2.4 -1.5 False]
[2002 'Nevada' 2.9 -1.7 False]]
四舍哄、索引對(duì)象
1、pandas的索引對(duì)象負(fù)責(zé)管理軸標(biāo)簽和其他元數(shù)據(jù)(比如軸名稱等)誊锭。構(gòu)建Series或DataFrame時(shí)表悬,所用到的任何數(shù)組或其他序列的標(biāo)簽都會(huì)被轉(zhuǎn)換成index。
2丧靡、index對(duì)象是不可修改的(immutable)蟆沫,因此用戶不能對(duì)其進(jìn)行修改籽暇。不可修改性非常重要,因此這樣才能使index對(duì)象在多個(gè)數(shù)據(jù)結(jié)構(gòu)之間安全共享饭庞。
import numpy as np
import pandas as pd
import sys
from pandas import Series,DataFrame,Index
print('獲取index')
obj=Series(range(3),index=['a','b','c'])
index=obj.index
print(index[1:])
try:
index[1]='d' #index對(duì)象read only
except:
print(sys.exc_info()[0])
#輸出結(jié)果
獲取index
Index(['b', 'c'], dtype='object')
<class 'TypeError'>
print('使用index對(duì)象')
index=Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
print(obj2)
print(obj2.index is index)
#輸出結(jié)果
使用index對(duì)象
0 1.5
1 -2.5
2 0.0
dtype: float64
True
print('判斷列和索引是否存在')
pop={'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame3=DataFrame(pop)
print('Ohio' in frame3.columns)
print('2003' in frame3.index)
#輸出結(jié)果
判斷列和索引是否存在
True
False
3戒悠、pandas中主要的index對(duì)象
類型 | 說(shuō)明 |
---|---|
index | 最泛化的Index對(duì)象,將軸標(biāo)簽作為一個(gè)由Python對(duì)象組成的NumPy數(shù)組 |
int64Index | 針對(duì)整數(shù)的特殊Index |
MultiIndex | "層次化"索引對(duì)象舟山,表示單個(gè)軸上的多層索引救崔。可以看做由原數(shù)組組成的數(shù)組 |
DatatimeIndex | 存儲(chǔ)納秒級(jí)時(shí)間戳 |
PeriodIndex | 針對(duì)Period數(shù)據(jù)的特殊Index |
五捏顺、基本功能——重新索引
1六孵、創(chuàng)建一個(gè)適應(yīng)新索引的新對(duì)象,該Series的reindex將會(huì)根據(jù)新索引進(jìn)行重排幅骄。如果某個(gè)索引值當(dāng)前不存在劫窒,就引入缺失值。
2拆座、對(duì)于時(shí)間序列這樣的有序數(shù)據(jù)主巍,重新索引時(shí)可能需要做一些插值處理。method選項(xiàng)即可達(dá)到此目的挪凑。
3孕索、reindex函數(shù)的參數(shù)
類型 | 說(shuō)明 |
---|---|
index | 用于索引的新序列。既可以是index實(shí)例躏碳,也可以是其他序列類型的python數(shù)據(jù)結(jié)構(gòu)搞旭。index會(huì)被完全使用,就像沒(méi)有任何復(fù)制一樣 |
method | 插值填充方式菇绵,ffill或bfill |
fill_value | 在重新索引過(guò)程中肄渗,需要引入缺失值時(shí)使用的替代值 |
limit | 前向或后向填充時(shí)的最大填充量 |
level | 在MultiIndex的指定級(jí)別上匹配簡(jiǎn)單索引,否則選取其子集 |
copy | 默認(rèn)為T(mén)rue咬最,無(wú)論如何都復(fù)制翎嫡。如果為False,則新舊相等就不復(fù)制永乌。 |
import numpy as np
from pandas import DataFrame,Series
print('重新制定索引及順序')
obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
print(obj)
#輸出結(jié)果
重新制定索引及順序
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
obj2=obj.reindex(['a','b','d','c','e']) #重新指定索引而不是改變索引本身
print(obj2)
#輸出結(jié)果
a -5.3
b 7.2
d 4.5
c 3.6
e NaN
dtype: float64
print(obj.reindex(['a','b','d','c','e'],fill_value=0))#指定不存在的元素
#輸出結(jié)果
a -5.3
b 7.2
d 4.5
c 3.6
e 0.0
dtype: float64
print('重新指定索引并指定元素填充方法')
obj3=Series(['blue','purple','yellow'],index=[0,2,4])
print(obj3)
#輸出結(jié)果
重新指定索引并指定填元素充方法
0 blue
2 purple
4 yellow
dtype: object
print(obj3.reindex(range(6),method='ffill'))#用前面一行的值去填充
#輸出結(jié)果
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
print('對(duì)DataFrame重新指定索引')
frame=DataFrame(np.arange(9).reshape(3,3),
index=['a','c','d'],
columns=['Ohio','Texas','California'])
print(frame)
#輸出結(jié)果
對(duì)DataFrame重新指定索引
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame2=frame.reindex(['a','b','c','d'])
print(frame2)
#輸出結(jié)果
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
print('重新指定column')
states=['Texas','Utah','California']
print(frame.reindex(columns=states))
#輸出結(jié)果
重新指定column
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
print('對(duì)DataFrame重新指定索引并指定填充元素方法')
print(frame.reindex(index=['a','b','c','d'],columns=states).ffill())
print(frame.loc[['a','b','d','c'],states])
x=frame.reindex(index=['a','b','c','d'],columns=states).ffill()
print(x.drop('Utah',axis=1))
#輸出結(jié)果
對(duì)DataFrame重新指定索引并指定填充元素方法
Texas Utah California
a 1.0 NaN 2.0
b 1.0 NaN 2.0
c 4.0 NaN 5.0
d 7.0 NaN 8.0
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
d 7.0 NaN 8.0
c 4.0 NaN 5.0
Texas California
a 1.0 2.0
b 1.0 2.0
c 4.0 5.0
d 7.0 8.0
六惑申、基本功能——丟棄指定軸上的項(xiàng)
丟棄某條軸上的一個(gè)或多個(gè)項(xiàng)很簡(jiǎn)單,只要有一個(gè)索引數(shù)組或列表即可翅雏。由于需要執(zhí)行一些數(shù)據(jù)整理和集合邏輯圈驼,所以drop方法返回的是一個(gè)在指定軸上刪除了指定值的新對(duì)象
import numpy as np
from pandas import Series,DataFrame
print('Series根據(jù)索引刪除元素')
obj=Series(np.arange(5),index=['a','b','c','d','e'])
new_obj=obj.drop('c')
print(new_obj)
#輸出結(jié)果
Series根據(jù)索引刪除元素
a 0
b 1
d 3
e 4
dtype: int64
print(obj.drop(['d','c']))
#輸出結(jié)果
a 0
b 1
e 4
dtype: int64
print('DataFrame刪除元素,可指定索引或列')
data=DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data)
#輸出結(jié)果
DataFrame刪除元素枚荣,可指定索引或列
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
print(data.drop(['Colorado','Ohio']))#不指定直接刪除行
#輸出結(jié)果
one two three four
Utah 8 9 10 11
New York 12 13 14 15
print(data.drop('two',axis=1))
#輸出結(jié)果
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
print(data.drop(['two','four'],axis=1))#原始數(shù)據(jù)data沒(méi)有發(fā)生改變
#輸出結(jié)果
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
七碗脊、基本功能——索引、選取和過(guò)濾
1、Series索引(obj[...])的工作方式類似與NumPy數(shù)組的索引衙伶,只不過(guò)Series索引值不只是整數(shù)
2祈坠、利用標(biāo)簽的切片運(yùn)算與普通的Python切片運(yùn)算不同,其末端是包含的(inclusive)
3矢劲、對(duì)DataFrame進(jìn)行索引其實(shí)就是獲取一個(gè)或多個(gè)列
4赦拘、為了在DataFrame的行上進(jìn)行標(biāo)簽索引,引入了專門(mén)的索引字段ix(python3版本即將棄用)
注:
loc——通過(guò)行標(biāo)簽索引行數(shù)據(jù)
iloc——通過(guò)行號(hào)索引行數(shù)據(jù)
ix——通過(guò)行標(biāo)簽或者行號(hào)索引行數(shù)據(jù)(基于loc和iloc 的混合)
import numpy as np
from pandas import Series,DataFrame
print('Series的索引芬沉,默認(rèn)數(shù)字索引可以工作')
obj=Series(np.arange(4),index=['a','b','c','d'])
print(obj['b'])
print(obj[3])
print(obj[[3]])#默認(rèn)的數(shù)字?jǐn)?shù)組索引
print(obj[[1,3]])#花式索引
print(obj[obj<2])
#輸出結(jié)果
Series的索引躺同,默認(rèn)數(shù)字索引可以工作
1
3
d 3
dtype: int64
b 1
d 3
dtype: int64
a 0
b 1
dtype: int64
print('Series的數(shù)組切片')
print(obj['b':'c'])#閉區(qū)間,非數(shù)字索引是閉區(qū)間
obj['b':'c']=5
print(obj)
#輸出結(jié)果
Series的數(shù)組切片
b 5
c 5
dtype: int64
a 0
b 5
c 5
d 3
dtype: int64
print('DataFrame的索引')
data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data)
print(data['two'])#打印列
print(data[['three','one']])
print(data[:2]) #前兩行
print(data.ix['Colorado',['two','three']])#指定索引和列
print(data.ix[['Colorado','Utah'],[3,0,1]])
print(data.ix[2])#打印第2行(從0開(kāi)始)
print(data.ix[:'Utah','two'])#從開(kāi)始到Utah,第2列
#輸出結(jié)果
DataFrame的索引
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
two 5
three 6
Name: Colorado, dtype: int64
four one two
Colorado 7 4 5
Utah 11 8 9
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int64
print('根據(jù)條件選擇')
print(data[data.three>5])
print(data<5)#打印True或False
data[data<5]=0
print(data)
#輸出結(jié)果
根據(jù)條件選擇
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
八丸逸、基本功能——算術(shù)運(yùn)算和數(shù)據(jù)對(duì)齊
1蹋艺、對(duì)不同的索引對(duì)象進(jìn)行算數(shù)運(yùn)算
2、自動(dòng)數(shù)據(jù)對(duì)齊在不重疊的索引處引入了NA值黄刚,缺失值會(huì)在算術(shù)運(yùn)算過(guò)程中傳播
3捎谨、對(duì)于DataFrame,對(duì)齊操作會(huì)同時(shí)發(fā)生在行和列上
4憔维、fill_value參數(shù)
5涛救、DataFrame和Series之間的運(yùn)算
import numpy as np
from pandas import Series,DataFrame
print('加法')
s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
s2=Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])
print(s1)
print(s2)
print(s1+s2)
#輸出結(jié)果
加法
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
print('DataFrame加法,索引和列都必須匹配')
df1=DataFrame(np.arange(9).reshape((3,3)),
columns=list('bcd'),
index=['Ohio','Texas','Colorado'])
df2=DataFrame(np.arange(12).reshape((4,3)),
columns=list('bde'),
index=['Utah','Ohio','Texas','Oregon'])
print(df1)
print(df2)
print(df1+df2)
#輸出結(jié)果
DataFrame加法业扒,索引和列都必須匹配
b c d
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
print('數(shù)據(jù)填充')
df1=DataFrame(np.arange(12).reshape((3,4)),columns=list('abcd'))
df2=DataFrame(np.arange(20).reshape((4,5)),columns=list('abcde'))
print(df1)
print(df2)
print(df1.add(df2,fill_value=0))
print(df1.reindex(columns=df2.columns,fill_value=0))
#輸出結(jié)果
數(shù)據(jù)填充
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
a b c d e
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0
print('DataFrame與Series之間的操作')
arr=np.arange(12).reshape((3,4))
print(arr)
print(arr[0])
print(arr-arr[0])
frame=DataFrame(np.arange(12).reshape((4,3)),
columns=list('bde'),
index=['Utah','Ohio','Texas','Oregon'])
series=frame.ix[0]
print(frame)
print(series)
print(frame-series)
series2=Series(range(3),index=list('bef'))#range(3)=[0,3]检吆,返回的是一個(gè)可迭代的對(duì)象,而不是列表
print(frame+series2)
series3=frame['d']
print(frame.sub(series3,axis=0))#按列減
#輸出結(jié)果
DataFrame與Series之間的操作
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[0 1 2 3]
[[0 0 0 0]
[4 4 4 4]
[8 8 8 8]]
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
b 0
d 1
e 2
Name: Utah, dtype: int64
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
b d e
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1
九程储、基本功能——函數(shù)應(yīng)用和映射
1蹭沛、numpy的ufuncs(元素級(jí)數(shù)組方法)
2、DataFrame的apply方法
3虱肄、對(duì)象的applymap方法(因?yàn)镾eries有一個(gè)應(yīng)用于元素級(jí)的map方法
import numpy as np
from pandas import Series,DataFrame
print('函數(shù)')
frame=DataFrame(np.random.randn(4,3),
columns=list('bde'),
index=['Utah','Ohio','Texas','Oregon'])
print(frame)
print(np.abs(frame))
#輸出結(jié)果
函數(shù)
b d e
Utah 0.446524 -0.045516 -0.745873
Ohio 0.425788 0.198789 -0.326030
Texas -1.374720 0.941039 1.248235
Oregon 0.732467 -0.823564 -1.864570
b d e
Utah 0.446524 0.045516 0.745873
Ohio 0.425788 0.198789 0.326030
Texas 1.374720 0.941039 1.248235
Oregon 0.732467 0.823564 1.864570
print('lambda以及應(yīng)用')
f=lambda x:x.max()-x.min()
print(frame.apply(f))#按列操作致板,每列上的最大值減去最小值
print(frame.apply(f,axis=1))#按行操作
def f(x):#不使用匿名函數(shù)
return Series([x.min(),x.max()],index=['min','max'])
print(frame.apply(f))
#輸出結(jié)果
lamnda以及應(yīng)用
b 2.107187
d 1.764603
e 3.112805
dtype: float64
Utah 1.192397
Ohio 0.751818
Texas 2.622955
Oregon 2.597037
dtype: float64
b d e
min -1.374720 -0.823564 -1.864570
max 0.732467 0.941039 1.248235
print('applymap和map')
_format=lambda x:'%.2f' % x #取x的前兩位交煞,Python的一種表達(dá)式 %前面對(duì)應(yīng)的是字符串咏窿,后面對(duì)應(yīng)參數(shù)
print(frame.applymap(_format))#每一個(gè)元素
print(frame['e'].map(_format))#某一列的元素,相當(dāng)于Series
#DataFrame調(diào)用applymap函數(shù)作用到每個(gè)元素上
#Series調(diào)用map作用到每個(gè)元素上
#輸出結(jié)果
applymap和map
b d e
Utah 0.45 -0.05 -0.75
Ohio 0.43 0.20 -0.33
Texas -1.37 0.94 1.25
Oregon 0.73 -0.82 -1.86
Utah -0.75
Ohio -0.33
Texas 1.25
Oregon -1.86
Name: e, dtype: object
十素征、基本功能——排序和排名
1集嵌、對(duì)行或列索引進(jìn)行排序
2、對(duì)于DataFrame,根據(jù)任意一個(gè)軸上的索引進(jìn)行排序
3御毅、可以指定升序降序
4根欧、按值排序
5、對(duì)于DataFrame端蛆,可以指定按值排序的列
6凤粗、rank函數(shù)
import numpy as np
from pandas import Series,DataFrame
print('根據(jù)索引排序,對(duì)于DataFrame可以指定軸')
obj=Series(range(4),index=['d','a','b','c'])
print(obj.sort_index())
#輸出結(jié)果
根據(jù)索引排序今豆,對(duì)于DataFrame可以指定軸
a 1
b 2
c 3
d 0
dtype: int64
frame=DataFrame(np.arange(8).reshape((2,4)),
index=['three','one'],
columns=list('dabc'))
print(frame.sort_index())
print(frame.sort_index(axis=1))
print(frame.sort_index(axis=1,ascending=False))#降序
#輸出結(jié)果
d a b c
one 4 5 6 7
three 0 1 2 3
a b c d
three 1 2 3 0
one 5 6 7 4
d c b a
three 0 3 2 1
one 4 7 6 5
print('根據(jù)值排序')
obj=Series([4,7,-3,2])
print(obj.sort_values())#order已淘汰
#輸出結(jié)果
根據(jù)值排序
2 -3
3 2
0 4
1 7
dtype: int64
print('DataFrame指定列排序')
frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
print(frame)
print(frame.sort_values(by='b'))#sort_index(by=...)已淘汰
print(frame.sort_values(by=['a','b']))#先按列a排序嫌拣,再按列b排序
#輸出結(jié)果
DataFrame指定列排序
b a
0 4 0
1 7 1
2 -3 0
3 2 1
b a
2 -3 0
3 2 1
0 4 0
1 7 1
b a
2 -3 0
0 4 0
3 2 1
1 7 1
print('rank,求排名的平均位置(從1開(kāi)始)')
obj=Series([7,-5,7,4,2,0,4])
#對(duì)應(yīng)排名:-5(1)柔袁,0(2),2(3)异逐,4(4)捶索,4(5),7(6)灰瞻,7(7)
print(obj.rank())
#輸出結(jié)果
rank,求排名的平均位置(從1開(kāi)始)
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
print(obj.rank(method='first'))#取第一次出現(xiàn)腥例,不求平均值
#輸出結(jié)果
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
print(obj.rank(ascending =False,method='max'))#不平均排名,降序排名酝润,并且按照相同數(shù)字使用最大排名進(jìn)行統(tǒng)一排名
#7(1)燎竖,7(2),4,(3)要销,4(4)底瓣,2(5),0(6)蕉陋,-5(7)
#輸出結(jié)果
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
frame=DataFrame({'b':[4.3,7,-3,2],
'a':[0,1,0,1],
'c':[-2,5,8,-2.5]})
print(frame)
print(frame.rank(axis=1))#按行rank,以第一行為例捐凭,[4.3,0,-2.0]=>[-2.0(1),0(2),4.3(3)]=>[3.0,2.0,1.0]
#輸出結(jié)果
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
十一、基本功能——帶有重復(fù)值的索引
對(duì)于重復(fù)索引凳鬓,返回Series,對(duì)應(yīng)單個(gè)值的索引則返回標(biāo)量
import numpy as np
from pandas import Series,DataFrame
print('重復(fù)的索引')
obj=Series(range(5),index=['a','a','b','b','c'])
print(obj.index.is_unique)#判斷索引是否唯一
print(type(obj['a']))
print(obj['a'].ix[0],obj['a'].ix[1])
# ,obj.a[1]
#輸出結(jié)果
重復(fù)的索引
False
<class 'pandas.core.series.Series'>
0 1
df=DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
print(df)
print(df.loc['b'].iloc[0])
print(df.loc['b'].iloc[1])
#輸出結(jié)果
0 1 2
a 0.155635 -0.099546 0.112265
a -0.918338 0.707659 0.263030
b -1.075503 0.902052 0.254616
b -0.245483 -0.058749 1.182611
0 -1.075503
1 0.902052
2 0.254616
Name: b, dtype: float64
0 -0.245483
1 -0.058749
2 1.182611
Name: b, dtype: float64
十二茁肠、匯總和計(jì)算描述統(tǒng)計(jì)
1、常用方法選項(xiàng)
類型 | 說(shuō)明 |
---|---|
axis | 指定軸缩举,DataFrame的行用0垦梆,列用1. |
skipna | 排除缺失值,默認(rèn)值為T(mén)rue |
level | 如果軸是層次化索引的(即MultiIndex),則根據(jù)level選取分組 |
2仅孩、常用描述和匯總統(tǒng)計(jì)函數(shù)
類型 | 說(shuō)明 |
---|---|
count | 非NA值的數(shù)量 |
describe | 針對(duì)Series或各DataFrames列計(jì)算匯總統(tǒng)計(jì) |
min,max | 計(jì)算最小值和最大值 |
argmin,argmax | 計(jì)算能夠獲取到最小值和最大值的索引位置(整數(shù)) |
idxmin,idxmax | 計(jì)算能夠獲取到最小值和最大值的索引值 |
sum | 值的總和 |
mean | 值的平均數(shù) |
median | 值的算數(shù)中位數(shù) |
mad | 根據(jù)平均值計(jì)算平均絕對(duì)離差 |
var | 樣本值的方差 |
std | 樣本值的標(biāo)準(zhǔn)差 |
skew | 樣本值的偏差(三階矩) |
kurt | 樣本值的偏差(四階矩) |
cumsum | 樣本值的累積和 |
cummin,cummax | 樣本值的累計(jì)最大值和累計(jì)最小值 |
cumprod | 樣本值的累計(jì)積 |
diff | 計(jì)算一階差分 |
pct_change | 計(jì)算百分?jǐn)?shù)變化 |
3托猩、唯一值以及成員資格
類型 | 說(shuō)明 |
---|---|
is_in | 計(jì)算一個(gè)表示“Series各值是否包含于傳入的值序列中”的布爾型數(shù)組 |
unique | 計(jì)算Series中的唯一值數(shù)組,按發(fā)現(xiàn)的順序返回 |
value_counts | 返回一個(gè)Series辽慕,其索引為唯一值京腥,其值為頻率,按計(jì)數(shù)值降序排列 |
4溅蛉、處理缺失值
前面在例題中已經(jīng)講過(guò)了公浪,后面還會(huì)出一期專門(mén)講解一下如何處理缺失值的問(wèn)題。
5船侧、NA處理方法
類型 | 說(shuō)明 |
---|---|
dropna | 根據(jù)各標(biāo)簽的值中是否存在缺少數(shù)據(jù)對(duì)軸標(biāo)簽進(jìn)行過(guò)濾欠气,可通過(guò)閾值調(diào)節(jié)對(duì)缺失值的容忍度 |
fillna | 用指定值或插值方法(如ffill或bfill)填充缺失數(shù)據(jù) |
isnull | 返回一個(gè)含有布爾值的對(duì)象 |
notnull | isnull的否定式 |
import numpy as np
from pandas import Series,DataFrame
print('求和')
df=DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
index=['a','b','c','d'],
columns=['one','two'])
print(df)
print(df.sum())#按列求和
print(df.sum(axis=1))#按行求和
#輸出結(jié)果
求和
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
one 9.25
two -5.80
dtype: float64
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
print('平均數(shù)')
print(df.mean(axis=1,skipna=False))#按行
print(df.mean(axis=1))
#輸出結(jié)果
平均數(shù)
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
print('其他')
print(df.idxmax())
print(df.cumsum())#按行累計(jì)
print(df.describe())
obj=Series(['a','a','b','c']*4)
print(obj.describe())
#輸出結(jié)果
其他
one b
two d
dtype: object
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
count 16
unique 3
top a
freq 8
dtype: object
十三、層次化索引
1镜撩、能在一個(gè)軸上擁有多個(gè)(兩個(gè)以上)索引級(jí)別预柒。抽象的說(shuō)寿羞,它使你能以低緯度的形式處理高緯度數(shù)據(jù)
2剥扣、通過(guò)stack和unstack變換DataFrame
import numpy as np
from pandas import Series,DataFrame,MultiIndex
print('Series的層次索引')
data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],
[1,2,3,1,2,3,1,2,2,3]])
print(data)
#輸出結(jié)果
Series的層次索引
a 1 1.726844
2 0.513376
3 -1.901820
b 1 1.433962
2 1.784395
3 -0.109921
c 1 -0.235029
2 -0.475485
d 2 -0.528125
3 -0.076163
dtype: float64
print(data.index)
#輸出結(jié)果
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 2),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
print(data.b)
#輸出結(jié)果
1 1.433962
2 1.784395
3 -0.109921
dtype: float64
print(data['b':'c'])
#輸出結(jié)果
b 1 1.433962
2 1.784395
3 -0.109921
c 1 -0.235029
2 -0.475485
dtype: float64
print(data[:2])
#輸出結(jié)果
a 1 1.726844
2 0.513376
dtype: float64
print(data.unstack())
#輸出結(jié)果
1 2 3
a 1.726844 0.513376 -1.901820
b 1.433962 1.784395 -0.109921
c -0.235029 -0.475485 NaN
d NaN -0.528125 -0.076163
print(data.unstack().stack())
#輸出結(jié)果
a 1 1.726844
2 0.513376
3 -1.901820
b 1 1.433962
2 1.784395
3 -0.109921
c 1 -0.235029
2 -0.475485
d 2 -0.528125
3 -0.076163
dtype: float64
print('DataFrame的層次索引')
frame=DataFrame(np.arange(12).reshape((4,3)),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['Ohio','Ohio','Colorado'],
['Green','Red','Green']])
print(frame)
#輸出結(jié)果
DataFrame的層次索引
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
print(frame)
#輸出結(jié)果
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
print(frame.index)
#輸出結(jié)果
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
names=['key1', 'key2'])
print(frame.columns)
#輸出結(jié)果
MultiIndex([( 'Ohio', 'Green'),
( 'Ohio', 'Red'),
('Colorado', 'Green')],
names=['state', 'color'])
print(frame.loc['a',1])
#輸出結(jié)果
state color
Ohio Green 0
Red 1
Colorado Green 2
Name: (a, 1), dtype: int64
print(frame.loc['a',2]['Colorado'])
#輸出結(jié)果
color
Green 5
Name: (a, 2), dtype: int64
print(frame.loc['a',2]['Ohio']['Red'])
#輸出結(jié)果
4
3轰异、索引交換
import numpy as np
from pandas import Series,DataFrame
print('索引層級(jí)交換')
frame=DataFrame(np.arange(12).reshape((4,3)),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
frame.index.names=['key1','key2']
frame_swapped=frame.swaplevel('key1','key2')
print(frame_swapped)
print(frame_swapped.swaplevel(0,1))
#輸出結(jié)果
索引層級(jí)交換
Ohio Colorado
Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
Ohio Colorado
Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
4骨饿、根據(jù)級(jí)別匯總統(tǒng)計(jì)
import numpy as np
from pandas import DataFrame
print('根據(jù)指定的key計(jì)算統(tǒng)計(jì)信息')
frame=DataFrame(np.arange(12).reshape((4,3)),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
frame.index.names=['key1','key2']
print(frame)
print(frame.sum(level='key2'))
print(frame.sum(level='key1'))
#輸出結(jié)果
根據(jù)指定的key計(jì)算統(tǒng)計(jì)信息
Ohio Colorado
Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Ohio Colorado
Green Red Green
key2
1 6 8 10
2 12 14 16
Ohio Colorado
Green Red Green
key1
a 3 5 7
b 15 17 19
print('使用列生成層次索引')
frame=DataFrame({'a':range(7),
'b':range(7,0,-1),
'c':['one','one','one','two','two','two','two'],
'd':[0,1,2,0,1,2,3]})
print(frame)
print(frame.set_index(['c','d']))#把c/d列變成索引
print(frame.set_index(['c','d'],drop=False))#列依然保留
frame2=frame.set_index(['c','d'])
print(frame2.reset_index())
#輸出結(jié)果
使用列生成層次索引
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
5、整數(shù)索引
import numpy as np
import sys
from pandas import Series,DataFrame
print('整數(shù)索引')
ser=Series(np.arange(3))
print(ser)
try:
print(ser[-1]) #這里會(huì)有歧義
except:
print(sys.exc_info()[0])
ser2=Series(np.arange(3),index=['a','b','c'])
print(ser2[-1])
ser3=Series(range(3),index=[-5,1,3])
print(ser3.iloc[2])#避免直接用[2]產(chǎn)生的歧義
print('對(duì)DataFrame使用整數(shù)索引')
frame=DataFrame(np.arange(6).reshape((3,2)),index=[2,0,1])
print(frame)
print(frame.iloc[0])
print(frame.iloc[:,1])
#輸出結(jié)果
整數(shù)索引
0 0
1 1
2 2
dtype: int64
<class 'KeyError'>
2
2
對(duì)DataFrame使用整數(shù)索引
0 1
2 0 1
0 2 3
1 4 5
0 0
1 1
Name: 2, dtype: int64
2 1
0 3
1 5
Name: 1, dtype: int64
好了顾翼,pandas庫(kù)的學(xué)習(xí)到這里就結(jié)束了投放,下一篇我們利用學(xué)習(xí)的pandas知識(shí),對(duì)股票數(shù)據(jù)進(jìn)行一次實(shí)戰(zhàn)分析适贸。