這里整理下pandas常用的操作炼幔,為什么要寫這個呢乃秀?有本書《利用Python進行數(shù)據(jù)分析》一邊看一遍記錄下圆兵。
1. 重新索引(reindex)
就是重構(gòu)一下索引,在重構(gòu)的同時刀脏,我們可以做一些其他操作
DataFrame.reindex(index=None, columns=None, **kwargs)
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
Series.reindex(index=None, **kwargs)
Conform Series to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
一個小例子
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
Out[156]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
#reindex后愈污,沒有的值轮傍,默認會用NaN填充
obj.reindex(['a','b','c','d','e'])
Out[157]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
#fill_value,常用的參數(shù)擎析,表示沒有數(shù)據(jù)時默認填充的值
obj.reindex(['a','b','c','d','e'] , fill_value=9.9)
Out[159]:
a -5.3
b 7.2
c 3.6
d 4.5
e 9.9
dtype: float64
#method,常用參數(shù),在遞增或遞減index中桨醋,填充空值的方法
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
Out[165]:
0 blue
2 purple
4 yellow
dtype: object
obj3.reindex(range(6))
Out[170]:
0 blue
1 NaN
2 purple
3 NaN
4 yellow
5 NaN
dtype: object
#ffill现斋,前向填充
obj3.reindex(range(6),method='ffill')
Out[167]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
#bfill庄蹋,后向填充
obj3.reindex(range(6),method='bfill')
Out[171]:
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
對于DataFrame來說,用起來也是差不多的
2. 丟棄指定軸上的項
主要就是drop方法的使用
DataFrame.drop(labels, axis=0, level=None, inplace=False, errors='raise')
Return new object with labels in requested axis removed.
小例子
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
Out[174]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
obj.drop('c')
Out[175]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['b','d'])
Out[176]:
a 0.0
c 2.0
e 4.0
dtype: float64
#DataFrame
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
Out[178]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
#默認是橫軸虫蝶,
data.drop(['Ohio','Utah'])
Out[179]:
one two three four
Colorado 4 5 6 7
New York 12 13 14 15
#我們可以指定axis能真,在columns上刪除
data.drop(['two','four'],axis=1)
Out[180]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
3. 算術(shù)運算和數(shù)據(jù)對齊
在numpy和pandas中好像都會看到這個詞扰柠,數(shù)據(jù)對齊,就是說2個對象在運算的時候蝙泼,會取一個并集劝枣,然后在自動對齊的時候,不重疊的部分就會填充NaN
小例子先看看
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
#index不重疊的地方茎活,會填充NaN
s1+s2
Out[188]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
#使用自帶的add方法琢唾,就可以填充默認值了,這個和我們上面reindex時的思想是一樣的
#Series.add(other, level=None, fill_value=None, axis=0)
s1.add(s2,fill_value=0)
Out[189]:
a 5.2
c 1.1
d 3.4
e 0.0
f 4.0
g 3.1
dtype: float64
4.DataFrame和Series之間的運算
這里用到了一個廣播的思想懒熙,就是指不同形狀的數(shù)組之間的算術(shù)運算的執(zhí)行方式普办,很強大的功能,這里肢娘,我們先簡單了解下。
小例子
arr = np.arange(12.).reshape((3, 4))
arr
Out[191]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
Out[192]: array([ 0., 1., 2., 3.])
#3行4列的數(shù)組橱健,減1行4列的數(shù)組,這就是廣播
arr - arr[0]
Out[193]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
DataFrame和Series之間的計算也是這樣
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[195]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
s = frame.iloc[0]
s
Out[197]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - s
Out[198]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
s = pd.Series(range(3),index=list('abc'))
frame
Out[223]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
s
Out[224]:
a 0
b 1
c 2
dtype: int32
frame.add(s)
Out[225]:
a b c d e
Utah NaN 1.0 NaN NaN NaN
Ohio NaN 4.0 NaN NaN NaN
Texas NaN 7.0 NaN NaN NaN
Oregon NaN 10.0 NaN NaN NaN
#我們可以通過axis控制在哪個方向上去廣播
frame.add(s,axis=0)
Out[227]:
b d e
Ohio NaN NaN NaN
Oregon NaN NaN NaN
Texas NaN NaN NaN
Utah NaN NaN NaN
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
在這里臼节,不能使用fill_value填充默認值网缝,還不知道為啥蟋定,總是報錯,說不支持
5. 函數(shù)應用和映射
這里主要是介紹DataFrame中的一個函數(shù)使用扼仲,apply促王,就是對DataFrame中的每一個元素執(zhí)行傳入的函數(shù)
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Applies function along input axis of DataFrame.
小例子
f = lambda x: x+10
#每一個單元格都會加10
frame.apply(f)
Out[230]:
b d e
Utah 10.0 11.0 12.0
Ohio 13.0 14.0 15.0
Texas 16.0 17.0 18.0
Oregon 19.0 20.0 21.0
f = lambda x: x.max() - x.min()
frame.apply(f)
Out[232]:
b 9.0
d 9.0
e 9.0
dtype: float64
#我們可以指定軸蝇狼,去執(zhí)行函數(shù)
frame.apply(f,axis=1)
Out[233]:
Utah 2.0
Ohio 2.0
Texas 2.0
Oregon 2.0
dtype: float64
這里還有一個applymap函數(shù)
DataFrame.applymap(func)
Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
這里得注意下迅耘,這2個函數(shù)的區(qū)別监署;
目前的理解是,applymap是元素級的钠乏,apply在軸上進行操作(貌似不太順,等明白了再記錄下)
f = lambda x: '${:,.3f}'.format(x)
frame
Out[237]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
#前面簇捍,我們有用過俏拱,格式化內(nèi)容的
frame.applymap(f)
Out[238]:
b d e
Utah $0.000 $1.000 $2.000
Ohio $3.000 $4.000 $5.000
Texas $6.000 $7.000 $8.000
Oregon $9.000 $10.000 $11.000
6.處理缺失數(shù)據(jù)
在pandas中處理缺失數(shù)據(jù)非常容易,pandas使用浮點值NaN(Not a Number)表示缺失值事格。
前面,我們說過使用isnull來判斷是否有NaN值
小例子
a = pd.Series(['one','two',np.nan,'three'])
a
Out[240]:
0 one
1 two
2 NaN
3 three
dtype: object
a.isnull()
Out[241]:
0 False
1 False
2 True
3 False
dtype: bool
a.notnull()
Out[242]:
0 True
1 True
2 False
3 True
dtype: bool
#Python內(nèi)置的None也會被當做NaN處理
a[4]=None
a
Out[247]:
0 one
1 two
2 NaN
3 three
4 None
dtype: object
a.isnull()
Out[248]:
0 False
1 False
2 True
3 False
4 True
dtype: bool
對于這種數(shù)據(jù)远搪,我們要怎樣處理呢么鹤?有的時候,我們可能會初始化為默認值棠耕,或者直接剔除掉
我們可以使用dropna函數(shù)來剔除掉柠新,或者布爾類型索引
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
a
Out[249]:
0 one
1 two
2 NaN
3 three
4 None
dtype: object
a.dropna()
Out[250]:
0 one
1 two
3 three
dtype: object
a[a.notnull()]
Out[251]:
0 one
1 two
3 three
dtype: object
##dataframe
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
[np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data
Out[254]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
#默認的話,會將行蕊退、列含有NaN的都剔除掉
data.dropna()
Out[255]:
0 1 2
0 1.0 6.5 3.0
#我們可以使用參數(shù)how來控制
how : {‘a(chǎn)ny’, ‘a(chǎn)ll’}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
data.dropna(how='all')
Out[257]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
有的時候憔恳,我們想要做填充而不是剔除,像我們前面使用的參數(shù)fill_value
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
Fill NA/NaN values using the specified method
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
data
Out[261]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
data.fillna(9.9)
Out[259]:
0 1 2
0 1.0 6.5 3.0
1 1.0 9.9 9.9
2 9.9 9.9 9.9
3 9.9 6.5 3.0
#使用method输硝,和前面reindex的時候是一個道理
data.fillna(method='ffill')
Out[262]:
0 1 2
0 1.0 6.5 3.0
1 1.0 6.5 3.0
2 1.0 6.5 3.0
3 1.0 6.5 3.0