上一篇學習筆記學習了如何處理缺失值(數(shù)據(jù)清理)弄贿,這一篇筆記繼續(xù)學習pandas的數(shù)據(jù)處理:數(shù)據(jù)轉(zhuǎn)換
Removing Duplicates去重
#首先還是進入ipython,當然你也可以用python進行練習
$ ipython
Python 3.7.6 (default, Jan 8 2020, 19:59:22)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
In [2]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
...: 'k2': [1, 1, 2, 3, 3, 4, 4]}) #構建一個dataframe
In [3]: data #這里索引5和索引6對應的行是完全一樣的
Out[3]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [4]: data.duplicated() #檢查是否有重復的行癞谒,這里索引6的結果返回的是True
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [5]: data.drop_duplicates() #去掉.duplicated()函數(shù)判斷結果為True的行啊掏,只保留返回結果為False的行
Out[5]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
上面的去重是直接把整行去掉梗搅,你也可以按照某一列里的元素進行去重:
In [6]: data['v1'] = range(7) #把dataframe加一列v1
In [7]: data
Out[7]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
In [8]: data.drop_duplicates(['k1']) #根據(jù)k1這一列里的元素進行去重禾唁,所以只返回了索引為0和1的兩行,因為對于k1列來說无切,后面的元素都是one和two荡短,所以都是重復項
Out[8]:
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated
和drop_duplicates
都只保留了重復項的第一項,你也可以選擇只保留所有重復項里最后出現(xiàn)的那一項:
In [9]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[9]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6 #索引5和6對應的兩行是重復的哆键,這里只保留了索引6對應的行
Transforming Data Using a Function or Mapping
對于許多數(shù)據(jù)集肢预,你可能想根據(jù)數(shù)組、Series或Dataframe其中某一列中的值執(zhí)行一些轉(zhuǎn)換洼哎,舉個例子:
In [11]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
...: 'Pastrami', 'corned beef', 'Bacon',
...: 'pastrami', 'honey ham', 'nova lox'],
...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
In [12]: data #創(chuàng)建一個dataframe烫映,一列是food,一列是ounces數(shù)
Out[12]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
然后再加一列噩峦,是每一種食物對應的喂養(yǎng)的動物:
In [13]: meat_to_animal ={
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
The map method on a Series accepts a function or dict-like object containing a map‐
ping, but here we have a small problem in that some of the meats are capitalized and
others are not. Thus, we need to convert each value to lowercase using the str.lower
Series method
這里需要注意的是锭沟,如果你要使用map
功能,你必須使你要map的兩個對象的大小寫一致识补,比如Pastrami這個單詞族淮,在data和meat_to_animal里的開頭字母的大小寫不一致,所以我們要先把data里的food一欄都改成小寫開頭:
In [14]: lowercased = data['food'].str.lower()
In [15]: lowercased
Out[15]:
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
然后將data和meat_to_animal合并凭涂,并根據(jù)food一欄進行map:
In [16]: data['animal'] = lowercased.map(meat_to_animal)
In [17]: data
Out[17]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
上面是map之后祝辣,返回的是完整的dataframe,你也可以選擇只返回map好的meat_to_animal一列:
In [18]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[18]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
Replacing Values
上一篇筆記學習了用fillna
填補缺失值切油,這只是一個特殊情況◎保現(xiàn)在來學習一下一般值的替換。使用replace
功能:
In [19]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [20]: data
Out[20]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
In [21]: import numpy as np
In [22]: data.replace(-999, np.nan) #把-999的值替換成缺失值
Out[22]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [23]: data.replace([-999, -1000], np.nan) #把-999和-1000替換成缺失值
Out[23]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [24]: data.replace([-999, -1000], [np.nan, 0]) #把-999替換成缺失值澎胡,把-1000替換成0
Out[24]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
In [25]: data.replace({-999: np.nan, -1000: 0}) #用字典形式進行替換
Out[25]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
Renaming Axis Indexes重命名軸索引
除了對dataframe里的值進行map孕荠,你也可以map dataframe的行名和列名:
#將行名重命名
In [26]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
In [27]: data
Out[27]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [28]: transform = lambda x: x[:4].upper()
In [29]: data.index.map(transform)
Out[29]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
In [30]: data.index = data.index.map(transform)
In [31]: data
Out[31]:
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
#將列名重命名
#將列名進行大寫處理
In [32]: data.rename(index=str.title, columns=str.upper)
Out[32]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
用rename
功能同時進行行名和列名的修改:
In [33]: data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
Out[33]:
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
rename
功能默認不覆蓋原始dataframe娩鹉,你也可以通過參數(shù)設置,讓替換后的dataframe覆蓋原始dataframe:
In [34]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
In [35]: data
Out[35]:
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
Discretization and Binning離散化和分箱
連續(xù)型數(shù)據(jù)經(jīng)常是離散的稚伍,或者被分成不同的“箱”(bin)進行分析弯予。假設這里有一組年齡的數(shù)據(jù),你要把這些人的年齡分成幾組:
In [36]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
然后我們將把年齡分成:18-25个曙,26-35锈嫩,36-60,以及61以上的垦搬。你可以使用pandas里的cut
函數(shù):
In [37]: bins = [18, 25, 35, 60, 100]
In [38]: cats = pd.cut(ages,bins)
In [39]: cats #把ages里每一個數(shù)字都對應到相應的bin里
Out[39]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
除了把每一個元素對應到bin里呼寸,你還可以根據(jù)bin的索引來map你的每一個元素:
#查看你的bin
In [40]: cats.categories
Out[40]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
#根據(jù)bin的索引位置來map你的元素
In [41]: cats.codes
Out[41]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
#查看每一個bin里map了多少個元素
In [42]: pd.value_counts(cats)
Out[42]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
你也可以給每一個bin命名:
In [43]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
In [44]: pd.cut(ages, bins, labels=group_names)
Out[44]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
上面是對于整數(shù)元素而言的,下面看看浮點元素的操作:
In [45]: data = np.random.rand(20)
In [46]: data #隨機生成20個0-1的數(shù)字
Out[46]:
array([0.81110965, 0.47609535, 0.71813052, 0.71125473, 0.50874785,
0.54215936, 0.51799094, 0.92532814, 0.75840793, 0.59097154,
0.28318023, 0.580895 , 0.57077378, 0.86404974, 0.10115019,
0.29316477, 0.57158071, 0.14108119, 0.36732534, 0.18085161])
In [47]: pd.cut(data,4,precision=2) #從上面20個數(shù)字里的最小值到最大值悼沿,等分成4分,就是4個bin
#precision=2的意思是保留2位小數(shù)點
Out[47]:
[(0.72, 0.93], (0.31, 0.51], (0.51, 0.72], (0.51, 0.72], (0.31, 0.51], ..., (0.1, 0.31], (0.51, 0.72], (0.1, 0.31], (0.31, 0.51], (0.1, 0.31]]
Length: 20
Categories (4, interval[float64]): [(0.1, 0.31] < (0.31, 0.51] < (0.51, 0.72] < (0.72, 0.93]]
Detecting and Filtering Outliers檢測并過濾離群值
In [48]: data = pd.DataFrame(np.random.randn(1000, 4)) #構建一個1000行骚灸,4列的datagrame由隨機數(shù)構成
In [49]: data
Out[49]:
0 1 2 3
0 0.650892 2.070260 0.538299 0.679726
1 1.074305 0.495664 -0.970136 -0.344586
2 -0.859329 1.453186 1.612602 -0.321719
3 0.558712 0.562590 0.646543 -0.483733
4 0.935728 1.516592 0.852229 1.926105
.. ... ... ... ...
995 -1.304159 -0.085793 0.482636 1.523466
996 0.285450 -1.898452 1.422107 -1.227373
997 2.083328 -1.410110 0.061404 -0.860416
998 -0.438052 -1.744427 -1.677008 -0.481408
999 0.999009 -0.240101 -1.299566 -0.172057
[1000 rows x 4 columns]
In [50]: data.describe() #查看每一列的數(shù)據(jù)情況
Out[50]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.004135 -0.041167 -0.030157 -0.015668
std 0.987452 0.996992 1.004676 0.982699
min -3.438090 -3.179065 -3.055933 -3.336112
25% -0.683605 -0.714445 -0.679892 -0.677707
50% -0.033715 -0.066457 -0.122572 -0.039982
75% 0.646951 0.578665 0.643963 0.632722
max 3.625363 3.499694 3.896634 3.082743
如果你想找第2列里絕對值大于3的數(shù)字:
In [51]: col = data[2]
In [52]: col[np.abs(col) >3]
Out[52]:
106 3.248979
365 3.896634
467 3.659385
475 -3.055933
Name: 2, dtype: float64
尋找所有行里絕對值大于3的數(shù)字:
In [53]: data[(np.abs(data) > 3).any(1)]
Out[53]:
0 1 2 3
106 0.402003 0.675224 3.248979 0.594704
135 3.625363 0.757227 0.267661 2.870315
143 -0.040237 3.025956 -0.950451 2.369624
250 0.411540 3.499694 -0.268605 0.811673
365 -1.609991 -0.691870 3.896634 -0.761072
439 0.050316 0.433346 -0.938715 -3.277411
467 -0.722803 2.093205 3.659385 0.373497
475 0.326796 0.196501 -3.055933 -0.601524
492 -3.438090 0.132217 -0.305294 -0.574376
616 -0.623880 0.588810 -0.742245 -3.239445
646 -0.009522 -3.179065 -1.350068 0.309251
778 1.959143 -0.304691 -0.760753 3.082743
871 -0.686462 3.470719 1.305890 -1.060105
919 3.452396 1.216505 -0.495451 0.237577
966 -1.427159 -0.546189 1.346565 -3.336112
把dataframe里所有大于3和小于-3的元素糟趾,一律都設置為3:
In [54]: data[np.abs(data) > 3] = np.sign(data) * 3
In [55]: data.describe()
Out[55]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.003495 -0.041985 -0.031906 -0.014898
std 0.982432 0.993157 0.998391 0.979717
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.683605 -0.714445 -0.679892 -0.677707
50% -0.033715 -0.066457 -0.122572 -0.039982
75% 0.646951 0.578665 0.643963 0.632722
max 3.000000 3.000000 3.000000 3.000000
上面的代碼里np.sign(data)
的意思是把任何數(shù)根據(jù)正負判斷,變成-1和1:
In [57]: np.sign(data)
Out[57]:
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.0 -1.0 -1.0
2 -1.0 1.0 1.0 -1.0
3 1.0 1.0 1.0 -1.0
4 1.0 1.0 1.0 1.0
.. ... ... ... ...
995 -1.0 -1.0 1.0 1.0
996 1.0 -1.0 1.0 -1.0
997 1.0 -1.0 1.0 -1.0
998 -1.0 -1.0 -1.0 -1.0
999 1.0 -1.0 -1.0 -1.0
Permutation and Random Sampling隨機重排
使用np.random.permutation
可以對Series或者dataframe的行和列進行重排:
In [58]: df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
In [59]: df
Out[59]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [60]: sampler = np.random.permutation(5)
In [61]: sampler
Out[61]: array([2, 4, 0, 3, 1])
In [62]: df.take(sampler)
Out[62]:
0 1 2 3
2 8 9 10 11
4 16 17 18 19
0 0 1 2 3
3 12 13 14 15
1 4 5 6 7
Computing Indicator/Dummy Variables計算指標/虛擬變量
統(tǒng)計建纳跎或機器學習應用的另一種轉(zhuǎn)換是將分類變量轉(zhuǎn)換為“虛擬”或“指標”矩陣义郑。如果DataFrame中的一列有k個不同的值,那么你將得到k個列丈钙,包含所有1和0的矩陣或DataFrame非驮。pandas有一個get_dummies
函數(shù)來完成這個任務:
In [67]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
...: 'data1': range(6)})
In [68]: df
Out[68]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
In [69]: pd.get_dummies(df['key'])
Out[69]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
上面的函數(shù)可以這樣理解:get_dummies
函數(shù)是把dataframe里其中一列抽出來,把這一列的元素作為列名雏赦,行代表是否出現(xiàn)過key這一列的元素劫笙,出現(xiàn)過即為1,沒出現(xiàn)即為0星岗。所以對于key列來說填大,b在最開始出現(xiàn)了兩次,所以在 get_dummies`返回的dataframe里俏橘,b列的前2行是1允华,其他列是0。以此類推寥掐。
可以把上面得到的dataframe改一下列名:
In [70]: dummies = pd.get_dummies(df['key'], prefix='key')
In [71]: dummies
Out[71]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0