數(shù)據(jù)規(guī)整(數(shù)據(jù)預處理,數(shù)據(jù)清洗)
數(shù)據(jù)規(guī)整的一般分類:
- 清理
- 轉換
- 合并
- 重塑
Pandas數(shù)據(jù)規(guī)整-清理:
對指定數(shù)據(jù)(如缺失數(shù)據(jù)续语、重復數(shù)據(jù))進行處理(檢查、替換荸哟、刪除)
- 缺失值的表示:np.nan
- 檢查缺失值:isnull(),notnull(),info()
- 刪除缺失值:dropna()
- 填充缺失值:fillna()
- 替換值(填充缺失值是替換值的一種情況):replace()
- 移除重復數(shù)據(jù)
- 檢測和過濾異常值
In [1]:
import pandas as pd
import numpy as np
Pandas缺失數(shù)據(jù)處理
缺失值的表示
In [2]:
a = np.array([2, 4, 8, 10, 12])
a
Out[2]:
array([ 2, 4, 8, 10, 12])
In [3]:
a + 10
Out[3]:
array([12, 14, 18, 20, 22])
Python原生缺失值表示:None
運算直接報錯
In [4]:
b = np.array([2, 4, None, 10, 12])
b
Out[4]:
array([2, 4, None, 10, 12], dtype=object)
In [5]:
# b + 10 # 缺失值導致計算報錯
使用Numpy的缺失值數(shù)據(jù)類型:np.nan
缺失值運算不會報錯锦积,和缺失值進行運算芒帕,結果還是缺失值
In [6]:
c = np.array([2, 4, np.nan, 10, 12])
c
Out[6]:
array([ 2., 4., nan, 10., 12.])
In [7]:
c + 10
Out[7]:
array([12., 14., nan, 20., 22.])
In [8]:
c.sum() # 任何數(shù)組和缺失值計算,結果還是缺失值
np.sum(c)
Out[8]:
nan
nan專有運算方法丰介,會跳過缺失值背蟆,直接計算正常值
In [9]:
np.nansum(c)
np.nanmean(c)
Out[9]:
7.0
In [10]:
(2 + 4 + 10 + 12) / 4 # 除以4鉴分, 缺失值不參與運算
Out[10]:
7.0
使用Pandas缺失值計算
Pandas中,不論缺失值是 None 還是 np.nan 带膀,都會被轉化為 NaN 的形式
NaN:非數(shù)字志珍,not a number,Pandas中它表示缺失或NA值垛叨,便于被檢測出來
本質上就是np.nan
Pandas的對象可以跳過缺失值直接進行運算
In [11]:
b = pd.Series([1,2,np.nan,4,None,6])
b
Out[11]:
0 1.0
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [12]:
b + 10
Out[12]:
0 11.0
1 12.0
2 NaN
3 14.0
4 NaN
5 16.0
dtype: float64
In [13]:
b.sum()
b.mean()
Out[13]:
3.25
In [14]:
# 缺失值賦值
b[0] = np.nan
b
Out[14]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [15]:
c = pd.DataFrame([[1,np.nan,3], [4,5,6], [np.nan,8,9]])
c
Out[15]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
In [16]:
c.sum()
Out[16]:
0 5.0
1 13.0
2 18.0
dtype: float64
通過函數(shù)檢查數(shù)據(jù)中是否含有缺失值
檢查單個空值
單個空值伦糯,底層類型為 np.nan,不能直接比較是否相同
In [17]:
3 == 3
2.5 == 2.5
np.nan == np.nan # 缺失值不能自己比較
Out[17]:
False
單個np.nan空值嗽元,可以用 np.isnan() 方法判斷是否是空值
In [18]:
np.isnan(np.nan)
np.isnan(123)
Out[18]:
False
整體判斷敛纲,表格中各列缺失值情況
In [19]:
c.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
0 2 non-null float64
1 2 non-null float64
2 3 non-null int64
dtypes: float64(2), int64(1)
memory usage: 152.0 bytes
isnull()和notnull()
- isnull():缺失值返回True,正常值返回False
- notnull():正常值返回True,缺失值返回False
In [20]:
b
Out[20]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [21]:
b.isnull()
Out[21]:
0 True
1 False
2 True
3 False
4 True
5 False
dtype: bool
In [22]:
-(b.isnull()) # 非運算,缺失值返回False
Out[22]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
In [23]:
b.notnull()
Out[23]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
返回所有正常值
手動過濾Series的缺失值
In [24]:
b
Out[24]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [25]:
b.notnull()
Out[25]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
In [26]:
b[b.notnull()]
Out[26]:
1 2.0
3 4.0
5 6.0
dtype: float64
DataFrame不能通過布爾查詢方式過濾缺失值,必須使用Pandas的特定方法過濾
查到缺失值后剂癌,Series可以直接過濾淤翔,DataFrame需要進一步處理
In [27]:
c
Out[27]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
In [28]:
c.isnull()
c[c.isnull()]
Out[28]:
0 | 1 | 2 | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
In [29]:
c.notnull()
c[c.notnull()]
Out[29]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
去除缺失值,只保留有效值
dropna()函數(shù)
In [30]:
b
Out[30]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [31]:
b.dropna()
Out[31]:
1 2.0
3 4.0
5 6.0
dtype: float64
In [32]:
# 等同于
b[b.notnull()]
Out[32]:
1 2.0
3 4.0
5 6.0
dtype: float64
In [33]:
c
Out[33]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
In [34]:
c.dropna() # 默認刪除缺失值所在行
c.dropna(axis=0)
Out[34]:
0 | 1 | 2 | |
---|---|---|---|
1 | 4.0 | 5.0 | 6 |
In [35]:
c.dropna(axis=1) # 按列刪除
Out[35]:
2 | |
---|---|
0 | 3 |
1 | 6 |
2 | 9 |
In [36]:
# 增加一列全部為缺失值的數(shù)據(jù)
c[3] = np.nan
c
Out[36]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [37]:
# 行或列佩谷,有1個缺失值即刪除
c.dropna(axis=1)
c.dropna(axis=1, how='any')
Out[37]:
2 | |
---|---|
0 | 3 |
1 | 6 |
2 | 9 |
In [38]:
# 行或列必須全部都是缺失值才刪
c.dropna(axis=1, how='all')
Out[38]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
根據(jù)行或列的非缺失值數(shù)量衡量刪除與否
In [39]:
c
Out[39]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [40]:
c.dropna()
Out[40]:
0 | 1 | 2 | 3 |
---|
In [41]:
c.dropna(thresh=3) # 行非缺失值數(shù)量大于等于3個旁壮,保留
Out[41]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | 4.0 | 5.0 | 6 | NaN |
填充缺失值
缺失值問題除了刪除所在行列以外,還可以通過填充值解決
fillna()函數(shù)參數(shù)
In [42]:
b
Out[42]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [43]:
b.fillna(0) # 缺失值填充為0
Out[43]:
0 0.0
1 2.0
2 0.0
3 4.0
4 0.0
5 6.0
dtype: float64
In [44]:
b.fillna(b.mean()) # 缺失值填充為平均數(shù)
Out[44]:
0 4.0
1 2.0
2 4.0
3 4.0
4 4.0
5 6.0
dtype: float64
前向填充和后向填充
- method='ffill'
- method='bfill'
In [45]:
b
Out[45]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [46]:
b.mean()
Out[46]:
4.0
前向填充:使用缺失值的前一個值填充
In [47]:
b.fillna(method='ffill')
Out[47]:
0 NaN
1 2.0
2 2.0
3 4.0
4 4.0
5 6.0
dtype: float64
后向填充谐檀,使用缺失值的后一個值填充
In [48]:
b.fillna(method='bfill')
Out[48]:
0 2.0
1 2.0
2 4.0
3 4.0
4 6.0
5 6.0
dtype: float64
In [49]:
c
Out[49]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [50]:
c.fillna(method='ffill') # 前向填充抡谐,按行
Out[50]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | 4.0 | 8.0 | 9 | NaN |
In [51]:
c.fillna(method='bfill') # 后向填充,按行
Out[51]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 5.0 | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [52]:
c.fillna(method='ffill', axis=1) # 按列桐猬,前向填充
Out[52]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 1.0 | 3.0 | 3.0 |
1 | 4.0 | 5.0 | 6.0 | 6.0 |
2 | NaN | 8.0 | 9.0 | 9.0 |
給各列分別填充不同值
In [53]:
c
Out[53]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [54]:
c.fillna({0: 111, 1: 222, 2: 333, 3: 444})
Out[54]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 222.0 | 3 | 444.0 |
1 | 4.0 | 5.0 | 6 | 444.0 |
2 | 111.0 | 8.0 | 9 | 444.0 |
上面一切刪除童叠、填充操作都沒有修改原變量
修改原值參數(shù):inplace=True
In [55]:
c
Out[55]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [56]:
# c.fillna(100)
c.fillna(100, inplace=True)
In [57]:
c
Out[57]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 100.0 | 3 | 100.0 |
1 | 4.0 | 5.0 | 6 | 100.0 |
2 | 100.0 | 8.0 | 9 | 100.0 |
連續(xù)填充數(shù)量
In [58]:
c.loc[3] = np.nan
c.loc[0, 1] = np.nan
c.loc[1:3,0] = np.nan
c[3] = np.nan
c
Out[58]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | NaN |
1 | NaN | 5.0 | 6.0 | NaN |
2 | NaN | 8.0 | 9.0 | NaN |
3 | NaN | NaN | NaN | NaN |
In [59]:
c.fillna(method='ffill') # 默認全部填充
Out[59]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | NaN |
1 | 1.0 | 5.0 | 6.0 | NaN |
2 | 1.0 | 8.0 | 9.0 | NaN |
3 | 1.0 | 8.0 | 9.0 | NaN |
In [60]:
c.fillna(method='ffill', limit=2) # 設置填充多少行或列
Out[60]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | NaN |
1 | 1.0 | 5.0 | 6.0 | NaN |
2 | 1.0 | 8.0 | 9.0 | NaN |
3 | NaN | 8.0 | 9.0 | NaN |
替換值
利用fillna方法填充缺失數(shù)據(jù)是值替換的一種特殊情況, replace方法用作替換值更簡單课幕、更靈活
In [61]:
data = pd.Series([1,-999,2,-999,-1000,3])
data
Out[61]:
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: int64
In [62]:
# 替換單值
data.replace(-999, np.nan)
Out[62]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [63]:
# 替換多值,多個替換為1個
data.replace([-999, -1000], np.nan)
Out[63]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [64]:
# 多個值替換為不同數(shù)值
data.replace([-999, -1000], [0, 1])
data.replace({-999: 0, -1000: 1})
Out[64]:
0 1
1 0
2 2
3 0
4 1
5 3
dtype: int64
映射數(shù)據(jù)替換
map除了自定義函數(shù)運算,還是一種映射轉換元素以及其他數(shù)據(jù)清理工作的便捷方式
In [65]:
a = pd.DataFrame([['鬃刷','皮帶','煎蛋','觀賞'], [10,20,30,40]]).T
a
Out[65]:
0 | 1 | |
---|---|---|
0 | 鬃刷 | 10 |
1 | 皮帶 | 20 |
2 | 煎蛋 | 30 |
3 | 觀賞 | 40 |
In [66]:
y = {'鬃刷': '豬', '皮帶': '牛', '觀賞': '魚', '衣服': '棉花'}
y
Out[66]:
{'鬃刷': '豬', '皮帶': '牛', '觀賞': '魚', '衣服': '棉花'}
In [67]:
a[0].map(y)
Out[67]:
0 豬
1 牛
2 NaN
3 魚
Name: 0, dtype: object
移除重復數(shù)據(jù)
移除DataFrame的重復行
In [68]:
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1,1,2,3,3,4,4]})
data
Out[68]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | one | 1 |
2 | one | 2 |
3 | two | 3 |
4 | two | 3 |
5 | two | 4 |
6 | two | 4 |
In [69]:
# 布爾型Series,各列重復值交集
data.duplicated()
Out[69]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
In [70]:
# 移除重復行
data.drop_duplicates()
data[-(data.duplicated())]
Out[70]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
2 | one | 2 |
3 | two | 3 |
5 | two | 4 |
In [71]:
# 移除自定義列重復行
data.drop_duplicates('k1')
Out[71]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
3 | two | 3 |
In [72]:
#keep : {‘first’, ‘last’, False},
# first默認留下第一次出現(xiàn)的值
data.drop_duplicates(['k1', 'k2'], keep='first')
Out[72]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
2 | one | 2 |
3 | two | 3 |
5 | two | 4 |
In [73]:
# last,留下最后一次出現(xiàn)的值
data.drop_duplicates(['k1', 'k2'], keep='last')
Out[73]:
k1 | k2 | |
---|---|---|
1 | one | 1 |
2 | one | 2 |
4 | two | 3 |
6 | two | 4 |
In [74]:
# False,刪掉所有重復值
data.drop_duplicates(['k1', 'k2'], keep=False)
Out[74]:
k1 | k2 | |
---|---|---|
2 | one | 2 |
移除重復索引值
In [75]:
obj = pd.Series(range(5), index = ['a','a','b','b','c'])
obj
Out[75]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [76]:
obj['a']
Out[76]:
a 0
a 1
dtype: int64
In [77]:
obj.index.duplicated()
Out[77]:
array([False, True, False, True, False])
In [78]:
obj.loc[~(obj.index.duplicated())]
Out[78]:
a 0
b 2
c 4
dtype: int64
檢測和過濾異常值
過濾或變換異常值(outlier)在很大程度上就是運用數(shù)組運算
例子:一個含有正態(tài)分布數(shù)據(jù)的DataFrame
目標:將每列數(shù)據(jù)的范圍控制在 -3 到 3 之間
In [79]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.loc[:10]
Out[79]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.056350 | -0.815325 | -1.739909 | -0.044069 |
1 | -0.342930 | 2.049838 | 0.571770 | 1.431593 |
2 | 0.155038 | -0.019762 | 0.365701 | -0.405128 |
3 | 0.896189 | -0.372501 | -0.289871 | 0.579180 |
4 | 0.619782 | -2.591634 | -1.116019 | 1.642877 |
5 | -1.558965 | 1.861874 | -0.620186 | 1.222442 |
6 | -1.404131 | -0.230156 | -1.057523 | 0.781039 |
7 | 0.329895 | -1.247459 | 1.360716 | 0.352324 |
8 | -2.796119 | -0.783126 | 0.071594 | 0.071109 |
9 | -0.638204 | 0.348094 | -0.784782 | 0.304675 |
10 | -0.125829 | -1.645295 | 0.080631 | 0.105201 |
In [80]:
# 查看數(shù)據(jù)整體指標
data.describe()
Out[80]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.039946 | -0.020816 | 0.061418 | 0.001750 |
std | 1.025141 | 0.989867 | 0.999929 | 0.995840 |
min | -3.372125 | -3.244717 | -3.219369 | -2.866783 |
25% | -0.757997 | -0.730113 | -0.626054 | -0.698372 |
50% | -0.048779 | -0.029766 | 0.075186 | 0.004677 |
75% | 0.644870 | 0.623295 | 0.744526 | 0.663374 |
max | 3.512715 | 2.932729 | 2.903664 | 3.288854 |
In [81]:
# 找出某列中絕對值大于3的值
data[2][(data[2] > 3) | (data[2] < -3)]
data[2][np.abs(data[2]) > 3]
Out[81]:
87 -3.141451
631 -3.219369
Name: 2, dtype: float64
In [82]:
# 找出全部絕對值大于3的值所在的行
data[np.abs(data) > 3]
data[np.abs(data) > 3].any(axis=1)
data[(np.abs(data) > 3).any(1)]
Out[82]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
87 | -0.159197 | 0.751127 | -3.141451 | -0.092924 |
88 | 3.032059 | 1.106646 | -0.762529 | 0.109455 |
228 | 0.313696 | 1.587848 | 1.495230 | 3.288854 |
246 | 3.061156 | -1.015008 | -0.965786 | -0.535780 |
310 | 0.900701 | -0.860197 | 2.287573 | 3.003777 |
567 | 3.150045 | -1.160207 | 2.412660 | -0.850230 |
631 | -0.658641 | -0.355802 | -3.219369 | -0.667448 |
703 | -0.813356 | -3.244717 | 0.996636 | 1.243210 |
726 | -3.372125 | 0.621298 | -0.264892 | 0.716488 |
806 | 3.512715 | 1.699141 | -0.632934 | -0.975362 |
In [83]:
# 將數(shù)據(jù)范圍限制在3到-3之間(大于3的改為3五垮,小于-3的改為-3)
np.sign(data) # 判斷數(shù)據(jù)正負乍惊,正數(shù)1,負數(shù)-1放仗,生成一個對應數(shù)據(jù)的1润绎,-1數(shù)據(jù)
data[np.abs(data) > 3] = np.sign(data) * 3
In [84]:
data.describe()
Out[84]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.040330 | -0.020571 | 0.061779 | 0.001457 |
std | 1.021605 | 0.989099 | 0.998788 | 0.994916 |
min | -3.000000 | -3.000000 | -3.000000 | -2.866783 |
25% | -0.757997 | -0.730113 | -0.626054 | -0.698372 |
50% | -0.048779 | -0.029766 | 0.075186 | 0.004677 |
75% | 0.644870 | 0.623295 | 0.744526 | 0.663374 |
max | 3.000000 | 2.932729 | 2.903664 | 3.000000 |