Pandas數(shù)據(jù)規(guī)整 - 轉(zhuǎn)換 - 離散化和面元?jiǎng)澐?/h1>
為了便于分析,連續(xù)數(shù)據(jù)常常被離散化或拆分為“面元”(bin,分組區(qū)間)
連續(xù)數(shù)據(jù)離散化:降雨量勺疼、年齡矾克、身高這類連續(xù)數(shù)據(jù)页慷,要分析:只能畫直方圖,無法分組聚合 ,所以可以將連續(xù)數(shù)據(jù)離散化酒繁,例如降雨量轉(zhuǎn)為 小雨中雨大雨暴雨滓彰,年齡轉(zhuǎn)為 少年青年中年老年,就可以分組聚合
In [1]:
import numpy as np
import pandas as pd
例子:一組年齡數(shù)據(jù)州袒,將它們劃分為不同的年齡組
劃分為“18到25”揭绑、“26到35”、“35到60”以及“60以上”幾個(gè)面元
In [2]:
# 年齡
ages = [18, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# 面元區(qū)間
bins = [18, 25, 35, 60, 100]
In [3]:
cats = pd.cut(ages, bins)
cats
Out[3]:
[NaN, (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
返回的是categories對象(劃分的面元)郎哭,可看做一組表示面元名稱的字符串
底層含有:
一個(gè)codes屬性中的年齡數(shù)據(jù)標(biāo)簽
一個(gè)表示不同分類的類型數(shù)組
In [4]:
type(cats)
Out[4]:
pandas.core.arrays.categorical.Categorical
In [4]:
cats.codes # 分組后的數(shù)據(jù)(下面分組區(qū)間的索引)
Out[4]:
array([-1, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [5]:
cats.categories # 類型他匪,分組區(qū)間
Out[5]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
In [7]:
cats[11] # 查詢單個(gè)值的分類
Out[7]:
Interval(25, 35, closed='right')
In [8]:
# pd.cut結(jié)果的面元計(jì)數(shù)
pd.value_counts(cats) # 統(tǒng)計(jì)每個(gè)分組區(qū)間的數(shù)據(jù)個(gè)數(shù)
Out[8]:
(18, 25] 4
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
cut方法:默認(rèn)是左開右閉區(qū)間,不包含起始值夸研,包含結(jié)束值
right=False后邦蜜,左閉右開區(qū)間,包含起始值陈惰,不包含結(jié)束值
In [9]:
cats2 = pd.cut(ages, bins, right=False)
cats2
Out[9]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
In [10]:
cats2.codes
Out[10]:
array([0, 0, 1, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [11]:
cats2.categories
Out[11]:
IntervalIndex([[18, 25), [25, 35), [35, 60), [60, 100)]
closed='left',
dtype='interval[int64]')
修改面元名稱
In [12]:
cat3 = pd.cut(ages, bins)
cat3 = pd.cut(ages, bins, labels=False) # 去掉面元名稱
cat3 = pd.cut(ages, bins, labels=['少年', '青年', '中年', '老年']) # 自定義面元名稱
cat3
Out[12]:
[NaN, 少年, 少年, 青年, 少年, ..., 青年, 老年, 中年, 中年, 青年]
Length: 12
Categories (4, object): [少年 < 青年 < 中年 < 老年]
In [13]:
cat3.codes
Out[13]:
array([-1, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [14]:
cat3.categories
Out[14]:
Index(['少年', '青年', '中年', '老年'], dtype='object')
不指定面元切分的起始結(jié)束值畦徘,而是指定面元切分的個(gè)數(shù)(切成幾份),自動(dòng)計(jì)算面元起始結(jié)束值
In [15]:
cat4 = pd.cut(ages, 4, precision=2) # 將數(shù)據(jù)分成四組抬闯,限定小數(shù)位數(shù)為2位
cat4
Out[15]:
[(17.96, 28.75], (17.96, 28.75], (17.96, 28.75], (17.96, 28.75], (17.96, 28.75], ..., (28.75, 39.5], (50.25, 61.0], (39.5, 50.25], (39.5, 50.25], (28.75, 39.5]]
Length: 12
Categories (4, interval[float64]): [(17.96, 28.75] < (28.75, 39.5] < (39.5, 50.25] < (50.25, 61.0]]
In [20]:
(61 - 18) / 4 # 最大值-最小值井辆,除以劃分個(gè)數(shù),得出的時(shí)每個(gè)區(qū)間的年齡范圍
Out[20]:
10.75
In [25]:
18 + 10.75, 28.75 + 10.75, 39.5 + 10.75, 50.25 + 10.75
Out[25]:
(28.75, 39.5, 50.25, 61.0)
In [29]:
cat4.codes
cat4.categories
cat4.value_counts()
pd.value_counts(cat4)
Out[29]:
(17.96, 28.75] 6
(28.75, 39.5] 3
(39.5, 50.25] 2
(50.25, 61.0] 1
dtype: int64
qcut根據(jù)樣本分位數(shù)進(jìn)行面元?jiǎng)澐?/h2>
某些數(shù)據(jù)分布情況cut可能無法使得各個(gè)面元含有相同數(shù)量的值
qcut使用樣本分位數(shù)可以得到大小基本相等的面元
In [16]:
cat5 = pd.qcut(ages, 4)
cat5
Out[16]:
[(17.999, 22.75], (17.999, 22.75], (22.75, 29.0], (22.75, 29.0], (17.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(17.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]
In [17]:
cat5.value_counts()
Out[17]:
(17.999, 22.75] 3
(22.75, 29.0] 3
(29.0, 38.0] 3
(38.0, 61.0] 3
dtype: int64
手輸入4分位數(shù)溶握,效果一樣
In [18]:
cat6 = pd.qcut(ages, [0,0.25,0.5,0.75,1])
cat6
Out[18]:
[(17.999, 22.75], (17.999, 22.75], (22.75, 29.0], (22.75, 29.0], (17.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(17.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]
In [19]:
cat6.value_counts()
Out[19]:
(17.999, 22.75] 3
(22.75, 29.0] 3
(29.0, 38.0] 3
(38.0, 61.0] 3
dtype: int64
In [34]:
cat6.codes
cat6.categories
Out[34]:
IntervalIndex([(17.999, 22.75], (22.75, 29.0], (29.0, 38.0], (38.0, 61.0]]
closed='right',
dtype='interval[float64]')
分位數(shù)和桶分析
pandas有一些能根據(jù)指定面元或樣本分位數(shù)將數(shù)據(jù)拆分成多塊的工具(比如cut和qcut)
將這些函數(shù)跟groupby結(jié)合起來杯缺,就能實(shí)現(xiàn)對數(shù)據(jù)集的桶(bucket)或分位數(shù)(quantile)分析
例:有年齡和性別兩列,要分析某年齡段下的性別情況睡榆,需要先將年齡離散化萍肆,將離散數(shù)據(jù)為分組基準(zhǔn)進(jìn)行分組后,對性別列聚合
以下面這個(gè)簡單的隨機(jī)數(shù)據(jù)集為例胀屿,利用cut將其裝入長度相等的桶中:
In [28]:
frame = pd.DataFrame({'data1': np.random.randn(1000), 'data2': np.random.randn(1000)})
frame.head()
Out[28]:
data1 | data2 | |
---|---|---|
0 | -0.092267 | 0.455749 |
1 | 2.240468 | 0.500134 |
2 | -0.841825 | 0.796062 |
3 | 1.338347 | 1.470217 |
4 | 0.704546 | 0.485647 |
In [29]:
q = pd.cut(frame['data1'], 4)
q.head()
Out[29]:
0 (-1.487, 0.188]
1 (1.864, 3.539]
2 (-1.487, 0.188]
3 (0.188, 1.864]
4 (0.188, 1.864]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.169, -1.487] < (-1.487, 0.188] < (0.188, 1.864] < (1.864, 3.539]]
In [30]:
# q是Series類型塘揣,不是面元類型類型
type(q)
Out[30]:
pandas.core.series.Series
In [31]:
# 面元類型
type(q.cat)
Out[31]:
pandas.core.arrays.categorical.CategoricalAccessor
In [80]:
q.cat.codes
q.cat.categories
Out[80]:
IntervalIndex([(-2.427, -0.966], (-0.966, 0.489], (0.489, 1.944], (1.944, 3.399]]
closed='right',
dtype='interval[float64]')
In [32]:
q.value_counts()
Out[32]:
(-1.487, 0.188] 484
(0.188, 1.864] 415
(-3.169, -1.487] 65
(1.864, 3.539] 36
Name: data1, dtype: int64
由cut返回的Categorical對象可直接傳遞到groupby。我們可以像下面這樣對data2列做一些統(tǒng)計(jì)計(jì)算
In [33]:
frame.describe()
Out[33]:
data1 | data2 | |
---|---|---|
count | 1000.000000 | 1000.000000 |
mean | 0.067439 | -0.016663 |
std | 1.017269 | 1.015648 |
min | -3.162162 | -3.058359 |
25% | -0.573029 | -0.720729 |
50% | 0.069688 | -0.047600 |
75% | 0.738430 | 0.666327 |
max | 3.539001 | 3.629984 |
In [35]:
frame.groupby(q).size()
frame.groupby(q)['data2'].size()
Out[35]:
data1
(-3.169, -1.487] 65
(-1.487, 0.188] 484
(0.188, 1.864] 415
(1.864, 3.539] 36
Name: data2, dtype: int64
In [37]:
frame.groupby(q).sum()
frame.groupby(q)['data2'].sum()
Out[37]:
data1
(-3.169, -1.487] 20.243026
(-1.487, 0.188] -6.394616
(0.188, 1.864] -31.361414
(1.864, 3.539] 0.849714
Name: data2, dtype: float64
使用自定義函數(shù)同時(shí)計(jì)算多個(gè)指標(biāo),快速綜合統(tǒng)計(jì)
自定義函數(shù)內(nèi)構(gòu)建字典或Series數(shù)據(jù)返回宿崭,會(huì)輸出DataFrame
In [40]:
def aaa(x):
# return {
# 'count': x.count(),
# 'mean': x.mean(),
# 'std': x.std(),
# 'min': x.min(),
# 'max': x.max(),
# }
return pd.Series([x.count(), x.mean(), x.std(), x.min(), x.max()], index=['count', 'mean', 'std', 'min', 'max'])
# frame.groupby(q).apply(aaa)
frame.groupby(q)['data2'].apply(aaa)
frame.groupby(q)['data2'].apply(aaa).unstack()
frame.groupby(q)['data2'].apply(aaa).unstack().T
Out[40]:
data1 | (-3.169, -1.487] | (-1.487, 0.188] | (0.188, 1.864] | (1.864, 3.539] |
---|---|---|---|---|
count | 65.000000 | 484.000000 | 415.000000 | 36.000000 |
mean | 0.311431 | -0.013212 | -0.075570 | 0.023603 |
std | 1.119511 | 1.004911 | 1.007641 | 0.981124 |
min | -2.234248 | -3.058359 | -2.848361 | -2.647758 |
max | 3.629984 | 3.175076 | 3.080270 | 1.592952 |
計(jì)算指標(biāo)/啞變量(了解)
一種常用于統(tǒng)計(jì)建那渍。或機(jī)器學(xué)習(xí)的轉(zhuǎn)換方式是:將分類變量(categorical variable)轉(zhuǎn)換為 啞變量、指標(biāo)矩陣(虛擬變量葡兑,獨(dú)熱(one-hot)編碼變量)
如果DataFrame的某一列含有k個(gè)不同的值奖蔓,則可以派生出一個(gè)k列矩陣或DataFrame(其值全為1和0)
pandas有一個(gè)get_dummies函數(shù)可以實(shí)現(xiàn)該功能
獨(dú)熱編碼的作用:將不能計(jì)算的字符串轉(zhuǎn)為可以計(jì)算的數(shù)值(表格,或矩陣)
字符串:'一個(gè)對統(tǒng)計(jì)應(yīng)用有用的方法:結(jié)合get_dummies和如cut之類的離散化函數(shù)'
[統(tǒng)計(jì),應(yīng)用,有用,方法,結(jié)合,離散化,函數(shù)]
[1,1,1,1,1,1,1]
統(tǒng)計(jì):[1, 0, 0, 0, 0, 0, 0]
方法:[0, 0, 0, 1, 0, 0, 0]
In [41]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df
Out[41]:
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
In [42]:
df['key']
Out[42]:
0 b
1 b
2 a
3 c
4 a
5 b
Name: key, dtype: object
手動(dòng)轉(zhuǎn)為獨(dú)熱編碼
[a,b,c]
[1,1,1]
a: [1,0,0]
b: [0,1,0]
c: [0,0,1]
[b,b,a,c,a,b]
b:[1,1,0,0,0,1]
a:[0,0,1,0,1,0]
In [43]:
pd.get_dummies(df['key'])
Out[43]:
a | b | c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
合并兩個(gè)表格
In [61]:
dummies = pd.get_dummies(df['key'], prefix='key')
dummies
Out[61]:
key_a | key_b | key_c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
In [62]:
df
Out[62]:
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
In [63]:
df.join(dummies) # 按行索引合并
Out[63]:
key | data1 | key_a | key_b | key_c | |
---|---|---|---|---|---|
0 | b | 0 | 0 | 1 | 0 |
1 | b | 1 | 0 | 1 | 0 |
2 | a | 2 | 1 | 0 | 0 |
3 | c | 3 | 0 | 0 | 1 |
4 | a | 4 | 1 | 0 | 0 |
5 | b | 5 | 0 | 1 | 0 |
例子:將一組數(shù)據(jù)轉(zhuǎn)為啞變量
一個(gè)對統(tǒng)計(jì)應(yīng)用有用的方法:結(jié)合get_dummies和如cut之類的離散化函數(shù)
In [44]:
# 生成隨機(jī)數(shù)據(jù)
np.random.seed(12345)
values = np.random.rand(10)
values
Out[44]:
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
面元?jiǎng)澐?/p>
In [45]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
x = pd.cut(values, bins)
x
Out[45]:
[(0.8, 1.0], (0.2, 0.4], (0.0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.4, 0.6], (0.8, 1.0], (0.6, 0.8], (0.6, 0.8], (0.6, 0.8]]
Categories (5, interval[float64]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]
In [46]:
x.categories
Out[46]:
IntervalIndex([(0.0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]]
closed='right',
dtype='interval[float64]')
In [47]:
x.codes
Out[47]:
array([4, 1, 0, 1, 2, 2, 4, 3, 3, 3], dtype=int8)
將面元?jiǎng)澐纸Y(jié)構(gòu)進(jìn)行獨(dú)熱編碼(啞變量)
In [68]:
pd.get_dummies(x)
Out[68]:
(0.0, 0.2] | (0.2, 0.4] | (0.4, 0.6] | (0.6, 0.8] | (0.8, 1.0] | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 |
5 | 0 | 0 | 1 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 1 |
7 | 0 | 0 | 0 | 1 | 0 |
8 | 0 | 0 | 0 | 1 | 0 |
9 | 0 | 0 | 0 | 1 | 0 |
In [69]:
values
Out[69]:
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
0.8-1.0區(qū)間下的元素:第0個(gè)和第6個(gè)