分組與聚合
GroupBy對(duì)象
import pandas as pd
import numpy as np
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
data1 data2 key1 key2
0 -0.294007 -0.102993 a one
1 -0.478387 -0.530629 b one
2 -0.544628 0.569690 a two
3 -2.129353 0.863822 b three
4 1.498616 -2.339837 a two
5 -0.050811 1.013685 b two
6 0.191150 0.088627 a one
7 1.504741 0.669640 a three
dataframe根據(jù)key1進(jìn)行分組
print(type(df_obj.groupby('key1')))
<class 'pandas.core.groupby.DataFrameGroupBy'>
data1列根據(jù)key1進(jìn)行分組
print(type(df_obj['data1'].groupby(df_obj['key1'])))
<class 'pandas.core.groupby.SeriesGroupBy'>
分組運(yùn)算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())
grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
data1 data2
key1
a 0.471175 -0.222975
b -0.886184 0.448960
key1
a 0.471175
b -0.886184
Name: data1, dtype: float64
size
print(grouped1.size())
print(grouped2.size())
key1
a 5
b 3
dtype: int64
key1
a 5
b 3
Name: data1, dtype: int64
按列名分組
df_obj.groupby('key1')
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D850F08780>
按自定義key分組杈绸,列表
self_def_key = [1, 1, 2, 2, 2, 1, 1, 1]
df_obj.groupby(self_def_key).size()
1 5
2 3
dtype: int64
按自定義key分組罩句,多層列表
df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()
key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64
按多個(gè)列多層分組
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())
key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64
多層分組按key的順序進(jìn)行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
print()
print(grouped3.mean().unstack())
data1 data2
key2 key1
one a -0.051428 -0.007183
b -0.478387 -0.530629
three a 1.504741 0.669640
b -2.129353 0.863822
two a 0.476994 -0.885074
b -0.050811 1.013685
data1 data2
key1 a b a b
key2
one -0.051428 -0.478387 -0.007183 -0.530629
three 1.504741 -2.129353 0.669640 0.863822
two 0.476994 -0.050811 -0.885074 1.013685
GroupBy對(duì)象分組迭代
單層分組
for group_name, group_data in grouped1:
print(group_name)
print(group_data)
a
data1 data2 key1 key2
0 -0.294007 -0.102993 a one
2 -0.544628 0.569690 a two
4 1.498616 -2.339837 a two
6 0.191150 0.088627 a one
7 1.504741 0.669640 a three
b
data1 data2 key1 key2
1 -0.478387 -0.530629 b one
3 -2.129353 0.863822 b three
5 -0.050811 1.013685 b two
多層分組
for group_name, group_data in grouped2:
print(group_name)
print(group_data)
('a', 'one')
data1 data2 key1 key2
0 -0.294007 -0.102993 a one
6 0.191150 0.088627 a one
('a', 'three')
data1 data2 key1 key2
7 1.504741 0.66964 a three
('a', 'two')
data1 data2 key1 key2
2 -0.544628 0.569690 a two
4 1.498616 -2.339837 a two
('b', 'one')
data1 data2 key1 key2
1 -0.478387 -0.530629 b one
('b', 'three')
data1 data2 key1 key2
3 -2.129353 0.863822 b three
('b', 'two')
data1 data2 key1 key2
5 -0.050811 1.013685 b two
GroupBy對(duì)象轉(zhuǎn)換list
list(grouped1)
[('a', data1 data2 key1 key2
0 -0.294007 -0.102993 a one
2 -0.544628 0.569690 a two
4 1.498616 -2.339837 a two
6 0.191150 0.088627 a one
7 1.504741 0.669640 a three), ('b', data1 data2 key1 key2
1 -0.478387 -0.530629 b one
3 -2.129353 0.863822 b three
5 -0.050811 1.013685 b two)]
GroupBy對(duì)象轉(zhuǎn)換dict
dict(list(grouped1))
{'a': data1 data2 key1 key2
0 -0.294007 -0.102993 a one
2 -0.544628 0.569690 a two
4 1.498616 -2.339837 a two
6 0.191150 0.088627 a one
7 1.504741 0.669640 a three, 'b': data1 data2 key1 key2
1 -0.478387 -0.530629 b one
3 -2.129353 0.863822 b three
5 -0.050811 1.013685 b two}
按列分組
print(df_obj.dtypes)
data1 float64
data2 float64
key1 object
key2 object
dtype: object
按數(shù)據(jù)類型分組
df_obj.groupby(df_obj.dtypes, axis=1).size()
df_obj.groupby(df_obj.dtypes, axis=1).sum()
其他分組方法
df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['A', 'B', 'C', 'D', 'E'])
df_obj2.iloc[1, 1:4] = np.NaN
df_obj2
通過(guò)字典分組
mapping_dict = {'a':'python', 'b':'python', 'c':'java', 'd':'C', 'e':'java'}
df_obj2.groupby(mapping_dict, axis=1).size()
df_obj2.groupby(mapping_dict, axis=1).count() # 非NaN的個(gè)數(shù)
df_obj2.groupby(mapping_dict, axis=1).sum()
通過(guò)函數(shù)分組
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3
def group_key(idx):
"""
idx 為列索引或行索引
"""
#return idx
return len(idx)
df_obj3.groupby(group_key).size()
# 以上自定義函數(shù)等價(jià)于
#df_obj3.groupby(len).size()
1 1
2 3
3 1
dtype: int64
通過(guò)索引級(jí)別分組
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
df_obj4
根據(jù)language進(jìn)行分組
df_obj4.groupby(level='language', axis=1).sum()
df_obj4.groupby(level='index', axis=1).sum()
聚合
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)
data1 data2 key1 key2
0 1 8 a one
1 4 8 b one
2 3 7 a two
3 6 7 b three
4 8 1 a two
5 9 3 b two
6 7 3 a one
7 8 2 a three
內(nèi)置的聚合函數(shù)
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
data1 data2
key1
a 27 21
b 19 18
data1 data2 key2
key1
a 8 8 two
b 9 8 two
data1 data2 key2
key1
a 1 1 one
b 4 3 one
data1 data2
key1
a 5.400000 4.2
b 6.333333 6.0
key1
a 5
b 3
dtype: int64
data1 data2 key2
key1
a 5 5 5
b 3 3 3
data1 data2 \
count mean std min 25% 50% 75% max count mean std
key1
a 5.0 5.400000 3.209361 1.0 3.0 7.0 8.0 8.0 5.0 4.2 3.114482
b 3.0 6.333333 2.516611 4.0 5.0 6.0 7.5 9.0 3.0 6.0 2.645751
min 25% 50% 75% max
key1
a 1.0 2.0 3.0 7.0 8.0
b 3.0 5.0 7.0 7.5 8.0
自定義聚合函數(shù)
def peak_range(df):
"""
返回?cái)?shù)值范圍
"""
#print type(df) #參數(shù)為索引所對(duì)應(yīng)的記錄
return df.max() - df.min()
print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
data1 data2
key1
a 7 7
b 5 5
data1 data2
key1
a 2.049369 3.009477
b 2.078541 1.544314
應(yīng)用多個(gè)聚合函數(shù)
同時(shí)應(yīng)用多個(gè)聚合函數(shù)
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默認(rèn)列名為函數(shù)名
data1 data2
mean std count peak_range mean std count peak_range
key1
a 0.471175 0.977198 5 2.049369 -0.222975 1.226547 5 3.009477
b -0.886184 1.097637 3 2.078541 0.448960 0.851651 3 1.544314
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通過(guò)元組提供新的列名
data1 data2
mean std count range mean std count range
key1
a 0.471175 0.977198 5 2.049369 -0.222975 1.226547 5 3.009477
b -0.886184 1.097637 3 2.078541 0.448960 0.851651 3 1.544314
每列作用不同的聚合函數(shù)
dict_mapping = {'data1':'mean',
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
data1 data2
key1
a 0.471175 -1.114873
b -0.886184 1.346879
dict_mapping = {'data1':['mean','max'],
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
data1 data2
mean max sum
key1
a 0.471175 1.504741 -1.114873
b -0.886184 -0.050811 1.346879