size() 和count()的區(qū)別
size跟count的區(qū)別: size計(jì)數(shù)時(shí)包含NaN值哼勇,而count不包含NaN值
user_df = user_df.groupby(['uid','time'])[['qid_time']].count()
# 返回值為DF
user_df = user_df.groupby(['uid','time'])[['qid_time']].size()..reset_index(name='new_columns')
# 返回值為Series积担,需要.reset_index(name='new_columns')多列才能轉(zhuǎn)成DF
df.drop()和del df['column']的區(qū)別
1)直接del DF[‘column-name’]
被普遍認(rèn)為不是最好的方法,建議慎用先誉,參考:https://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe-by-column-name
2)采用drop方法褐耳,有下面三種等價(jià)的表達(dá)式:
DF= DF.drop(‘column_name’, 1)渴庆;
DF.drop(‘column_name’,axis=1, inplace=True)
DF.drop([DF.columns[[0,1,]]], axis=1,inplace=True)
參考:https://blog.csdn.net/claroja/article/details/65661826
loc 和iloc 的用法
loc[row][‘column_name’]就可以取得對應(yīng)row行數(shù)的字段值
user_df = user_df.groupby(['uid','time'])[['qid_time']].mean().reset_index()
# 索引重建后就是行號
user_df = user_df.groupby(['uid','time'])[['qid_time']].mean()
# 這種方式索引不是順序自增襟雷,無法按行用行號索引遍歷
print(type(user_df))
# 多值輸出耸弄,格式轉(zhuǎn)為Series了(一般用的很少)
print(user_df.loc[2][['time','uid']])
<class 'pandas.core.series.Series'>
time 2019-01-06 12:03:35
uid 128758
# 這樣可以唯一定位某一行某一列
print(user_df.loc[2,'time'])
2019-01-06 12:03:35
# 定位 多行 一列
print(user_df.loc[2:5,['time','uid']])
time uid
2 2019-01-06 12:03:35 128758
3 2019-01-06 12:05:35 128758
4 2019-01-06 12:07:35 128758
5 2019-01-06 12:09:35 128758
# 定位 多行 多列
print(user_df.loc[2:5,'uid':'qid_time'])
uid time qid_time
2 128758 2019-01-06 12:03:35 2.0
3 128758 2019-01-06 12:05:35 2.0
4 128758 2019-01-06 12:07:35 2.0
df.loc[1:3, ['total_bill', 'tip']]
df.loc[1:3, 'tip': 'total_bill']
df.iloc[1:3, [1, 2]]
df.iloc[1:3, 1: 3]
df.at[3, 'tip']
df.iat[3, 1]
df.ix[1:3, [1, 2]]
df.ix[1:3, ['total_bill', 'tip']]
df[1: 3]
df[['total_bill', 'tip']]
# df[1:2, ['total_bill', 'tip']] # TypeError: unhashable type
參考:https://www.cnblogs.com/en-heng/p/5630849.html
.reset_index(drop=Ture)的解釋
對于經(jīng)過運(yùn)算的數(shù)據(jù)來說叙赚,默認(rèn)索引可能是計(jì)算時(shí)的基準(zhǔn)列震叮,不重置的話索引沒有規(guī)律,不利于定位尉间,索引可能是單列或者多列的組合索引哲嘲,重置后統(tǒng)一為一列自增索引,如果選擇drop的話画切,原索引會(huì)被刪除囱怕,不drop的話老的索引會(huì)成為新結(jié)果的普通列數(shù)據(jù)(單列或多列)娃弓,索引默認(rèn)是不會(huì)存入表中(也可以在選擇保存索引),索引也不是能參與計(jì)算的數(shù)據(jù)耍缴,只是表的一個(gè)屬性
- 對Series.reset_index()
print(type(user_df['uid']))
print(user_df['uid'].head(3))
print(type(user_df['uid'].reset_index()))
print(user_df['uid'].reset_index().head(3))
print(type(user_df['uid'].reset_index(drop=True)))
print(user_df['uid'].reset_index(drop=True).head(3))
# 輸出:
<class 'pandas.core.series.Series'>
0 128758
1 128758
2 128758
<class 'pandas.core.frame.DataFrame'>
index uid
0 0 128758
1 1 128758
2 2 128758
<class 'pandas.core.series.Series'>
0 128758
1 128758
2 128758
# 其他例子:
user_df = user_df.groupby(['uid'])['qid_time'].mean()
<class 'pandas.core.series.Series'>
uid
128758 0.918033
181094 0.086957
182392 0.655738
user_df = user_df.groupby(['uid'])['qid_time'].mean().reset_index()
# 此處因?yàn)槎嗔诵碌囊涣蟹牢耍許eries的類型自動(dòng)轉(zhuǎn)化成DataFrame類型本鸣,
# 若drop=true,那數(shù)據(jù)還是一列的情況下硅蹦,不會(huì)轉(zhuǎn)化成Dataframe
uid qid_time
0 128758 0.918033
1 181094 0.086957
2 182392 0.655738
- 對DF.reset_index()
print(type(user_df[['uid']]))
print(user_df[['uid']].head(3))
print(type(user_df[['uid']].reset_index()))
print(user_df[['uid']].reset_index().head(3))
print(type(user_df[['uid']].reset_index(drop=True)))
print(user_df[['uid']].reset_index(drop=True).head(3))
輸出:
<class 'pandas.core.frame.DataFrame'>
uid
0 128758
1 128758
2 128758
<class 'pandas.core.frame.DataFrame'>
index uid # index列表示原來的索引列
0 0 128758
1 1 128758
2 2 128758
<class 'pandas.core.frame.DataFrame'>
uid
0 128758
1 128758
2 128758
# 其他例子
# 未重置之前
user_df = user_df.groupby(['uid'])[['qid_time']].mean()
qid_time
uid
128758 0.918033
181094 0.086957
182392 0.655738
# 重置索引后,將原來的索引列uid轉(zhuǎn)化為普通列鲤拿,新增自增索引列
user_df = user_df.groupby(['uid'])[['qid_time']].mean().reset_index()
uid qid_time
0 128758 0.918033
1 181094 0.086957
2 182392 0.655738
# 重置索引后,添加drop=Ture,表示將原來的默認(rèn)索引列uid刪除
user_df = user_df.groupby(['uid'])[['qid_time']].mean().reset_index(drop=True)
qid_time
0 0.918033
1 0.086957
2 0.655738
df.groupby().agg() 生音,df. groupby().apply()
groupdf=df.groupby(df['key1'])
for name,group in groupdf:
print group # 分完組的小組 dataframe類型對象
# print name # name 是分組的關(guān)鍵字
分組統(tǒng)計(jì)任務(wù)缀遍,保留索引列
user_df[['uid','qid','min','max','mean']]=user_df.groupby(['uid','qid']
.agg({'qid_time':[np.min, np.max, np.mean]}).reset_index()
實(shí)現(xiàn)sql中的group_concat功能
df.groupby('team').apply(lambda x: ','.join(x.user))
df.groupby('team').apply(lambda x: list(x.user))
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})
參考:https://stackoverflow.com/questions/18138693/replicating-group-concat-for-pandas-dataframe
dataframe需要操作列時(shí)域醇,啥時(shí)候用單中括號/雙中括號?
- 當(dāng)直接使用df后接一個(gè)中括號時(shí),表示取其一列锅铅,類型為Series盐须,接2個(gè)中括號時(shí)漆腌,也是取一列,但類型為DataFrame(帶有列名)
print(type(df['uid']))
<class 'pandas.core.series.Series'>
# 輸出如下
0 128758.0
1 128758.0
2 128758.0
3 128758.0
print(type(user_df[['uid']]))
<class 'pandas.core.frame.DataFrame'>
# 輸出如下
uid
0 128758.0
1 128758.0
2 128758.0
3 128758.0
- groupby進(jìn)行分組聚合時(shí)
print(type(user_df['uid']))
print(type(user_df[['uid']]))
print(type(user_df['uid','qid_time']))
print(type(user_df[['uid', 'qid_time']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
錯(cuò)誤立帖,沒有這種語法
<class 'pandas.core.frame.DataFrame'>
# 單括號晓勇,則輸出時(shí)不會(huì)帶有列標(biāo)簽绑咱,末尾會(huì)單獨(dú)輸出一行屬性列枢泰,輸出類型為Series
user_df = user_df.groupby(['uid'])['qid_time'].mean()
print(type(user_df))
<class 'pandas.core.series.Series'>
uid
128758 0.918033
181094 0.086957
182392 0.655738
# 雙括號 ,若對需要聚合的單列使用雙中括號窿克,則輸出時(shí)會(huì)帶有列標(biāo)年叮,輸出類型為DF
user_df = user_df.groupby(['uid'])[['qid_time']].mean()
print(type(user_df))
<class 'pandas.core.frame.DataFrame'>
# 輸出如下玻募,但不帶排序自增索引列,此處uid為默認(rèn)索引列
qid_time
uid
128758 0.918033
181094 0.086957
182392 0.655738
# 若需要將多列進(jìn)行聚合時(shí)跃惫,單中括號和雙中括號沒有區(qū)別
user_df = user_df.groupby(['uid'])['qid_time','time'].mean()
user_df = user_df.groupby(['uid'])[['qid_time','time']].mean()
# 兩個(gè)輸出一致爆存,因?yàn)榻Y(jié)果集至少兩列裹粤,所以類型只能是DataFrame蜂林,單括號的結(jié)果也會(huì)自動(dòng)轉(zhuǎn)化
# 在groupby()的括號中的參數(shù)只能是Series噪叙,多列分組使用[column1,column2],不能采用[['column1,columns2]]或df[[column1,columns2]]傳參睁蕾,可以使用df[column1],column只能為一個(gè),因?yàn)閐f['column1']類型也為Series
user_df = user_df.groupby(['uid','time'])[['qid_time']].mean()
<class 'pandas.core.frame.DataFrame'>
qid_time
uid time
128758 2019-01-06 11:55:35 2.000000
2019-01-06 12:01:35 1.000000
2019-01-06 12:03:35 2.000000
user_df = user_df.groupby(['uid','time'])['qid_time'].mean()
<class 'pandas.core.series.Series'> # 這里類型的區(qū)別在于后面的單雙括號
uid time # 此處兩列作為一個(gè)組合索引
128758 2019-01-06 11:55:35 2.000000
2019-01-06 12:01:35 1.000000
2019-01-06 12:03:35 2.000000
user_df = user_df.groupby(['uid','time'])['qid_time'].mean().reset_index()
<class 'pandas.core.frame.DataFrame'>
# 此處兩列作為一個(gè)組合索引子眶,重置后為常規(guī)列
uid time qid_time
0 128758 2019-01-06 11:55:35 2.000000
1 128758 2019-01-06 12:01:35 1.000000
2 128758 2019-01-06 12:03:35 2.000000
user_df = user_df.groupby(['uid','time'])['qid_time'].mean().reset_index(drop=True)
<class 'pandas.core.series.Series'> # 兩列索引同時(shí)刪除臭杰,生成新索引
0 2.000000
1 1.000000
2 2.000000
series 如何轉(zhuǎn)成DataFrame谚中?
如果是單列索引的Series
s.to_frame(name='column_name')
如果是多列組合索引
s.reset_index(name='column_name')