import pandas as pd
import numpy as np
df = pd.DataFrame({'total_bill': [16.99, 10.34, 23.68, 23.68, 24.59],
'tip': [1.01, 1.66, 3.50, 3.31, 3.61],
'sex': ['Female', 'Male', 'Male', 'Male', 'Female']})
對(duì)于DataFrame,我們可以看到其固有屬性
# data type of columns
df.dtypes
# indexes,行索引
df.index
# return pandas.Index,列名稱(label)
df.columns
# each row, return array[array]
df.values
# a tuple representing the dimensionality of df
df.shape
select:
SQL中的select是根據(jù)列的名稱來選仍途濉;Pandas則更為靈活伯诬,不但可根據(jù)列名稱選取晚唇,還可以根據(jù)列所在的position選取。相關(guān)函數(shù)如下:
loc盗似,基于列l(wèi)abel哩陕,可選取特定行(根據(jù)行index);
df.loc[1:3, ['total_bill', 'tip']]
df.loc[1:3, 'tip': 'total_bill']
有更為簡潔的行/列選取方式:
df[1: 3]
df[['total_bill', 'tip']]
# df[1:2, ['total_bill', 'tip']] # TypeError: unhashable type
where:
# and
df[(df['sex'] == 'Female') & (df['total_bill'] > 20)]
# or
df[(df['sex'] == 'Female') | (df['total_bill'] > 20)]
# in
df[df['total_bill'].isin([21.01, 23.68, 24.59])]
# not
df[-(df['sex'] == 'Male')]
df[-df['total_bill'].isin([21.01, 23.68, 24.59])]
# string function
df = df[(-df['app'].isin(sys_app)) & (-df.app.str.contains('^微信\d+$'))]
join:
Pandas中join的實(shí)現(xiàn)也有兩種:
# 1.
df.join(df2, how='left'...)
# 2.
pd.merge(df1, df2, how='left', left_on='app', right_on='app')
第一種方法是按DataFrame的index進(jìn)行join的赫舒,而第二種方法才是按on指定的列做join悍及。Pandas滿足left、right接癌、inner心赶、full outer四種join方式。
order:
Pandas中支持多列order缺猛,并可以調(diào)整不同列的升序/降序缨叫,有更高的排序自由度:
df.sort_values(['total_bill', 'tip'], ascending=[False, True])
replace:
replace函數(shù)提供對(duì)dataframe全局修改,亦可通過where條件進(jìn)行過濾修改(搭配loc):
# overall replace
df.replace(to_replace='Female', value='Sansa', inplace=True)
# dict replace
df.replace({'sex': {'Female': 'Sansa', 'Male': 'Leone'}}, inplace=True)
# replace on where condition
df.loc[df.sex == 'Male', 'sex'] = 'Leone'
distinct:
df.drop_duplicates(subset=['sex'], keep='first', inplace=True)
#subset枯夜,為選定的列做distinct弯汰,默認(rèn)為所有列艰山;
#keep湖雹,值選項(xiàng){'first', 'last', False},保留重復(fù)元素中的第一個(gè)曙搬、最后一個(gè)摔吏,或全部刪除鸽嫂;
#inplace ,默認(rèn)為False征讲,返回一個(gè)新的dataframe据某;若為True,則返回去重后的原dataframe
group:
group一般會(huì)配合合計(jì)函數(shù)(Aggregate functions)使用诗箍,比如:count癣籽、avg等。Pandas對(duì)合計(jì)函數(shù)的支持有限滤祖,有count和size函數(shù)實(shí)現(xiàn)SQL的count:
df.groupby('sex').size()
df.groupby('sex').count()
df.groupby('sex')['tip'].count()
sql:
select sex, max(tip), sum(total_bill) as total
from tips_tb
group by sex;
實(shí)現(xiàn)在agg()中指定dict:
df.groupby('sex').agg({'tip': np.max, 'total_bill': np.sum})
# count(distinct **)
df.groupby('tip').agg({'sex': pd.Series.nunique})
as:
df.rename(columns={'total_bill': 'total', 'tip': 'pit', 'sex': 'xes'}, inplace=True)
作者:Treant