《Pandas Cookbook》第07章分組聚合办斑、過濾外恕、轉換

第01章 Pandas基礎
 第02章 DataFrame運算
 第03章數據分析入門
 第04章選取數據子集
 第05章布爾索引
 第06章索引對齊
第07章分組聚合杆逗、過濾、轉換
第08章數據清理
 第09章合并Pandas對象
 第10章時間序列分析
 第11章用Matplotlib鳞疲、Pandas罪郊、Seaborn進行可視化

 In[1]: import pandas as pd
        import numpy as np

1. 定義聚合

# 讀取flights數據集，查詢頭部
 In[2]: flights = pd.read_csv('data/flights.csv')
        flights.head()
Out[2]:

# 按照AIRLINE分組尚洽，使用agg方法悔橄，傳入要聚合的列和聚合函數
 In[3]: flights.groupby('AIRLINE').agg({'ARR_DELAY':'mean'}).head()
Out[3]:

# 或者要選取的列使用索引，聚合函數作為字符串傳入agg
 In[4]: flights.groupby('AIRLINE')['ARR_DELAY'].agg('mean').head()
Out[4]: 
AIRLINE
AA     5.542661
AS    -0.833333
B6     8.692593
DL     0.339691
EV     7.034580
Name: ARR_DELAY, dtype: float64

# 也可以向agg中傳入NumPy的mean函數
 In[5]: flights.groupby('AIRLINE')['ARR_DELAY'].agg(np.mean).head()
Out[5]:

# 也可以直接使用mean()函數
 In[6]: flights.groupby('AIRLINE')['ARR_DELAY'].mean().head()
Out[6]:

原理

# groupby方法產生的是一個DataFrameGroupBy對象
 In[7]: grouped = flights.groupby('AIRLINE')
        type(grouped)
Out[7]: pandas.core.groupby.DataFrameGroupBy

# 如果agg接收的不是聚合函數腺毫，則會導致異常
 In[8]: flights.groupby('AIRLINE')['ARR_DELAY'].agg(np.sqrt)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py:842: RuntimeWarning: invalid value encountered in sqrt
  f = lambda x: func(x, *args, **kwargs)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py:3015: RuntimeWarning: invalid value encountered in sqrt
  output = func(group, *args, **kwargs)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in agg_series(self, obj, func)
   2177         try:
-> 2178             return self._aggregate_series_fast(obj, func)
   2179         except Exception:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_series_fast(self, obj, func)
   2197                                     dummy)
-> 2198         result, counts = grouper.get_result()
   2199         return result, counts

pandas/_libs/src/reduce.pyx in pandas._libs.lib.SeriesGrouper.get_result (pandas/_libs/lib.c:39105)()

pandas/_libs/src/reduce.pyx in pandas._libs.lib.SeriesGrouper.get_result (pandas/_libs/lib.c:38973)()

pandas/_libs/src/reduce.pyx in pandas._libs.lib._get_result_array (pandas/_libs/lib.c:32039)()

ValueError: function does not reduce

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2882             try:
-> 2883                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
   2884             except Exception:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    847             try:
--> 848                 result, counts = self.grouper.agg_series(obj, f)
    849                 output[name] = self._try_cast(result, obj, numeric_only=True)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in agg_series(self, obj, func)
   2179         except Exception:
-> 2180             return self._aggregate_series_pure_python(obj, func)
   2181 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_series_pure_python(self, obj, func)
   2214                         isinstance(res, list)):
-> 2215                     raise ValueError('Function does not reduce')
   2216                 result = np.empty(ngroups, dtype='O')

ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-8-2bcc9ccfec77> in <module>()
----> 1 flights.groupby('AIRLINE')['ARR_DELAY'].agg(np.sqrt)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2883                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
   2884             except Exception:
-> 2885                 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
   2886 
   2887             index = Index(sorted(result), name=self.grouper.names[0])

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_named(self, func, *args, **kwargs)
   3015             output = func(group, *args, **kwargs)
   3016             if isinstance(output, (Series, Index, np.ndarray)):
-> 3017                 raise Exception('Must produce aggregated value')
   3018             result[name] = self._try_cast(output, group)
   3019 

Exception: Must produce aggregated value

2. 用多個列和函數進行分組和聚合

# 導入數據
 In[9]: flights = pd.read_csv('data/flights.csv')
        flights.head()
Out[9]:

# 每家航空公司每周平均每天取消的航班數
 In[10]: flights.groupby(['AIRLINE', 'WEEKDAY'])['CANCELLED'].agg('sum').head(7)
Out[10]: AIRLINE  WEEKDAY
         AA       1          41
                  2           9
                  3          16
                  4          20
                  5          18
                  6          21
                  7          29
         Name: CANCELLED, dtype: int64

# 分組可以是多組癣疟，選取可以是多組，聚合函數也可以是多個
# 每周每家航空公司取消或改變航線的航班總數和比例
 In[11]: flights.groupby(['AIRLINE', 'WEEKDAY'])['CANCELLED', 'DIVERTED'].agg(['sum', 'mean']).head(7)
Out[11]:

# 用列表和嵌套字典對多列分組和聚合
# 對于每條航線潮酒，找到總航班數睛挚，取消的數量和比例，飛行時間的平均時間和方差
 In[12]: group_cols = ['ORG_AIR', 'DEST_AIR']
         agg_dict = {'CANCELLED':['sum', 'mean', 'size'], 
                     'AIR_TIME':['mean', 'var']}
         flights.groupby(group_cols).agg(agg_dict).head()
         # flights.groupby(['ORG_AIR', 'DEST_AIR']).agg({'CANCELLED': ['sum', 'mean', 'size'], 
         #                                               'AIR_TIME':['mean', 'var']}).head()
Out[12]:

3. 分組后去除多級索引

# 讀取數據
 In[13]: flights = pd.read_csv('data/flights.csv')
         flights.head()
Out[13]:

# 按'AIRLINE', 'WEEKDAY'分組澈灼，分別對DIST和ARR_DELAY聚合
 In[14]: airline_info = flights.groupby(['AIRLINE', 'WEEKDAY'])\
                               .agg({'DIST':['sum', 'mean'], 
                                     'ARR_DELAY':['min', 'max']}).astype(int)
         airline_info.head()
Out[14]:

# 行和列都有兩級索引竞川，get_level_values(0)取出第一級索引
 In[15]: level0 = airline_info.columns.get_level_values(0)
         level0
Out[15]: Index(['DIST', 'DIST', 'ARR_DELAY', 'ARR_DELAY'], dtype='object')

# get_level_values(1)取出第二級索引
 In[16]: level1 = airline_info.columns.get_level_values(1)
         level1
Out[16]: Index(['sum', 'mean', 'min', 'max'], dtype='object')

# 一級和二級索引拼接成新的列索引
 In[17]: airline_info.columns = level0 + '_' + level1
 In[18]: airline_info.head(7)
Out[18]:

# reset_index()可以將行索引變成單級
 In[19]: airline_info.reset_index().head(7)
Out[19]:

# Pandas默認會在分組運算后店溢，將所有分組的列放在索引中叁熔，as_index設為False可以避免這么做。分組后使用reset_index床牧，也可以達到同樣的效果
 In[20]: flights.groupby(['AIRLINE'], as_index=False)['DIST'].agg('mean').round(0)
Out[20]:

# 上面這么做荣回，會默認對AIRLINE排序，sort設為False可以避免排序
 In[21]: flights.groupby(['AIRLINE'], as_index=False, sort=False)['DIST'].agg('mean')
Out[21]:

4. 自定義聚合函數

 In[22]: college = pd.read_csv('data/college.csv')
         college.head()
Out[22]:

# 求出每個州的本科生的平均值和標準差
 In[23]: college.groupby('STABBR')['UGDS'].agg(['mean', 'std']).round(0).head()
Out[23]:

# 遠離平均值的標準差的最大個數戈咳，寫一個自定義函數
 In[24]: def max_deviation(s):
             std_score = (s - s.mean()) / s.std()
             return std_score.abs().max()
# agg聚合函數在調用方法時心软，直接引入自定義的函數名
 In[25]: college.groupby('STABBR')['UGDS'].agg(max_deviation).round(1).head()
Out[25]: STABBR
         AK    2.6
         AL    5.8
         AR    6.3
         AS    NaN
         AZ    9.9
         Name: UGDS, dtype: float64

# 自定義的聚合函數也適用于多個數值列
 In[26]: college.groupby('STABBR')['UGDS', 'SATVRMID', 'SATMTMID'].agg(max_deviation).round(1).head()
Out[26]:

# 自定義聚合函數也可以和預先定義的函數一起使用
 In[27]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATVRMID', 'SATMTMID']\
                .agg([max_deviation, 'mean', 'std']).round(1).head()
Out[27]:

# Pandas使用函數名作為返回列的名字；你可以直接使用rename方法修改著蛙，或通過__name__屬性修改
 In[28]: max_deviation.__name__
Out[28]: 'max_deviation'

 In[29]: max_deviation.__name__ = 'Max Deviation'
 In[30]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATVRMID', 'SATMTMID']\
                .agg([max_deviation, 'mean', 'std']).round(1).head()
Out[30]:

5. 用 *args 和 **kwargs 自定義聚合函數

# 用inspect模塊查看groupby對象的agg方法的簽名
 In[31]: college = pd.read_csv('data/college.csv')
         grouped = college.groupby(['STABBR', 'RELAFFIL'])
 In[32]: import inspect
         inspect.signature(grouped.agg)
Out[32]: <Signature (arg, *args, **kwargs)>

如何做

# 自定義一個返回去本科生人數在1000和3000之間的比例的函數
 In[33]: def pct_between_1_3k(s):
             return s.between(1000, 3000).mean()
# 用州和宗教分組删铃，再聚合
 In[34]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between_1_3k).head(9)
Out[34]: 
STABBR  RELAFFIL
AK      0           0.142857
        1           0.000000
AL      0           0.236111
        1           0.333333
AR      0           0.279412
        1           0.111111
AS      0           1.000000
AZ      0           0.096774
        1           0.000000
Name: UGDS, dtype: float64

# 但是這個函數不能讓用戶自定義上下限，再新寫一個函數
 In[35]: def pct_between(s, low, high):
             return s.between(low, high).mean()
# 使用這個自定義聚合函數踏堡，并傳入最大和最小值
 In[36]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between, 1000, 10000).head(9)
Out[36]: 
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
AR      0           0.397059
        1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, dtype: float64

原理

# 顯示指定最大和最小值
 In[37]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between, high=10000, low=1000).head(9)
Out[37]: 
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
AR      0           0.397059
        1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, dtype: float64

# 也可以關鍵字參數和非關鍵字參數混合使用猎唁，只要非關鍵字參數在后面
 In[38]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between, 1000, high=10000).head(9)
Out[38]: 
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
AR      0           0.397059
        1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, dtype: float64

# Pandas不支持多重聚合時，使用參數
 In[39]: college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(['mean', pct_between], low=100, high=1000)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-3e3e18919cf9> in <module>()
----> 1 college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(['mean', pct_between], low=100, high=1000)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2871         if hasattr(func_or_funcs, '__iter__'):
   2872             ret = self._aggregate_multiple_funcs(func_or_funcs,
-> 2873                                                  (_level or 0) + 1)
   2874         else:
   2875             cyfunc = self._is_cython_func(func_or_funcs)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_multiple_funcs(self, arg, _level)
   2944                 obj._reset_cache()
   2945                 obj._selection = name
-> 2946             results[name] = obj.aggregate(func)
   2947 
   2948         if isinstance(list(compat.itervalues(results))[0],

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2878 
   2879             if self.grouper.nkeys > 1:
-> 2880                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
   2881 
   2882             try:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    852 
    853         if len(output) == 0:
--> 854             return self._python_apply_general(f)
    855 
    856         if self.grouper._filter_empty_groups:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _python_apply_general(self, f)
    718     def _python_apply_general(self, f):
    719         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 720                                                    self.axis)
    721 
    722         return self._wrap_applied_output(

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in apply(self, f, data, axis)
   1800             # group might be modified
   1801             group_axes = _get_axes(group)
-> 1802             res = f(group)
   1803             if not _is_indexed_like(res, group_axes):
   1804                 mutated = True

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in <lambda>(x)
    840     def _python_agg_general(self, func, *args, **kwargs):
    841         func = self._is_builtin_func(func)
--> 842         f = lambda x: func(x, *args, **kwargs)
    843 
    844         # iterate through "columns" ex exclusions to populate output dict

TypeError: pct_between() missing 2 required positional arguments: 'low' and 'high'

# 用閉包自定義聚合函數
 In[40]: def make_agg_func(func, name, *args, **kwargs):
             def wrapper(x):
                 return func(x, *args, **kwargs)
             wrapper.__name__ = name
             return wrapper

         my_agg1 = make_agg_func(pct_between, 'pct_1_3k', low=1000, high=3000)
         my_agg2 = make_agg_func(pct_between, 'pct_10_30k', 10000, 30000)['UGDS'].agg(pct_between, 1000, high=10000).head(9)
Out[41]:

6. 檢查分組對象

# 查看分組對象的類型
 In[42]: college = pd.read_csv('data/college.csv')
         grouped = college.groupby(['STABBR', 'RELAFFIL'])
         type(grouped)
Out[42]: pandas.core.groupby.DataFrameGroupBy

# 用dir函數找到該對象所有的可用函數
 In[43]: print([attr for attr in dir(grouped) if not attr.startswith('_')])
['CITY', 'CURROPER', 'DISTANCEONLY', 'GRAD_DEBT_MDN_SUPP', 'HBCU', 'INSTNM', 'MD_EARN_WNE_P10', 'MENONLY', 'PCTFLOAN', 'PCTPELL', 'PPTUG_EF', 'RELAFFIL', 'SATMTMID', 'SATVRMID', 'STABBR', 'UG25ABV', 'UGDS', 'UGDS_2MOR', 'UGDS_AIAN', 'UGDS_ASIAN', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_NHPI', 'UGDS_NRA', 'UGDS_UNKN', 'UGDS_WHITE', 'WOMENONLY', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']

# 用ngroups屬性查看分組的數量
 In[44]: grouped.ngroups
Out[44]: 112

# 查看每個分組的唯一識別標簽顷蟆，groups屬性是一個字典诫隅，包含每個獨立分組與行索引標簽的對應
 In[45]: groups = list(grouped.groups.keys())
         groups[:6]
Out[45]: [('AK', 0), ('AK', 1), ('AL', 0), ('AL', 1), ('AR', 0), ('AR', 1)]

# 用get_group，傳入分組標簽的元組帐偎。例如逐纬，獲取佛羅里達州所有與宗教相關的學校
 In[46]: grouped.get_group(('FL', 1)).head()
Out[46]:

# groupby對象是一個可迭代對象，可以挨個查看每個獨立分組
 In[47]: from IPython.display import display
 In[48]: i = 0
         for name, group in grouped:
             print(name)
             display(group.head(2))
             i += 1
             if i == 5:
                 break

# groupby對象使用head方法削樊，可以在一個DataFrame鐘顯示每個分組的頭幾行
 In[49]: grouped.head(2).head(6)
Out[49]:

# nth方法可以選出每個分組指定行的數據豁生，下面選出的是第1行和最后1行
 In[50]: grouped.nth([1, -1]).head(8)
Out[50]:

7. 過濾狀態(tài)

 In[51]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         grouped = college.groupby('STABBR')
         grouped.ngroups
Out[51]: 59

# 這等于求出不同州的個數，nunique()可以得到同樣的結果
 In[52]: college['STABBR'].nunique()
Out[52]: 59

# 自定義一個計算少數民族學生總比例的函數，如果比例大于閾值甸箱，還返回True
 In[53]: def check_minority(df, threshold):
             minority_pct = 1 - df['UGDS_WHITE']
             total_minority = (df['UGDS'] * minority_pct).sum()
             total_ugds = df['UGDS'].sum()
             total_minority_pct = total_minority / total_ugds
             return total_minority_pct > threshold
# grouped變量有一個filter方法眼刃，可以接收一個自定義函數，決定是否保留一個分組
 In[54]: college_filtered = grouped.filter(check_minority, threshold=.5)
         college_filtered.head()
Out[54]:

# 通過查看形狀摇肌，可以看到過濾了60%擂红，只有20個州的少數學生占據多數
 In[55]: college.shape
Out[55]: (7535, 26)

 In[56]: college_filtered.shape
Out[56]: (3028, 26)

 In[57]: college_filtered['STABBR'].nunique()
Out[57]: 20

# 用一些不同的閾值，檢查形狀和不同州的個數
 In[58]: college_filtered_20 = grouped.filter(check_minority, threshold=.2)
        college_filtered_20.shape
Out[58]: (7461, 26)

 In[59]: college_filtered_20['STABBR'].nunique()
Out[59]: 57

 In[60]: college_filtered_70 = grouped.filter(check_minority, threshold=.7)
         college_filtered_70.shape
Out[60]: (957, 26)

 In[61]: college_filtered_70['STABBR'].nunique()
Out[61]: 10

 In[62]: college_filtered_95 = grouped.filter(check_minority, threshold=.95)
         college_filtered_95.shape
Out[62]: (156, 26)

8. 減肥對賭

# 讀取減肥數據集围小，查看一月的數據
 In[63]: weight_loss = pd.read_csv('data/weight_loss.csv')
         weight_loss.query('Month == "Jan"')
Out[63]:

# 定義一個求減肥比例的函數
 In[64]: def find_perc_loss(s):
            return (s - s.iloc[0]) / s.iloc[0]
# 查看Bob在一月的減肥成果
 In[65]: bob_jan = weight_loss.query('Name=="Bob" and Month=="Jan"')
         find_perc_loss(bob_jan['Weight'])
Out[65]: 0    0.000000
         2   -0.010309
         4   -0.027491
         6   -0.027491
         Name: Weight, dtype: float64

# 對Name和Month進行分組昵骤，然后使用transform方法，傳入函數肯适，對數值進行轉換
 In[66]: pcnt_loss = weight_loss.groupby(['Name', 'Month'])['Weight'].transform(find_perc_loss)
         pcnt_loss.head(8)
Out[66]: 0    0.000000
         1    0.000000
         2   -0.010309
         3   -0.040609
         4   -0.027491
         5   -0.040609
         6   -0.027491
         7   -0.035533
         Name: Weight, dtype: float64

# transform之后的結果变秦，行數不變，可以賦值給原始DataFrame作為一個新列框舔；
# 為了縮短輸出蹦玫，只選擇Bob的前兩個月數據
 In[67]: weight_loss['Perc Weight Loss'] = pcnt_loss.round(3)
         weight_loss.query('Name=="Bob" and Month in ["Jan", "Feb"]')
Out[67]:

# 因為最重要的是每個月的第4周，只選擇第4周的數據
 In[68]: week4 = weight_loss.query('Week == "Week 4"')
         week4
Out[68]:

# 用pivot重構DataFrame刘绣，讓Amy和Bob的數據并排放置
 In[69]: winner = week4.pivot(index='Month', columns='Name', values='Perc Weight Loss')
         winner
Out[69]:

# 用where方法選出每月的贏家
 In[70]: winner['Winner'] = np.where(winner['Amy'] < winner['Bob'], 'Amy', 'Bob')
         winner.style.highlight_min(axis=1)
Out[70]:

# 用value_counts()返回最后的比分
 In[71]: winner.Winner.value_counts()
Out[71]: Amy    3
         Bob    1
         Name: Winner, dtype: int64

# Pandas默認是按字母排序的
 In[72]: week4a = week4.copy()
         month_chron = week4a['Month'].unique() 
         month_chron
Out[72]: array(['Jan', 'Feb', 'Mar', 'Apr'], dtype=object)

# 轉換為Categorical變量樱溉，可以做成按時間排序
 In[73]: week4a['Month'] = pd.Categorical(week4a['Month'], 
                                          categories=month_chron,
                                          ordered=True)
         week4a.pivot(index='Month', columns='Name', values='Perc Weight Loss')
Out[73]:

9. 用apply計算每州的加權平均SAT分數

# 讀取college，'UGDS', 'SATMTMID', 'SATVRMID'三列如果有缺失值則刪除行
 In[74]: college = pd.read_csv('data/college.csv')
         subset = ['UGDS', 'SATMTMID', 'SATVRMID']
         college2 = college.dropna(subset=subset)
         college.shape
Out[74]: (7535, 27)

 In[75]: college2.shape
Out[75]: (1184, 27)

# 自定義一個求SAT數學成績的加權平均值的函數
 In[76]: def weighted_math_average(df):
             weighted_math = df['UGDS'] * df['SATMTMID']
             return int(weighted_math.sum() / df['UGDS'].sum())
# 按州分組纬凤，并調用apply方法福贞，傳入自定義函數
 In[77]: college2.groupby('STABBR').apply(weighted_math_average).head()
Out[77]: STABBR
         AK    503
         AL    536
         AR    529
         AZ    569
         CA    564
         dtype: int64

# 效果同上
 In[78]: college2.groupby('STABBR').agg(weighted_math_average).head()
Out[78]:

# 如果將列限制到SATMTMID，會報錯停士。這是因為不能訪問UGDS挖帘。
 In[79]: college2.groupby('STABBR')['SATMTMID'].agg(weighted_math_average)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14010)()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in agg_series(self, obj, func)
  2177         try:
-> 2178             return self._aggregate_series_fast(obj, func)
  2179         except Exception:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_series_fast(self, obj, func)
  2197                                     dummy)
-> 2198         result, counts = grouper.get_result()
  2199         return result, counts

pandas/_libs/src/reduce.pyx in pandas._libs.lib.SeriesGrouper.get_result (pandas/_libs/lib.c:39105)()

pandas/_libs/src/reduce.pyx in pandas._libs.lib.SeriesGrouper.get_result (pandas/_libs/lib.c:38888)()

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in <lambda>(x)
   841         func = self._is_builtin_func(func)
--> 842         f = lambda x: func(x, *args, **kwargs)
   843 

<ipython-input-76-01eb90aa258d> in weighted_math_average(df)
     1 def weighted_math_average(df):
----> 2     weighted_math = df['UGDS'] * df['SATMTMID']
     3     return int(weighted_math.sum() / df['UGDS'].sum())

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
   600         try:
--> 601             result = self.index.get_value(self, key)
   602 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
  2476             return self._engine.get_value(s, k,
-> 2477                                           tz=getattr(series.dtype, 'tz', None))
  2478         except KeyError as e1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5210)()

KeyError: 'UGDS'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14010)()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
  2882             try:
-> 2883                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
  2884             except Exception:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs)
   847             try:
--> 848                 result, counts = self.grouper.agg_series(obj, f)
   849                 output[name] = self._try_cast(result, obj, numeric_only=True)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in agg_series(self, obj, func)
  2179         except Exception:
-> 2180             return self._aggregate_series_pure_python(obj, func)
  2181 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_series_pure_python(self, obj, func)
  2210         for label, group in splitter:
-> 2211             res = func(group)
  2212             if result is None:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in <lambda>(x)
   841         func = self._is_builtin_func(func)
--> 842         f = lambda x: func(x, *args, **kwargs)
   843 

<ipython-input-76-01eb90aa258d> in weighted_math_average(df)
     1 def weighted_math_average(df):
----> 2     weighted_math = df['UGDS'] * df['SATMTMID']
     3     return int(weighted_math.sum() / df['UGDS'].sum())

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
   600         try:
--> 601             result = self.index.get_value(self, key)
   602 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
  2476             return self._engine.get_value(s, k,
-> 2477                                           tz=getattr(series.dtype, 'tz', None))
  2478         except KeyError as e1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5210)()

KeyError: 'UGDS'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14010)()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-79-1351e4f306c7> in <module>()
----> 1 college2.groupby('STABBR')['SATMTMID'].agg(weighted_math_average)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
  2883                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
  2884             except Exception:
-> 2885                 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
  2886 
  2887             index = Index(sorted(result), name=self.grouper.names[0])

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_named(self, func, *args, **kwargs)
  3013         for name, group in self:
  3014             group.name = name
-> 3015             output = func(group, *args, **kwargs)
  3016             if isinstance(output, (Series, Index, np.ndarray)):
  3017                 raise Exception('Must produce aggregated value')

<ipython-input-76-01eb90aa258d> in weighted_math_average(df)
     1 def weighted_math_average(df):
----> 2     weighted_math = df['UGDS'] * df['SATMTMID']
     3     return int(weighted_math.sum() / df['UGDS'].sum())

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
   599         key = com._apply_if_callable(key, self)
   600         try:
--> 601             result = self.index.get_value(self, key)
   602 
   603             if not is_scalar(result):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
  2475         try:
  2476             return self._engine.get_value(s, k,
-> 2477                                           tz=getattr(series.dtype, 'tz', None))
  2478         except KeyError as e1:
  2479             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5210)()

KeyError: 'UGDS'

# apply的一個不錯的功能是通過返回Series，創(chuàng)建多個新的列
 In[80]: from collections import OrderedDict
         def weighted_average(df):
             data = OrderedDict()
             weight_m = df['UGDS'] * df['SATMTMID']
             weight_v = df['UGDS'] * df['SATVRMID']

             data['weighted_math_avg'] = weight_m.sum() / df['UGDS'].sum()
             data['weighted_verbal_avg'] = weight_v.sum() / df['UGDS'].sum()
             data['math_avg'] = df['SATMTMID'].mean()
             data['verbal_avg'] = df['SATVRMID'].mean()
             data['count'] = len(df)
             return pd.Series(data, dtype='int')

         college2.groupby('STABBR').apply(weighted_average).head(10)
Out[80]:

# 多創(chuàng)建兩個新的列
 In[81]: from collections import OrderedDict
         def weighted_average(df):
             data = OrderedDict()
             weight_m = df['UGDS'] * df['SATMTMID']
             weight_v = df['UGDS'] * df['SATVRMID']

             wm_avg = weight_m.sum() / df['UGDS'].sum()
             wv_avg = weight_v.sum() / df['UGDS'].sum()

             data['weighted_math_avg'] = wm_avg
             data['weighted_verbal_avg'] = wv_avg
             data['math_avg'] = df['SATMTMID'].mean()
             data['verbal_avg'] = df['SATVRMID'].mean()
             data['count'] = len(df)
             return pd.Series(data, dtype='int')

         college2.groupby('STABBR').apply(weighted_average).head(10)
Out[81]:

# 自定義一個返回DataFrame的函數恋技，使用NumPy的函數average計算加權平均值拇舀，使用SciPy的gmean和hmean計算幾何和調和平均值
 In[82]: from scipy.stats import gmean, hmean
         def calculate_means(df):
             df_means = pd.DataFrame(index=['Arithmetic', 'Weighted', 'Geometric', 'Harmonic'])
             cols = ['SATMTMID', 'SATVRMID']
             for col in cols:
                 arithmetic = df[col].mean()
                 weighted = np.average(df[col], weights=df['UGDS'])
                 geometric = gmean(df[col])
                 harmonic = hmean(df[col])
                 df_means[col] = [arithmetic, weighted, geometric, harmonic]
       
             df_means['count'] = len(df)
             return df_means.astype(int)

         college2.groupby('STABBR').filter(lambda x: len(x) != 1).groupby('STABBR').apply(calculate_means).head(10)
Out[82]:

10. 用連續(xù)變量分組

 In[83]: flights = pd.read_csv('data/flights.csv')
         flights.head()
Out[83]:

# 判斷DIST列有無缺失值
 In[84]: flights.DIST.hasnans
Out[84]: False

# 再次刪除DIST列的缺失值（原書是沒有這兩段的）
 In[85]: flights.dropna(subset=['DIST']).shape
Out[85]: (58492, 14)

# 使用Pandas的cut函數，將數據分成5個面元
 In[86]: bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
         cuts = pd.cut(flights['DIST'], bins=bins)
         cuts.head()
Out[86]: 0     (500.0, 1000.0]
         1    (1000.0, 2000.0]
         2     (500.0, 1000.0]
         3    (1000.0, 2000.0]
         4    (1000.0, 2000.0]
         Name: DIST, dtype: category
         Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] < (1000.0, 2000.0] < (2000.0, inf]]

# 對每個面元進行統計
 In[87]: cuts.value_counts()
Out[87]: (500.0, 1000.0]     20659
         (200.0, 500.0]      15874
         (1000.0, 2000.0]    14186
         (2000.0, inf]        4054
         (-inf, 200.0]        3719
         Name: DIST, dtype: int64

# 面元Series可以用來進行分組
 In[88]: flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True).round(3).head(15)
Out[88]: DIST            AIRLINE
         (-inf, 200.0]   OO         0.326
                         EV         0.289
                         MQ         0.211
                         DL         0.086
                         AA         0.052
                         UA         0.027
                         WN         0.009
         (200.0, 500.0]  WN         0.194
                         DL         0.189
                         OO         0.159
                         EV         0.156
                         MQ         0.100
                         AA         0.071
                         UA         0.062
                         VX         0.028
         Name: AIRLINE, dtype: float64

原理

 In[89]: flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True)['AIRLINE'].value_counts(normalize=True).round(3).head(15)
Out[89]: 
DIST              AIRLINE
(-inf, 200.0]     OO         0.325625
                 EV         0.289325
                 MQ         0.210809
                 DL         0.086045
                 AA         0.052165
                 UA         0.027427
                 WN         0.008604
(200.0, 500.0]    WN         0.193902
                 DL         0.188736
                 OO         0.158687
                 EV         0.156293
                 MQ         0.100164
                 AA         0.071375
                 UA         0.062051
                 VX         0.028222
                 US         0.016001
                 NK         0.011843
                 B6         0.006867
                 F9         0.004914
                 AS         0.000945
(500.0, 1000.0]   DL         0.205625
                 AA         0.143908
                 WN         0.138196
                 UA         0.131129
                 OO         0.106443
                 EV         0.100683
                 MQ         0.051213
                 F9         0.038192
                 NK         0.029527
                 US         0.025316
                 AS         0.023234
                 VX         0.003582
                 B6         0.002953
(1000.0, 2000.0]  AA         0.263781
                 UA         0.199070
                 DL         0.165092
                 WN         0.159664
                 OO         0.046454
                 NK         0.045115
                 US         0.040462
                 F9         0.030664
                 AS         0.015931
                 EV         0.015579
                 VX         0.012125
                 B6         0.003313
                 MQ         0.002749
(2000.0, inf]     UA         0.289097
                 AA         0.211643
                 DL         0.171436
                 B6         0.080414
                 VX         0.073754
                 US         0.065121
                 WN         0.046374
                 HA         0.027627
                 NK         0.019240
                 AS         0.011593
                 F9         0.003700
Name: AIRLINE, dtype: float64

# 求飛行時間的0.25蜻底，0.5骄崩，0.75分位數
 In[90]: flights.groupby(cuts)['AIR_TIME'].quantile(q=[.25, .5, .75]).div(60).round(2)
Out[90]: DIST                  
         (-inf, 200.0]     0.25    0.43
                           0.50    0.50
                           0.75    0.57
         (200.0, 500.0]    0.25    0.77
                           0.50    0.92
                           0.75    1.05
         (500.0, 1000.0]   0.25    1.43
                           0.50    1.65
                           0.75    1.92
         (1000.0, 2000.0]  0.25    2.50
                           0.50    2.93
                           0.75    3.40
         (2000.0, inf]     0.25    4.30
                           0.50    4.70
                           0.75    5.03
         Name: AIR_TIME, dtype: float64

# unstack方法可以將內層的索引變?yōu)榱忻? In[91]: labels=['Under an Hour', '1 Hour', '1-2 Hours', '2-4 Hours', '4+ Hours']
         cuts2 = pd.cut(flights['DIST'], bins=bins, labels=labels)
         flights.groupby(cuts2)['AIRLINE'].value_counts(normalize=True).round(3).unstack().style.highlight_max(axis=1)
Out[91]:

11. 計算城市之間的航班總數

 In[92]: flights = pd.read_csv('data/flights.csv')
         flights.head()
Out[92]:

# 求每兩個城市間的航班總數
 In[93]: flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
         flights_ct.head()
Out[93]: ORG_AIR  DEST_AIR
         ATL      ABE         31
                  ABQ         16
                  ABY         19
                  ACY          6
                  AEX         40
         dtype: int64

# 選出休斯頓（IAH）和亞特蘭大（ATL）之間雙方向的航班總數
 In[94]: flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]]
Out[94]: ORG_AIR  DEST_AIR
         ATL      IAH         121
         IAH      ATL         148
         dtype: int64

# 分別對每行按照出發(fā)地和目的地，按字母排序
 In[95]: flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
         flights_sort.head()
Out[95]:

# 因為現在每行都是獨立排序的朱躺，列名存在問題刁赖。對列重命名，然后再計算所有城市間的航班數
 In[96]: rename_dict = {'ORG_AIR':'AIR1','DEST_AIR':'AIR2'}
         flights_sort = flights_sort.rename(columns=rename_dict)
         flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
         flights_ct2.head()
Out[96]: AIR1  AIR2
         ABE   ATL     31
               ORD     24
         ABI   DFW     74
         ABQ   ATL     16
               DEN     46
         dtype: int64

# 找到亞特蘭大和休斯頓之間的航班數
 In[97]: flights_ct2.loc[('ATL', 'IAH')]
Out[97]: 269

# 如果調換順序长搀，則會出錯
 In[98]: flights_ct2.loc[('IAH', 'ATL')]
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-98-56147a7d0bb5> in <module>()
----> 1 flights_ct2.loc[('IAH', 'ATL')]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
  1323             except (KeyError, IndexError):
  1324                 pass
-> 1325             return self._getitem_tuple(key)
  1326         else:
  1327             key = com._apply_if_callable(key, self.obj)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   839 
   840         # no multi-index, so validate all of the indexers
--> 841         self._has_valid_tuple(tup)
   842 
   843         # ugly hack for GH #836

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
   186         for i, k in enumerate(key):
   187             if i >= self.obj.ndim:
--> 188                 raise IndexingError('Too many indexers')
   189             if not self._has_valid_type(k, i):
   190                 raise ValueError("Location based indexing can only have [%s] "

IndexingError: Too many indexers

# 用NumPy的sort函數可以大大提高速度
 In[99]: data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
         data_sorted[:10]
Out[99]: array([['LAX', 'SLC'],
                ['DEN', 'IAD'],
                ['DFW', 'VPS'],
                ['DCA', 'DFW'],
                ['LAX', 'MCI'],
                ['IAH', 'SAN'],
                ['DFW', 'MSY'],
                ['PHX', 'SFO'],
                ['ORD', 'STL'],
                ['IAH', 'SJC']], dtype=object)

# 重新用DataFrame構造器創(chuàng)建一個DataFrame宇弛，檢測其是否與flights_sorted相等
 In[100]: flights_sort2 = pd.DataFrame(data_sorted, columns=['AIR1', 'AIR2'])
          fs_orig = flights_sort.rename(columns={'ORG_AIR':'AIR1', 'DEST_AIR':'AIR2'})
          flights_sort2.equals(fs_orig)
Out[100]: True

# 比較速度
 In[101]: %timeit flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
          7.82 s ± 189 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 In[102]: %%timeit
          data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
          flights_sort2 = pd.DataFrame(data_sorted, columns=['AIR1', 'AIR2'])
          10.9 ms ± 325 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

12. 找到持續(xù)最長的準時航班

# 創(chuàng)建一個Series
 In[103]: s = pd.Series([1, 1, 1, 0, 1, 1, 1, 0])
          s
Out[103]: 0    1
          1    1
          2    1
          3    0
          4    1
          5    1
          6    1
          7    0
          dtype: int64

# 累積求和
 In[104]: s1 = s.cumsum()
          s1
Out[104]: 0    1
          1    2
          2    3
          3    3
          4    4
          5    5
          6    6
          7    6
          dtype: int64

 In[105]: s.mul(s1).diff()
Out[105]: 0    NaN
          1    1.0
          2    1.0
          3   -3.0
          4    4.0
          5    1.0
          6    1.0
          7   -6.0
          dtype: float64

# 將所有非負值變?yōu)槿笔е? In[106]: s.mul(s1).diff().where(lambda x: x < 0)
Out[106]: 0    NaN
          1    NaN
          2    NaN
          3   -3.0
          4    NaN
          5    NaN
          6    NaN
          7   -6.0
          dtype: float64

 In[107]: s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1, fill_value=0)
Out[107]: 0    1.0
          1    2.0
          2    3.0
          3    0.0
          4    1.0
          5    2.0
          6    3.0
          7    0.0
          dtype: float64

# 創(chuàng)建一個準時的列 ON_TIME
 In[108]: flights = pd.read_csv('data/flights.csv')
          flights['ON_TIME'] = flights['ARR_DELAY'].lt(15).astype(int)
          flights[['AIRLINE', 'ORG_AIR', 'ON_TIME']].head(10)
Out[108]:

# 將之前的邏輯做成一個函數
 In[109]: def max_streak(s):
              s1 = s.cumsum()
              return s.mul(s1).diff().where(lambda x: x < 0) \
                      .ffill().add(s1, fill_value=0).max()
 In[110]: flights.sort_values(['MONTH', 'DAY', 'SCHED_DEP']) \
                 .groupby(['AIRLINE', 'ORG_AIR'])['ON_TIME'] \
                 .agg(['mean', 'size', max_streak]).round(2).head()
Out[110]:

# 求最長的延誤航班
 In[111]: def max_delay_streak(df):
              df = df.reset_index(drop=True)
              s = 1 - df['ON_TIME']
              s1 = s.cumsum()
              streak = s.mul(s1).diff().where(lambda x: x < 0) \
                        .ffill().add(s1, fill_value=0)
              last_idx = streak.idxmax()
              first_idx = last_idx - streak.max() + 1
              df_return = df.loc[[first_idx, last_idx], ['MONTH', 'DAY']]
              df_return['streak'] = streak.max()
              df_return.index = ['first', 'last']
              df_return.index.name='streak_row'
              return df_return
 In[112]: flights.sort_values(['MONTH', 'DAY', 'SCHED_DEP']) \
                 .groupby(['AIRLINE', 'ORG_AIR']) \
                 .apply(max_delay_streak) \
                 .sort_values(['streak','MONTH','DAY'], ascending=[False, True, True]).head(10)
Out[112]:

第01章 Pandas基礎
 第02章 DataFrame運算
 第03章數據分析入門
 第04章選取數據子集
 第05章布爾索引
 第06章索引對齊
第07章分組聚合、過濾源请、轉換
第08章數據清理
 第09章合并Pandas對象
 第10章時間序列分析
 第11章用Matplotlib枪芒、Pandas彻况、Seaborn進行可視化

最后編輯于：2018.10.26 18:21:21

?著作權歸作者所有,轉載或內容合作請聯系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市舅踪，隨后出現的幾起案子纽甘，更是在濱河造成了極大的恐慌，老刑警劉巖抽碌，帶你破解...
沈念sama閱讀 206,126評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件悍赢，死亡現場離奇詭異，居然都是意外死亡货徙，警方通過查閱死者的電腦和手機左权，發(fā)現死者居然都...
沈念sama閱讀 88,254評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來痴颊，“玉大人赏迟，你說我怎么就攤上這事〈览猓” “怎么了锌杀？”我有些...
開封第一講書人閱讀 152,445評論 0贊 341
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長泻仙。經常有香客問我糕再，道長，這世上最難降的妖魔是什么饰豺？我笑而不...
開封第一講書人閱讀 55,185評論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任亿鲜，我火速辦了婚禮，結果婚禮上冤吨，老公的妹妹穿的比我還像新娘。我一直安慰自己饶套，他們只是感情好漩蟆，可當我...
茶點故事閱讀 64,178評論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著妓蛮，像睡著了一般怠李。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上蛤克，一...
開封第一講書人閱讀 48,970評論 1贊 284
城市分裂傳說
那天捺癞，我揣著相機與錄音，去河邊找鬼构挤。笑死髓介，一個胖子當著我的面吹牛，可吹牛的內容都是我干的筋现。我是一名探鬼主播唐础，決...
沈念sama閱讀 38,276評論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼箱歧，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了一膨？” 一聲冷哼從身側響起呀邢，我...
開封第一講書人閱讀 36,927評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎豹绪，沒想到半個月后价淌，有當地人在樹林里發(fā)現了一具尸體，經...
沈念sama閱讀 43,400評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡瞒津，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 35,883評論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年输钩，在試婚紗的時候發(fā)現自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片仲智。...
茶點故事閱讀 37,997評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡买乃，死狀恐怖，靈堂內的尸體忽然破棺而出钓辆，到底是詐尸還是另有隱情剪验，我是刑警寧澤，帶...
沈念sama閱讀 33,646評論 4贊 322
?日本核電站爆炸內幕
正文年R本政府宣布前联，位于F島的核電站功戚，受9級特大地震影響，放射性物質發(fā)生泄漏似嗤。R本人自食惡果不足惜啸臀，卻給世界環(huán)境...
茶點故事閱讀 39,213評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望烁落。院中可真熱鬧乘粒，春花似錦、人聲如沸伤塌。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,204評論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽每聪。三九已至旦棉，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間药薯，已是汗流浹背绑洛。一陣腳步聲響...
開封第一講書人閱讀 31,423評論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留童本，地道東北人真屯。一個月前我還...
沈念sama閱讀 45,423評論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長得像巾陕，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 42,722評論 2贊 345

《Pandas Cookbook》第07章分組聚合嵌溢、過濾、轉換

《Pandas Cookbook》第07章分組聚合办斑、過濾外恕、轉換

1. 定義聚合

原理

更多

2. 用多個列和函數進行分組和聚合

3. 分組后去除多級索引

更多

4. 自定義聚合函數

更多

5. 用 *args 和 **kwargs 自定義聚合函數

如何做

原理

更多

6. 檢查分組對象

更多

7. 過濾狀態(tài)

更多

8. 減肥對賭

更多

9. 用apply計算每州的加權平均SAT分數

更多

10. 用連續(xù)變量分組

原理

更多

11. 計算城市之間的航班總數

更多

12. 找到持續(xù)最長的準時航班

更多

推薦閱讀更多精彩內容

《Pandas Cookbook》第07章 分組聚合办斑、過濾外恕、轉換

1. 定義聚合

原理

更多

2. 用多個列和函數進行分組和聚合

3. 分組后去除多級索引

更多

4. 自定義聚合函數

更多

5. 用 *args 和 **kwargs 自定義聚合函數

如何做

原理

更多

6. 檢查分組對象

更多

7. 過濾狀態(tài)

更多

8. 減肥對賭

更多

9. 用apply計算每州的加權平均SAT分數

更多

10. 用連續(xù)變量分組

原理

更多

11. 計算城市之間的航班總數

更多

12. 找到持續(xù)最長的準時航班

更多

推薦閱讀更多精彩內容

《Pandas Cookbook》第07章分組聚合办斑、過濾外恕、轉換