《Pandas Cookbook》第08章數(shù)據(jù)清理

第01章 Pandas基礎(chǔ)
第02章 DataFrame運(yùn)算
 第03章數(shù)據(jù)分析入門(mén)
第04章選取數(shù)據(jù)子集
 第05章布爾索引
 第06章索引對(duì)齊
 第07章分組聚合猴仑、過(guò)濾滔迈、轉(zhuǎn)換
第08章數(shù)據(jù)清理
第09章合并Pandas對(duì)象
 第10章時(shí)間序列分析
 第11章用Matplotlib浙芙、Pandas蹂风、Seaborn進(jìn)行可視化

 In[1]: import pandas as pd
        import numpy as np

1. 用stack清理變量值作為列名

# 加載state_fruit數(shù)據(jù)集
 In[2]: state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
        state_fruit
out[2]:

# stack方法可以將所有列名，轉(zhuǎn)變?yōu)榇怪钡囊患?jí)行索引
 In[3]: state_fruit.stack()
out[3]: Texas    Apple      12
                 Orange     10
                 Banana     40
        Arizona  Apple       9
                 Orange      7
                 Banana     12
        Florida  Apple       0
                 Orange     14
                 Banana    190
        dtype: int64

# 使用reset_index()沟沙，將結(jié)果變?yōu)镈ataFrame
 In[4]: state_fruit_tidy = state_fruit.stack().reset_index()
        state_fruit_tidy
out[4]:

# 重命名列名
 In[5]: state_fruit_tidy.columns = ['state', 'fruit', 'weight']
        state_fruit_tidy
out[5]:

# 也可以使用rename_axis給不同的行索引層級(jí)命名
 In[6]: state_fruit.stack()\
                   .rename_axis(['state', 'fruit'])\
out[6]: state    fruit 
        Texas    Apple      12
                 Orange     10
                 Banana     40
        Arizona  Apple       9
                 Orange      7
                 Banana     12
        Florida  Apple       0
                 Orange     14
                 Banana    190
        dtype: int64

# 再次使用reset_index方法
 In[7]: state_fruit.stack()\
                   .rename_axis(['state', 'fruit'])\
                   .reset_index(name='weight')
out[7]:

# 讀取state_fruit2數(shù)據(jù)集
 In[8]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')
        state_fruit2
out[8]:

# 州名不在行索引的位置上，使用stack將所有列名變?yōu)橐粋€(gè)長(zhǎng)Series
 In[9]: state_fruit2.stack()
out[9]: 0  State       Texas
           Apple          12
           Orange         10
           Banana         40
        1  State     Arizona
           Apple           9
           Orange          7
           Banana         12
        2  State     Florida
           Apple           0
           Orange         14
           Banana        190
        dtype: object

# 先設(shè)定state作為行索引名壁榕，再stack矛紫，可以得到和前面相似的結(jié)果
 In[10]: state_fruit2.set_index('State').stack()
out[10]: 0  State       Texas
            Apple          12
            Orange         10
            Banana         40
         1  State     Arizona
            Apple           9
            Orange          7
            Banana         12
         2  State     Florida
            Apple           0
            Orange         14
            Banana        190
         dtype: object

2. 用melt清理變量值作為列名

# 讀取state_fruit2數(shù)據(jù)集
 In[11]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')
         state_fruit2
out[11]:

# 使用melt方法，將列傳給id_vars和value_vars牌里。melt可以將原先的列名作為變量颊咬，原先的值作為值。
 In[12]: state_fruit2.melt(id_vars=['State'],
                           value_vars=['Apple', 'Orange', 'Banana'])
out[12]:

# 隨意設(shè)定一個(gè)行索引
 In[13]: state_fruit2.index=list('abc')
         state_fruit2.index.name = 'letter'
 In[14]: state_fruit2
out[14]:

# var_name和value_name可以用來(lái)重命名新生成的變量列和值的列
 In[15]: state_fruit2.melt(id_vars=['State'],
                      value_vars=['Apple', 'Orange', 'Banana'],
                      var_name='Fruit',
                      value_name='Weight')
out[15]:

# 如果你想讓所有值都位于一列牡辽，舊的列標(biāo)簽位于另一列喳篇，可以直接使用melt
 In[16]: state_fruit2.melt()
out[16]:

# 要指明id變量，只需使用id_vars參數(shù)
 In[17]: state_fruit2.melt(id_vars='State')
out[17]:

3. 同時(shí)stack多組變量

# 讀取movie數(shù)據(jù)集态辛，選取所有演員名和其Facebook likes
 In[18]: movie = pd.read_csv('data/movie.csv')
         actor = movie[['movie_title', 'actor_1_name', 'actor_2_name', 'actor_3_name', 
               'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']]
         actor.head()
out[18]:

# 創(chuàng)建一個(gè)自定義函數(shù)麸澜，用來(lái)改變列名。wide_to_long要求分組的變量要有相同的數(shù)字結(jié)尾：
 In[19]: def change_col_name(col_name):
             col_name = col_name.replace('_name', '')
             if 'facebook' in col_name:
                 fb_idx = col_name.find('facebook')
                 col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]
             return col_name
 In[20]: actor2 = actor.rename(columns=change_col_name)
         actor2.head()
out[20]:

# 使用wide_to_long函數(shù)奏黑，同時(shí)stack兩列actor和Facebook
 In[21]: stubs = ['actor', 'actor_facebook_likes']
         actor2_tidy = pd.wide_to_long(actor2, 
                                       stubnames=stubs, 
                                       i=['movie_title'], 
                                       j='actor_num', 
                                       sep='_').reset_index()
         actor2_tidy.head()
out[21]:

# 加載數(shù)據(jù)
 In[22]: df = pd.read_csv('data/stackme.csv')
         df
out[22]:

# 對(duì)列重命名
 In[23]: df2 = df.rename(columns = {'a1':'group1_a1', 'b2':'group1_b2',
                                    'd':'group2_a1', 'e':'group2_b2'})
         df2
out[23]:

# 設(shè)定stubnames=['group1', 'group2']炊邦，對(duì)任何數(shù)字都起作用
 In[24]: pd.wide_to_long(df2, 
                         stubnames=['group1', 'group2'], 
                         i=['State', 'Country', 'Test'], 
                         j='Label', 
                         suffix='.+', 
                         sep='_')
out[24]:

4. 反轉(zhuǎn)stacked數(shù)據(jù)

# 讀取college數(shù)據(jù)集，學(xué)校名作為行索引熟史，馁害，只選取本科生的列
 In[25]: usecol_func = lambda x: 'UGDS_' in x or x == 'INSTNM'
         college = pd.read_csv('data/college.csv', 
                                   index_col='INSTNM', 
                                   usecols=usecol_func)
         college.head()
out[25]:

# 用stack方法，將所有水平列名蹂匹，轉(zhuǎn)化為垂直的行索引
 In[26]: college_stacked = college.stack()
         college_stacked.head(18)
out[26]: INSTNM                                         
Alabama A & M University             UGDS_WHITE    0.0333
                                     UGDS_BLACK    0.9353
                                     UGDS_HISP     0.0055
                                     UGDS_ASIAN    0.0019
                                     UGDS_AIAN     0.0024
                                     UGDS_NHPI     0.0019
                                     UGDS_2MOR     0.0000
                                     UGDS_NRA      0.0059
                                     UGDS_UNKN     0.0138
University of Alabama at Birmingham  UGDS_WHITE    0.5922
                                     UGDS_BLACK    0.2600
                                     UGDS_HISP     0.0283
                                     UGDS_ASIAN    0.0518
                                     UGDS_AIAN     0.0022
                                     UGDS_NHPI     0.0007
                                     UGDS_2MOR     0.0368
                                     UGDS_NRA      0.0179
                                     UGDS_UNKN     0.0100
dtype: float64

# unstack方法可以將其還原
 In[27]: college_stacked.unstack().head()
out[27]:

# 另一種方式是先用melt碘菜，再用pivot。先加載數(shù)據(jù)限寞，不指定行索引名
 In[28]: college2 = pd.read_csv('data/college.csv', 
                               usecols=usecol_func)
         college2.head()
out[28]:

# 使用melt忍啸，將所有race列變?yōu)橐涣? In[29]: college_melted = college2.melt(id_vars='INSTNM', 
                                        var_name='Race',
                                        value_name='Percentage')
         college_melted.head()
out[29]:

# 用pivot還原
 In[30]: melted_inv = college_melted.pivot(index='INSTNM',
                                           columns='Race',
                                           values='Percentage')
         melted_inv.head()
out[30]:

# 用loc同時(shí)選取行和列，然后重置索引昆烁，可以獲得和原先索引順序一樣的DataFrame
 In[31]: college2_replication = melted_inv.loc[college2['INSTNM'], 
                                               college2.columns[1:]]\
                                                  .reset_index()
         college2.equals(college2_replication)
out[31]: True

# 使用最外層的行索引做unstack
 In[32]: college.stack().unstack(0)
out[32]:

# 轉(zhuǎn)置DataFrame更簡(jiǎn)單的方法是transpose()或T
 In[33]: college.T
out[33]:

5. 分組聚合后unstacking

# 讀取employee數(shù)據(jù)集吊骤，求出每個(gè)種族的平均工資
 In[34]: employee = pd.read_csv('data/employee.csv')
 In[35]: employee.groupby('RACE')['BASE_SALARY'].mean().astype(int)
out[35]: RACE
         American Indian or Alaskan Native    60272
         Asian/Pacific Islander               61660
         Black or African American            50137
         Hispanic/Latino                      52345
         Others                               51278
         White                                64419
         Name: BASE_SALARY, dtype: int64

# 對(duì)種族和性別分組，求平均工資
 In[36]: agg = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].mean().astype(int)
         agg
out[36]: RACE                               GENDER
         American Indian or Alaskan Native  Female    60238
                                            Male      60305
         Asian/Pacific Islander             Female    63226
                                            Male      61033
         Black or African American          Female    48915
                                            Male      51082
         Hispanic/Latino                    Female    46503
                                            Male      54782
         Others                             Female    63785
                                            Male      38771
         White                              Female    66793
                                            Male      63940
         Name: BASE_SALARY, dtype: int64

# 對(duì)索引層GENDER做unstack
 In[37]: agg.unstack('GENDER')
out[37]:

# 對(duì)索引層RACE做unstack
 In[38]: agg.unstack('RACE')
out[38]:

# 按RACE和GENDER分組静尼，求工資的平均值白粉、最大值和最小值
 In[39]: agg2 = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].agg(['mean', 'max', 'min']).astype(int)
         agg2
out[39]:

# 此時(shí)unstack('GENDER')會(huì)生成多級(jí)列索引传泊，可以用stack和unstack調(diào)整結(jié)構(gòu)
agg2.unstack('GENDER')

6. 用分組聚合實(shí)現(xiàn)透視表

# 讀取flights數(shù)據(jù)集
 In[40]: flights = pd.read_csv('data/flights.csv')
         flights.head()
out[40]:

# 用pivot_table方法求出每條航線每個(gè)始發(fā)地的被取消的航班總數(shù)
 In[41]: fp = flights.pivot_table(index='AIRLINE', 
                                  columns='ORG_AIR', 
                                  values='CANCELLED', 
                                  aggfunc='sum',
                                  fill_value=0).round(2)
         fp.head()
out[41]:

# groupby聚合不能直接復(fù)現(xiàn)這張表。需要先按所有index和columns的列聚合
 In[42]: fg = flights.groupby(['AIRLINE', 'ORG_AIR'])['CANCELLED'].sum()
         fg.head()
out[42]: AIRLINE  ORG_AIR
         AA       ATL         3
                  DEN         4
                  DFW        86
                  IAH         3
                  LAS         3
         Name: CANCELLED, dtype: int64

# 再使用unstack鸭巴，將ORG_AIR這層索引作為列名
 In[43]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)
         fg_unstack.head()
out[43]:

# 判斷兩個(gè)方式是否等價(jià)
 In[44]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)
         fp.equals(fg_unstack)
out[44]: True

# 先實(shí)現(xiàn)一個(gè)稍微復(fù)雜的透視表
 In[45]: fp2 = flights.pivot_table(index=['AIRLINE', 'MONTH'],
                                   columns=['ORG_AIR', 'CANCELLED'],
                                   values=['DEP_DELAY', 'DIST'],
                                   aggfunc=[np.mean, np.sum],
                                   fill_value=0)
         fp2.head()
out[45]:

# 用groupby和unstack復(fù)現(xiàn)上面的方法
 In[46]: flights.groupby(['AIRLINE', 'MONTH', 'ORG_AIR', 'CANCELLED'])['DEP_DELAY', 'DIST'] \
                .agg(['mean', 'sum']) \
                .unstack(['ORG_AIR', 'CANCELLED'], fill_value=0) \
                .swaplevel(0, 1, axis='columns') \
                .head()
out[46]:

7. 為了更容易reshaping眷细，重新命名索引層

# 讀取college數(shù)據(jù)集，分組后鹃祖，統(tǒng)計(jì)本科生的SAT數(shù)學(xué)成績(jī)信息
 In[47]: college = pd.read_csv('data/college.csv')
 In[48]: cg = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'] \
                     .agg(['count', 'min', 'max']).head(6)
 In[49]: cg
out[49]:

# 行索引的兩級(jí)都有名字溪椎，而列索引沒(méi)有名字。用rename_axis給列索引的兩級(jí)命名
 In[50]:cg = cg.rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
        cg
out[50]:

# 將AGG_FUNCS列移到行索引
 In[51]:cg.stack('AGG_FUNCS').head()
out[51]:

# stack默認(rèn)是將列放到行索引的最內(nèi)層恬口，可以使用swaplevel改變層級(jí)
 In[52]:cg.stack('AGG_FUNCS').swaplevel('AGG_FUNCS', 'STABBR', axis='index').head()
out[52]:

# 在此前的基礎(chǔ)上再做sort_index
 In[53]:cg.stack('AGG_FUNCS') \
          .swaplevel('AGG_FUNCS', 'STABBR', axis='index') \
          .sort_index(level='RELAFFIL', axis='index') \
          .sort_index(level='AGG_COLS', axis='columns').head(6)
out[53]:

# 對(duì)一些列做stack校读，對(duì)其它列做unstack
 In[54]:cg.stack('AGG_FUNCS').unstack(['RELAFFIL', 'STABBR'])
out[54]:

# 對(duì)所有列做stack，會(huì)返回一個(gè)Series
 In[55]:cg.stack(['AGG_FUNCS', 'AGG_COLS']).head(12)
out[55]:

# 刪除行和列索引所有層級(jí)的名稱
 In[56]:cg.rename_axis([None, None], axis='index').rename_axis([None, None], axis='columns')
out[56]:

8. 當(dāng)多個(gè)變量被存儲(chǔ)為列名時(shí)進(jìn)行清理

# 讀取weightlifting數(shù)據(jù)集
 In[57]:weightlifting = pd.read_csv('data/weightlifting_men.csv')
        weightlifting
out[57]:

# 用melt方法祖能，將sex_age放入一個(gè)單獨(dú)的列
 In[58]:wl_melt = weightlifting.melt(id_vars='Weight Category', 
                                     var_name='sex_age', 
                                     value_name='Qual Total')
        wl_melt.head()
out[58]:

# 用split方法將sex_age列分為兩列
 In[59]:sex_age = wl_melt['sex_age'].str.split(expand=True)
        sex_age.head()
out[59]:      0         1
      0     M35     35-39
      1     M35     35-39
      2     M35     35-39
      3     M35     35-39
      4     M35     35-39

# 給列起名
 In[60]:sex_age.columns = ['Sex', 'Age Group']
        sex_age.head()
out[60]:

# 只取出字符串中的M
 In[61]:sex_age['Sex'] = sex_age['Sex'].str[0]
        sex_age.head()
out[61]:

# 用concat方法歉秫，將sex_age,與wl_cat_total連接起來(lái)
 In[62]:wl_cat_total = wl_melt[['Weight Category', 'Qual Total']]
        wl_tidy = pd.concat([sex_age, wl_cat_total], axis='columns')
        wl_tidy.head()
out[62]:

# 上面的結(jié)果也可以如下實(shí)現(xiàn)
 In[63]:cols = ['Weight Category', 'Qual Total']
        sex_age[cols] = wl_melt[cols]

# 也可以通過(guò)assign的方法，動(dòng)態(tài)加載新的列
 In[64]: age_group = wl_melt.sex_age.str.extract('(\d{2}[-+](?:\d{2})?)', expand=False)
         sex = wl_melt.sex_age.str[0]
         new_cols = {'Sex':sex, 
                     'Age Group': age_group}
 In[65]: wl_tidy2 = wl_melt.assign(**new_cols).drop('sex_age', axis='columns')
         wl_tidy2.head()
out[65]:

# 判斷兩種方法是否等效
 In[66]: wl_tidy2.sort_index(axis=1).equals(wl_tidy.sort_index(axis=1))
out[66]: True

9. 當(dāng)多個(gè)變量被存儲(chǔ)為列的值時(shí)進(jìn)行清理

# 讀取restaurant_inspections數(shù)據(jù)集养铸，將Date列的數(shù)據(jù)類型變?yōu)閐atetime64
 In[67]: inspections = pd.read_csv('data/restaurant_inspections.csv', parse_dates=['Date'])
         inspections.head(10)
out[67]:

# 用info列的所有值造一個(gè)新列雁芙。但是，Pandas不支持這種功能
 In[68]: inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
    297             try:
--> 298                 codes, categories = factorize(values, sort=True)
    299             except TypeError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    559     check_nulls = not is_integer_dtype(original)
--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    561 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21922)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

During handling of the above exception, another exception occurred:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-68-754f69d68d6c> in <module>()
----> 1 inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
   3851         """
   3852         from pandas.core.reshape.reshape import pivot
-> 3853         return pivot(self, index=index, columns=columns, values=values)
   3854 
   3855     def stack(self, level=-1, dropna=True):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values)
    375             index = self[index]
    376         indexed = Series(self[values].values,
--> 377                          index=MultiIndex.from_arrays([index, self[columns]]))
    378         return indexed.unstack(columns)
    379 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in from_arrays(cls, arrays, sortorder, names)
   1098         from pandas.core.categorical import _factorize_from_iterables
   1099 
-> 1100         labels, levels = _factorize_from_iterables(arrays)
   1101         if names is None:
   1102             names = [getattr(arr, "name", None) for arr in arrays]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables)
   2191         # For consistency, it should return a list of 2 lists.
   2192         return [[], []]
-> 2193     return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in <listcomp>(.0)
   2191         # For consistency, it should return a list of 2 lists.
   2192         return [[], []]
-> 2193     return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values)
   2163         codes = values.codes
   2164     else:
-> 2165         cat = Categorical(values, ordered=True)
   2166         categories = cat.categories
   2167         codes = cat.codes

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
    308 
    309                 # FIXME
--> 310                 raise NotImplementedError("> 1 ndim Categorical are not "
    311                                           "supported at this time")
    312 

NotImplementedError: > 1 ndim Categorical are not supported at this time

# 將'Name','Date', 'Info'作為所索引
 In[69]: inspections.set_index(['Name','Date', 'Info']).head(10)
out[69]:

# 用pivot钞螟，將info列中的值變?yōu)樾碌牧? In[70]: inspections.set_index(['Name','Date', 'Info']).unstack('Info').head()
out[70]:

# 用reset_index方法兔甘，使行索引層級(jí)與列索引相同
 In[71]: insp_tidy = inspections.set_index(['Name','Date', 'Info']) \
                                        .unstack('Info') \
                                        .reset_index(col_level=-1)
         insp_tidy.head()
out[71]:

# 除掉列索引的最外層，重命名行索引的層為None
 In[72]: insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
         insp_tidy.head()
out[72]:

# 使用squeeze方法鳞滨，可以避免前面的多級(jí)索引
 In[73]: inspections.set_index(['Name','Date', 'Info']) \
                    .squeeze() \
                    .unstack('Info') \
                    .reset_index() \
                    .rename_axis(None, axis='columns')
out[73]:

# pivot_table需要傳入聚合函數(shù)洞焙，才能產(chǎn)生一個(gè)單一值
 In[74]: inspections.pivot_table(index=['Name', 'Date'], 
                                 columns='Info', 
                                 values='Value', 
                                 aggfunc='first') \
                    .reset_index()\
                    .rename_axis(None, axis='columns')
out[74]:

10. 當(dāng)兩個(gè)或多個(gè)值存儲(chǔ)于一個(gè)單元格時(shí)進(jìn)行清理

# 讀取texas_cities數(shù)據(jù)集
 In[75]: cities = pd.read_csv('data/texas_cities.csv')
         cities
out[75]:

# 將Geolocation分解為四個(gè)單獨(dú)的列
 In[76]: geolocations = cities.Geolocation.str.split(pat='. ', expand=True)
         geolocations.columns = ['latitude', 'latitude direction', 'longitude', 'longitude direction']
         geolocations
out[76]:

# 轉(zhuǎn)變數(shù)據(jù)類型
 In[77]: geolocations = geolocations.astype({'latitude':'float', 'longitude':'float'})
         geolocations.dtypes
out[77]: latitude               float64
         latitude direction      object
         longitude              float64
         longitude direction     object
         dtype: object

# 將新列與原先的city列連起來(lái)
 In[78]: cities_tidy = pd.concat([cities['City'], geolocations], axis='columns')
         cities_tidy
out[78]:

# 忽略，作者這里是寫(xiě)重復(fù)了
 In[79]: pd.concat([cities['City'], geolocations], axis='columns')
out[79]:

原理

# 函數(shù)to_numeric可以將每列自動(dòng)變?yōu)檎麛?shù)或浮點(diǎn)數(shù)
 In[80]: temp = geolocations.apply(pd.to_numeric, errors='ignore')
         temp
out[80]:

# 再查看數(shù)據(jù)類型
 In[81]: temp.dtypes
out[81]: latitude               float64
         latitude direction      object
         longitude              float64
         longitude direction     object
         dtype: object

# |符拯啦，可以對(duì)多個(gè)標(biāo)記進(jìn)行分割
 In[82]: cities.Geolocation.str.split(pat='° |, ', expand=True)
out[82]:

# 更復(fù)雜的提取方式
 In[83]: cities.Geolocation.str.extract('([0-9.]+). (N|S), ([0-9.]+). (E|W)', expand=True)
out[83]:

11. 當(dāng)多個(gè)變量被存儲(chǔ)為列名和列值時(shí)進(jìn)行清理

# 讀取sensors數(shù)據(jù)集
 In[84]: sensors = pd.read_csv('data/sensors.csv')
         sensors
out[84]:

# 用melt清理數(shù)據(jù)
 In[85]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year').head(6)
out[85]:

# 用pivot_table闽晦，將Property列轉(zhuǎn)化為新的列名
 In[86]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year') \
                .pivot_table(index=['Group', 'Year'], columns='Property', values='value') \
                .reset_index() \
                .rename_axis(None, axis='columns')
out[86]:

# 用stack和unstack實(shí)現(xiàn)上述方法
 In[87]: sensors.set_index(['Group', 'Property']) \
                .stack() \
                .unstack('Property') \
                .rename_axis(['Group', 'Year'], axis='index') \
                .rename_axis(None, axis='columns') \
                .reset_index()
out[87]:

12. 當(dāng)多個(gè)觀察單位被存儲(chǔ)于同一張表時(shí)進(jìn)行清理

# 讀取movie_altered數(shù)據(jù)集
 In[88]: movie = pd.read_csv('data/movie_altered.csv')
         movie.head()
out[88]:

# 插入新的列，用來(lái)標(biāo)識(shí)每一部電影
 In[89]: movie.insert(0, 'id', np.arange(len(movie)))
         movie.head()
out[89]:

# 用wide_to_long提岔，將所有演員放到一列仙蛉，將所有Facebook likes放到一列
 In[90]: stubnames = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes']
         movie_long = pd.wide_to_long(movie, 
                                      stubnames=stubnames, 
                                      i='id', 
                                      j='num', 
                                      sep='_').reset_index()
         movie_long['num'] = movie_long['num'].astype(int)
         movie_long.head(9)
out[90]:

# 將這個(gè)數(shù)據(jù)分解成多個(gè)小表
 In[91]: movie_table = movie_long[['id','title', 'year', 'duration', 'rating']]
         director_table = movie_long[['id', 'director', 'num', 'director_fb_likes']]
         actor_table = movie_long[['id', 'actor', 'num', 'actor_fb_likes']]
 In[92]: movie_table.head(9)
out[90]:

 In[93]: director_table.head(9)
out[93]:

 In[94]: actor_table.head(9)
out[94]:

# 做一些去重和去除缺失值的工作
 In[95]: movie_table = movie_table.drop_duplicates().reset_index(drop=True)
         director_table = director_table.dropna().reset_index(drop=True)
         actor_table = actor_table.dropna().reset_index(drop=True)
 In[96]: movie_table.head()
out[96]:

 In[97]: director_table.head()
out[97]:

# 比較內(nèi)存的使用量
 In[98]: movie.memory_usage(deep=True).sum()
out[98]: 2318234

 In[99]: movie_table.memory_usage(deep=True).sum() + \
         director_table.memory_usage(deep=True).sum() + \
         actor_table.memory_usage(deep=True).sum()
out[99]: 2624898

# 創(chuàng)建演員和導(dǎo)演的id列
 In[100]: director_cat = pd.Categorical(director_table['director'])
          director_table.insert(1, 'director_id', director_cat.codes)

          actor_cat = pd.Categorical(actor_table['actor'])
          actor_table.insert(1, 'actor_id', actor_cat.codes)

          director_table.head()
out[100]:

 In[101]: actor_table.head()
out[101]:

# 可以用這兩張表生成要用的中間表。先來(lái)做director表
 In[102]: director_associative = director_table[['id', 'director_id', 'num']]
          dcols = ['director_id', 'director', 'director_fb_likes']
          director_unique = director_table[dcols].drop_duplicates().reset_index(drop=True)
          director_associative.head()         
out[102]:

 In[103]: director_unique.head()
out[103]:

# 再來(lái)做actor表
 In[104]: actor_associative = actor_table[['id', 'actor_id', 'num']]
          acols = ['actor_id', 'actor', 'actor_fb_likes']
          actor_unique = actor_table[acols].drop_duplicates().reset_index(drop=True)
          actor_associative.head()
out[104]:

 In[105]: actor_unique.head()
out[105]:

# 查看新的表所使用的內(nèi)存量
 In[106]: movie_table.memory_usage(deep=True).sum() + \
          director_associative.memory_usage(deep=True).sum() + \
          director_unique.memory_usage(deep=True).sum() + \
          actor_associative.memory_usage(deep=True).sum() + \
          actor_unique.memory_usage(deep=True).sum()
out[106]: 1833402

 In[107]: movie_table.head()
out[107]:

# 可以通過(guò)將左右表組合起來(lái)形成movie表碱蒙。首先將附表與actor/director表結(jié)合荠瘪，然后將num列pivot，再加上列的前綴
 In[108]: actors = actor_associative.merge(actor_unique, on='actor_id') \
                                    .drop('actor_id', 1) \
                                    .pivot_table(index='id', columns='num', aggfunc='first')

          actors.columns = actors.columns.get_level_values(0) + '_' + \
                           actors.columns.get_level_values(1).astype(str)

          directors = director_associative.merge(director_unique, on='director_id') \
                                          .drop('director_id', 1) \
                                          .pivot_table(index='id', columns='num', aggfunc='first')

          directors.columns = directors.columns.get_level_values(0) + '_' + \
                              directors.columns.get_level_values(1).astype(str)
 In[109]: actors.head()
out[109]:

 In[110]: directors.head()
out[110]:

 In[111]: movie2 = movie_table.merge(directors.reset_index(), on='id', how='left') \
                              .merge(actors.reset_index(), on='id', how='left')
 In[112]: movie2.head()
out[112]:

 In[113]: movie.equals(movie2[movie.columns])
out[113]: True

第01章 Pandas基礎(chǔ)
第02章 DataFrame運(yùn)算
 第03章數(shù)據(jù)分析入門(mén)
第04章選取數(shù)據(jù)子集
 第05章布爾索引
 第06章索引對(duì)齊
 第07章分組聚合赛惩、過(guò)濾哀墓、轉(zhuǎn)換
第08章數(shù)據(jù)清理
第09章合并Pandas對(duì)象
 第10章時(shí)間序列分析
 第11章用Matplotlib、Pandas喷兼、Seaborn進(jìn)行可視化

最后編輯于：2018.10.26 18:21:32

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末篮绰，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子季惯，更是在濱河造成了極大的恐慌吠各，老刑警劉巖臀突，帶你破解...
沈念sama閱讀 217,509評(píng)論 6贊 504
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異贾漏，居然都是意外死亡候学，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,806評(píng)論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)纵散，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)梳码，“玉大人，你說(shuō)我怎么就攤上這事伍掀￡瑁” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 163,875評(píng)論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵蜜笤，是天一觀的道長(zhǎng)符匾。經(jīng)常有香客問(wèn)我，道長(zhǎng)瘩例，這世上最難降的妖魔是什么？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,441評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任甸各，我火速辦了婚禮垛贤，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘趣倾。我一直安慰自己聘惦，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,488評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布儒恋。她就那樣靜靜地躺著善绎，像睡著了一般。火紅的嫁衣襯著肌膚如雪诫尽。梳的紋絲不亂的頭發(fā)上禀酱，一...
開(kāi)封第一講書(shū)人閱讀 51,365評(píng)論 1贊 302
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音牧嫉，去河邊找鬼剂跟。笑死，一個(gè)胖子當(dāng)著我的面吹牛酣藻，可吹牛的內(nèi)容都是我干的曹洽。我是一名探鬼主播，決...
沈念sama閱讀 40,190評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼辽剧，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼送淆！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起怕轿，我...
開(kāi)封第一講書(shū)人閱讀 39,062評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤偷崩，失蹤者是張志新（化名）和其女友劉穎辟拷，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體环凿，經(jīng)...
沈念sama閱讀 45,500評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡梧兼，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,706評(píng)論 3贊 335
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了智听。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片羽杰。...
茶點(diǎn)故事閱讀 39,834評(píng)論 1贊 347
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖到推，靈堂內(nèi)的尸體忽然破棺而出考赛，到底是詐尸還是另有隱情，我是刑警寧澤莉测，帶...
沈念sama閱讀 35,559評(píng)論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布颜骤，位于F島的核電站，受9級(jí)特大地震影響捣卤，放射性物質(zhì)發(fā)生泄漏忍抽。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,167評(píng)論 3贊 328
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一董朝、第九天我趴在偏房一處隱蔽的房頂上張望鸠项。院中可真熱鬧，春花似錦子姜、人聲如沸祟绊。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,779評(píng)論 0贊 22
一樁弒父案哥捕，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)牧抽。三九已至，卻和暖如春遥赚，著一層夾襖步出監(jiān)牢的瞬間扬舒，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,912評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工凫佛，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留呼巴，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,958評(píng)論 2贊 370
代替公主和親
正文我出身青樓御蒲，卻偏偏與公主長(zhǎng)得像衣赶，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子厚满，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,779評(píng)論 2贊 354

《Pandas Cookbook》第08章數(shù)據(jù)清理

《Pandas Cookbook》第08章數(shù)據(jù)清理

1. 用stack清理變量值作為列名

更多

2. 用melt清理變量值作為列名

3. 同時(shí)stack多組變量

更多

4. 反轉(zhuǎn)stacked數(shù)據(jù)

更多

5. 分組聚合后unstacking

更多

6. 用分組聚合實(shí)現(xiàn)透視表

更多

7. 為了更容易reshaping眷细，重新命名索引層

更多

8. 當(dāng)多個(gè)變量被存儲(chǔ)為列名時(shí)進(jìn)行清理

更多

9. 當(dāng)多個(gè)變量被存儲(chǔ)為列的值時(shí)進(jìn)行清理

更多

10. 當(dāng)兩個(gè)或多個(gè)值存儲(chǔ)于一個(gè)單元格時(shí)進(jìn)行清理

原理

更多

11. 當(dāng)多個(gè)變量被存儲(chǔ)為列名和列值時(shí)進(jìn)行清理

更多

12. 當(dāng)多個(gè)觀察單位被存儲(chǔ)于同一張表時(shí)進(jìn)行清理

推薦閱讀更多精彩內(nèi)容

《Pandas Cookbook》第08章 數(shù)據(jù)清理

1. 用stack清理變量值作為列名

更多

2. 用melt清理變量值作為列名

3. 同時(shí)stack多組變量

更多

4. 反轉(zhuǎn)stacked數(shù)據(jù)

更多

5. 分組聚合后unstacking

更多

6. 用分組聚合實(shí)現(xiàn)透視表

更多

7. 為了更容易reshaping眷细，重新命名索引層

更多

8. 當(dāng)多個(gè)變量被存儲(chǔ)為列名時(shí)進(jìn)行清理

更多

9. 當(dāng)多個(gè)變量被存儲(chǔ)為列的值時(shí)進(jìn)行清理

更多

10. 當(dāng)兩個(gè)或多個(gè)值存儲(chǔ)于一個(gè)單元格時(shí)進(jìn)行清理

原理

更多

11. 當(dāng)多個(gè)變量被存儲(chǔ)為列名和列值時(shí)進(jìn)行清理

更多

12. 當(dāng)多個(gè)觀察單位被存儲(chǔ)于同一張表時(shí)進(jìn)行清理

推薦閱讀更多精彩內(nèi)容

《Pandas Cookbook》第08章數(shù)據(jù)清理