《Pandas Cookbook》第02章 DataFrame基本操作

第01章 Pandas基礎(chǔ)
第02章 DataFrame運(yùn)算
第03章數(shù)據(jù)分析入門
 第04章選取數(shù)據(jù)子集
 第05章布爾索引
 第06章索引對齊
 第07章分組聚合瘪撇、過濾泉孩、轉(zhuǎn)換
 第08章數(shù)據(jù)清理
 第09章合并Pandas對象
 第10章時(shí)間序列分析
 第11章用Matplotlib改览、Pandas碳锈、Seaborn進(jìn)行可視化

In[1]: import pandas as pd
       import numpy as np
       pd.options.display.max_columns = 40

1. 選取多個(gè)DataFrame列

# 用列表選取多個(gè)列
 In[2]: movie = pd.read_csv('data/movie.csv')
        movie_actor_director = movie[['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']]
        movie_actor_director.head()
Out[2]:

# 選取單列
 In[3]: movie[['director_name']].head()
Out[3]:

# 錯(cuò)誤的選取多列的方式
 In[4]: movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-954222273e42> in <module>()
----> 1 movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

# 將列表賦值給一個(gè)變量，便于多選
 In[6]: cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
        movie_actor_director = movie[cols]
Out[6]: float64    13
        int64       3
        object     11
        dtype: int64

# 使用select_dtypes()，選取整數(shù)列
 In[7]: movie.select_dtypes(include=['int']).head()
Out[7]:

# 選取所有的數(shù)值列
 In[8]: movie.select_dtypes(include=['number']).head()
Out[8]:

# 通過filter()函數(shù)過濾選取多列
 In[9]: movie.filter(like='facebook').head()
Out[9]:

# 通過正則表達(dá)式選取多列
 In[10]: movie.filter(regex='\d').head()
Out[10]:

# filter()函數(shù)沾谓，傳遞列表到參數(shù)items委造，選取多列
 In[11]: movie.filter(items=['actor_1_name', 'asdf']).head()
Out[11]:

2. 對列名進(jìn)行排序

# 讀取movie數(shù)據(jù)集
 In[12]: movie = pd.read_csv('data/movie.csv')
 In[13]: movie.head()
Out[13]:

# 打印列索引
 In[14]: movie.columns
Out[14]: Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

# 將列索引按照指定的順序排列
 In[15]: disc_core = ['movie_title','title_year', 'content_rating','genres']
         disc_people = ['director_name','actor_1_name', 'actor_2_name','actor_3_name']
         disc_other = ['color','country','language','plot_keywords','movie_imdb_link']
         cont_fb = ['director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes',
                    'actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']
         cont_finance = ['budget','gross']
         cont_num_reviews = ['num_voted_users','num_user_for_reviews', 'num_critic_for_reviews']
         cont_other = ['imdb_score','duration', 'aspect_ratio', 'facenumber_in_poster']

 In[16]: new_col_order = disc_core + disc_people + disc_other + \
                    cont_fb + cont_finance + cont_num_reviews + cont_other
         set(movie.columns) == set(new_col_order)
Out[16]: True

 In[17]: movie2 = movie[new_col_order]
         movie2.head()
Out[17]:

3. 在整個(gè)DataFrame上操作

 In[18]: pd.options.display.max_rows = 8
         movie = pd.read_csv('data/movie.csv')
         # 打印行數(shù)和列數(shù)
         movie.shape 
Out[18]: (4916, 28)

# 打印數(shù)據(jù)的個(gè)數(shù)
 In[19]: movie.size
Out[19]: 137648

# 該數(shù)據(jù)集的維度
 In[20]: movie.ndim
Out[20]: 2

# 該數(shù)據(jù)集的長度
 In[21]: len(movie)
Out[21]: 4916

# 各個(gè)列的值的個(gè)數(shù)
 In[22]: movie.count()
Out[22]: color                     4897
         director_name             4814
         num_critic_for_reviews    4867
         duration                  4901
                                   ... 
         actor_2_facebook_likes    4903
         imdb_score                4916
         aspect_ratio              4590
         movie_facebook_likes      4916
         Length: 28, dtype: int64

# 各列的最小值
 In[23]: movie.min()
Out[23]: num_critic_for_reviews     1.00
         duration                   7.00
         director_facebook_likes    0.00
         actor_3_facebook_likes     0.00
                                    ... 
         actor_2_facebook_likes     0.00
         imdb_score                 1.60
         aspect_ratio               1.18
         movie_facebook_likes       0.00
         Length: 16, dtype: float64

# 打印描述信息
 In[24]: movie.describe()
Out[24]:

# 使用percentiles參數(shù)指定分位數(shù)
 In[25]: pd.options.display.max_rows = 10
 In[26]: movie.describe(percentiles=[.01, .3, .99])
Out[26]:

# 打印各列空值的個(gè)數(shù)
 In[27]: pd.options.display.max_rows = 8
 In[28]: movie.isnull().sum()
Out[28]: color                      19
         director_name             102
         num_critic_for_reviews     49
         duration                   15
                                   ... 
         actor_2_facebook_likes     13
         imdb_score                  0
         aspect_ratio              326
         movie_facebook_likes        0
         Length: 28, dtype: int64

# 設(shè)定skipna=False，沒有缺失值的數(shù)值列才會計(jì)算結(jié)果
 In[29]: movie.min(skipna=False)
Out[29]: num_critic_for_reviews     NaN
         duration                   NaN
         director_facebook_likes    NaN
         actor_3_facebook_likes     NaN
                                    ... 
         actor_2_facebook_likes     NaN
         imdb_score                 1.6
         aspect_ratio               NaN
         movie_facebook_likes       0.0
         Length: 16, dtype: float64

4. 串聯(lián)DataFrame方法

# 使用isnull方法將每個(gè)值轉(zhuǎn)變?yōu)椴紶栔? In[30]: movie = pd.read_csv('data/movie.csv')
         movie.isnull().head()
Out[30]:

# 使用sum統(tǒng)計(jì)布爾值均驶，返回的是Series
 In[31]: movie.isnull().sum().head()
Out[31]: color                       19
         director_name              102
         num_critic_for_reviews      49
         duration                    15
         director_facebook_likes    102
         dtype: int64

# 對這個(gè)Series再使用sum昏兆，返回整個(gè)DataFrame的缺失值的個(gè)數(shù)，返回值是個(gè)標(biāo)量
 In[32]: movie.isnull().sum().sum()
Out[32]: 2654

# 判斷整個(gè)DataFrame有沒有缺失值妇穴，方法是連著使用兩個(gè)any
 In[33]: movie.isnull().any().any()
Out[33]: True

原理

# isnull返回同樣大小的DataFrame爬虱，但所有的值變?yōu)椴紶栔? In[34]: movie.isnull().get_dtype_counts()
Out[34]: bool    28
         dtype: int64

# movie數(shù)據(jù)集的對象數(shù)據(jù)包含缺失值。默認(rèn)條件下腾它，聚合方法min跑筝、max、sum瞒滴，不會返回任何值曲梗。
 In[35]: movie[['color', 'movie_title', 'color']].max()
Out[35]: Series([], dtype: float64)

# 要讓pandas強(qiáng)行返回每列的值，必須填入缺失值妓忍。下面填入的是空字符串：
 In[36]: movie.select_dtypes(['object']).fillna('').max()
Out[36]: color                                                          Color
         director_name                                          étienne Faure
         actor_2_name                                           Zubaida Sahar
         genres                                                       Western
                                                                           ...                        
         movie_imdb_link    [http://www.imdb.com/title/tt5574490/?ref_=fn_t...](http://www.imdb.com/title/tt5574490/?ref_=fn_t...)
         language                                                        Zulu
         country                                                 West Germany
         content_rating                                                     X
         Length: 12, dtype: object</pre>

5. 在DataFrame上使用運(yùn)算符

# college數(shù)據(jù)集的值既有數(shù)值也有對象虏两，整數(shù)5不能與字符串相加
 In[37]: college = pd.read_csv('data/college.csv')
         college + 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1175             result = expressions.evaluate(op, str_rep, x, y,
-> 1176                                           raise_on_error=True, **eval_kwargs)
   1177         except TypeError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    210         return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 211                          **eval_kwargs)
    212     return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs)
    121     if result is None:
--> 122         result = _evaluate_standard(op, op_str, a, b, raise_on_error)
    123 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1183             with np.errstate(all='ignore'):
-> 1184                 result = get_result(other)
   1185 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get_result(other)
   1152             else:
-> 1153                 result = func(values, other)
   1154 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1201                     with np.errstate(all='ignore'):
-> 1202                         result[mask] = op(xrav, y)
   1203             else:

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-37-4749f68a2501> in <module>()
      1 college = pd.read_csv('data/college.csv')
----> 2 college + 5

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other, axis, level, fill_value)
   1239                 self = self.fillna(fill_value)
   1240 
-> 1241             return self._combine_const(other, na_op)
   1242 
   1243     f.__name__ = name

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error)
   3541     def _combine_const(self, other, func, raise_on_error=True):
   3542         new_data = self._data.eval(func=func, other=other,
-> 3543                                    raise_on_error=raise_on_error)
   3544         return self._constructor(new_data)
   3545 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1189             raise
   1190         except Exception as detail:
-> 1191             result = handle_error()
   1192 
   1193         # technically a broadcast error in numpy can 'work' by returning a

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in handle_error()
   1172                 # The 'detail' variable is defined in outer scope.
   1173                 raise TypeError('Could not operate %s with block values %s' %
-> 1174                                 (repr(other), str(detail)))  # noqa
   1175             else:
   1176                 # return the values

TypeError: Could not operate 5 with block values must be str, not int

# 行索引名設(shè)為INSTNM，用UGDS_過濾出本科生的種族比例
In[38]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
        college_ugds_ = college.filter(like='UGDS_')
In[39]: college == 'asdf' # 這是jn上的世剖，想要比較college和‘a(chǎn)sdf’定罢，沒有意義，忽略
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-697c8af60bcf> in <module>()
----> 1 college == 'asdf'

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other)
   1302             # straight boolean comparisions we want to allow all columns
   1303             # (regardless of dtype to pass thru) See #4537 for discussion.
-> 1304             res = self._combine_const(other, func, raise_on_error=False)
   1305             return res.fillna(True).astype(bool)
   1306 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error)
   3541     def _combine_const(self, other, func, raise_on_error=True):
   3542         new_data = self._data.eval(func=func, other=other,
-> 3543                                    raise_on_error=raise_on_error)
   3544         return self._constructor(new_data)
   3545 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1203 
   1204                 raise TypeError('Could not compare [%s] with block values' %
-> 1205                                 repr(other))
   1206 
   1207         # transpose if needed

TypeError: Could not compare ['asdf'] with block values

# 查看前5行
 In[40]: college_ugds_.head()
Out[40]:

# 現(xiàn)在都是均質(zhì)數(shù)據(jù)了旁瘫，可以進(jìn)行數(shù)值運(yùn)算
 In[41]: college_ugds_.head() + .00501
Out[41]:

# 用底除計(jì)算百分比分?jǐn)?shù)
 In[42]: (college_ugds_.head() + .00501) // .01
Out[42]:

# 再除以100
 In[43]: college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
         college_ugds_op_round.head()
Out[43]:

# 保留兩位小數(shù)
 In[44]: college_ugds_round = (college_ugds_ + .00001).round(2)
         college_ugds_round.head()
Out[44]:

 In[45]: .045 + .005
Out[45]: 0.049999999999999996

 In[46]: college_ugds_op_round.equals(college_ugds_round)
Out[46]: True

# DataFrame的通用函數(shù)也可以實(shí)現(xiàn)上述方法
 In[47]: college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)

6. 比較缺失值

# Pandas使用NumPy NaN（np.nan）對象表示缺失值祖凫。這是一個(gè)不等于自身的特殊對象：
 In[48]: np.nan == np.nan
Out[48]: False

# Python的None對象是等于自身的
 In[49]: None == None
Out[49]: True

# 所有和np.nan的比較都返回False，除了不等于：
 In[50]: 5 > np.nan
Out[50]: False

 In[51]: np.nan > 5
Out[51]: False

 In[52]: 5 != np.nan
Out[52]: True

# college_ugds_所有值和.0019比較酬凳，返回布爾值DataFrame
 In[53]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
 In[54]: college_ugds_.head() == .0019
Out[54]:

# 用DataFrame和DataFrame進(jìn)行比較
 In[55]: college_self_compare = college_ugds_ == college_ugds_
         college_self_compare.head()
Out[55]:

# 用all()檢查是否所有的值都是True惠况；這是因?yàn)槿笔е挡换ハ嗟扔凇? In[56]: college_self_compare.all()
Out[56]: UGDS_WHITE    False
         UGDS_BLACK    False
         UGDS_HISP     False
         UGDS_ASIAN    False
                        ...  
         UGDS_NHPI     False
         UGDS_2MOR     False
         UGDS_NRA      False
         UGDS_UNKN     False
         Length: 9, dtype: bool

# 可以用==號判斷，然后求和
 In[57]: (college_ugds_ == np.nan).sum()
Out[57]: UGDS_WHITE    0
         UGDS_BLACK    0
         UGDS_HISP     0
         UGDS_ASIAN    0
                      ..
         UGDS_NHPI     0
         UGDS_2MOR     0
         UGDS_NRA      0
         UGDS_UNKN     0
         Length: 9, dtype: int64

# 統(tǒng)計(jì)缺失值最主要方法是使用isnull方法：
 In[58]: college_ugds_.isnull().sum()
Out[58]: UGDS_WHITE    661
         UGDS_BLACK    661
         UGDS_HISP     661
         UGDS_ASIAN    661
                       ... 
         UGDS_NHPI     661
         UGDS_2MOR     661
         UGDS_NRA      661
         UGDS_UNKN     661
         Length: 9, dtype: int64

# 比較兩個(gè)DataFrame最直接的方法是使用equals()方法
 In[59]: from pandas.testing import assert_frame_equal
 In[60]: assert_frame_equal(college_ugds_, college_ugds_)
Out[60]: True

# eq()方法類似于==宁仔，和前面的equals有所不同
 In[61]: college_ugds_.eq(.0019).head()
Out[61]:

7. 矩陣轉(zhuǎn)置

 In[62]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
         college_ugds_.head()
Out[62]:

# count()返回非缺失值的個(gè)數(shù)
 In[63]: college_ugds_.count()
Out[63]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64

# axis默認(rèn)設(shè)為0
 In[64]: college_ugds_.count(axis=0)
Out[64]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64

# 等價(jià)于axis='index'
 In[65]: college_ugds_.count(axis='index')
Out[65]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64

# 統(tǒng)計(jì)每行的非缺失值個(gè)數(shù)
 In[66]: college_ugds_.count(axis='columns').head()
Out[66]: INSTNM
         Alabama A & M University               9
         University of Alabama at Birmingham    9
         Amridge University                     9
         University of Alabama in Huntsville    9
         Alabama State University               9
         dtype: int64

# 除了統(tǒng)計(jì)每行的非缺失值個(gè)數(shù)稠屠，也可以求和加以確認(rèn)
 In[67]: college_ugds_.sum(axis='columns').head()
Out[67]: INSTNM
         Alabama A & M University               1.0000
         University of Alabama at Birmingham    0.9999
         Amridge University                     1.0000
         University of Alabama in Huntsville    1.0000
         Alabama State University               1.0000
         dtype: float64

# 用中位數(shù)了解每列的分布
 In[68]: college_ugds_.median(axis='index')
Out[68]: UGDS_WHITE    0.55570
         UGDS_BLACK    0.10005
         UGDS_HISP     0.07140
         UGDS_ASIAN    0.01290
                        ...   
         UGDS_NHPI     0.00000
         UGDS_2MOR     0.01750
         UGDS_NRA      0.00000
         UGDS_UNKN     0.01430
         Length: 9, dtype: float64

# 使用累積求和cumsum()可以很容易看到白人、黑人台诗、西班牙裔的比例
 In[69]: college_ugds_cumsum = college_ugds_.cumsum(axis=1)
         college_ugds_cumsum.head()
Out[69]:

# UGDS_HISP一列降序排列
 In[70]: college_ugds_cumsum.sort_values('UGDS_HISP', ascending=False)
Out[70]:

8. 確定大學(xué)校園多樣性

# US News給出的美國10所最具多樣性的大學(xué)
 In[71]: pd.read_csv('data/college_diversity.csv', index_col='School')
Out[71]:

 In[72]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
         college_ugds_.head()
Out[72]:

 In[73]: college_ugds_.isnull().sum(axis=1).sort_values(ascending=False).head()
Out[73]: INSTNM
         Excel Learning Center-San Antonio South         9
         Philadelphia College of Osteopathic Medicine    9
         Assemblies of God Theological Seminary          9
         Episcopal Divinity School                       9
         Phillips Graduate Institute                     9
         dtype: int64

# 如果所有列都是缺失值，則將其去除
 In[74]: college_ugds_ = college_ugds_.dropna(how='all')
 In[75]: college_ugds_.isnull().sum()
Out[75]: UGDS_WHITE    0
         UGDS_BLACK    0
         UGDS_HISP     0
         UGDS_ASIAN    0
                       ..
         UGDS_NHPI     0
         UGDS_2MOR     0
         UGDS_NRA      0
         UGDS_UNKN     0
         Length: 9, dtype: int64

# 用大于或等于方法ge()赐俗，將DataFrame變?yōu)椴紶栔稻仃? In[76]: college_ugds_.ge(.15).head()
Out[76]:

# 對所有True值求和
 In[77]: diversity_metric = college_ugds_.ge(.15).sum(axis='columns')
         diversity_metric.head()
Out[77]: INSTNM
         Alabama A & M University               1
         University of Alabama at Birmingham    2
         Amridge University                     3
         University of Alabama in Huntsville    1
         Alabama State University               1
         dtype: int64

# 使用value_counts()拉队，查看分布情況
 In[78]: diversity_metric.value_counts()
Out[78]: 1    3042
         2    2884
         3     876
         4      63
         0       7
         5       2
         dtype: int64

# 查看哪些學(xué)校種群比例超過15%的數(shù)量多
 In[79]: diversity_metric.sort_values(ascending=False).head()
Out[79]: INSTNM
         Regency Beauty Institute-Austin          5
         Central Texas Beauty College-Temple      5
         Sullivan and Cogliano Training Center    4
         Ambria College of Nursing                4
         Berkeley College-New York                4
         dtype: int64

# 用loc()方法查看對應(yīng)行索引的行
 In[80]: college_ugds_.loc[['Regency Beauty Institute-Austin', 
                          'Central Texas Beauty College-Temple']]
Out[80]:

# 查看US News前五所最具多樣性的大學(xué)在diversity_metric中的情況
 In[81]: us_news_top = ['Rutgers University-Newark', 
               'Andrews University', 
               'Stanford University', 
               'University of Houston',
               'University of Nevada-Las Vegas']
 In[82]: diversity_metric.loc[us_news_top]
Out[82]: INSTNM
         Rutgers University-Newark         4
         Andrews University                3
         Stanford University               3
         University of Houston             3
         University of Nevada-Las Vegas    3
         dtype: int64

# 可以用最大種群比例查看哪些學(xué)校最不具有多樣性
 In[83]: college_ugds_.max(axis=1).sort_values(ascending=False).head(10)
Out[83]: INSTNM
         Dewey University-Manati                               1.0
         Yeshiva and Kollel Harbotzas Torah                    1.0
         Mr Leon's School of Hair Design-Lewiston              1.0
         Dewey University-Bayamon                              1.0
                                                               ... 
         Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
         Yeshiva Shaar Hatorah                                 1.0
         Bais Medrash Elyon                                    1.0
         Yeshiva of Nitra Rabbinical College                   1.0
         Length: 10, dtype: float64

# 查看Talmudical Seminary Oholei Torah哲學(xué)學(xué)校
 In[84]: college_ugds_.loc['Talmudical Seminary Oholei Torah']
Out[84]: UGDS_WHITE    1.0
         UGDS_BLACK    0.0
         UGDS_HISP     0.0
         UGDS_ASIAN    0.0
                       ... 
         UGDS_NHPI     0.0
         UGDS_2MOR     0.0
         UGDS_NRA      0.0
         UGDS_UNKN     0.0
         Name: Talmudical Seminary Oholei Torah, Length: 9, dtype: float64

# 查看是否有學(xué)校九個(gè)種族的比例都超過了1%
 In[85]: (college_ugds_ > .01).all(axis=1).any()
Out[85]: True

第01章 Pandas基礎(chǔ)
第02章 DataFrame運(yùn)算
第03章數(shù)據(jù)分析入門
 第04章選取數(shù)據(jù)子集
 第05章布爾索引
 第06章索引對齊
 第07章分組聚合、過濾阻逮、轉(zhuǎn)換
 第08章數(shù)據(jù)清理
 第09章合并Pandas對象
 第10章時(shí)間序列分析
 第11章用Matplotlib粱快、Pandas、Seaborn進(jìn)行可視化

最后編輯于：2018.10.26 18:20:23

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市事哭，隨后出現(xiàn)的幾起案子漫雷，更是在濱河造成了極大的恐慌，老刑警劉巖鳍咱，帶你破解...
沈念sama閱讀 206,839評論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件降盹，死亡現(xiàn)場離奇詭異，居然都是意外死亡谤辜，警方通過查閱死者的電腦和手機(jī)蓄坏，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,543評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來丑念，“玉大人涡戳，你說我怎么就攤上這事「校” “怎么了渔彰？”我有些...
開封第一講書人閱讀 153,116評論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長推正。經(jīng)常有香客問我恍涂，道長，這世上最難降的妖魔是什么舔稀？我笑而不...
開封第一講書人閱讀 55,371評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任乳丰，我火速辦了婚禮，結(jié)果婚禮上内贮，老公的妹妹穿的比我還像新娘产园。我一直安慰自己，他們只是感情好夜郁，可當(dāng)我...
茶點(diǎn)故事閱讀 64,384評論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布什燕。她就那樣靜靜地躺著，像睡著了一般竞端。火紅的嫁衣襯著肌膚如雪屎即。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,111評論 1贊 285
城市分裂傳說
那天事富，我揣著相機(jī)與錄音技俐，去河邊找鬼。笑死统台，一個(gè)胖子當(dāng)著我的面吹牛雕擂，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播贱勃，決...
沈念sama閱讀 38,416評論 3贊 400
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼井赌，長吁一口氣：“原來是場噩夢啊……” “哼谤逼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起仇穗，我...
開封第一講書人閱讀 37,053評論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤流部，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后纹坐，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體枝冀，經(jīng)...
沈念sama閱讀 43,558評論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,007評論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年恰画，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了宾茂。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 38,117評論 1贊 334
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡拴还，死狀恐怖跨晴，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情片林，我是刑警寧澤端盆，帶...
沈念sama閱讀 33,756評論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站费封，受9級特大地震影響焕妙，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜弓摘，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,324評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一焚鹊、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧韧献，春花似錦末患、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,315評論 0贊 19
一樁弒父案璧针，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至渊啰，卻和暖如春盼樟，著一層夾襖步出監(jiān)牢的瞬間半哟，已是汗流浹背拍皮。一陣腳步聲響...
開封第一講書人閱讀 31,539評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工奠宜，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人嚷那。一個(gè)月前我還...
沈念sama閱讀 45,578評論 2贊 355
代替公主和親
正文我出身青樓胞枕，卻偏偏與公主長得像，于是被迫代替她去往敵國和親车酣。傳聞我的和親對象是個(gè)殘疾皇子曲稼，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,877評論 2贊 345

《Pandas Cookbook》第02章 DataFrame基本操作

《Pandas Cookbook》第02章 DataFrame基本操作

1. 選取多個(gè)DataFrame列

更多

2. 對列名進(jìn)行排序

3. 在整個(gè)DataFrame上操作

更多

4. 串聯(lián)DataFrame方法

原理

更多

5. 在DataFrame上使用運(yùn)算符

更多

6. 比較缺失值

更多

7. 矩陣轉(zhuǎn)置

更多

8. 確定大學(xué)校園多樣性

更多

推薦閱讀更多精彩內(nèi)容