《Pandas Cookbook》第02章 DataFrame基本操作


第01章 Pandas基礎(chǔ)
第02章 DataFrame運(yùn)算
第03章 數(shù)據(jù)分析入門
第04章 選取數(shù)據(jù)子集
第05章 布爾索引
第06章 索引對齊
第07章 分組聚合瘪撇、過濾泉孩、轉(zhuǎn)換
第08章 數(shù)據(jù)清理
第09章 合并Pandas對象
第10章 時(shí)間序列分析
第11章 用Matplotlib改览、Pandas碳锈、Seaborn進(jìn)行可視化


In[1]: import pandas as pd
       import numpy as np
       pd.options.display.max_columns = 40

1. 選取多個(gè)DataFrame列

# 用列表選取多個(gè)列
 In[2]: movie = pd.read_csv('data/movie.csv')
        movie_actor_director = movie[['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']]
        movie_actor_director.head()
Out[2]: 
# 選取單列
 In[3]: movie[['director_name']].head()
Out[3]: 
# 錯(cuò)誤的選取多列的方式
 In[4]: movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-954222273e42> in <module>()
----> 1 movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

更多

# 將列表賦值給一個(gè)變量,便于多選
 In[6]: cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
        movie_actor_director = movie[cols]
Out[6]: float64    13
        int64       3
        object     11
        dtype: int64
# 使用select_dtypes(),選取整數(shù)列
 In[7]: movie.select_dtypes(include=['int']).head()
Out[7]: 
# 選取所有的數(shù)值列
 In[8]: movie.select_dtypes(include=['number']).head()
Out[8]: 
# 通過filter()函數(shù)過濾選取多列
 In[9]: movie.filter(like='facebook').head()
Out[9]: 
# 通過正則表達(dá)式選取多列
 In[10]: movie.filter(regex='\d').head()
Out[10]: 
# filter()函數(shù)沾谓,傳遞列表到參數(shù)items委造,選取多列
 In[11]: movie.filter(items=['actor_1_name', 'asdf']).head()
Out[11]: 

2. 對列名進(jìn)行排序

# 讀取movie數(shù)據(jù)集
 In[12]: movie = pd.read_csv('data/movie.csv')
 In[13]: movie.head()
Out[13]: 
# 打印列索引
 In[14]: movie.columns
Out[14]: Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')
# 將列索引按照指定的順序排列
 In[15]: disc_core = ['movie_title','title_year', 'content_rating','genres']
         disc_people = ['director_name','actor_1_name', 'actor_2_name','actor_3_name']
         disc_other = ['color','country','language','plot_keywords','movie_imdb_link']
         cont_fb = ['director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes',
                    'actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']
         cont_finance = ['budget','gross']
         cont_num_reviews = ['num_voted_users','num_user_for_reviews', 'num_critic_for_reviews']
         cont_other = ['imdb_score','duration', 'aspect_ratio', 'facenumber_in_poster']

 In[16]: new_col_order = disc_core + disc_people + disc_other + \
                    cont_fb + cont_finance + cont_num_reviews + cont_other
         set(movie.columns) == set(new_col_order)
Out[16]: True

 In[17]: movie2 = movie[new_col_order]
         movie2.head()
Out[17]: 

3. 在整個(gè)DataFrame上操作

 In[18]: pd.options.display.max_rows = 8
         movie = pd.read_csv('data/movie.csv')
         # 打印行數(shù)和列數(shù)
         movie.shape 
Out[18]: (4916, 28)
# 打印數(shù)據(jù)的個(gè)數(shù)
 In[19]: movie.size
Out[19]: 137648
# 該數(shù)據(jù)集的維度
 In[20]: movie.ndim
Out[20]: 2
# 該數(shù)據(jù)集的長度
 In[21]: len(movie)
Out[21]: 4916
# 各個(gè)列的值的個(gè)數(shù)
 In[22]: movie.count()
Out[22]: color                     4897
         director_name             4814
         num_critic_for_reviews    4867
         duration                  4901
                                   ... 
         actor_2_facebook_likes    4903
         imdb_score                4916
         aspect_ratio              4590
         movie_facebook_likes      4916
         Length: 28, dtype: int64
# 各列的最小值
 In[23]: movie.min()
Out[23]: num_critic_for_reviews     1.00
         duration                   7.00
         director_facebook_likes    0.00
         actor_3_facebook_likes     0.00
                                    ... 
         actor_2_facebook_likes     0.00
         imdb_score                 1.60
         aspect_ratio               1.18
         movie_facebook_likes       0.00
         Length: 16, dtype: float64
# 打印描述信息
 In[24]: movie.describe()
Out[24]: 
# 使用percentiles參數(shù)指定分位數(shù)
 In[25]: pd.options.display.max_rows = 10
 In[26]: movie.describe(percentiles=[.01, .3, .99])
Out[26]: 
# 打印各列空值的個(gè)數(shù)
 In[27]: pd.options.display.max_rows = 8
 In[28]: movie.isnull().sum()
Out[28]: color                      19
         director_name             102
         num_critic_for_reviews     49
         duration                   15
                                   ... 
         actor_2_facebook_likes     13
         imdb_score                  0
         aspect_ratio              326
         movie_facebook_likes        0
         Length: 28, dtype: int64

更多

# 設(shè)定skipna=False,沒有缺失值的數(shù)值列才會計(jì)算結(jié)果
 In[29]: movie.min(skipna=False)
Out[29]: num_critic_for_reviews     NaN
         duration                   NaN
         director_facebook_likes    NaN
         actor_3_facebook_likes     NaN
                                    ... 
         actor_2_facebook_likes     NaN
         imdb_score                 1.6
         aspect_ratio               NaN
         movie_facebook_likes       0.0
         Length: 16, dtype: float64

4. 串聯(lián)DataFrame方法

# 使用isnull方法將每個(gè)值轉(zhuǎn)變?yōu)椴紶栔? In[30]: movie = pd.read_csv('data/movie.csv')
         movie.isnull().head()
Out[30]: 
# 使用sum統(tǒng)計(jì)布爾值均驶,返回的是Series
 In[31]: movie.isnull().sum().head()
Out[31]: color                       19
         director_name              102
         num_critic_for_reviews      49
         duration                    15
         director_facebook_likes    102
         dtype: int64
# 對這個(gè)Series再使用sum昏兆,返回整個(gè)DataFrame的缺失值的個(gè)數(shù),返回值是個(gè)標(biāo)量
 In[32]: movie.isnull().sum().sum()
Out[32]: 2654
# 判斷整個(gè)DataFrame有沒有缺失值妇穴,方法是連著使用兩個(gè)any
 In[33]: movie.isnull().any().any()
Out[33]: True

原理

# isnull返回同樣大小的DataFrame爬虱,但所有的值變?yōu)椴紶栔? In[34]: movie.isnull().get_dtype_counts()
Out[34]: bool    28
         dtype: int64

更多

# movie數(shù)據(jù)集的對象數(shù)據(jù)包含缺失值。默認(rèn)條件下腾它,聚合方法min跑筝、max、sum瞒滴,不會返回任何值曲梗。
 In[35]: movie[['color', 'movie_title', 'color']].max()
Out[35]: Series([], dtype: float64)
# 要讓pandas強(qiáng)行返回每列的值,必須填入缺失值妓忍。下面填入的是空字符串:
 In[36]: movie.select_dtypes(['object']).fillna('').max()
Out[36]: color                                                          Color
         director_name                                          étienne Faure
         actor_2_name                                           Zubaida Sahar
         genres                                                       Western
                                                                           ...                        
         movie_imdb_link    [http://www.imdb.com/title/tt5574490/?ref_=fn_t...](http://www.imdb.com/title/tt5574490/?ref_=fn_t...)
         language                                                        Zulu
         country                                                 West Germany
         content_rating                                                     X
         Length: 12, dtype: object</pre>

5. 在DataFrame上使用運(yùn)算符

# college數(shù)據(jù)集的值既有數(shù)值也有對象虏两,整數(shù)5不能與字符串相加
 In[37]: college = pd.read_csv('data/college.csv')
         college + 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1175             result = expressions.evaluate(op, str_rep, x, y,
-> 1176                                           raise_on_error=True, **eval_kwargs)
   1177         except TypeError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    210         return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 211                          **eval_kwargs)
    212     return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs)
    121     if result is None:
--> 122         result = _evaluate_standard(op, op_str, a, b, raise_on_error)
    123 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1183             with np.errstate(all='ignore'):
-> 1184                 result = get_result(other)
   1185 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get_result(other)
   1152             else:
-> 1153                 result = func(values, other)
   1154 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1201                     with np.errstate(all='ignore'):
-> 1202                         result[mask] = op(xrav, y)
   1203             else:

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-37-4749f68a2501> in <module>()
      1 college = pd.read_csv('data/college.csv')
----> 2 college + 5

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other, axis, level, fill_value)
   1239                 self = self.fillna(fill_value)
   1240 
-> 1241             return self._combine_const(other, na_op)
   1242 
   1243     f.__name__ = name

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error)
   3541     def _combine_const(self, other, func, raise_on_error=True):
   3542         new_data = self._data.eval(func=func, other=other,
-> 3543                                    raise_on_error=raise_on_error)
   3544         return self._constructor(new_data)
   3545 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1189             raise
   1190         except Exception as detail:
-> 1191             result = handle_error()
   1192 
   1193         # technically a broadcast error in numpy can 'work' by returning a

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in handle_error()
   1172                 # The 'detail' variable is defined in outer scope.
   1173                 raise TypeError('Could not operate %s with block values %s' %
-> 1174                                 (repr(other), str(detail)))  # noqa
   1175             else:
   1176                 # return the values

TypeError: Could not operate 5 with block values must be str, not int
# 行索引名設(shè)為INSTNM,用UGDS_過濾出本科生的種族比例
In[38]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
        college_ugds_ = college.filter(like='UGDS_')
In[39]: college == 'asdf' # 這是jn上的世剖,想要比較college和‘a(chǎn)sdf’定罢,沒有意義,忽略
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-697c8af60bcf> in <module>()
----> 1 college == 'asdf'

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other)
   1302             # straight boolean comparisions we want to allow all columns
   1303             # (regardless of dtype to pass thru) See #4537 for discussion.
-> 1304             res = self._combine_const(other, func, raise_on_error=False)
   1305             return res.fillna(True).astype(bool)
   1306 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error)
   3541     def _combine_const(self, other, func, raise_on_error=True):
   3542         new_data = self._data.eval(func=func, other=other,
-> 3543                                    raise_on_error=raise_on_error)
   3544         return self._constructor(new_data)
   3545 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1203 
   1204                 raise TypeError('Could not compare [%s] with block values' %
-> 1205                                 repr(other))
   1206 
   1207         # transpose if needed

TypeError: Could not compare ['asdf'] with block values
# 查看前5行
 In[40]: college_ugds_.head()
Out[40]: 
# 現(xiàn)在都是均質(zhì)數(shù)據(jù)了旁瘫,可以進(jìn)行數(shù)值運(yùn)算
 In[41]: college_ugds_.head() + .00501
Out[41]: 
# 用底除計(jì)算百分比分?jǐn)?shù)
 In[42]: (college_ugds_.head() + .00501) // .01
Out[42]: 
# 再除以100
 In[43]: college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
         college_ugds_op_round.head()
Out[43]: 
# 保留兩位小數(shù)
 In[44]: college_ugds_round = (college_ugds_ + .00001).round(2)
         college_ugds_round.head()
Out[44]: 
 In[45]: .045 + .005
Out[45]: 0.049999999999999996
 In[46]: college_ugds_op_round.equals(college_ugds_round)
Out[46]: True

更多

# DataFrame的通用函數(shù)也可以實(shí)現(xiàn)上述方法
 In[47]: college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)

6. 比較缺失值

# Pandas使用NumPy NaN(np.nan)對象表示缺失值祖凫。這是一個(gè)不等于自身的特殊對象:
 In[48]: np.nan == np.nan
Out[48]: False
# Python的None對象是等于自身的
 In[49]: None == None
Out[49]: True
# 所有和np.nan的比較都返回False,除了不等于:
 In[50]: 5 > np.nan
Out[50]: False

 In[51]: np.nan > 5
Out[51]: False

 In[52]: 5 != np.nan
Out[52]: True
# college_ugds_所有值和.0019比較酬凳,返回布爾值DataFrame
 In[53]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
 In[54]: college_ugds_.head() == .0019
Out[54]: 
# 用DataFrame和DataFrame進(jìn)行比較
 In[55]: college_self_compare = college_ugds_ == college_ugds_
         college_self_compare.head()
Out[55]: 
# 用all()檢查是否所有的值都是True惠况;這是因?yàn)槿笔е挡换ハ嗟扔凇? In[56]: college_self_compare.all()
Out[56]: UGDS_WHITE    False
         UGDS_BLACK    False
         UGDS_HISP     False
         UGDS_ASIAN    False
                        ...  
         UGDS_NHPI     False
         UGDS_2MOR     False
         UGDS_NRA      False
         UGDS_UNKN     False
         Length: 9, dtype: bool
# 可以用==號判斷,然后求和
 In[57]: (college_ugds_ == np.nan).sum()
Out[57]: UGDS_WHITE    0
         UGDS_BLACK    0
         UGDS_HISP     0
         UGDS_ASIAN    0
                      ..
         UGDS_NHPI     0
         UGDS_2MOR     0
         UGDS_NRA      0
         UGDS_UNKN     0
         Length: 9, dtype: int64
# 統(tǒng)計(jì)缺失值最主要方法是使用isnull方法:
 In[58]: college_ugds_.isnull().sum()
Out[58]: UGDS_WHITE    661
         UGDS_BLACK    661
         UGDS_HISP     661
         UGDS_ASIAN    661
                       ... 
         UGDS_NHPI     661
         UGDS_2MOR     661
         UGDS_NRA      661
         UGDS_UNKN     661
         Length: 9, dtype: int64
# 比較兩個(gè)DataFrame最直接的方法是使用equals()方法
 In[59]: from pandas.testing import assert_frame_equal
 In[60]: assert_frame_equal(college_ugds_, college_ugds_)
Out[60]: True

更多

# eq()方法類似于==宁仔,和前面的equals有所不同
 In[61]: college_ugds_.eq(.0019).head()
Out[61]: 

7. 矩陣轉(zhuǎn)置

 In[62]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
         college_ugds_.head()
Out[62]: 
# count()返回非缺失值的個(gè)數(shù)
 In[63]: college_ugds_.count()
Out[63]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64
# axis默認(rèn)設(shè)為0
 In[64]: college_ugds_.count(axis=0)
Out[64]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64
# 等價(jià)于axis='index'
 In[65]: college_ugds_.count(axis='index')
Out[65]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64
# 統(tǒng)計(jì)每行的非缺失值個(gè)數(shù)
 In[66]: college_ugds_.count(axis='columns').head()
Out[66]: INSTNM
         Alabama A & M University               9
         University of Alabama at Birmingham    9
         Amridge University                     9
         University of Alabama in Huntsville    9
         Alabama State University               9
         dtype: int64
# 除了統(tǒng)計(jì)每行的非缺失值個(gè)數(shù)稠屠,也可以求和加以確認(rèn)
 In[67]: college_ugds_.sum(axis='columns').head()
Out[67]: INSTNM
         Alabama A & M University               1.0000
         University of Alabama at Birmingham    0.9999
         Amridge University                     1.0000
         University of Alabama in Huntsville    1.0000
         Alabama State University               1.0000
         dtype: float64
# 用中位數(shù)了解每列的分布
 In[68]: college_ugds_.median(axis='index')
Out[68]: UGDS_WHITE    0.55570
         UGDS_BLACK    0.10005
         UGDS_HISP     0.07140
         UGDS_ASIAN    0.01290
                        ...   
         UGDS_NHPI     0.00000
         UGDS_2MOR     0.01750
         UGDS_NRA      0.00000
         UGDS_UNKN     0.01430
         Length: 9, dtype: float64

更多

# 使用累積求和cumsum()可以很容易看到白人、黑人台诗、西班牙裔的比例
 In[69]: college_ugds_cumsum = college_ugds_.cumsum(axis=1)
         college_ugds_cumsum.head()
Out[69]: 
# UGDS_HISP一列降序排列
 In[70]: college_ugds_cumsum.sort_values('UGDS_HISP', ascending=False)
Out[70]: 

8. 確定大學(xué)校園多樣性

# US News給出的美國10所最具多樣性的大學(xué)
 In[71]: pd.read_csv('data/college_diversity.csv', index_col='School')
Out[71]: 
 In[72]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
         college_ugds_.head()
Out[72]: 
 In[73]: college_ugds_.isnull().sum(axis=1).sort_values(ascending=False).head()
Out[73]: INSTNM
         Excel Learning Center-San Antonio South         9
         Philadelphia College of Osteopathic Medicine    9
         Assemblies of God Theological Seminary          9
         Episcopal Divinity School                       9
         Phillips Graduate Institute                     9
         dtype: int64
# 如果所有列都是缺失值,則將其去除
 In[74]: college_ugds_ = college_ugds_.dropna(how='all')
 In[75]: college_ugds_.isnull().sum()
Out[75]: UGDS_WHITE    0
         UGDS_BLACK    0
         UGDS_HISP     0
         UGDS_ASIAN    0
                       ..
         UGDS_NHPI     0
         UGDS_2MOR     0
         UGDS_NRA      0
         UGDS_UNKN     0
         Length: 9, dtype: int64
# 用大于或等于方法ge()赐俗,將DataFrame變?yōu)椴紶栔稻仃? In[76]: college_ugds_.ge(.15).head()
Out[76]: 
# 對所有True值求和
 In[77]: diversity_metric = college_ugds_.ge(.15).sum(axis='columns')
         diversity_metric.head()
Out[77]: INSTNM
         Alabama A & M University               1
         University of Alabama at Birmingham    2
         Amridge University                     3
         University of Alabama in Huntsville    1
         Alabama State University               1
         dtype: int64
# 使用value_counts()拉队,查看分布情況
 In[78]: diversity_metric.value_counts()
Out[78]: 1    3042
         2    2884
         3     876
         4      63
         0       7
         5       2
         dtype: int64
# 查看哪些學(xué)校種群比例超過15%的數(shù)量多
 In[79]: diversity_metric.sort_values(ascending=False).head()
Out[79]: INSTNM
         Regency Beauty Institute-Austin          5
         Central Texas Beauty College-Temple      5
         Sullivan and Cogliano Training Center    4
         Ambria College of Nursing                4
         Berkeley College-New York                4
         dtype: int64
# 用loc()方法查看對應(yīng)行索引的行
 In[80]: college_ugds_.loc[['Regency Beauty Institute-Austin', 
                          'Central Texas Beauty College-Temple']]
Out[80]:
# 查看US News前五所最具多樣性的大學(xué)在diversity_metric中的情況
 In[81]: us_news_top = ['Rutgers University-Newark', 
               'Andrews University', 
               'Stanford University', 
               'University of Houston',
               'University of Nevada-Las Vegas']
 In[82]: diversity_metric.loc[us_news_top]
Out[82]: INSTNM
         Rutgers University-Newark         4
         Andrews University                3
         Stanford University               3
         University of Houston             3
         University of Nevada-Las Vegas    3
         dtype: int64

更多

# 可以用最大種群比例查看哪些學(xué)校最不具有多樣性
 In[83]: college_ugds_.max(axis=1).sort_values(ascending=False).head(10)
Out[83]: INSTNM
         Dewey University-Manati                               1.0
         Yeshiva and Kollel Harbotzas Torah                    1.0
         Mr Leon's School of Hair Design-Lewiston              1.0
         Dewey University-Bayamon                              1.0
                                                               ... 
         Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
         Yeshiva Shaar Hatorah                                 1.0
         Bais Medrash Elyon                                    1.0
         Yeshiva of Nitra Rabbinical College                   1.0
         Length: 10, dtype: float64
# 查看Talmudical Seminary Oholei Torah哲學(xué)學(xué)校
 In[84]: college_ugds_.loc['Talmudical Seminary Oholei Torah']
Out[84]: UGDS_WHITE    1.0
         UGDS_BLACK    0.0
         UGDS_HISP     0.0
         UGDS_ASIAN    0.0
                       ... 
         UGDS_NHPI     0.0
         UGDS_2MOR     0.0
         UGDS_NRA      0.0
         UGDS_UNKN     0.0
         Name: Talmudical Seminary Oholei Torah, Length: 9, dtype: float64
# 查看是否有學(xué)校九個(gè)種族的比例都超過了1%
 In[85]: (college_ugds_ > .01).all(axis=1).any()
Out[85]: True

第01章 Pandas基礎(chǔ)
第02章 DataFrame運(yùn)算
第03章 數(shù)據(jù)分析入門
第04章 選取數(shù)據(jù)子集
第05章 布爾索引
第06章 索引對齊
第07章 分組聚合、過濾阻逮、轉(zhuǎn)換
第08章 數(shù)據(jù)清理
第09章 合并Pandas對象
第10章 時(shí)間序列分析
第11章 用Matplotlib粱快、Pandas、Seaborn進(jìn)行可視化


最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市事哭,隨后出現(xiàn)的幾起案子漫雷,更是在濱河造成了極大的恐慌,老刑警劉巖鳍咱,帶你破解...
    沈念sama閱讀 206,839評論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件降盹,死亡現(xiàn)場離奇詭異,居然都是意外死亡谤辜,警方通過查閱死者的電腦和手機(jī)蓄坏,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,543評論 2 382
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來丑念,“玉大人涡戳,你說我怎么就攤上這事「校” “怎么了渔彰?”我有些...
    開封第一講書人閱讀 153,116評論 0 344
  • 文/不壞的土叔 我叫張陵,是天一觀的道長推正。 經(jīng)常有香客問我恍涂,道長,這世上最難降的妖魔是什么舔稀? 我笑而不...
    開封第一講書人閱讀 55,371評論 1 279
  • 正文 為了忘掉前任乳丰,我火速辦了婚禮,結(jié)果婚禮上内贮,老公的妹妹穿的比我還像新娘产园。我一直安慰自己,他們只是感情好夜郁,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,384評論 5 374
  • 文/花漫 我一把揭開白布什燕。 她就那樣靜靜地躺著,像睡著了一般竞端。 火紅的嫁衣襯著肌膚如雪屎即。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,111評論 1 285
  • 那天事富,我揣著相機(jī)與錄音技俐,去河邊找鬼。 笑死统台,一個(gè)胖子當(dāng)著我的面吹牛雕擂,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播贱勃,決...
    沈念sama閱讀 38,416評論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼井赌,長吁一口氣:“原來是場噩夢啊……” “哼谤逼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起仇穗,我...
    開封第一講書人閱讀 37,053評論 0 259
  • 序言:老撾萬榮一對情侶失蹤流部,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后纹坐,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體枝冀,經(jīng)...
    沈念sama閱讀 43,558評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,007評論 2 325
  • 正文 我和宋清朗相戀三年恰画,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了宾茂。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,117評論 1 334
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡拴还,死狀恐怖跨晴,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情片林,我是刑警寧澤端盆,帶...
    沈念sama閱讀 33,756評論 4 324
  • 正文 年R本政府宣布,位于F島的核電站费封,受9級特大地震影響焕妙,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜弓摘,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,324評論 3 307
  • 文/蒙蒙 一焚鹊、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧韧献,春花似錦末患、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,315評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至渊啰,卻和暖如春盼樟,著一層夾襖步出監(jiān)牢的瞬間半哟,已是汗流浹背拍皮。 一陣腳步聲響...
    開封第一講書人閱讀 31,539評論 1 262
  • 我被黑心中介騙來泰國打工奠宜, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人嚷那。 一個(gè)月前我還...
    沈念sama閱讀 45,578評論 2 355
  • 正文 我出身青樓胞枕,卻偏偏與公主長得像,于是被迫代替她去往敵國和親车酣。 傳聞我的和親對象是個(gè)殘疾皇子曲稼,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,877評論 2 345

推薦閱讀更多精彩內(nèi)容