《Pandas 1.x Cookbook · 第二版》第02章 DataFrame基礎(chǔ)運算

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對齊


2.1 從DataFrame中選擇多列

使用列名列表提取DataFrame的多列:

>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
...     [
...         "actor_1_name",
...         "actor_2_name",
...         "actor_3_name",
...         "director_name",
...     ]
... ]
>>> movie_actor_director.head()
  actor_1_name actor_2_name actor_3_name director_name
0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
2  Christop...  Rory Kin...  Stephani...   Sam Mendes
3    Tom Hardy  Christia...  Joseph G...  Christop...
4  Doug Walker   Rob Walker          NaN  Doug Walker
# 提取單列時娃属,列表和鍵名提取出來的數(shù)據(jù)類型不同辱挥。
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'>   # DataFrame類型
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>   # Series類型

也可以使用loc提取多列。

>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>

預(yù)先將列名存儲在列表中踢俄,可以提高代碼的可讀性蚊逢。

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

如果沒有使用列表层扶,則會報KeyError錯誤。

>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

2.2 使用方法提取多列

縮短列名之后查看每種數(shù)據(jù)類型的個數(shù):

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return (
...         str(col)
...         .replace("facebook_likes", "fb")
...         .replace("_for_reviews", "")
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64    13
int64       3
object     12
dtype: int64

使用.select_dtypes方法提取整型的列:

>>> movies.select_dtypes(include="int").head()
   num_voted_users  cast_total_fb  movie_fb
0           886204           4834     33000
1           471220          48350         0
2           275868          11700     85000
3          1144337         106759    164000
4                8            143         0

選擇所有數(shù)值類型的列:

>>> movies.select_dtypes(include="number").head()
   num_critics  duration  ...  aspect_ratio  movie_fb
0        723.0     178.0  ...         1.78      33000
1        302.0     169.0  ...         2.35          0
2        602.0     148.0  ...         2.35      85000
3        813.0     164.0  ...         2.35     164000
4          NaN       NaN  ...          NaN          0

選擇整型和字符串的列:

>>> movies.select_dtypes(include=["int", "object"]).head()
   color        direc/_name  ... conte/ating movie_fb
0  Color      James Cameron  ...       PG-13    33000
1  Color     Gore Verbinski  ...       PG-13        0
2  Color         Sam Mendes  ...       PG-13    85000
3  Color  Christopher Nolan  ...       PG-13   164000
4    NaN        Doug Walker  ...         NaN        0

提取所有非浮點類型的列:

>>> movies.select_dtypes(exclude="float").head()
   color director_name  ... content_rating movie_fb
0  Color  James Ca...   ...        PG-13      33000
1  Color  Gore Ver...   ...        PG-13          0
2  Color   Sam Mendes   ...        PG-13      85000
3  Color  Christop...   ...        PG-13     164000
4    NaN  Doug Walker   ...          NaN          0

使用.filter方法篩選所有列名中包含fb的列:

>>> movies.filter(like="fb").head()
   director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
0          0.0       855.0  ...       936.0     33000
1        563.0      1000.0  ...      5000.0         0
2          0.0       161.0  ...       393.0     85000
3      22000.0     23000.0  ...     23000.0    164000
4        131.0         NaN  ...        12.0         0

items參數(shù)可以用來選擇多列:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movies.filter(items=cols).head()
      actor_1_name  ...      director_name
0      CCH Pounder  ...      James Cameron
1      Johnny Depp  ...     Gore Verbinski
2  Christoph Waltz  ...         Sam Mendes
3        Tom Hardy  ...  Christopher Nolan
4      Doug Walker  ...        Doug Walker

regex參數(shù)可以用來進行正則匹配烙荷,下面的代碼提取出了列名中包含數(shù)字的列:

>>> movies.filter(regex=r"\d").head()
   actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
0       855.0  Joel Dav...  ...    Wes Studi       936.0
1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
2       161.0  Rory Kin...  ...  Stephani...       393.0
3     23000.0  Christia...  ...  Joseph G...     23000.0
4         NaN   Rob Walker  ...          NaN        12.0

2.3 按列名進行排列

對列進行排序的原則:

  • 將列分為分類型和連續(xù)型镜会;
  • 按照分類型和連續(xù)型對列分組;
  • 分類型排在連續(xù)型的前面终抽;

下面是個例子戳表。先讀取數(shù)據(jù)桶至,縮短列名:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)

對下面的列名進行

>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')
>>> cat_core = [
...     "movie_title",
...     "title_year",
...     "content_rating",
...     "genres",
... ]
>>> cat_people = [
...     "director_name",
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
... ]
>>> cat_other = [
...     "color",
...     "country",
...     "language",
...     "plot_keywords",
...     "movie_imdb_link",
... ]
>>> cont_fb = [
...     "director_fb",
...     "actor_1_fb",
...     "actor_2_fb",
...     "actor_3_fb",
...     "cast_total_fb",
...     "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
...     "num_voted_users",
...     "num_user",
...     "num_critic",
... ]
>>> cont_other = [
...     "imdb_score",
...     "duration",
...     "aspect_ratio",
...     "facenumber_in_poster",
... ]

將上面所有列表連起來,組成最終的列的順序匾旭,并確認沒有遺漏任何列:

>>> new_col_order = (
...     cat_core
...     + cat_people
...     + cat_other
...     + cont_fb
...     + cont_finance
...     + cont_num_reviews
...     + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True

將新的列數(shù)組傳給movies镣屹,得到排好列的對象:

>>> movies[new_col_order].head()
   movie_title  title_year  ... aspect_ratio facenumber_in_poster
0       Avatar      2009.0  ...         1.78          0.0
1  Pirates ...      2007.0  ...         2.35          0.0
2      Spectre      2015.0  ...         2.35          1.0
3  The Dark...      2012.0  ...         2.35          0.0
4  Star War...         NaN  ...          NaN          0.0

2.4 對DataFrame進行概括性分析

查看數(shù)據(jù)集的屬性:shape、size价涝、ndim女蜈。

>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2

.count方法可以統(tǒng)計所有的非缺失值:

>>> movies.count()
color                      4897
director_name              4814
num_critic_for_reviews     4867
duration                   4901
director_facebook_likes    4814
                           ... 
title_year                 4810
actor_2_facebook_likes     4903
imdb_score                 4916
aspect_ratio               4590
movie_facebook_likes       4916
Length: 28, dtype: int64

.min.max色瘩、.mean伪窖、.median.std方法泞遗,返回的是數(shù)值列的統(tǒng)計信息:

>>> movies.min()
num_critic_for_reviews        1.00
duration                      7.00
director_facebook_likes       0.00
actor_3_facebook_likes        0.00
actor_1_facebook_likes        0.00
                            ...   
title_year                 1916.00
actor_2_facebook_likes        0.00
imdb_score                    1.60
aspect_ratio                  1.18
movie_facebook_likes          0.00
Length: 16, dtype: float64

.describe是一個非常強大的方法惰许,可以返回描述性統(tǒng)計信息和分位數(shù),如果想在屏幕中顯示更多信息史辙,可以用.T進行矩陣轉(zhuǎn)置:

>>> movies.describe().T
               count         mean  ...       75%       max
num_criti...  4867.0   137.988905  ...    191.00     813.0
duration      4901.0   107.090798  ...    118.00     511.0
director_...  4814.0   691.014541  ...    189.75   23000.0
actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2011.00    2016.0
actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
imdb_score    4916.0     6.437429  ...      7.20       9.5
aspect_ratio  4590.0     2.222349  ...      2.35      16.0
movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0

.describe方法中通過percentiles參數(shù)汹买,可以得到任意分位數(shù):

>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
               count         mean  ...       99%       max
num_criti...  4867.0   137.988905  ...    546.68     813.0
duration      4901.0   107.090798  ...    189.00     511.0
director_...  4814.0   691.014541  ...  16000.00   23000.0
actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2016.00    2016.0
actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
imdb_score    4916.0     6.437429  ...      8.50       9.5
aspect_ratio  4590.0     2.222349  ...      4.00      16.0
movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0

如果在上述的描述性方法中,將參數(shù)skipna設(shè)為False聊倔,則可以將所有列都體現(xiàn)出來:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64

2.5 DataFrame的鏈?zhǔn)椒椒?/h1>

使用.isnull方法晦毙,判斷每個值是否是缺失值:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.isnull().head()
   color  director_name  ...  aspect_ratio  movie_fb
0  False        False    ...        False      False
1  False        False    ...        False      False
2  False        False    ...        False      False
3  False        False    ...        False      False
4   True        False    ...         True      False

.sum方法可以對TrueFalse求和,True是1耙蔑,False是0见妒,這樣就能清楚地看到每列有多少缺失值:

>>> (movies.isnull().sum().head())
color             19
director_name    102
num_critic        49
duration          15
director_fb      102
dtype: int64

再進一步,兩個.sum方法連用甸陌,可以知道總共有多少缺失值:

>>> movies.isnull().sum().sum()
2654

如果僅僅想知道是否有缺失值须揣,使用.any()更為便捷:

>>> movies.isnull().any().any()
True

原理

.isnull方法將原始的DataFrame轉(zhuǎn)換為了相同大小的布爾值矩陣:

>>> movies.isnull().dtypes.value_counts()
bool    28
dtype: int64

更多

如果object類型的數(shù)據(jù)存在缺失值,則在做聚合運算(.min钱豁、.max耻卡、.sum)時,返回為空:

>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)

如果想讓返回值不為空牲尺,需要對缺失值進行填充:

>>> movies.select_dtypes(["object"]).fillna("").max()
color                            Color
director_name            étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

出于可讀性考慮卵酪,鏈?zhǔn)椒椒ㄍǔS美ㄌ柪ㄆ饋恚@樣對每個方法做注釋和調(diào)試時非常方便:

>>> (movies.select_dtypes(["object"]).fillna("").max())
color                            Color
director_name            étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

2.6 DataFrame運算

DataFrame的列的類型可能是數(shù)值,也可能是對象,直接+5的話烂翰,會報類型錯誤:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

.filter方法篩選出所有列名以'UGDS_'開頭的列,該列是按照種族分類的本科生:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

Pandas采取的是“四舍六入五成雙” 的 Banker's Rounding 規(guī)則 (五后有數(shù)入瘸羡、五后無數(shù)湊偶數(shù))。觀察UGDS_BLACK是如何變化的:

>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE    0.7912
UGDS_BLACK    0.1250
UGDS_HISP     0.0339
UGDS_ASIAN    0.0036
UGDS_AIAN     0.0088
UGDS_NHPI     0.0006
UGDS_2MOR     0.0012
UGDS_NRA      0.0033
UGDS_UNKN     0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.12
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

如果圓整之前加0.0001搓茬,看看變化:

>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.13
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

在開始圓整探險之前最铁,將college_ugds中的每個數(shù)加0.00501

>>> college_ugds + 0.00501
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
Universit...     0.59721     0.26501  ...   0.02291    0.01501
Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
Universit...     0.70381     0.13051  ...   0.03821    0.04001
Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...

使用//讯赏,將DataFrame中的值圓整為百分比的整數(shù):

>>> (college_ugds + 0.00501) // 0.01
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...         3.0        94.0  ...       1.0        1.0
Universit...        59.0        26.0  ...       2.0        1.0
Amridge U...        30.0        42.0  ...       0.0       27.0
Universit...        70.0        13.0  ...       3.0        4.0
Alabama S...         2.0        92.0  ...       2.0        1.0
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       

如果將其除以100,則:

>>> college_ugds_op_round =(
...     (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...        0.03        0.94  ...      0.01       0.01
Universit...        0.59        0.26  ...      0.02       0.01
Amridge U...        0.30        0.42  ...      0.00       0.27
Universit...        0.70        0.13  ...      0.03       0.04
Alabama S...        0.02        0.92  ...      0.02       0.01

下面使用round方法冷尉,因為是bankers規(guī)則漱挎,加個0.00001:

>>> college_ugds_round = (college_ugds + 0.00001).round(2)

判斷兩個結(jié)果是否相同:

>>> college_ugds_op_round.equals(college_ugds_round)
True

原理

浮點運算會產(chǎn)生誤差:

>>> 0.045 + 0.005
0.049999999999999996

加0.00001之后,則變?yōu)椋?/p>

>>> 0.045 + 0.005 + 0.00001
0.05001

更多

和Series相同雀哨,DataFrame也有對應(yīng)的函數(shù)運算方法:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True 

2.7 比較缺失值

Pandas使用np.nan表示缺失值磕谅,這個對象很獨特:

>>> np.nan == np.nan
False
>>> None == None
True
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

先加載數(shù)據(jù):

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

==是對DataFrame中每個元素進行比較:

>>> college_ugds == 0.0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False

但是,如果DataFrame中有缺失值雾棺,用==就會出現(xiàn)問題:

>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Alabama A...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Amridge U...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Alabama S...        True        True  ...      True       True

看起來好像沒問題膊夹,但是如果用.all方法,就會發(fā)現(xiàn)問題:

>>> college_self_compare.all()
UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

這是因為缺失值不能相互比較捌浩,如果像下面用`== np.nan判斷有沒有缺失值放刨,就會得到0:

>>> (college_ugds == np.nan).sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

統(tǒng)計缺失值個數(shù)的方法是使用.isna

>>> college_ugds.isna().sum()
UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

比較兩個DataFrame的正確方法是使用.equals方法:

>>> college_ugds.equals(college_ugds)
True

更多

.eq方法等價于==

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

pandas.testing包中有個斷言方法assert_frame_equal,可以用于判斷兩個DataFrame是否相同尸饺,如果不同返回AssertionError进统,如果相同返回None

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

2.8 轉(zhuǎn)置DataFrame運算的方向

DataFrame的許多方法都使用了axis參數(shù),這個參數(shù)控制了運算方向浪听。axis參數(shù)可以是index0)或columns1)螟碎。字符串更清晰,建議使用字符串迹栓。

提取并篩選數(shù)據(jù)掉分。

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350

.count()方法默認axis=0,等價于college_ugds.count(axis=0)college_ugds.count(axis='index')

>>> college_ugds.count()
UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

axis參數(shù)改為columns克伊,可以得到每行非空值的數(shù)量:

>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

計算每行是否是百分之百:

>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

計算每列的中位數(shù):

>>> college_ugds.median(axis="index")
UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
UGDS_NHPI     0.00000
UGDS_2MOR     0.01750
UGDS_NRA      0.00000
UGDS_UNKN     0.01430
dtype: float64

更多

使用cumsum可以沿著列的方向進行累計求和酥郭,能從另一個視角觀察白人和黑人所占比例:

>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
Universit...      0.5922      0.8522  ...    0.9899     0.9999
Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
Universit...      0.6988      0.8243  ...    0.9650     1.0000
Alabama S...      0.0158      0.9366  ...    0.9863     1.0000

2.9 校園的多樣性

US News 的多樣性指數(shù)TOP10高校如下:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

對于我們的問題,先讀取數(shù)據(jù):

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

計算每行有多少缺失值愿吹,并從大到小排列:

>>> (
...     college_ugds.isnull()
...     .sum(axis="columns")
...     .sort_values(ascending=False)
...     .head()
... )
INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

如果一行的九列都是缺失值褥民,則使用.dropna方法刪掉該行:

>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

如果某個種族的比例超過15%,則進行統(tǒng)計:

>>> college_ugds.ge(0.15)
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False        True  ...     False      False
Universit...        True        True  ...     False      False
Amridge U...        True        True  ...     False       True
Universit...        True       False  ...     False      False
Alabama S...       False        True  ...     False      False
...                  ...         ...  ...       ...        ...
Hollywood...        True        True  ...     False      False
Hollywood...       False        True  ...     False      False
Coachella...        True       False  ...     False      False
Dewey Uni...       False       False  ...     False      False
Coastal P...        True        True  ...     False      False

使用sum方法對每行進行統(tǒng)計:

>>> diversity_metric = college_ugds.ge(0.15).sum(
...     axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

使用.value_counts查看該序列是如何分布的:

>>> diversity_metric.value_counts()
1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

驚訝地發(fā)現(xiàn)洗搂,有兩所學(xué)校有5個種族的比例超過了15%。對diversity_metric進行排列:

>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

查看排名最高的兩所學(xué)校:

>>> college_ugds.loc[
...     [
...         "Regency Beauty Institute-Austin",
...         "Central Texas Beauty College-Temple",
...     ]
... ]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Regency B...      0.1867      0.2133  ...       0.0     0.2667
Central T...      0.1616      0.2323  ...       0.0     0.1515

再來看看 US News中的排名前五的學(xué)校的表現(xiàn):

>>> us_news_top = [
...     "Rutgers University-Newark",
...     "Andrews University",
...     "Stanford University",
...     "University of Houston",
...     "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

更多

查看最不具有多樣性的學(xué)校的前十名:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

是否存在所有種族比例都超過1%的學(xué)校:

>>> (college_ugds > 0.01).all(axis=1).any()
True

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對齊

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末载弄,一起剝皮案震驚了整個濱河市耘拇,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌宇攻,老刑警劉巖惫叛,帶你破解...
    沈念sama閱讀 211,042評論 6 490
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異逞刷,居然都是意外死亡嘉涌,警方通過查閱死者的電腦和手機妻熊,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 89,996評論 2 384
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來仑最,“玉大人扔役,你說我怎么就攤上這事【剑” “怎么了亿胸?”我有些...
    開封第一講書人閱讀 156,674評論 0 345
  • 文/不壞的土叔 我叫張陵,是天一觀的道長预皇。 經(jīng)常有香客問我侈玄,道長,這世上最難降的妖魔是什么吟温? 我笑而不...
    開封第一講書人閱讀 56,340評論 1 283
  • 正文 為了忘掉前任序仙,我火速辦了婚禮,結(jié)果婚禮上鲁豪,老公的妹妹穿的比我還像新娘潘悼。我一直安慰自己,他們只是感情好呈昔,可當(dāng)我...
    茶點故事閱讀 65,404評論 5 384
  • 文/花漫 我一把揭開白布挥等。 她就那樣靜靜地躺著,像睡著了一般堤尾。 火紅的嫁衣襯著肌膚如雪肝劲。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,749評論 1 289
  • 那天郭宝,我揣著相機與錄音辞槐,去河邊找鬼。 笑死粘室,一個胖子當(dāng)著我的面吹牛榄檬,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播衔统,決...
    沈念sama閱讀 38,902評論 3 405
  • 文/蒼蘭香墨 我猛地睜開眼鹿榜,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了锦爵?” 一聲冷哼從身側(cè)響起舱殿,我...
    開封第一講書人閱讀 37,662評論 0 266
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎险掀,沒想到半個月后沪袭,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,110評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡樟氢,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,451評論 2 325
  • 正文 我和宋清朗相戀三年冈绊,在試婚紗的時候發(fā)現(xiàn)自己被綠了侠鳄。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 38,577評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡死宣,死狀恐怖伟恶,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情十电,我是刑警寧澤知押,帶...
    沈念sama閱讀 34,258評論 4 328
  • 正文 年R本政府宣布,位于F島的核電站鹃骂,受9級特大地震影響台盯,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜畏线,卻給世界環(huán)境...
    茶點故事閱讀 39,848評論 3 312
  • 文/蒙蒙 一静盅、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧寝殴,春花似錦蒿叠、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,726評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至抵蚊,卻和暖如春施绎,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背贞绳。 一陣腳步聲響...
    開封第一講書人閱讀 31,952評論 1 264
  • 我被黑心中介騙來泰國打工谷醉, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人冈闭。 一個月前我還...
    沈念sama閱讀 46,271評論 2 360
  • 正文 我出身青樓俱尼,卻偏偏與公主長得像,于是被迫代替她去往敵國和親萎攒。 傳聞我的和親對象是個殘疾皇子遇八,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 43,452評論 2 348

推薦閱讀更多精彩內(nèi)容