第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對齊
2.1 從DataFrame中選擇多列
使用列名列表提取DataFrame的多列:
>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
... [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
... ]
>>> movie_actor_director.head()
actor_1_name actor_2_name actor_3_name director_name
0 CCH Pounder Joel Dav... Wes Studi James Ca...
1 Johnny Depp Orlando ... Jack Dav... Gore Ver...
2 Christop... Rory Kin... Stephani... Sam Mendes
3 Tom Hardy Christia... Joseph G... Christop...
4 Doug Walker Rob Walker NaN Doug Walker
# 提取單列時娃属,列表和鍵名提取出來的數(shù)據(jù)類型不同辱挥。
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'> # DataFrame類型
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'> # Series類型
也可以使用loc
提取多列。
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>
預(yù)先將列名存儲在列表中踢俄,可以提高代碼的可讀性蚊逢。
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movie_actor_director = movies[cols]
如果沒有使用列表层扶,則會報KeyError
錯誤。
>>> movies[
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
Traceback (most recent call last):
...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')
2.2 使用方法提取多列
縮短列名之后查看每種數(shù)據(jù)類型的個數(shù):
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return (
... str(col)
... .replace("facebook_likes", "fb")
... .replace("_for_reviews", "")
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64 13
int64 3
object 12
dtype: int64
使用.select_dtypes
方法提取整型的列:
>>> movies.select_dtypes(include="int").head()
num_voted_users cast_total_fb movie_fb
0 886204 4834 33000
1 471220 48350 0
2 275868 11700 85000
3 1144337 106759 164000
4 8 143 0
選擇所有數(shù)值類型的列:
>>> movies.select_dtypes(include="number").head()
num_critics duration ... aspect_ratio movie_fb
0 723.0 178.0 ... 1.78 33000
1 302.0 169.0 ... 2.35 0
2 602.0 148.0 ... 2.35 85000
3 813.0 164.0 ... 2.35 164000
4 NaN NaN ... NaN 0
選擇整型和字符串的列:
>>> movies.select_dtypes(include=["int", "object"]).head()
color direc/_name ... conte/ating movie_fb
0 Color James Cameron ... PG-13 33000
1 Color Gore Verbinski ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christopher Nolan ... PG-13 164000
4 NaN Doug Walker ... NaN 0
提取所有非浮點類型的列:
>>> movies.select_dtypes(exclude="float").head()
color director_name ... content_rating movie_fb
0 Color James Ca... ... PG-13 33000
1 Color Gore Ver... ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christop... ... PG-13 164000
4 NaN Doug Walker ... NaN 0
使用.filter
方法篩選所有列名中包含fb
的列:
>>> movies.filter(like="fb").head()
director_fb actor_3_fb ... actor_2_fb movie_fb
0 0.0 855.0 ... 936.0 33000
1 563.0 1000.0 ... 5000.0 0
2 0.0 161.0 ... 393.0 85000
3 22000.0 23000.0 ... 23000.0 164000
4 131.0 NaN ... 12.0 0
items
參數(shù)可以用來選擇多列:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movies.filter(items=cols).head()
actor_1_name ... director_name
0 CCH Pounder ... James Cameron
1 Johnny Depp ... Gore Verbinski
2 Christoph Waltz ... Sam Mendes
3 Tom Hardy ... Christopher Nolan
4 Doug Walker ... Doug Walker
regex
參數(shù)可以用來進行正則匹配烙荷,下面的代碼提取出了列名中包含數(shù)字的列:
>>> movies.filter(regex=r"\d").head()
actor_3_fb actor_2_name ... actor_3_name actor_2_fb
0 855.0 Joel Dav... ... Wes Studi 936.0
1 1000.0 Orlando ... ... Jack Dav... 5000.0
2 161.0 Rory Kin... ... Stephani... 393.0
3 23000.0 Christia... ... Joseph G... 23000.0
4 NaN Rob Walker ... NaN 12.0
2.3 按列名進行排列
對列進行排序的原則:
- 將列分為分類型和連續(xù)型镜会;
- 按照分類型和連續(xù)型對列分組;
- 分類型排在連續(xù)型的前面终抽;
下面是個例子戳表。先讀取數(shù)據(jù)桶至,縮短列名:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
對下面的列名進行
>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
'movie_fb'],
dtype='object')
>>> cat_core = [
... "movie_title",
... "title_year",
... "content_rating",
... "genres",
... ]
>>> cat_people = [
... "director_name",
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... ]
>>> cat_other = [
... "color",
... "country",
... "language",
... "plot_keywords",
... "movie_imdb_link",
... ]
>>> cont_fb = [
... "director_fb",
... "actor_1_fb",
... "actor_2_fb",
... "actor_3_fb",
... "cast_total_fb",
... "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
... "num_voted_users",
... "num_user",
... "num_critic",
... ]
>>> cont_other = [
... "imdb_score",
... "duration",
... "aspect_ratio",
... "facenumber_in_poster",
... ]
將上面所有列表連起來,組成最終的列的順序匾旭,并確認沒有遺漏任何列:
>>> new_col_order = (
... cat_core
... + cat_people
... + cat_other
... + cont_fb
... + cont_finance
... + cont_num_reviews
... + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True
將新的列數(shù)組傳給movies镣屹,得到排好列的對象:
>>> movies[new_col_order].head()
movie_title title_year ... aspect_ratio facenumber_in_poster
0 Avatar 2009.0 ... 1.78 0.0
1 Pirates ... 2007.0 ... 2.35 0.0
2 Spectre 2015.0 ... 2.35 1.0
3 The Dark... 2012.0 ... 2.35 0.0
4 Star War... NaN ... NaN 0.0
2.4 對DataFrame進行概括性分析
查看數(shù)據(jù)集的屬性:shape、size价涝、ndim女蜈。
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
.count
方法可以統(tǒng)計所有的非缺失值:
>>> movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
director_facebook_likes 4814
...
title_year 4810
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
.min
、.max
色瘩、.mean
伪窖、.median
、.std
方法泞遗,返回的是數(shù)值列的統(tǒng)計信息:
>>> movies.min()
num_critic_for_reviews 1.00
duration 7.00
director_facebook_likes 0.00
actor_3_facebook_likes 0.00
actor_1_facebook_likes 0.00
...
title_year 1916.00
actor_2_facebook_likes 0.00
imdb_score 1.60
aspect_ratio 1.18
movie_facebook_likes 0.00
Length: 16, dtype: float64
.describe
是一個非常強大的方法惰许,可以返回描述性統(tǒng)計信息和分位數(shù),如果想在屏幕中顯示更多信息史辙,可以用.T
進行矩陣轉(zhuǎn)置:
>>> movies.describe().T
count mean ... 75% max
num_criti... 4867.0 137.988905 ... 191.00 813.0
duration 4901.0 107.090798 ... 118.00 511.0
director_... 4814.0 691.014541 ... 189.75 23000.0
actor_3_f... 4893.0 631.276313 ... 633.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 11000.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2011.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 912.00 137000.0
imdb_score 4916.0 6.437429 ... 7.20 9.5
aspect_ratio 4590.0 2.222349 ... 2.35 16.0
movie_fac... 4916.0 7348.294142 ... 2000.00 349000.0
.describe
方法中通過percentiles
參數(shù)汹买,可以得到任意分位數(shù):
>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
count mean ... 99% max
num_criti... 4867.0 137.988905 ... 546.68 813.0
duration 4901.0 107.090798 ... 189.00 511.0
director_... 4814.0 691.014541 ... 16000.00 23000.0
actor_3_f... 4893.0 631.276313 ... 11000.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 44920.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2016.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 17000.00 137000.0
imdb_score 4916.0 6.437429 ... 8.50 9.5
aspect_ratio 4590.0 2.222349 ... 4.00 16.0
movie_fac... 4916.0 7348.294142 ... 93850.00 349000.0
如果在上述的描述性方法中,將參數(shù)skipna
設(shè)為False
聊倔,則可以將所有列都體現(xiàn)出來:
>>> movies.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
director_facebook_likes NaN
actor_3_facebook_likes NaN
actor_1_facebook_likes NaN
...
title_year NaN
actor_2_facebook_likes NaN
imdb_score 1.6
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64
2.5 DataFrame的鏈?zhǔn)椒椒?/h1>
使用.isnull
方法晦毙,判斷每個值是否是缺失值:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.isnull().head()
color director_name ... aspect_ratio movie_fb
0 False False ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 True False ... True False
.sum
方法可以對True
和False
求和,True
是1耙蔑,False
是0见妒,這樣就能清楚地看到每列有多少缺失值:
>>> (movies.isnull().sum().head())
color 19
director_name 102
num_critic 49
duration 15
director_fb 102
dtype: int64
再進一步,兩個.sum
方法連用甸陌,可以知道總共有多少缺失值:
>>> movies.isnull().sum().sum()
2654
如果僅僅想知道是否有缺失值须揣,使用.any()
更為便捷:
>>> movies.isnull().any().any()
True
原理
.isnull
方法將原始的DataFrame轉(zhuǎn)換為了相同大小的布爾值矩陣:
>>> movies.isnull().dtypes.value_counts()
bool 28
dtype: int64
更多
如果object
類型的數(shù)據(jù)存在缺失值,則在做聚合運算(.min
钱豁、.max
耻卡、.sum
)時,返回為空:
>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)
如果想讓返回值不為空牲尺,需要對缺失值進行填充:
>>> movies.select_dtypes(["object"]).fillna("").max()
color Color
director_name étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
出于可讀性考慮卵酪,鏈?zhǔn)椒椒ㄍǔS美ㄌ柪ㄆ饋恚@樣對每個方法做注釋和調(diào)試時非常方便:
>>> (movies.select_dtypes(["object"]).fillna("").max())
color Color
director_name étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
2.6 DataFrame運算
DataFrame的列的類型可能是數(shù)值,也可能是對象,直接+5的話烂翰,會報類型錯誤:
>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str
用.filter
方法篩選出所有列名以'UGDS_'開頭的列,該列是按照種族分類的本科生:
>>> colleges = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
Pandas采取的是“四舍六入五成雙” 的 Banker's Rounding 規(guī)則 (五后有數(shù)入瘸羡、五后無數(shù)湊偶數(shù))。觀察UGDS_BLACK
是如何變化的:
>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE 0.7912
UGDS_BLACK 0.1250
UGDS_HISP 0.0339
UGDS_ASIAN 0.0036
UGDS_AIAN 0.0088
UGDS_NHPI 0.0006
UGDS_2MOR 0.0012
UGDS_NRA 0.0033
UGDS_UNKN 0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE 0.79
UGDS_BLACK 0.12
UGDS_HISP 0.03
UGDS_ASIAN 0.00
UGDS_AIAN 0.01
UGDS_NHPI 0.00
UGDS_2MOR 0.00
UGDS_NRA 0.00
UGDS_UNKN 0.03
Name: Northwest-Shoals Community College, dtype: float64
如果圓整之前加0.0001
搓茬,看看變化:
>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE 0.79
UGDS_BLACK 0.13
UGDS_HISP 0.03
UGDS_ASIAN 0.00
UGDS_AIAN 0.01
UGDS_NHPI 0.00
UGDS_2MOR 0.00
UGDS_NRA 0.00
UGDS_UNKN 0.03
Name: Northwest-Shoals Community College, dtype: float64
在開始圓整探險之前最铁,將college_ugds
中的每個數(shù)加0.00501
:
>>> college_ugds + 0.00501
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03831 0.94031 ... 0.01091 0.01881
Universit... 0.59721 0.26501 ... 0.02291 0.01501
Amridge U... 0.30401 0.42421 ... 0.00501 0.27651
Universit... 0.70381 0.13051 ... 0.03821 0.04001
Alabama S... 0.02081 0.92581 ... 0.02931 0.01871
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ...
使用//
讯赏,將DataFrame中的值圓整為百分比的整數(shù):
>>> (college_ugds + 0.00501) // 0.01
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 3.0 94.0 ... 1.0 1.0
Universit... 59.0 26.0 ... 2.0 1.0
Amridge U... 30.0 42.0 ... 0.0 27.0
Universit... 70.0 13.0 ... 3.0 4.0
Alabama S... 2.0 92.0 ... 2.0 1.0
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ... NaN NaN
National ... NaN NaN ...
如果將其除以100,則:
>>> college_ugds_op_round =(
... (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03 0.94 ... 0.01 0.01
Universit... 0.59 0.26 ... 0.02 0.01
Amridge U... 0.30 0.42 ... 0.00 0.27
Universit... 0.70 0.13 ... 0.03 0.04
Alabama S... 0.02 0.92 ... 0.02 0.01
下面使用round
方法冷尉,因為是bankers規(guī)則漱挎,加個0.00001:
>>> college_ugds_round = (college_ugds + 0.00001).round(2)
判斷兩個結(jié)果是否相同:
>>> college_ugds_op_round.equals(college_ugds_round)
True
原理
浮點運算會產(chǎn)生誤差:
>>> 0.045 + 0.005
0.049999999999999996
加0.00001之后,則變?yōu)椋?/p>
>>> 0.045 + 0.005 + 0.00001
0.05001
更多
和Series相同雀哨,DataFrame也有對應(yīng)的函數(shù)運算方法:
>>> college2 = (
... college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True
2.7 比較缺失值
Pandas使用np.nan
表示缺失值磕谅,這個對象很獨特:
>>> np.nan == np.nan
False
>>> None == None
True
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
先加載數(shù)據(jù):
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
==
是對DataFrame中每個元素進行比較:
>>> college_ugds == 0.0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
但是,如果DataFrame中有缺失值雾棺,用==
就會出現(xiàn)問題:
>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... True True ... True True
Universit... True True ... True True
Amridge U... True True ... True True
Universit... True True ... True True
Alabama S... True True ... True True
看起來好像沒問題膊夹,但是如果用.all
方法,就會發(fā)現(xiàn)問題:
>>> college_self_compare.all()
UGDS_WHITE False
UGDS_BLACK False
UGDS_HISP False
UGDS_ASIAN False
UGDS_AIAN False
UGDS_NHPI False
UGDS_2MOR False
UGDS_NRA False
UGDS_UNKN False
dtype: bool
這是因為缺失值不能相互比較捌浩,如果像下面用`== np.nan判斷有沒有缺失值放刨,就會得到0:
>>> (college_ugds == np.nan).sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
統(tǒng)計缺失值個數(shù)的方法是使用.isna
:
>>> college_ugds.isna().sum()
UGDS_WHITE 661
UGDS_BLACK 661
UGDS_HISP 661
UGDS_ASIAN 661
UGDS_AIAN 661
UGDS_NHPI 661
UGDS_2MOR 661
UGDS_NRA 661
UGDS_UNKN 661
dtype: int64
比較兩個DataFrame的正確方法是使用.equals
方法:
>>> college_ugds.equals(college_ugds)
True
更多
.eq
方法等價于==
:
>>> college_ugds.eq(0.0019) # same as college_ugds == .0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
pandas.testing
包中有個斷言方法assert_frame_equal
,可以用于判斷兩個DataFrame是否相同尸饺,如果不同返回AssertionError
进统,如果相同返回None
:
>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True
2.8 轉(zhuǎn)置DataFrame運算的方向
DataFrame的許多方法都使用了axis
參數(shù),這個參數(shù)控制了運算方向浪听。axis
參數(shù)可以是index
(0
)或columns
(1
)螟碎。字符串更清晰,建議使用字符串迹栓。
提取并篩選數(shù)據(jù)掉分。
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
.count()
方法默認axis=0
,等價于college_ugds.count(axis=0)
和 college_ugds.count(axis='index')
:
>>> college_ugds.count()
UGDS_WHITE 6874
UGDS_BLACK 6874
UGDS_HISP 6874
UGDS_ASIAN 6874
UGDS_AIAN 6874
UGDS_NHPI 6874
UGDS_2MOR 6874
UGDS_NRA 6874
UGDS_UNKN 6874
dtype: int64
將axis
參數(shù)改為columns
克伊,可以得到每行非空值的數(shù)量:
>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University 9
University of Alabama at Birmingham 9
Amridge University 9
University of Alabama in Huntsville 9
Alabama State University 9
dtype: int64
計算每行是否是百分之百:
>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University 1.0000
University of Alabama at Birmingham 0.9999
Amridge University 1.0000
University of Alabama in Huntsville 1.0000
Alabama State University 1.0000
dtype: float64
計算每列的中位數(shù):
>>> college_ugds.median(axis="index")
UGDS_WHITE 0.55570
UGDS_BLACK 0.10005
UGDS_HISP 0.07140
UGDS_ASIAN 0.01290
UGDS_AIAN 0.00260
UGDS_NHPI 0.00000
UGDS_2MOR 0.01750
UGDS_NRA 0.00000
UGDS_UNKN 0.01430
dtype: float64
更多
使用cumsum
可以沿著列的方向進行累計求和酥郭,能從另一個視角觀察白人和黑人所占比例:
>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9686 ... 0.9862 1.0000
Universit... 0.5922 0.8522 ... 0.9899 0.9999
Amridge U... 0.2990 0.7182 ... 0.7285 1.0000
Universit... 0.6988 0.8243 ... 0.9650 1.0000
Alabama S... 0.0158 0.9366 ... 0.9863 1.0000
2.9 校園的多樣性
US News 的多樣性指數(shù)TOP10高校如下:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA 0.74
San Francisco State University San Francisco, CA 0.73
University of Illinois--Chicago Chicago, IL 0.73
New Jersey Institute of Technology Newark, NJ 0.72
Texas Woman's University Denton, TX 0.72
對于我們的問題,先讀取數(shù)據(jù):
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
計算每行有多少缺失值愿吹,并從大到小排列:
>>> (
... college_ugds.isnull()
... .sum(axis="columns")
... .sort_values(ascending=False)
... .head()
... )
INSTNM
Excel Learning Center-San Antonio South 9
Philadelphia College of Osteopathic Medicine 9
Assemblies of God Theological Seminary 9
Episcopal Divinity School 9
Phillips Graduate Institute 9
dtype: int64
如果一行的九列都是缺失值褥民,則使用.dropna
方法刪掉該行:
>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
如果某個種族的比例超過15%,則進行統(tǒng)計:
>>> college_ugds.ge(0.15)
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False True ... False False
Universit... True True ... False False
Amridge U... True True ... False True
Universit... True False ... False False
Alabama S... False True ... False False
... ... ... ... ... ...
Hollywood... True True ... False False
Hollywood... False True ... False False
Coachella... True False ... False False
Dewey Uni... False False ... False False
Coastal P... True True ... False False
使用sum
方法對每行進行統(tǒng)計:
>>> diversity_metric = college_ugds.ge(0.15).sum(
... axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University 1
University of Alabama at Birmingham 2
Amridge University 3
University of Alabama in Huntsville 1
Alabama State University 1
dtype: int64
使用.value_counts
查看該序列是如何分布的:
>>> diversity_metric.value_counts()
1 3042
2 2884
3 876
4 63
0 7
5 2
dtype: int64
驚訝地發(fā)現(xiàn)洗搂,有兩所學(xué)校有5個種族的比例超過了15%。對diversity_metric
進行排列:
>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin 5
Central Texas Beauty College-Temple 5
Sullivan and Cogliano Training Center 4
Ambria College of Nursing 4
Berkeley College-New York 4
dtype: int64
查看排名最高的兩所學(xué)校:
>>> college_ugds.loc[
... [
... "Regency Beauty Institute-Austin",
... "Central Texas Beauty College-Temple",
... ]
... ]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Regency B... 0.1867 0.2133 ... 0.0 0.2667
Central T... 0.1616 0.2323 ... 0.0 0.1515
再來看看 US News中的排名前五的學(xué)校的表現(xiàn):
>>> us_news_top = [
... "Rutgers University-Newark",
... "Andrews University",
... "Stanford University",
... "University of Houston",
... "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark 4
Andrews University 3
Stanford University 3
University of Houston 3
University of Nevada-Las Vegas 3
dtype: int64
更多
查看最不具有多樣性的學(xué)校的前十名:
>>> (
... college_ugds.max(axis=1)
... .sort_values(ascending=False)
... .head(10)
... )
INSTNM
Dewey University-Manati 1.0
Yeshiva and Kollel Harbotzas Torah 1.0
Mr Leon's School of Hair Design-Lewiston 1.0
Dewey University-Bayamon 1.0
Shepherds Theological Seminary 1.0
Yeshiva Gedolah Kesser Torah 1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias 1.0
Yeshiva Shaar Hatorah 1.0
Bais Medrash Elyon 1.0
Yeshiva of Nitra Rabbinical College 1.0
dtype: float64
是否存在所有種族比例都超過1%的學(xué)校:
>>> (college_ugds > 0.01).all(axis=1).any()
True
第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對齊