《Pandas 1.x Cookbook · 第二版》第07章 過濾行

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對齊


7.1 計(jì)算布爾統(tǒng)計(jì)信息

讀取電影數(shù)據(jù)集倔韭,檢查前幾行:

>>> import pandas as pd
>>> import numpy as np
>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> movie[["duration"]].head()
                                            Duration
movie_title
Avatar                                         178.0
Pirates of the Caribbean: At World's End       169.0
Spectre                                        148.0
The Dark Knight Rises                          164.0
Star Wars: Episode VII - The Force Awakens       NaN

判斷電影時(shí)長是否超過兩小時(shí):

>>> movie_2_hours = movie["duration"] > 120
>>> movie_2_hours.head(10)
movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
John Carter                                    True
Spider-Man 3                                   True
Tangled                                       False
Avengers: Age of Ultron                        True
Harry Potter and the Half-Blood Prince         True
Name: duration, dtype: bool

使用這個(gè)Series判斷時(shí)長超過兩小時(shí)的電影總和:

>>> movie_2_hours.sum()
1039

時(shí)長超過兩小時(shí)的電影所占的比例:

>>> movie_2_hours.mean() * 100
21.13506916192026

前面的步驟沒有刪除缺失值,其實(shí)有誤導(dǎo)性:

>>> movie["duration"].dropna().gt(120).mean() * 100
21.199755152009794

使用.describe方法輸出概括統(tǒng)計(jì)性信息:

>>> movie_2_hours.describe()
count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

原理

使用.value_counts方法統(tǒng)計(jì)FalseTrue所占的比例:

>>> movie_2_hours.value_counts(normalize=True)
False    0.788649
True     0.211351
Name: duration, dtype: float64

更多

可以使用DataFrame中的兩列,創(chuàng)建布爾Series:

>>> actors = movie[
...     ["actor_1_facebook_likes", "actor_2_facebook_likes"]
... ].dropna()
>>> (
...     actors["actor_1_facebook_likes"]
...     > actors["actor_2_facebook_likes"]
... ).mean()
0.9777687130328371

7.2 構(gòu)造布爾條件

讀取數(shù)據(jù):

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )

創(chuàng)建變量用于存儲(chǔ)布爾數(shù)組:

>>> criteria1 = movie.imdb_score > 8
>>> criteria2 = movie.content_rating == "PG-13"
>>> criteria3 = (movie.title_year < 2000) | (
...     movie.title_year > 2009
... )

將所有過濾器組成一個(gè)布爾數(shù)組:

>>> criteria_final = criteria1 & criteria2 & criteria3
>>> criteria_final.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

更多

比較運(yùn)算符是有順序的:

>>> 5 < 10 and 3 > 4
False
>>> 5 < 10 and 3 > 4
False    
>>> True and 3 > 4
False    
>>> True and False
False    
>>> False
False 

7.3 使用布爾數(shù)組進(jìn)行過濾

讀取數(shù)據(jù)赞枕,設(shè)置過濾條件:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> crit_a1 = movie.imdb_score > 8
>>> crit_a2 = movie.content_rating == "PG-13"
>>> crit_a3 = (movie.title_year < 2000) | (
...     movie.title_year > 2009
... )
>>> final_crit_a = crit_a1 & crit_a2 & crit_a3

再創(chuàng)建一組條件:

>>> crit_b1 = movie.imdb_score < 5
>>> crit_b2 = movie.content_rating == "R"
>>> crit_b3 = (movie.title_year >= 2000) & (
...     movie.title_year <= 2010
... )
>>> final_crit_b = crit_b1 & crit_b2 & crit_b3

將這兩個(gè)條件組成最后的條件:

>>> final_crit_all = final_crit_a | final_crit_b
>>> final_crit_all.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

用最后的條件過濾數(shù)據(jù):

>>> movie[final_crit_all].head()
                            color  ... movie/likes
movie_title                        ...
The Dark Knight Rises       Color  ...      164000
The Avengers                Color  ...      123000
Captain America: Civil War  Color  ...       72000
Guardians of the Galaxy     Color  ...       96000
Interstellar                Color  ...      349000

.loc也可以使用這個(gè)過濾條件:

>>> movie.loc[final_crit_all].head()
                            color  ... movie/likes
movie_title                        ...
The Dark Knight Rises       Color  ...      164000
The Avengers                Color  ...      123000
Captain America: Civil War  Color  ...       72000
Guardians of the Galaxy     Color  ...       96000
Interstellar                Color  ...      349000

另外也可以在.loc中指定列:

>>> cols = ["imdb_score", "content_rating", "title_year"]
>>> movie_filtered = movie.loc[final_crit_all, cols]
>>> movie_filtered.head(10)
              imdb_score content_rating  title_year
movie_title
The Dark ...         8.5        PG-13        2012.0
The Avengers         8.1        PG-13        2012.0
Captain A...         8.2        PG-13        2016.0
Guardians...         8.1        PG-13        2014.0
Interstellar         8.6        PG-13        2014.0
Inception            8.8        PG-13        2010.0
The Martian          8.1        PG-13        2015.0
Town & Co...         4.4            R        2001.0
Sex and t...         4.3            R        2010.0
Rollerball           3.0            R        2002.0

.iloc不支持布爾數(shù)組窍侧,但支持NumPy數(shù)組:

>>> movie.iloc[final_crit_all]
Traceback (most recent call last):
  ...
ValueError: iLocation based boolean indexing cannot use an indexable as a mask    
>>> movie.iloc[final_crit_all.to_numpy()]
                            color  ... movie/likes
movie_title                        ...
The Dark Knight Rises       Color  ...      164000
The Avengers                Color  ...      123000
Captain America: Civil War  Color  ...       72000
Guardians of the Galaxy     Color  ...       96000
Interstellar                Color  ...      349000
...                           ...  ...         ...
The Young Unknowns          Color  ...           4
Bled                        Color  ...         128
Hoop Dreams                 Color  ...           0
Death Calls                 Color  ...          16
The Legend of God's Gun     Color  ...          13

更多

可以將所有條件放入一行:

>>> final_crit_a2 = (
...     (movie.imdb_score > 8)
...     & (movie.content_rating == "PG-13")
...     & (
...         (movie.title_year < 2000)
...         | (movie.title_year > 2009)
...     )
... )
>>> final_crit_a2.equals(final_crit_a)
True

7.4 對比行過濾和索引過濾

讀取數(shù)據(jù)绩卤,并進(jìn)行篩選:

>>> college = pd.read_csv("data/college.csv")
>>> college[college["STABBR"] == "TX"].head()
                            INSTNM  ...        GRAD_/_SUPP
3610  Abilene Christian University  ...              25985
3611       Alvin Community College  ...               6750
3612              Amarillo College  ...              10950
3613              Angelina College  ...  PrivacySuppressed
3614       Angelo State University  ...            21319.5

重復(fù)上面的步驟瓤湘,使用STABBR列作為行索引泊脐,然后使用基于標(biāo)簽的進(jìn)行提取:

>>> college2 = college.set_index("STABBR")
>>> college2.loc["TX"].head()
                            INSTNM  ...        GRAD_/_SUPP
3610  Abilene Christian University  ...              25985
3611       Alvin Community College  ...               6750
3612              Amarillo College  ...              10950
3613              Angelina College  ...  PrivacySuppressed
3614       Angelo State University  ...            21319.5

比較兩種方法的速度:

>>> %timeit college[college['STABBR'] == 'TX']
1.75 ms ± 187 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college2.loc['TX']
882 μs ± 69.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

雖然用行索引快课蔬,但是創(chuàng)建行索引也需要時(shí)間:

>>> %timeit college2 = college.set_index('STABBR')
2.01 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

更多

使用布爾條件選取多列:

>>> states = ["TX", "CA", "NY"]
>>> college[college["STABBR"].isin(states)]
           INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
192   Academy ...  San Fran...  ...        36000           35093
193   ITT Tech...  Rancho C...  ...        38800         25827.5
194   Academy ...      Oakland  ...          NaN     PrivacyS...
195   The Acad...  Huntingt...  ...        28400            9500
196   Avalon S...      Alameda  ...        21600            9860
...           ...          ...  ...          ...             ...
7528  WestMed ...       Merced  ...          NaN         15623.5
7529  Vantage ...      El Paso  ...          NaN            9500
7530  SAE Inst...   Emeryville  ...          NaN            9500
7533  Bay Area...     San Jose  ...          NaN     PrivacyS...
7534  Excel Le...  San Antonio  ...          NaN           12125
>>> college2.loc[states]
             INSTNM        CITY  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
STABBR                           ...
TX      Abilene ...     Abilene  ...        40200            25985
TX      Alvin Co...       Alvin  ...        34500             6750
TX      Amarillo...    Amarillo  ...        31700            10950
TX      Angelina...      Lufkin  ...        26900      PrivacyS...
TX      Angelo S...  San Angelo  ...        37700          21319.5
...             ...         ...  ...          ...              ...
NY      Briarcli...   Patchogue  ...        38200          28720.5
NY      Jamestow...   Salamanca  ...          NaN            12050
NY      Pratt Ma...    New York  ...        40900            26691
NY      Saint Jo...   Patchogue  ...        52000          22143.5
NY      Franklin...    Brooklyn  ...        20000      PrivacyS...

7.5 使用唯一和有序索引選取

讀取數(shù)據(jù)集囱稽,使用STABBR作為索引,判斷索引是否是單調(diào)的:

>>> college = pd.read_csv("data/college.csv")
>>> college2 = college.set_index("STABBR")
>>> college2.index.is_monotonic
False

對索引進(jìn)行排序二跋,并判斷是否單調(diào):

>>> college3 = college2.sort_index()
>>> college3.index.is_monotonic
True

查詢從這三個(gè)DataFrame選取TX的速度:

>>> %timeit college[college['STABBR'] == 'TX']
1.75 ms ± 187 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college2.loc['TX']
1.09 ms ± 232 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college3.loc['TX']
304 μs ± 17.8 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

有序索引大大提高了速度≌骄現(xiàn)在試試唯一索引:

>>> college_unique = college.set_index("INSTNM")
>>> college_unique.index.is_unique
True

使用布爾索引選取數(shù)據(jù),返回的是個(gè)DataFrame:

>>> college[college["INSTNM"] == "Stanford University"]
           INSTNM      CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
4217  Stanford...  Stanford  ...        86000           12782

使用行索引進(jìn)行選韧贰:

>>> college_unique.loc["Stanford University"]
CITY                  Stanford
STABBR                      CA
HBCU                         0
MENONLY                      0
WOMENONLY                    0
                        ...
PCTPELL                 0.1556
PCTFLOAN                0.1256
UG25ABV                 0.0401
MD_EARN_WNE_P10          86000
GRAD_DEBT_MDN_SUPP       12782
Name: Stanford University, Length: 26, dtype: object

更多

使用城市名和州縮寫作為行索引:

>>> college.index = (
...     college["CITY"] + ", " + college["STABBR"]
... )
>>> college = college.sort_index()
>>> college.head()
                   INSTNM      CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
ARTESIA, CA   Angeles ...   ARTESIA  ...          NaN           16850
Aberdeen, SD  Presenta...  Aberdeen  ...        35900           25000
Aberdeen, SD  Northern...  Aberdeen  ...        33600           24847
Aberdeen, WA  Grays Ha...  Aberdeen  ...        27000           11490
Abilene, TX   Hardin-S...   Abilene  ...        38700           25864

選取所有來自Miami, FL的學(xué)校:

>>> college.loc["Miami, FL"].head()
                INSTNM   CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
Miami, FL  New Prof...  Miami  ...        18700            8682
Miami, FL  Manageme...  Miami  ...  PrivacyS...           12182
Miami, FL  Strayer ...  Miami  ...        49200         36173.5
Miami, FL  Keiser U...  Miami  ...        29700           26063
Miami, FL  George T...  Miami  ...        38600     PrivacyS...

比較二者的速度:

>>> %%timeit
>>> crit1 = college["CITY"] == "Miami"
>>> crit2 = college["STABBR"] == "FL"
>>> college[crit1 & crit2]
3.05 ms ± 66.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit college.loc['Miami, FL']
369 μs ± 130 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

7.6 翻譯SQL的WHERE子句

SQL語句如下:

SELECT
    UNIQUE_ID,
    DEPARTMENT,
    GENDER,
    BASE_SALARY
FROM
    EMPLOYEE
WHERE
    DEPARTMENT IN ('Houston Police Department-HPD',
                   'Houston Fire Department (HFD)') AND
    GENDER = 'Female' AND
    BASE_SALARY BETWEEN 80000 AND 120000;

使用Pandas實(shí)現(xiàn)上面SQL語句同樣的目的:

>>> employee = pd.read_csv("data/employee.csv")

查看數(shù)據(jù)集的信息:

>>> employee.dtypes
UNIQUE_ID              int64
POSITION_TITLE        object
DEPARTMENT            object
BASE_SALARY          float64
RACE                  object
EMPLOYMENT_TYPE       object
GENDER                object
EMPLOYMENT_STATUS     object
HIRE_DATE             object
JOB_DATE              object
dtype: object
>>> employee.DEPARTMENT.value_counts().head()
Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Name: DEPARTMENT, dtype: int64    
>>> employee.GENDER.value_counts()
Male      1397
Female     603
Name: GENDER, dtype: int64    
>>> employee.BASE_SALARY.describe()
count      1886.000000
mean      55767.931601
std       21693.706679
min       24960.000000
25%       40170.000000
50%       54461.000000
75%       66614.000000
max      275000.000000
Name: BASE_SALARY, dtype: float64

創(chuàng)建過濾條件:

>>> depts = [
...     "Houston Police Department-HPD",
...     "Houston Fire Department (HFD)",
... ]
>>> criteria_dept = employee.DEPARTMENT.isin(depts)
>>> criteria_gender = employee.GENDER == "Female"
>>> criteria_sal = (employee.BASE_SALARY >= 80000) & (
...     employee.BASE_SALARY <= 120000
... )
>>> criteria_final = (
...     criteria_dept & criteria_gender & criteria_sal
... )

使用過濾條件篩選數(shù)據(jù):

>>> select_columns = [
...     "UNIQUE_ID",
...     "DEPARTMENT",
...     "GENDER",
...     "BASE_SALARY",
... ]
>>> employee.loc[criteria_final, select_columns].head()
     UNIQUE_ID   DEPARTMENT  GENDER  BASE_SALARY
61          61  Houston ...  Female      96668.0
136        136  Houston ...  Female      81239.0
367        367  Houston ...  Female      86534.0
474        474  Houston ...  Female      91181.0
513        513  Houston ...  Female      81239.0

更多

和SQL類似样傍,Pandas也有between方法:

''' {.sourceCode .pycon}    
>>> criteria_sal = employee.BASE_SALARY.between(
...     80_000, 120_000
... )
'''

7.7 用查詢方法提高布爾索引的可讀性

本節(jié)使用DataFrame的query方法。

讀取數(shù)據(jù):

>>> employee = pd.read_csv("data/employee.csv")
>>> depts = [
...     "Houston Police Department-HPD",
...     "Houston Fire Department (HFD)",
... ]
>>> select_columns = [
...     "UNIQUE_ID",
...     "DEPARTMENT",
...     "GENDER",
...     "BASE_SALARY",
... ]

創(chuàng)建查詢字符串:

>>> qs = (
...     "DEPARTMENT in @depts "
...     " and GENDER == 'Female' "
...     " and 80000 <= BASE_SALARY <= 120000"
... )
>>> emp_filtered = employee.query(qs)
>>> emp_filtered[select_columns].head()
     UNIQUE_ID   DEPARTMENT  GENDER  BASE_SALARY
61          61  Houston ...  Female      96668.0
136        136  Houston ...  Female      81239.0
367        367  Houston ...  Female      86534.0
474        474  Houston ...  Female      91181.0
513        513  Houston ...  Female      81239.0
>>> top10_depts = (
...     employee.DEPARTMENT.value_counts()
...     .index[:10]
...     .tolist()
... )
>>> qs = "DEPARTMENT not in @top10_depts and GENDER == 'Female'"
>>> employee_filtered2 = employee.query(qs)
>>> employee_filtered2.head()
     UNIQUE_ID POSITION_TITLE  ...   HIRE_DATE    JOB_DATE
0            0  ASSISTAN...    ...  2006-06-12  2012-10-13
73          73  ADMINIST...    ...  2011-12-19  2013-11-23
96          96  ASSISTAN...    ...  2013-06-10  2013-06-10
117        117  SENIOR A...    ...  1998-03-20  2012-07-21
146        146  SENIOR S...    ...  2014-03-17  2014-03-17

7.8 用.where方法保留Series的大小

讀取數(shù)據(jù)铺遂,電影名作為索引衫哥,actor_1_facebook_likes列不為空:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> fb_likes = movie["actor_1_facebook_likes"].dropna()
>>> fb_likes.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

使用describe方法查看:

>>> fb_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

用柱狀圖查看分布:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> fb_likes.hist(ax=ax)
>>> fig.savefig(
...     "c7-hist.png", dpi=300
... ) 

這張圖看不出數(shù)據(jù)分布,大部分都是小于20000的:

>>> criteria_high = fb_likes < 20_000
>>> criteria_high.mean().round(2)
0.91

數(shù)據(jù)中有缺失值:

>>> fb_likes.where(criteria_high).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

where中可以設(shè)置other參數(shù)可以用于控制替換值:

>>> fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

創(chuàng)建另一個(gè)where條件:

>>> criteria_low = fb_likes > 300
>>> fb_likes_cap = fb_likes.where(
...     criteria_high, other=20_000
... ).where(criteria_low, 300)
>>> fb_likes_cap.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64

前后兩個(gè)Series大小相同:

>>> len(fb_likes), len(fb_likes_cap)
(4909, 4909)

重新用柱狀圖查看分布:

>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> fb_likes_cap.hist(ax=ax)
>>> fig.savefig(
...     "c7-hist2.png", dpi=300
... )

更多

Pandas有.clip襟锐、.clip_lower撤逢、.clip_upper三個(gè)方法用于最低值和最高值:

>>> fb_likes_cap2 = fb_likes.clip(lower=300, upper=20000)
>>> fb_likes_cap2.equals(fb_likes_cap)
True

7.9 遮掩DataFrame的行

讀取數(shù)據(jù),創(chuàng)建條件:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> c1 = movie["title_year"] >= 2010
>>> c2 = movie["title_year"].isna()
>>> criteria = c1 | c2

mask方法遮擋上述條件的數(shù)據(jù):

>>> movie.mask(criteria).head()
                                            color  ...
movie_title                                        ...
Avatar                                      Color  ...
Pirates of the Caribbean: At World's End    Color  ...
Spectre                                       NaN  ...
The Dark Knight Rises                         NaN  ...
Star Wars: Episode VII - The Force Awakens    NaN  ...

注意上面三四五是缺失值:

>>> movie_mask = movie.mask(criteria).dropna(how="all")
>>> movie_mask.head()
                                          color  ...
movie_title                                      ...
Avatar                                    Color  ...
Pirates of the Caribbean: At World's End  Color  ...
Spider-Man 3                              Color  ...
Harry Potter and the Half-Blood Prince    Color  ...
Superman Returns                          Color  ...

.equals方法檢查這兩個(gè)條件是不一樣的:

>>> movie_boolean = movie[movie["title_year"] < 2010]
>>> movie_mask.equals(movie_boolean)
False

但形狀是一樣的:

>>> movie_mask.shape == movie_boolean.shape
True

檢查兩個(gè)條件的數(shù)據(jù)類型:

>>> movie_mask.dtypes == movie_boolean.dtypes
color                       True
director_name               True
num_critic_for_reviews      True
duration                    True
director_facebook_likes     True
                           ...
title_year                  True
actor_2_facebook_likes      True
imdb_score                  True
aspect_ratio                True
movie_facebook_likes       False
Length: 27, dtype: bool

Pandas有一個(gè)assert_frame_equal方法粮坞,也可以判斷DataFrame是否相同:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(
...     movie_boolean, movie_mask, check_dtype=False
... )

更多

比較這兩個(gè)條件的速度:

>>> %timeit movie.mask(criteria).dropna(how='all')
11.2 ms ± 144 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit movie[movie['title_year'] < 2010]
1.07 ms ± 34.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

7.10 使用布爾值蚊荣、整數(shù)位置和標(biāo)簽選取數(shù)據(jù)

讀取數(shù)據(jù),創(chuàng)建條件:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> c1 = movie["content_rating"] == "G"
>>> c2 = movie["imdb_score"] < 4
>>> criteria = c1 & c2

使用.loc過濾行:

>>> movie_loc = movie.loc[criteria]
>>> movie_loc.head()
                                color  ... movie/likes
movie_title                            ...
The True Story of Puss'N Boots  Color  ...          90
Doogal                          Color  ...         346
Thomas and the Magic Railroad   Color  ...         663
Barney's Great Adventure        Color  ...         436
Justin Bieber: Never Say Never  Color  ...       62000

這兩個(gè)方法是等價(jià)的:

>>> movie_loc.equals(movie[criteria])
True

iloc需要將條件轉(zhuǎn)換為numpy數(shù)組:

>>> movie_iloc = movie.iloc[criteria.to_numpy()]
>>> movie_iloc.equals(movie_loc)
True

選取數(shù)據(jù)類型是int64的:

>>> criteria_col = movie.dtypes == np.int64
>>> criteria_col.head()
color                      False
director_name              False
num_critic_for_reviews     False
duration                   False
director_facebook_likes    False
dtype: bool
>>> movie.loc[:, criteria_col].head()
              num_voted_users  cast_total_facebook_likes  movie_facebook_likes
movie_title
Avatar             886204             4834                      33000
Pirates o...       471220            48350                          0
Spectre            275868            11700                      85000
The Dark ...      1144337           106759                     164000
Star Wars...            8              143                          0

因?yàn)槭荢eries莫杈,criteria_col必須要轉(zhuǎn)化為numpy就可以用于iloc

>>> movie.iloc[:, criteria_col.to_numpy()].head()
              num_voted_users  cast_total_facebook_likes  movie_facebook_likes
movie_title
Avatar             886204             4834                      33000
Pirates o...       471220            48350                          0
Spectre            275868            11700                      85000
The Dark ...      1144337           106759                     164000
Star Wars...            8              143                          0

loc中將條件和列合用:

>>> cols = [
...     "content_rating",
...     "imdb_score",
...     "title_year",
...     "gross",
... ]
>>> movie.loc[criteria, cols].sort_values("imdb_score")
             content_rating  imdb_score  title_year       gross
movie_title
Justin Bi...            G           1.6      2011.0  73000942.0
Sunday Sc...            G           2.5      2008.0         NaN
Doogal                  G           2.8      2006.0   7382993.0
Barney's ...            G           2.8      1998.0  11144518.0
The True ...            G           2.9      2009.0         NaN
Thomas an...            G           3.6      2000.0  15911333.0

.iloc必須使用列的位置:

>>> col_index = [movie.columns.get_loc(col) for col in cols]
>>> col_index
[20, 24, 22, 8]
>>> movie.iloc[criteria.to_numpy(), col_index].sort_values(
...     "imdb_score"
... )
             content_rating  imdb_score  title_year       gross
movie_title
Justin Bi...            G           1.6      2011.0  73000942.0
Sunday Sc...            G           2.5      2008.0         NaN
Doogal                  G           2.8      2006.0   7382993.0
Barney's ...            G           2.8      1998.0  11144518.0
The True ...            G           2.9      2009.0         NaN
Thomas an...            G           3.6      2000.0  15911333.0

(這小節(jié)和之前的內(nèi)容重復(fù)不少)


第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對齊

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末互例,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子筝闹,更是在濱河造成了極大的恐慌媳叨,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,042評(píng)論 6 490
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件关顷,死亡現(xiàn)場離奇詭異糊秆,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)议双,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 89,996評(píng)論 2 384
  • 文/潘曉璐 我一進(jìn)店門痘番,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人平痰,你說我怎么就攤上這事汞舱。” “怎么了宗雇?”我有些...
    開封第一講書人閱讀 156,674評(píng)論 0 345
  • 文/不壞的土叔 我叫張陵兵拢,是天一觀的道長。 經(jīng)常有香客問我逾礁,道長说铃,這世上最難降的妖魔是什么访惜? 我笑而不...
    開封第一講書人閱讀 56,340評(píng)論 1 283
  • 正文 為了忘掉前任,我火速辦了婚禮腻扇,結(jié)果婚禮上债热,老公的妹妹穿的比我還像新娘。我一直安慰自己幼苛,他們只是感情好窒篱,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,404評(píng)論 5 384
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著舶沿,像睡著了一般墙杯。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上括荡,一...
    開封第一講書人閱讀 49,749評(píng)論 1 289
  • 那天高镐,我揣著相機(jī)與錄音,去河邊找鬼畸冲。 笑死嫉髓,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的邑闲。 我是一名探鬼主播算行,決...
    沈念sama閱讀 38,902評(píng)論 3 405
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼苫耸!你這毒婦竟也來了州邢?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,662評(píng)論 0 266
  • 序言:老撾萬榮一對情侶失蹤褪子,失蹤者是張志新(化名)和其女友劉穎量淌,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體褐筛,經(jīng)...
    沈念sama閱讀 44,110評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,451評(píng)論 2 325
  • 正文 我和宋清朗相戀三年叙身,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了渔扎。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,577評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡信轿,死狀恐怖晃痴,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情财忽,我是刑警寧澤倘核,帶...
    沈念sama閱讀 34,258評(píng)論 4 328
  • 正文 年R本政府宣布,位于F島的核電站即彪,受9級(jí)特大地震影響紧唱,放射性物質(zhì)發(fā)生泄漏活尊。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,848評(píng)論 3 312
  • 文/蒙蒙 一漏益、第九天 我趴在偏房一處隱蔽的房頂上張望蛹锰。 院中可真熱鬧,春花似錦绰疤、人聲如沸铜犬。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,726評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽癣猾。三九已至,卻和暖如春余爆,著一層夾襖步出監(jiān)牢的瞬間纷宇,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,952評(píng)論 1 264
  • 我被黑心中介騙來泰國打工龙屉, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留呐粘,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,271評(píng)論 2 360
  • 正文 我出身青樓转捕,卻偏偏與公主長得像作岖,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個(gè)殘疾皇子五芝,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,452評(píng)論 2 348

推薦閱讀更多精彩內(nèi)容