1.NaN(空值)與None(缺失值)
Missing data can take a few different forms:
- In Python, the None
keyword and type indicates no value. - The Pandas library uses NaN
, which stands for "not a number", to indicate a missing value.
In general terms, both NaN and None can be called null values.
2.判斷缺失值/空字符:pandas.isnull(XXX)
If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False
values, the same way that NumPy did when we compared arrays.
input
age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pandas.isnull(age) # 如果是NaN或者None,返回True办陷;否則蓬网,返回False
age_null_true = age[age_is_null] #
age_null_count = len(age_null_true)
print(age_null_count)
output
10 47.0
11 18.0
12 24.0
13 26.0
14 80.0
15 NaN
16 24.0
17 50.0
18 32.0
19 36.0
20 37.0
Name: age, dtype: float64
264
3.有null值時(shí)做加減乘除法
#計(jì)算有null值下的平均年齡
age_is_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival['age'][age_is_null == False]
correct_mean_age1 = sum(good_ages) / len(good_ages)
#使用Series.mean()
correct_mean_age2 = titanic_survival["age"].mean()
4.用詞典統(tǒng)計(jì)不同等級(jí)船艙的票價(jià)問(wèn)題
input
passenger_classes = [1, 2, 3] #泰坦尼克的船艙等級(jí)分為1爱只,2与纽,3
fares_by_class = {} #創(chuàng)建一個(gè)空字典
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival['pclass'] == this_class] # X等艙的所有數(shù)據(jù)
mean_fares = pclass_rows['fare'].mean() # X等艙的船票均值
fares_by_class[this_class] = mean_fares # 構(gòu)建詞典用于統(tǒng)計(jì)
print(fares_by_class)
output
{1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}
5.使用Dataframe.pivot_table()
Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean.
剛才第4點(diǎn)的問(wèn)題赦颇,可以用Dataframe.pivot_table()
The first parameter of the method, index
tells the method which column to group by.The second parameter values
is the column that we want to apply the calculation to.aggfunc
specifies the calculation we want to perform. The default for the aggfunc
parameter is actually the mean
input1
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=numpy.mean)
print(passenger_class_fares)
output1
pclass
1.0 87.508992
2.0 21.179196
3.0 13.302889
Name: fare, dtype: float64
input2
passenger_age = titanic_survival.pivot_table(index="pclass", values="age",aggfunc=numpy.mean)
print(passenger_age)
output2
pclass
1.0 39.159918
2.0 29.506705
3.0 24.816367
Name: age, dtype: float64
input3
import numpy as np
port_stats = titanic_survival.pivot_table(index='embarked', values=["fare", "survived"], aggfunc=numpy.sum)
print(port_stats)
output3
fare survivedembarked C 16830.7922 150.0Q 1526.3085 44.0S 25033.3862 304.0
6.剔除缺失值:DataFrame.dropna()
The methodDataFrame.dropna()
will drop any rows that contain missing values.
drop_na_rows = titanic_survival.dropna(axis=0) # 剔除所有含缺失值的行
drop_na_columns = titanic_survival.dropna(axis=1) # 剔除所有含缺失值的列
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["age", "sex"]) # 剔除所有在‘a(chǎn)ge’和‘sex’中嗜桌,有缺失值的行
7.Dataframe.loc[4]與Dataframe.iloc[4]
input
# We have already sorted new_titanic_survival by age
first_five_rows_1 = new_titanic_survival.iloc[5] # 定位到按順序第5的對(duì)象
first_five_rows_2 = new_titanic_survival.loc[5] # 定位到索引值為5的對(duì)象
row_index_25_survived = new_titanic_survival.loc[25, 'survived'] # 定位到索引值為5,且列名為'survived'的對(duì)象
print(first_five_rows_1)
print('------------------------------------------')
print(first_five_rows_2)
output
pclass 3survived 0name Connors, Mr. Patricksex maleage 70.5sibsp 0parch 0ticket 370369fare 7.75cabin NaNembarked Qboat NaNbody 171home.dest NaNName: 727, dtype: object------------------------------------------pclass 1survived 1name Anderson, Mr. Harrysex maleage 48sibsp 0parch 0ticket 19952fare 26.55cabin E12embarked Sboat 3body NaNhome.dest New York, NYName: 5, dtype: object
8.重新整理索引值:Dataframe.reset_index(drop=True)
input
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:5,0:3])
output
pclass survived name0 1.0 1.0 Barkworth, Mr. Algernon Henry Wilson1 1.0 1.0 Cavendish, Mrs. Tyrell William (Julia Florence...2 3.0 0.0 Svensson, Mr. Johan3 1.0 0.0 Goldschmidt, Mr. George B4 1.0 0.0 Artagaveytia, Mr. Ramon
9.Apply Functions Over a DataFrame
DataFrame.apply() will iterate through each column in a DataFrame, and perform on each function. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.
DataFrame可以調(diào)用apply函數(shù)對(duì)每一列(行)應(yīng)用一個(gè)函數(shù)
input
def not_null_count(column):
columns_null = pandas.isnull(column) #
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
output
pclass 1survived 1name 1sex 1age 264sibsp 1parch 1ticket 1fare 2cabin 1015embarked 3boat 824body 1189home.dest 565dtype: int64
10.Applying a Function to a Row
input
def age_label(row):
age = row['age']
if pandas.isnull(age):
return 'unknown'
elif age < 18:
return 'minor'
else:
return 'adult'
age_labels = titanic_survival.apply(age_label, axis=1) # use axis=1
so that the apply()
method applies your function over the rows
print(age_labels[0:5])
output
0 adult1 minor2 minor3 adult4 adultdtype: object
11.Calculating Survival Percentage by Age Group
Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.
We have added an "age_labels"
column to the dataframe containing the age_labels
variable from the previous step.
input
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
print(age_group_survival)
output
age_labelsadult 0.387892minor 0.525974unknown 0.277567Name: survived, dtype: float64