Kaggle上的Titanic败匹,先跟著別人做,學(xué)習(xí)下別人的特征工程和調(diào)參桨踪。
https://www.kaggle.com/startupsci/titanic-data-science-solutions
這篇是vote最多的一個(gè)kernel,就從這里開始吧老翘。
背景是20世紀(jì)初泰坦尼克沉沒,船上有2214人锻离,約32%的人獲救网严,數(shù)據(jù)集中給出了船上人員的信息,你需要對(duì)這些信息進(jìn)行整理建模揉稚,從而預(yù)測(cè)他是否會(huì)獲救炼七。
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
需要用到的包
所需的數(shù)據(jù):https://www.kaggle.com/c/titanic/data 大家自己下載
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
combine = [train_df, test_df]
數(shù)據(jù)讀進(jìn)來(lái)之后讓我們看看都有哪些特征呢
print(train_df.columns.values)
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
'Ticket' 'Fare' 'Cabin' 'Embarked']
看看這些特征哪些是標(biāo)簽型的,哪些是數(shù)值型的呢
Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.
注意其中Pclass是有次序的
這些是數(shù)值型的虱朵,一部分是連續(xù)的莉炉,剩下是離散值
Continous: Age, Fare. Discrete: SibSp, Parch.
看一下數(shù)據(jù)吧:
train_df.head()
數(shù)據(jù)里有哪些是混合型特征呢?
Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.可以看到Ticket和Cabin都是字母數(shù)字混合型碴犬,但是Cabin字母是有序的
那么哪些特征包含錯(cuò)誤信息呢絮宁?
數(shù)據(jù)集很大的時(shí)候我們很難檢查,但是這種小數(shù)據(jù)量下還是可以發(fā)現(xiàn)的
比如姓名這一個(gè)特征欄:有拼寫錯(cuò)誤服协,頭銜绍昂,縮寫等等問(wèn)題
還有的特征需要修正,里面存在空白值。
Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
Cabin > Age are incomplete in case of test dataset.
好了窘游,再看一下特征的數(shù)據(jù)信息:
train_df.info()
print('_'*40)
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
對(duì)數(shù)據(jù)做一些簡(jiǎn)單的統(tǒng)計(jì)分析吧唠椭!
Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).(891個(gè)樣本占總體2224個(gè)人的40%)
Survived is a categorical feature with 0 or 1 values.(是否獲救用0,1來(lái)代表,1為獲救)
Around 38% samples survived representative of the actual survival rate at 32%.(樣本中獲救率為38%忍饰,而真實(shí)獲救率為32%)
Most passengers (> 75%) did not travel with parents or children.(超過(guò)75%的人沒有和父母孩子旅行)
Nearly 30% of the passengers had siblings and/or spouse aboard.(有30%的人有配偶或兄弟姐妹在船上)
Fares varied significantly with few passengers (<1%) paying as high as $512.(少于1%付了最高可達(dá)512美刀的船費(fèi))
Few elderly passengers (<1%) within age range 65-80.(65-80歲的旅行者很少贪嫂,少于1%)
train_df.describe()
Names are unique across the dataset (count=unique=891)(名字都是唯一的!)
Sex variable as two possible values with 65% male (top=male, freq=577/count=891).(性別比)
Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.(房間號(hào)重復(fù)率很高喘批,好幾個(gè)人住一個(gè)房間)
Embarked takes three possible values. S port used by most passengers (top=S)(登陸港口有三個(gè)撩荣,其中S最多)
Ticket feature has high ratio (22%) of duplicate values (unique=681).(船票編號(hào)重復(fù)率很高,22%)
train_df.describe(include=['O'])
數(shù)據(jù)分析的假設(shè)##
特征相關(guān)性分析
看看不同特征和生存率之間的關(guān)系
補(bǔ)全數(shù)據(jù)
1.需要補(bǔ)全年齡這個(gè)特征
2.港口信息也要補(bǔ)全
這兩個(gè)特征和獲救率有很大關(guān)系
修正特征
票號(hào)這個(gè)特征重復(fù)高饶深,沒有實(shí)際用途餐曹,需要?jiǎng)h除
游客編號(hào)也要?jiǎng)h掉
船艙號(hào)可能也要?jiǎng)h除
創(chuàng)造新特征
這個(gè)后面會(huì)寫到
分類
婦女,小孩敌厘,倉(cāng)位等級(jí)高的人更容易獲救
數(shù)據(jù)分組觀察##
現(xiàn)在要依據(jù)不同特征類別分組進(jìn)行觀察台猴。
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
剩下的幾個(gè)特征大家依樣畫葫蘆試試咯
觀察完分組特征后,是時(shí)候可視化數(shù)據(jù)了##
表格畢竟沒有圖來(lái)的清楚明了俱两,可視化數(shù)據(jù)是關(guān)鍵一環(huán)
觀察到
Infants (Age <=4) had high survival rate.(小于4歲嬰兒生存率很高)
Oldest passengers (Age = 80) survived.(80歲老人生存率很高)
Large number of 15-25 year olds did not survive.(15-25歲很多人沒有獲救)
Most passengers are in 15-35 age range.(大多數(shù)人在15-35歲)
結(jié)論
We should consider Age (our assumption classifying #2) in our model training.(訓(xùn)練模型的時(shí)候得考慮年齡)
Complete the Age feature for null values (completing #1).(補(bǔ)全年齡特征的空值)
We should band age groups (creating #3).(需要對(duì)年齡段分組)
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
有次序的特征
Observations.
Pclass=3 had most passengers, however most did not survive. Confirms our classifying assumption #2.(3等座人最多饱狂,但多數(shù)人都跪了,證實(shí)了我們開始的數(shù)據(jù)假設(shè))
Infant passengers in Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption #2.(未成年的2宪彩、3等座乘客大多獲救)
Most passengers in Pclass=1 survived. Confirms our classifying assumption #3.(一等座大多獲救)
Pclass varies in terms of Age distribution of passengers.
Decisions.結(jié)論
Consider Pclass for model training.(座位等級(jí)需要考慮)
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();
修正類別屬性
Observations.觀察到
Female passengers had much better survival rate than males. Confirms classifying (#1).(女性乘客獲救率更高)
Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.(除了登陸點(diǎn)為C的男性獲救率更高之外休讳,其余都是女性獲救率更高,座位次序高的獲救率高)
Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing (#2).
Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating (#1).
Decisions.
Add Sex feature to model training.(需要考慮性別)
Complete and add Embarked feature to model training.(完善登陸信息)
# grid = sns.FacetGrid(train_df, col='Embarked')
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
相關(guān)分類和數(shù)字特征
Observations.
Higher fare paying passengers had better survival. Confirms our assumption for creating (#4) fare ranges.(船費(fèi)越貴生存率越高)
Port of embarkation correlates with survival rates. Confirms correlating (#1) and completing (#2).(登陸港口和生存率有關(guān))
Decisions.
Consider banding Fare feature.(需要考慮船費(fèi))
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()
修正數(shù)據(jù)##
通過(guò)觀察數(shù)據(jù)尿孔,其實(shí)我們有一些觀察結(jié)論了俊柔,現(xiàn)在可以執(zhí)行這些結(jié)論了
先刪除一些特征
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]
"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape
然后從現(xiàn)有特征中創(chuàng)造一些新的特征
Observations.
When we plot Title, Age, and Survived, we note the following observations.
Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.(頭銜和年齡鏈接緊密)
Survival among Title Age bands varies slightly.(年齡段和獲救率聯(lián)系緊密)
Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).(某些頭銜獲救率確實(shí)高,有些則不然)
Decision.
We decide to retain the new Title feature for model training.(保留頭銜特征)使用正則表達(dá)式
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train_df['Title'], train_df['Sex'])
把一些沒卵用的頭銜替換掉
for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
然后數(shù)字化
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
train_df.head()
然后再刪除名字和旅客ID兩個(gè)無(wú)用特征
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape
然后對(duì)性別進(jìn)行數(shù)字化
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_df.head()
然后開始完善一些特征活合,補(bǔ)充缺失值雏婶。首先是age
有三個(gè)考慮可以使用的方法來(lái)補(bǔ)充連續(xù)性的數(shù)值特征:
1.最簡(jiǎn)單的就是在均值到標(biāo)準(zhǔn)差內(nèi)來(lái)個(gè)隨機(jī)數(shù)
2.更準(zhǔn)確一點(diǎn)的就是通過(guò)相關(guān)特征來(lái)猜測(cè)當(dāng)前特征值。在這個(gè)例子中我們發(fā)現(xiàn)年齡與性別白指、座位級(jí)別這兩個(gè)特征有關(guān)留晚,所以用這兩個(gè)特征分類的均值來(lái)代替特征值。
3.結(jié)合1告嘲、2兩個(gè)方法
由于1错维、3會(huì)引入隨機(jī)誤差,這里作者更偏向于使用2
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')
grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()
上圖是不同年齡和性別組合下 年齡及人數(shù)分布
然后創(chuàng)建一個(gè)Array來(lái)準(zhǔn)備fill年齡的空值
guess_ages = np.zeros((2,3))
guess_ages
然后迭代的通過(guò)性別橄唬、座席的六種組合來(lái)估計(jì)年齡均值
for dataset in combine:
for i in range(0, 2):
for j in range(0, 3):
guess_df = dataset[(dataset['Sex'] == i) & \
(dataset['Pclass'] == j+1)]['Age'].dropna()
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)
age_guess = guess_df.median()
# Convert random age float to nearest .5 age
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
for i in range(0, 2):
for j in range(0, 3):
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
'Age'] = guess_ages[i,j]
dataset['Age'] = dataset['Age'].astype(int)
train_df.head()
填補(bǔ)完空白值之后在對(duì)年齡離散化
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
然后替換age
for dataset in combine:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age']
train_df.head()
然后就可以移除ageband這個(gè)過(guò)渡特征了
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()
從現(xiàn)有特征中組合出新特征各種花樣
新特征FamiliySize是 Parch SibSp之和赋焕,加了新特征之后就可以把這兩個(gè)去掉了
for dataset in combine:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
畢竟還是有不少人是一個(gè)人的,所以可以創(chuàng)造個(gè)新特征isalone
for dataset in combine:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
看了效果作者覺得isalone就行了轧坎,于是刪除了FamilySize和前面那兩個(gè)特征,但是我覺得familysize可以留著泽示,后面有讀者留言說(shuō)留了之后泛化誤差更低
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]
train_df.head()
作者又創(chuàng)造了一個(gè)新特征是Pclass與age的乘積
for dataset in combine:
dataset['Age*Class'] = dataset.Age * dataset.Pclass
train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)
也沒說(shuō)效果缸血,感覺就是這個(gè)數(shù)字越大越完蛋蜜氨,大家可以測(cè)試下
對(duì)于Embarked 這個(gè)特征代表了游客上船的港口,但是training dataset有些值缺失捎泻,作者就直接用頻率最高的代替了
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port
結(jié)果是S
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
生存率和上船的港口有關(guān)飒炎,越晚上船的可能越在上面更方便
然后把標(biāo)簽特征轉(zhuǎn)換成離散型
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_df.head()
然后需要填充fare這個(gè)特征,缺失值就用頻率最高的那個(gè)值代替笆豁,然后對(duì)它離散化郎汪。
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
for dataset in combine:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
train_df.head(10)
test_df.head(10)
OK,到現(xiàn)在特征處理和數(shù)據(jù)分析、清洗闯狱、轉(zhuǎn)換就做完了煞赢!下面就該放入模型預(yù)測(cè)了,有木有很激動(dòng)哄孤?
Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine
這些就不用翻譯了吧
選擇模型也是學(xué)問(wèn)照筑,首先確定自己要解決的是什么問(wèn)題,然后想想各個(gè)算法的優(yōu)劣勢(shì)瘦陈。
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
80.359999999999999
看看不同特征與Ytrain的相關(guān)性
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
單模型里SVM還是很強(qiáng)大的
83.840000000000003
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
84.739999999999995
KNN也不錯(cuò)
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
86.760000000000005
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
86.760000000000005
RF竟然和DT一樣凝危,有點(diǎn)吃驚
模型評(píng)價(jià)
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
最后就可以提交結(jié)果咯
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
# submission.to_csv('../output/submission.csv', index=False)
那么這個(gè)介紹就到這里了,謝謝大家咯晨逝,See you!