kaggle實(shí)戰(zhàn)——泰坦尼克號(hào)生還預(yù)測

數(shù)據(jù)分析

train.csv的屬性有：

屬性名	定義	取值
PassengerId	乘客編號(hào)	1-891
Suvived	生還情況	0, 1
Pclass	票的等級(jí)	1,2,3
Name	乘客姓名	Braund, Mr. Owen Harris
Sex	性別	male，female
Age	年齡	數(shù)字，有缺失值
SibSp	兄弟姐妹/配偶在船上	0-8
Parch	父母/子女在船上	0-6
Ticket	船票編號(hào)	A/5 21171
Fare	票價(jià)	7.25
Cabin	船艙號(hào)	C85钦奋，有缺失值
Embark	登船港	S,C,Q

test.csv缺少Survived字段蒙谓，也是需要我們預(yù)測的

數(shù)據(jù)預(yù)處理

import warnings
warnings.filterwarnings('ignore')
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

預(yù)覽數(shù)據(jù)

train = pd.read_csv("train.csv")
test = pd.read_csv('test.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

定義dummies函數(shù)，將某個(gè)離散型特征的所有取值變?yōu)樘卣?/h3>

def dummies(col,train,test):
    train_dum = pd.get_dummies(train[col])
    test_dum = pd.get_dummies(test[col])
    train = pd.concat([train, train_dum], axis=1)
    test = pd.concat([test,test_dum],axis=1)
    train.drop(col,axis=1,inplace=True)
    test.drop(col,axis=1,inplace=True)
    return train, test

# get rid of the useless cols
dropping = ['PassengerId', 'Name', 'Ticket']
train.drop(dropping,axis=1, inplace=True)
test.drop(dropping,axis=1, inplace=True)

Pclass處理

觀察Pclass和survived的關(guān)系否过，等級(jí)越高汽摹，生還率越大
將Pclass分解為1,2,3三個(gè)特征

print(train.Pclass.value_counts())
sns.factorplot("Pclass",'Survived',data=train,order=[1,2,3])

train, test = dummies('Pclass',train,test)

3    491
1    216
2    184
Name: Pclass, dtype: int64

Sex處理

觀察Sex和Survived的關(guān)系李丰，女性生還率顯著高于男性
分解Sex為male，female逼泣，并刪除原特征

print(train.Sex.value_counts(dropna=False))
sns.factorplot('Sex','Survived',data=train)
train,test = dummies('Sex',train,test)
train.drop('male',axis=1,inplace=True)
test.drop('male',axis=1,inplace=True)

male      577
female    314
Name: Sex, dtype: int64

Age處理

處理缺失值趴泌，計(jì)算平均值和方差舟舒，對(duì)缺失值進(jìn)行填充
觀察Age和Survived的關(guān)系，在15到30區(qū)間對(duì)結(jié)果影響較大踱讨，增加兩個(gè)特征魏蔗，Age小于15和Age大于15且小于30砍的，刪除Age

nan_num = len(train[train['Age'].isnull()])
age_mean = train['Age'].mean()
age_std = train['Age'].std()
filling = np.random.randint(age_mean-age_std,age_mean+age_std,size=nan_num)
train['Age'][train['Age'].isnull()==True] = filling
nan_num = train['Age'].isnull().sum()
# dealing the missing val in test
nan_num = test['Age'].isnull().sum()
# 86 null
age_mean = test['Age'].mean()
age_std = test['Age'].std()
filling = np.random.randint(age_mean-age_std,age_mean+age_std,size=nan_num)
test['Age'][test['Age'].isnull()==True]=filling
nan_num = test['Age'].isnull().sum()

s = sns.FacetGrid(train,hue='Survived',aspect=2)
s.map(sns.kdeplot,'Age',shade=True)
s.set(xlim=(0,train['Age'].max()))
s.add_legend()

def under15(row):
    result = 0.0
    if row<15:
        result = 1.0
    return result
def young(row):
    result = 0.0
    if row>=15 and row<30:
        result = 1.0
    return result
train['under15'] = train['Age'].apply(under15)
train['young'] = train['Age'].apply(young)
test['under15'] = test['Age'].apply(under15)
test['young'] = test['Age'].apply(young)

train.drop('Age',axis=1,inplace=True)
test.drop('Age',axis=1,inplace=True)

SibSp和Parch處理

發(fā)現(xiàn)兩者值越大痹筛，生還率越低
生成組合特征family = SibSp+Parch，刪除原特征

print (train.SibSp.value_counts(dropna=False))
print (train.Parch.value_counts(dropna=False))
sns.factorplot('SibSp','Survived',data=train,size=5)
sns.factorplot('Parch','Survived',data=train,szie=5)

train['family'] = train['SibSp'] +  train['Parch']
test['family'] = test['SibSp'] + test['Parch']
sns.factorplot('family','Survived',data=train,size=5)

train.drop(['SibSp','Parch'],axis=1,inplace=True)
test.drop(['SibSp','Parch'],axis=1,inplace=True)

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

Fare處理

票價(jià)高的生還率較大廓鞠，test里有一個(gè)缺失值帚稠，用均值填充

train.Fare.isnull().sum()
test.Fare.isnull().sum()

sns.factorplot('Survived','Fare',data=train,size=4)
s = sns.FacetGrid(train,hue='Survived',aspect=2)
s.map(sns.kdeplot,'Fare',shade=True)
s.set(xlim=(0,train['Fare'].max()))
s.add_legend()

test['Fare'].fillna(test['Fare'].median(),inplace=True)

Cabin處理

缺失值過多，刪除該特征

#Cabin
print train.Cabin.isnull().sum()
print test.Cabin.isnull().sum()

train.drop('Cabin',axis=1,inplace=True)
test.drop('Cabin',axis=1,inplace=True)

687
327

Embarked處理

訓(xùn)練集有兩個(gè)缺失值床佳，S出現(xiàn)最多滋早，用S進(jìn)行填充
觀察發(fā)現(xiàn)C港口的乘客生還率較高，分解Embarked為S, Q, C
刪除S砌们，Q杆麸，Embarked. 保留C作為新特征

#Embarked
print train.Embarked.isnull().sum()
print test.Embarked.isnull().sum()

print train['Embarked'].value_counts(dropna=False)
train['Embarked'].fillna('S',inplace=True)

sns.factorplot('Embarked','Survived',data=train,size=5)

train,test = dummies('Embarked',train,test)
train.drop(['S','Q'],axis=1,inplace=True)
test.drop(['S','Q'],axis=1,inplace=True)

2
0
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

訓(xùn)練模型

模型選擇

主要用邏輯回歸，隨機(jī)森林浪感，支持向量機(jī)和k近鄰

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, KFold

def modeling(clf,ft,target):
    acc = cross_val_score(clf,ft,target,cv=kf)
    acc_lst.append(acc.mean())
    return 

accuracy = []
def ml(ft,target,time):
    accuracy.append(acc_lst)
     #logisticregression
    logreg = LogisticRegression()
    modeling(logreg,ft,target)
    #RandomForest
    rf = RandomForestClassifier(n_estimators=50,min_samples_split=4,min_samples_leaf=2)
    modeling(rf,ft,target)
    #svc
    svc = SVC()
    modeling(svc,ft,target)
    #knn
    knn = KNeighborsClassifier(n_neighbors = 3)
    modeling(knn,ft,target)
    
    
    # see the coefficient
    logreg.fit(ft,target)
    feature = pd.DataFrame(ft.columns)
    feature.columns = ['Features']
    feature["Coefficient Estimate"] = pd.Series(logreg.coef_[0])
    print(feature)
    return

使用不同特征組合方案

1.使用全部特征

#test1
train_ft = train.drop('Survived',axis=1)
train_y = train['Survived']

kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft,train_y,'test_1')

  Features  Coefficient Estimate
0     Fare              0.004240
1        1              0.389135
2        2             -0.211795
3        3             -1.210494
4   female              2.689013
5  under15              1.658023
6    young              0.030681
7   family             -0.310545
8        C              0.374100

2.刪除young

# testing 2, lose young
train_ft_2=train.drop(['Survived','young'],axis=1)
test_2 = test.drop('young',axis=1)
train_ft.head()

# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst=[]
ml(train_ft_2,train_y,'test_2')

  Features  Coefficient Estimate
0     Fare              0.004285
1        1              0.386195
2        2             -0.207867
3        3             -1.202922
4   female              2.690898
5  under15              1.645827
6   family             -0.311682
7        C              0.376629

3.刪除young昔头，C

#test3, lose young, c
train_ft_3=train.drop(['Survived','young','C'],axis=1)
test_3 = test.drop(['young','C'],axis=1)
train_ft.head()

# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_3,train_y,'test_3')

  Features  Coefficient Estimate
0     Fare              0.004920
1        1              0.438557
2        2             -0.225821
3        3             -1.194444
4   female              2.694665
5  under15              1.679459
6   family             -0.322922

4.刪除Fare

# test4, no FARE
train_ft_4=train.drop(['Survived','Fare'],axis=1)
test_4 = test.drop(['Fare'],axis=1)
train_ft.head()
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_4,train_y,'test_4')

  Features  Coefficient Estimate
0        1              0.564754
1        2             -0.242384
2        3             -1.287715
3   female              2.699738
4  under15              1.629584
5    young              0.058133
6   family             -0.269146
7        C              0.436600

5.刪除C

# test5, get rid of c 
train_ft_5=train.drop(['Survived','C'],axis=1)
test_5 = test.drop('C',axis=1)

# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_5,train_y,'test_5')

  Features  Coefficient Estimate
0     Fare              0.004841
1        1              0.442430
2        2             -0.232150
3        3             -1.207308
4   female              2.691465
5  under15              1.700077
6    young              0.052091
7   family             -0.320831

6.刪除Fare和young

# test6, lose Fare and young
train_ft_6=train.drop(['Survived','Fare','young'],axis=1)
test_6 = test.drop(['Fare','young'],axis=1)
train_ft.head()
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_6,train_y,'test_6')

  Features  Coefficient Estimate
0        1              0.562814
1        2             -0.235606
2        3             -1.274657
3   female              2.702955
4  under15              1.604597
5   family             -0.270284
6        C              0.442288

結(jié)果匯總

accuracy_df=pd.DataFrame(data=accuracy,
                         index=['test1','test2','test3','test4','test5','test6'],
                         columns=['logistic','rf','svc','knn'])
accuracy_df

確定模型和特征

綜合來看，test_4和支持向量機(jī)的表現(xiàn)最好影兽，所以用該模型進(jìn)行預(yù)測

svc = SVC()
svc.fit(train_ft_4,train_y)
svc_pred = svc.predict(test_4)
print(svc.score(train_ft_4,train_y))

submission_test = pd.read_csv("test.csv")
submission = pd.DataFrame({"PassengerId":submission_test['PassengerId'],
                          "Survived":svc_pred})
submission.to_csv("kaggle_SVC.csv",index=False)

0.832772166105

結(jié)果提交

Reference

Titanic: Machine Learning from Disaster

TitanicLearningQI

最后編輯于：2017.12.07 01:09:43

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末揭斧，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子峻堰，更是在濱河造成了極大的恐慌讹开，老刑警劉巖，帶你破解...
沈念sama閱讀 217,277評(píng)論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件捐名，死亡現(xiàn)場離奇詭異旦万，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)镶蹋，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,689評(píng)論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門纸型，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人梅忌，你說我怎么就攤上這事狰腌。” “怎么了牧氮？”我有些...
開封第一講書人閱讀 163,624評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵琼腔，是天一觀的道長。經(jīng)常有香客問我踱葛，道長丹莲，這世上最難降的妖魔是什么光坝？我笑而不...
開封第一講書人閱讀 58,356評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮甥材，結(jié)果婚禮上盯另，老公的妹妹穿的比我還像新娘。我一直安慰自己洲赵，他們只是感情好鸳惯，可當(dāng)我...
茶點(diǎn)故事閱讀 67,402評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著叠萍，像睡著了一般芝发。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上苛谷，一...
開封第一講書人閱讀 51,292評(píng)論 1贊 301
城市分裂傳說
那天辅鲸，我揣著相機(jī)與錄音，去河邊找鬼腹殿。笑死独悴，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的锣尉。我是一名探鬼主播刻炒，決...
沈念sama閱讀 40,135評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼悟耘！你這毒婦竟也來了落蝙？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 38,992評(píng)論 0贊 275
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤暂幼，失蹤者是張志新（化名）和其女友劉穎筏勒，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體旺嬉，經(jīng)...
沈念sama閱讀 45,429評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡管行，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,636評(píng)論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了邪媳。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片捐顷。...
茶點(diǎn)故事閱讀 39,785評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖雨效，靈堂內(nèi)的尸體忽然破棺而出迅涮，到底是詐尸還是另有隱情，我是刑警寧澤徽龟，帶...
沈念sama閱讀 35,492評(píng)論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布叮姑，位于F島的核電站，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏传透。R本人自食惡果不足惜耘沼，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,092評(píng)論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望朱盐。院中可真熱鬧群嗤，春花似錦、人聲如沸兵琳。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,723評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽闰围。三九已至赃绊，卻和暖如春既峡，著一層夾襖步出監(jiān)牢的瞬間羡榴，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,858評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工运敢，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留校仑，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,891評(píng)論 2贊 370
代替公主和親
正文我出身青樓传惠，卻偏偏與公主長得像迄沫，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子卦方，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,713評(píng)論 2贊 354