數(shù)據(jù)分析
train.csv的屬性有:
屬性名 | 定義 | 取值 |
---|---|---|
PassengerId | 乘客編號(hào) | 1-891 |
Suvived | 生還情況 | 0, 1 |
Pclass | 票的等級(jí) | 1,2,3 |
Name | 乘客姓名 | Braund, Mr. Owen Harris |
Sex | 性別 | male,female |
Age | 年齡 | 數(shù)字,有缺失值 |
SibSp | 兄弟姐妹/配偶在船上 | 0-8 |
Parch | 父母/子女在船上 | 0-6 |
Ticket | 船票編號(hào) | A/5 21171 |
Fare | 票價(jià) | 7.25 |
Cabin | 船艙號(hào) | C85钦奋,有缺失值 |
Embark | 登船港 | S,C,Q |
test.csv缺少Survived字段蒙谓,也是需要我們預(yù)測的
數(shù)據(jù)預(yù)處理
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
預(yù)覽數(shù)據(jù)
train = pd.read_csv("train.csv")
test = pd.read_csv('test.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
定義dummies函數(shù),將某個(gè)離散型特征的所有取值變?yōu)樘卣?/h3>
def dummies(col,train,test):
train_dum = pd.get_dummies(train[col])
test_dum = pd.get_dummies(test[col])
train = pd.concat([train, train_dum], axis=1)
test = pd.concat([test,test_dum],axis=1)
train.drop(col,axis=1,inplace=True)
test.drop(col,axis=1,inplace=True)
return train, test
# get rid of the useless cols
dropping = ['PassengerId', 'Name', 'Ticket']
train.drop(dropping,axis=1, inplace=True)
test.drop(dropping,axis=1, inplace=True)
Pclass處理
def dummies(col,train,test):
train_dum = pd.get_dummies(train[col])
test_dum = pd.get_dummies(test[col])
train = pd.concat([train, train_dum], axis=1)
test = pd.concat([test,test_dum],axis=1)
train.drop(col,axis=1,inplace=True)
test.drop(col,axis=1,inplace=True)
return train, test
# get rid of the useless cols
dropping = ['PassengerId', 'Name', 'Ticket']
train.drop(dropping,axis=1, inplace=True)
test.drop(dropping,axis=1, inplace=True)
觀察Pclass和survived的關(guān)系否过,等級(jí)越高汽摹,生還率越大
將Pclass分解為1,2,3三個(gè)特征
print(train.Pclass.value_counts())
sns.factorplot("Pclass",'Survived',data=train,order=[1,2,3])
train, test = dummies('Pclass',train,test)
3 491
1 216
2 184
Name: Pclass, dtype: int64
![](output_4_1.png)
Sex處理
觀察Sex和Survived的關(guān)系李丰,女性生還率顯著高于男性
分解Sex為male,female逼泣,并刪除原特征
print(train.Sex.value_counts(dropna=False))
sns.factorplot('Sex','Survived',data=train)
train,test = dummies('Sex',train,test)
train.drop('male',axis=1,inplace=True)
test.drop('male',axis=1,inplace=True)
male 577
female 314
Name: Sex, dtype: int64
![](output_5_1.png)
Age處理
處理缺失值趴泌,計(jì)算平均值和方差舟舒,對(duì)缺失值進(jìn)行填充
觀察Age和Survived的關(guān)系,在15到30區(qū)間對(duì)結(jié)果影響較大踱讨,增加兩個(gè)特征魏蔗,Age小于15和Age大于15且小于30砍的,刪除Age
nan_num = len(train[train['Age'].isnull()])
age_mean = train['Age'].mean()
age_std = train['Age'].std()
filling = np.random.randint(age_mean-age_std,age_mean+age_std,size=nan_num)
train['Age'][train['Age'].isnull()==True] = filling
nan_num = train['Age'].isnull().sum()
# dealing the missing val in test
nan_num = test['Age'].isnull().sum()
# 86 null
age_mean = test['Age'].mean()
age_std = test['Age'].std()
filling = np.random.randint(age_mean-age_std,age_mean+age_std,size=nan_num)
test['Age'][test['Age'].isnull()==True]=filling
nan_num = test['Age'].isnull().sum()
s = sns.FacetGrid(train,hue='Survived',aspect=2)
s.map(sns.kdeplot,'Age',shade=True)
s.set(xlim=(0,train['Age'].max()))
s.add_legend()
def under15(row):
result = 0.0
if row<15:
result = 1.0
return result
def young(row):
result = 0.0
if row>=15 and row<30:
result = 1.0
return result
train['under15'] = train['Age'].apply(under15)
train['young'] = train['Age'].apply(young)
test['under15'] = test['Age'].apply(under15)
test['young'] = test['Age'].apply(young)
train.drop('Age',axis=1,inplace=True)
test.drop('Age',axis=1,inplace=True)
![](output_6_0.png)
SibSp和Parch處理
發(fā)現(xiàn)兩者值越大痹筛,生還率越低
生成組合特征family = SibSp+Parch,刪除原特征
print (train.SibSp.value_counts(dropna=False))
print (train.Parch.value_counts(dropna=False))
sns.factorplot('SibSp','Survived',data=train,size=5)
sns.factorplot('Parch','Survived',data=train,szie=5)
train['family'] = train['SibSp'] + train['Parch']
test['family'] = test['SibSp'] + test['Parch']
sns.factorplot('family','Survived',data=train,size=5)
train.drop(['SibSp','Parch'],axis=1,inplace=True)
test.drop(['SibSp','Parch'],axis=1,inplace=True)
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: SibSp, dtype: int64
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: Parch, dtype: int64
![](output_7_1.png)
![](output_7_2.png)
![](output_7_3.png)
Fare處理
票價(jià)高的生還率較大廓鞠,test里有一個(gè)缺失值帚稠,用均值填充
train.Fare.isnull().sum()
test.Fare.isnull().sum()
sns.factorplot('Survived','Fare',data=train,size=4)
s = sns.FacetGrid(train,hue='Survived',aspect=2)
s.map(sns.kdeplot,'Fare',shade=True)
s.set(xlim=(0,train['Fare'].max()))
s.add_legend()
test['Fare'].fillna(test['Fare'].median(),inplace=True)
![](output_8_0.png)
![](output_8_1.png)
Cabin處理
缺失值過多,刪除該特征
#Cabin
print train.Cabin.isnull().sum()
print test.Cabin.isnull().sum()
train.drop('Cabin',axis=1,inplace=True)
test.drop('Cabin',axis=1,inplace=True)
687
327
Embarked處理
訓(xùn)練集有兩個(gè)缺失值床佳,S出現(xiàn)最多滋早,用S進(jìn)行填充
觀察發(fā)現(xiàn)C港口的乘客生還率較高,分解Embarked為S, Q, C
刪除S砌们,Q杆麸,Embarked. 保留C作為新特征
#Embarked
print train.Embarked.isnull().sum()
print test.Embarked.isnull().sum()
print train['Embarked'].value_counts(dropna=False)
train['Embarked'].fillna('S',inplace=True)
sns.factorplot('Embarked','Survived',data=train,size=5)
train,test = dummies('Embarked',train,test)
train.drop(['S','Q'],axis=1,inplace=True)
test.drop(['S','Q'],axis=1,inplace=True)
2
0
S 644
C 168
Q 77
NaN 2
Name: Embarked, dtype: int64
![](output_10_1.png)
訓(xùn)練模型
模型選擇
主要用邏輯回歸,隨機(jī)森林浪感,支持向量機(jī)和k近鄰
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, KFold
def modeling(clf,ft,target):
acc = cross_val_score(clf,ft,target,cv=kf)
acc_lst.append(acc.mean())
return
accuracy = []
def ml(ft,target,time):
accuracy.append(acc_lst)
#logisticregression
logreg = LogisticRegression()
modeling(logreg,ft,target)
#RandomForest
rf = RandomForestClassifier(n_estimators=50,min_samples_split=4,min_samples_leaf=2)
modeling(rf,ft,target)
#svc
svc = SVC()
modeling(svc,ft,target)
#knn
knn = KNeighborsClassifier(n_neighbors = 3)
modeling(knn,ft,target)
# see the coefficient
logreg.fit(ft,target)
feature = pd.DataFrame(ft.columns)
feature.columns = ['Features']
feature["Coefficient Estimate"] = pd.Series(logreg.coef_[0])
print(feature)
return
使用不同特征組合方案
1.使用全部特征
#test1
train_ft = train.drop('Survived',axis=1)
train_y = train['Survived']
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft,train_y,'test_1')
Features Coefficient Estimate
0 Fare 0.004240
1 1 0.389135
2 2 -0.211795
3 3 -1.210494
4 female 2.689013
5 under15 1.658023
6 young 0.030681
7 family -0.310545
8 C 0.374100
2.刪除young
# testing 2, lose young
train_ft_2=train.drop(['Survived','young'],axis=1)
test_2 = test.drop('young',axis=1)
train_ft.head()
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst=[]
ml(train_ft_2,train_y,'test_2')
Features Coefficient Estimate
0 Fare 0.004285
1 1 0.386195
2 2 -0.207867
3 3 -1.202922
4 female 2.690898
5 under15 1.645827
6 family -0.311682
7 C 0.376629
3.刪除young昔头,C
#test3, lose young, c
train_ft_3=train.drop(['Survived','young','C'],axis=1)
test_3 = test.drop(['young','C'],axis=1)
train_ft.head()
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_3,train_y,'test_3')
Features Coefficient Estimate
0 Fare 0.004920
1 1 0.438557
2 2 -0.225821
3 3 -1.194444
4 female 2.694665
5 under15 1.679459
6 family -0.322922
4.刪除Fare
# test4, no FARE
train_ft_4=train.drop(['Survived','Fare'],axis=1)
test_4 = test.drop(['Fare'],axis=1)
train_ft.head()
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_4,train_y,'test_4')
Features Coefficient Estimate
0 1 0.564754
1 2 -0.242384
2 3 -1.287715
3 female 2.699738
4 under15 1.629584
5 young 0.058133
6 family -0.269146
7 C 0.436600
5.刪除C
# test5, get rid of c
train_ft_5=train.drop(['Survived','C'],axis=1)
test_5 = test.drop('C',axis=1)
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_5,train_y,'test_5')
Features Coefficient Estimate
0 Fare 0.004841
1 1 0.442430
2 2 -0.232150
3 3 -1.207308
4 female 2.691465
5 under15 1.700077
6 young 0.052091
7 family -0.320831
6.刪除Fare和young
# test6, lose Fare and young
train_ft_6=train.drop(['Survived','Fare','young'],axis=1)
test_6 = test.drop(['Fare','young'],axis=1)
train_ft.head()
# ml
kf = KFold(n_splits=3,random_state=1)
acc_lst = []
ml(train_ft_6,train_y,'test_6')
Features Coefficient Estimate
0 1 0.562814
1 2 -0.235606
2 3 -1.274657
3 female 2.702955
4 under15 1.604597
5 family -0.270284
6 C 0.442288
結(jié)果匯總
accuracy_df=pd.DataFrame(data=accuracy,
index=['test1','test2','test3','test4','test5','test6'],
columns=['logistic','rf','svc','knn'])
accuracy_df
確定模型和特征
綜合來看,test_4和支持向量機(jī)的表現(xiàn)最好影兽,所以用該模型進(jìn)行預(yù)測
svc = SVC()
svc.fit(train_ft_4,train_y)
svc_pred = svc.predict(test_4)
print(svc.score(train_ft_4,train_y))
submission_test = pd.read_csv("test.csv")
submission = pd.DataFrame({"PassengerId":submission_test['PassengerId'],
"Survived":svc_pred})
submission.to_csv("kaggle_SVC.csv",index=False)
0.832772166105
結(jié)果提交
![](kaggle_result.png)