本文主要目的是總結(jié)自己思路见擦。關(guān)于泰坦尼克(Titanic)的生存分析在網(wǎng)上大多比較深入最蕾,我自己看過一部分然后進行了比較多的練習(xí)后還是覺得自己要做一定的整理才能對相關(guān)函數(shù)和模型有更好的認識运准。
下面是這次的總結(jié),分析集中于清洗、可視化和使用模型進行預(yù)測豺撑。
平臺:jupyter notebook
數(shù)據(jù)初探
設(shè)定繪圖樣式、畫布中文標題和全局參數(shù)
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style('whitegrid',{'font.sans-serif':['simhei']})
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
導(dǎo)入數(shù)據(jù)
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
train = train_data.copy()
test = test_data.copy()
查看數(shù)據(jù)概況
train_data.head()
train_data.info()
test_data.info()
train_data.describe(include=['object'])
可能與生存相關(guān)的數(shù)據(jù):
1.pclass:客艙等級黔牵,頭等艙有身份的人士更多聪轿;
2.sex:性別,女士優(yōu)先猾浦;
3.age:年齡陆错,尊老愛幼;
4.sibsp:兄弟姐妹金赦,可能親屬多的獲救概率更大音瓷;
5.parch:父母和小孩,可能會讓父母子女先得救夹抗;
6.fare:船票價格绳慎,跟客艙等級應(yīng)該存在關(guān)聯(lián);
7.embarked :登船處,我認為登陸地點不同杏愤,可能顯示人的地位之類的不一樣靡砌;
8.name: 姓名一般帶有身份或者地位標志;
數(shù)據(jù)清洗
補充缺失值:
從trian_info圖中可以看出珊楼,embarked缺失最少乏奥,先補充這列
train[train.Embarked.isnull() == True]
登船處缺失的兩個人正好都是女性,將登船處與票價亥曹、艙等繪制箱線圖:
fig, ax = plt.subplots(figsize=(16,12),ncols=2)
ax1 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=train[train.Sex == 'female'], ax = ax[0]);
ax2 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=test_data[test_data.Sex == 'female'], ax = ax[1]);
ax1.set_title("Training Set", fontsize = 18)
ax2.set_title('Test Set', fontsize = 18)
fig.show()
看來C最符合
train.Embarked.fillna('C', inplace = True)
補充test數(shù)據(jù)集fare項:
farevalue = test[(test.Pclass == 3) & (test.Embarked == "S") & (test.Sex == "male")].Fare.mean()
test.Fare.fillna(farevalue, inplace=True)
剩下的數(shù)據(jù)缺失項是Age年齡和carbin艙號邓了。艙號比較不重要,并且缺失過多(缺78%)媳瞪,用‘U’代替缺失值骗炉。
train.Cabin.fillna('U',inplace=True)
test.Cabin.fillna('U',inplace=True)
處理年齡前先劃分名字中的信息:注意到在乘客名字(Name)中,有一個非常顯著的特點:乘客頭銜每個名字當中都包含了具體的稱謂或者說是頭銜蛇受,將這部分信息提取出來后可以作為非常有用一個新變量句葵,可以幫助預(yù)測兢仰。
all_data = pd.concat([train, test], ignore_index = True)
all_data['Title'] = all_data['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
Title_Dict = {}
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
all_data['Title'] = all_data['Title'].map(Title_Dict)
sns.barplot(x="Title", y="Survived", data=all_data, palette='Set3')
補充Age乍丈,一般的方法是用中位數(shù)和平均數(shù)代替。這樣的處理方式雖然能保證數(shù)據(jù)的整體性把将,但是容易丟失數(shù)據(jù)間差異和關(guān)聯(lián)∏嶙ǎ現(xiàn)嘗試用二折交叉驗證(Cross-Validation)補全數(shù)據(jù)。
from sklearn import cross_validation
train = all_data[all_data['Survived'].notnull()]
test = all_data[all_data['Survived'].isnull()]
#將訓(xùn)練集等分
train_split_1, train_split_2 = cross_validation.train_test_split(train, test_size=0.5, random_state=0)
def predict_age_use_cross_validationg(df1,df2,dfTest):
age_df1 = df1[['Age', 'Pclass','Sex','Title']]
age_df1 = pd.get_dummies(age_df1)
age_df2 = df2[['Age', 'Pclass','Sex','Title']]
age_df2 = pd.get_dummies(age_df2)
known_age = age_df1[age_df1.Age.notnull()].as_matrix()
unknow_age_df1 = age_df1[age_df1.Age.isnull()].as_matrix()
unknown_age = age_df2[age_df2.Age.isnull()].as_matrix()
print (unknown_age.shape)
y = known_age[:, 0]
X = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
predictedAges = rfr.predict(unknown_age[:, 1::])
df2.loc[ (df2.Age.isnull()), 'Age' ] = predictedAges
predictedAges = rfr.predict(unknow_age_df1[:,1::])
df1.loc[(df1.Age.isnull()),'Age'] = predictedAgesdaa
age_Test = dfTest[['Age', 'Pclass','Sex','Title']]
age_Test = pd.get_dummies(age_Test)
age_Tmp = df2[['Age', 'Pclass','Sex','Title']]
age_Tmp = pd.get_dummies(age_Tmp)
age_Tmp = pd.concat([age_Test[age_Test.Age.notnull()],age_Tmp])
known_age1 = age_Tmp.as_matrix()
unknown_age1 = age_Test[age_Test.Age.isnull()].as_matrix()
y = known_age1[:,0]
x = known_age1[:,1:]
rfr.fit(x, y)
predictedAges = rfr.predict(unknown_age1[:, 1:])
dfTest.loc[ (dfTest.Age.isnull()), 'Age' ] = predictedAges
return dfTest
t1 = train_split_1.copy()
t2 = train_split_2.copy()
tmp1 = test.copy()
t5 = predict_age_use_cross_validationg(t1,t2,tmp1)
t1 = pd.concat([t1,t2])
t3 = train_split_1.copy()
t4 = train_split_2.copy()
tmp2 = test.copy()
t6 = predict_age_use_cross_validationg(t4,t3,tmp2)
t3 = pd.concat([t3,t4])
train['Age'] = (t1['Age'] + t3['Age'])/2
test['Age'] = (t5['Age'] + t6['Age']) / 2
print (train.describe())
print (test.describe())
all_data = pd.concat([train,test])