????泰坦尼克號(hào)(RMS Titanic):20世紀(jì)初,由英國白星航運(yùn)公司制造的一艘巨大豪華客輪撕捍。是當(dāng)時(shí)世界上最大的豪華客輪拿穴,被稱為是“永不沉沒的”或是“夢幻客輪”。但她卻在1912年4月15日從南安普頓至紐約的處女航中卦洽,在北大西洋撞上冰山而沉沒贞言,由于缺少足夠的救生艇,1500人葬生海底阀蒂,造成了在和平時(shí)期最嚴(yán)重的一次航海事故该窗,也是迄今為止最著名的一次。派拉蒙電影公司與20世紀(jì)冈橄迹克斯電影公司于1997年將“泰坦尼克號(hào)”的事件改編為電影酗失,在全球各地上映,轟動(dòng)全球昧绣,也讓她變得家喻戶曉规肴。
????根據(jù)kaggle提供的數(shù)據(jù)集 ,總共樣本1392,分為訓(xùn)練集拖刃,892個(gè)樣本删壮,其中存活乘客392人,測試樣本兑牡,419個(gè)央碟。數(shù)據(jù)集可以從kaggle官方網(wǎng)站下載,也可以從我的百度云盤下載均函。
? ? 首先對(duì)數(shù)據(jù)進(jìn)行概覽亿虽,數(shù)據(jù)集中包含字段,
PassengerId:乘客編號(hào)
Survived? :存活情況(存活:1 ; 死亡:0)
Pclass:客艙等級(jí)(1苞也,2洛勉,3)
Name:乘客姓名(姓+稱謂+名)
Sex:性別(male男,female女性)
Age:年齡(乘客年齡)
SibSp:同乘的兄弟姐妹/配偶數(shù)()
Parch:同乘的父母/小孩數(shù)(小孩和保姆會(huì)算為0)
Ticket:船票編號(hào)(船票號(hào)如迟,有多個(gè)人公用一個(gè)船票號(hào))
Fare:船票價(jià)格(乘船費(fèi)用)
Cabin:客艙號(hào)(客艙編號(hào))
Embarked? : 登船港口(C = Cherbourg, Q = Queenstown, S = Southampton)
其中根據(jù)數(shù)據(jù)概覽中可以看出收毫,其他Age、Fare氓涣、Cabin和Embarked有缺失值:
下面對(duì)缺失值進(jìn)行處理:
乘客年齡:由于缺失值較多牛哺,暫不做處理陋气。
Fare根據(jù)kaggle上的多種模型驗(yàn)證一致認(rèn)為缺失的乘客的數(shù)值為:8.05劳吠,
Cabin有相同Ticket號(hào)的具有相同的Cabin,根據(jù)這條規(guī)則填充了12條數(shù)據(jù)巩趁,其余填充為Unknown痒玩;
Embarked有兩個(gè)缺失值,用C進(jìn)行填充议慰;
trainset <- read.csv('./train.csv', stringsAsFactors = F)
testset <- read.csv('./test.csv', stringsAsFactors = F)
fullset ?<-bind_rows(trainset, testset)#
str(fullset)#
fullset$Embarked[c(62,830)] <-'C'
fullset$Fare[1044] <-median(fullset[fullset$Pclass =='3'&fullset$Embarked =='S', ]$Fare,na.rm =TRUE)
CabinNA<-subset(fullset,fullset$Cabin=="")
CabinY<-subset(fullset,fullset$Cabin!="")
samet<-intersect(CabinNA$Ticket,CabinY$Ticket)
x<-match(samet,CabinY$Ticket)
for (i in x) {CabinNA$Cabin[i]<-CabinY$Cabin[i]}
fullset<- rbind(CabinNA,CabinY)
fullset$Cabin[fullset$Cabin=="NA"]<-"Unknown"
age預(yù)測
factor_vars <-c('PassengerId','Pclass','Sex','Embarked')
fullset[factor_vars] <-lapply(fullset[factor_vars], function(x)as.factor(x))
# Set a random seed
set.seed(129)
# Perform mice imputation, excluding certain less-than-useful variables:
mice_mod <- mice(fullset[, !names(fullset) %in% c('PassengerId','Name','Ticket','Cabin','Survived')], method='rf')
mice_output <-complete(mice_mod)
fullset$Age <-mice_output$Age
trainfullc <-full[1:891,]
testfullc <-full[892:1309,]
下面我選擇隨機(jī)森林(RandomForest)蠢古、決策樹(DecisionTree)、logit 模型和支持向量機(jī)别凹,建立模型草讶,先建立一個(gè)預(yù)測的基準(zhǔn)。
# Set a random seed
set.seed(754)
# Build the model (note: not all possible variables are used)
rf_model <-randomForest(factor(Survived) ~Pclass +Sex +Age +SibSp +Parch +Fare +Embarked +Title +FsizeD +Child +Mother,data =train)
# Show model errorplot(rf_model,ylim=c(0,0.36))legend('topright',colnames(rf_model$err.rate),col=1:3,fill=1:3)
# Get importanceimportance <-importance(rf_model)
varImportance <-data.frame(Variables =row.names(importance),Importance =round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <-varImportance %>%mutate(Rank =paste0('#',dense_rank(desc(Importance))))# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance,aes(x =reorder(Variables, Importance),y =Importance,fill =Importance)) +geom_bar(stat='identity') +geom_text(aes(x =Variables,y =0.5,label =Rank),hjust=0,vjust=0.55,size =4,colour ='red') +labs(x ='Variables') +coord_flip() +theme_few()
# Predict using the test set
prediction <-predict(rf_model, test)
# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)
solution <-data.frame(PassengerID =test$PassengerId,Survived =prediction)
# Write the solution to file
write.csv(solution,file ='rf_mod_Solution.csv',row.names =F)