Kaggle初體驗(yàn):隨機(jī)森林分析Machine Learning from Disaster In R語言

海難.jpg

寫在前面的話

泰坦尼克號(hào)的沉沒是歷史上最臭名昭著的海難鹰祸。1912年4月5日,在她的處女航上密浑,泰坦尼克號(hào)由于撞上冰山而沉沒蛙婴,使得2224人中的1502永遠(yuǎn)的葬身海底。Machine Learning from Disaster 是Kaggle知名的數(shù)據(jù)分析入門練手項(xiàng)目尔破,參與者需要完成:數(shù)據(jù)預(yù)處理街图、特征工程、建模懒构、預(yù)測(cè)餐济、驗(yàn)證步驟,實(shí)現(xiàn)根據(jù)給出的891行訓(xùn)練數(shù)據(jù)(包含乘客或海員信息胆剧,以及是否生還)訓(xùn)練出的數(shù)據(jù)模型來預(yù)測(cè)其他418條記錄的乘客的生存情況絮姆,由于此項(xiàng)目真實(shí)模擬了現(xiàn)實(shí)數(shù)據(jù)分析過程流程,被評(píng)為五大最適合數(shù)據(jù)分析練手項(xiàng)目之一秩霍。
Five data science projects to learn data science

本文的基本按照下述流程進(jìn)行Machine Learning from Disaster數(shù)據(jù)集進(jìn)行分析:

  • 數(shù)據(jù)清洗
  • 特征工程
  • 模型設(shè)計(jì)
  • 預(yù)測(cè)

數(shù)據(jù)預(yù)處理

數(shù)據(jù)集來源

  1. 訓(xùn)練數(shù)據(jù)集:train.csv;
  2. 預(yù)測(cè)數(shù)據(jù)集:test.csv;
    https://www.kaggle.com/c/titanic

數(shù)據(jù)導(dǎo)入與預(yù)覽

# 創(chuàng)建工程:Machine Learning from Disaster
# 加載包
library(dplyr)
library(stringr)
library(ggthemes)
library(ggplot2)

#加載完成后篙悯,導(dǎo)入數(shù)據(jù)
test<- read.csv("./db/test.csv", header = T, stringsAsFactors = F)
train <- read.csv("./db/train.csv", header = T, stringsAsFactors = F)

# 初步觀察數(shù)據(jù)
# 檢查數(shù)據(jù)
str(train)
str(test)
head(train)
head(test)

從結(jié)果可知:兩個(gè)的數(shù)據(jù)集除了test缺失Survived列,兩者數(shù)據(jù)框中的元素是完全一致

> str(train)
'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

> head(test)
 PassengerId Survived Pclass                                                Name    Sex Age SibSp Parch
1           1        0      3                             Braund, Mr. Owen Harris   male  22     1     0
2           2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3           3        1      3                              Heikkinen, Miss. Laina female  26     0     0
4           4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5           5        0      3                            Allen, Mr. William Henry   male  35     0     0
6           6        0      3                                    Moran, Mr. James   male  NA     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q

數(shù)據(jù)預(yù)處理

# 在test數(shù)據(jù)集中增加Survieved列
test.survived <- data.frame(Survived = rep("None", nrow(test)),test[,] )
# 將test 和 train數(shù)據(jù)集聚合
data.combined <- rbind(train,test.survived)
data.combined$Survived <- as.factor(data.combined$Survived)
data.combined$Pclass <- as.factor(data.combined$Pclass)

合并后的數(shù)據(jù)有生存情況(Survived)中有未知值N前域、418個(gè)(需要預(yù)測(cè)的)辕近,年齡(Age)中缺失值有263個(gè),船票費(fèi)用(Fare)中缺失值有1個(gè)匿垄。

目前移宅,我們已經(jīng)對(duì)test,train數(shù)據(jù)集有初步的了解椿疗,其中訓(xùn)練集891個(gè)漏峰,測(cè)試集418個(gè)。 我們的目標(biāo)是要預(yù)測(cè)生存情況(Survived)——因變量届榄,而可供使用的自變量11個(gè)浅乔,如下圖所示。


數(shù)據(jù)說明.png

特征工程

假設(shè)船艙等級(jí)越高铝条,幸存率越高

  ggplot(train,aes(x = Pclass, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  xlab('Plass') + 
  ylab('Count') + 
  ggtitle('How Plass impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
Rplot1.jpeg
  • 從圖中可很明顯看出船艙等級(jí)越高靖苇,幸存率越高,隨著船艙等級(jí)下降班缰,幸存率也從62.9%降到24.2%

假設(shè)乘客名字(Name)具有特征潛力

在乘客名字(Name)中贤壁,有一個(gè)非常顯著的特點(diǎn):乘客頭銜每個(gè)名字當(dāng)中都包含了具體的稱謂或者說是頭銜,將這部分信息提取出來后可以作為非常有用一個(gè)新變量埠忘,可以幫助我們預(yù)測(cè)脾拆。

# 從乘客名字中提取頭銜
data.combined$Title <- gsub('(.*, )|(\\..*)', '', data.combined$Name)
as.factor(data.combined$Title)
table(data.combined$Title)

        Capt          Col          Don         Dona           Dr     Jonkheer         Lady        Major 
           1            4            1            1            8            1            1            2 
      Master         Miss         Mlle          Mme           Mr          Mrs           Ms          Rev 
          61          260            2            1          757          197            2            8 
         Sir the Countess 
           1            1 
  • 上面列出的Title: Miss馒索、Mlle、Mme名船、Mrs绰上、Mr、Ms渠驼、Lady蜈块、Major、Capt迷扇、Col疯趟、Sir具有明顯的性別提示,而Rev谋梭、Master信峻,Jonkheer、Don瓮床、Dona盹舞,Dr性別不可得知
data.combined[which(data.combined$Title %in% "Master"), "Sex"]
 [1] "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male"
[15] "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male"
[29] "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male"
[43] "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male" "male"
[57] "male" "male" "male" "male" "male"

> data.combined[which(data.combined$Title %in% "Rev"), "Sex"]
[1] "male" "male" "male" "male" "male" "male" "male" "male"

> data.combined[which(data.combined$Title %in% "Jonkheer"), "Sex"]
[1] "male"
> data.combined[which(data.combined$Title %in% "Don"), "Sex"]
[1] "male"
> data.combined[which(data.combined$Title %in% "Dona"), "Sex"]
[1] "female"
> data.combined[which(data.combined$Title %in% "Dr"), "Sex"]
[1] "male"   "male"   "male"   "male"   "male"   "male"   "female" "male" 

-注意到Title具有非常強(qiáng)的性別傾向,除了Dr外隘庄,各個(gè)Title都是單性別屬性踢步,換句話說,Title包含有和Sex(性別)重復(fù)的信息丑掺,有可將其替換的潛質(zhì)

性別(Sex)特征影響

ggplot(data.combined[1:891,],aes(x = Sex, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  facet_wrap(~Pclass) + 
  xlab('Sex') + 
  ylab('Count') + 
  ggtitle('How Sex impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

Rplot2.jpeg

-- 從圖中可以看出各個(gè)船艙呈現(xiàn)出一致的規(guī)律获印,女性的幸存率更高

年齡(Age)特征影響

> summary(data.combined[1:891,"Age"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   20.12   28.00   29.70   38.00   80.00     177 
ggplot(data.combined[which(!is.na(data.combined[1:891,"Age"])),], aes(x = Age, fill=factor(Survived))) + facet_wrap(~Sex + Pclass) +
  geom_histogram(binwidth = 10) +
  xlab("Age") +
  ylab("Total Count")

> summary(data.combined[which(data.combined$Title %in% "Master"), "Age"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.330   2.000   4.000   5.483   9.000  14.500       8 
Rplot3.jpeg
  • 年齡列存在177個(gè)缺失值,占到train數(shù)據(jù)集的將近20%左右街州,剔除缺失值后兼丰,并不能看出其呈現(xiàn)何種明顯規(guī)律,但無意中發(fā)現(xiàn)Master的年齡分布唆缴,推斷其代表意義是:未成年男性

家庭組成人數(shù)特征影響

SibSp(兄弟姐妹及配偶的個(gè)數(shù))影響

data.combined$SibSp <- as.factor(data.combined$SibSp)
ggplot(data.combined[1:891,],aes(x = SibSp, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  facet_wrap(~Pclass+Title) + 
  xlab('SibSp') + 
  ylab('Count') + 
  ggtitle('How Sibsp impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
Rplot4.jpeg

Parch(父母或子女的個(gè)數(shù))影響

data.combined$Parch <- as.factor(data.combined$Parch)
ggplot(data.combined[1:891,],aes(x = Parch, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  facet_wrap(~Pclass+Title) + 
  xlab('Parch') + 
  ylab('Count') + 
  ggtitle('How Parch impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
Rplot6.jpeg

家庭總?cè)藬?shù)(Family.size)影響

Temp.SibSp <- c(train$SibSp, test$SibSp)
Temp.Parch <- c(train$Parch, test$Parch)
data.combined$family.size <- as.factor(Temp.SibSp + Temp.Parch + 1)

ggplot(data.combined[1:891,],aes(x = family.size, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  facet_wrap(~Pclass+Title) + 
  xlab('Parch') + 
  ylab('Count') + 
  ggtitle('How Parch impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
Rplot06.jpeg
  • 總體上鳍征,家庭成員對(duì)應(yīng)的列:SibSp、Parch面徽、family.size算是弱特征值艳丛,有家庭成員的乘客更有生還的機(jī)會(huì)

船票號(hào)(Ticket)特征影響

#船票號(hào)(Ticket)是字符類型數(shù)據(jù)
> data.combined$Ticket[1:20]
 [1] "A/5 21171"        "PC 17599"         "STON/O2. 3101282" "113803"           "373450"          
 [6] "330877"           "17463"            "349909"           "347742"           "237736"          
[11] "PP 9549"          "113783"           "A/5. 2151"        "347082"           "350406"          
[16] "248706"           "382652"           "244373"           "345763"           "2649"  

-- 數(shù)據(jù)很雜亂,沒有規(guī)律可尋

#提取船票號(hào)(Ticket)首字母作為Factor后統(tǒng)計(jì)
Ticket.first.char <- ifelse(data.combined$Ticket == "", " ", substr(data.combined$Ticket, 1, 1))
> unique(Ticket.first.char)
 [1] "A" "P" "S" "1" "3" "2" "C" "7" "W" "4" "F" "L" "9" "6" "5" "8"
data.combined$Ticket.first.char <- as.factor(Ticket.first.char)

#羅列出購買不同Ticket的乘客的生存狀況
ggplot(data.combined[1:891,], aes(x = Ticket.first.char, fill=factor(Survived))) +
  geom_bar() +
  ggtitle("Survivability by ticket.first.char") +
  xlab("ticket.first.char") +
  ylab("Total Count") +
  ylim(0,350) +
  labs(fill = "Survived")
Rplot7.jpeg
#羅列出購買不同Ticket的乘客在不同船艙的生存狀況
ggplot(data.combined[1:891,], aes(x = Ticket.first.char, fill=factor(Survived))) +
  geom_bar() +
  facet_wrap(~Pclass) + 
  ggtitle("Pclass") +
  xlab("Ticket.first.char") +
  ylab("Total Count") +
  ylim(0,300) +
  labs(fill = "Survived")
Rplot8.jpeg
##羅列出購買不同Ticket的乘客在不同船艙的生存狀況
ggplot(data.combined[1:891,], aes(x = Ticket.first.char, fill=factor(Survived))) +
  geom_bar() +
  facet_wrap(~Pclass) + 
  ggtitle("Pclass") +
  xlab("Ticket.first.char") +
  ylab("Total Count") +
  ylim(0,300) +
  labs(fill = "Survived")
Rplot9.jpeg

-- 總體上趟紊,船票號(hào)(Ticket)是弱特征值氮双,沒有表現(xiàn)出明顯的規(guī)律

船票費(fèi)用特征影響

##不同船票費(fèi)用乘客員生還分布情況
ggplot(data.combined[which(!is.na(data.combined[1:891,"Fare"])), ], aes(x = Fare,fill = Survived)) +
  geom_histogram(binwidth = 5,position="identity") +
  ggtitle("Combined Fare Distribution") +
  xlab("Fare") +
  ylab("Total Count") +
  ylim(0,100)
Rplot10.jpeg
# 在各船艙,Title不同的情況下霎匈,不同船票費(fèi)用乘客員生還分布情況
ggplot(data.combined[which(!is.na(data.combined[1:891,"Fare"])), ], aes(x = Fare, fill = Survived)) +
  geom_histogram(binwidth = 5,position="identity") +
  facet_wrap(~Pclass + Title) + 
  ggtitle("Pclass, Title") +
  xlab("fare") +
  ylab("Total Count") +
  ylim(0,50) + 
  labs(fill = "Survived")
Rplot11.jpeg
  • 無規(guī)律可尋戴差,暫不作為特征考慮

Cabin(客艙號(hào))特征影響

str(data.combined$Cabin)
chr [1:1309] "" "C85" "" "C123" "" "" "E46" "" "" "" "G6" "C103" "" "" "" "" "" "" "" "" "" "D56" "" ...
# Cabin(客艙號(hào))是字符型
# 觀察Cabin(客艙號(hào))分布,可以看到有很多缺失值唧躲,而且分布比較雜亂
> head(data.combined$Cabin,20)
 [1] ""     "C85"  ""     "C123" ""     ""     "E46"  ""     ""     ""     "G6"   "C103" ""     ""    
[15] ""     ""     ""     ""     ""     ""    

#填補(bǔ)缺失值
data.combined[which(data.combined$Cabin == ""), "Cabin"] <- "U"
data.combined$Cabin[1:20]
 [1] "U"    "C85"  "U"    "C123" "U"    "U"    "E46"  "U"    "U"    "U"    "G6"   "C103" "U"    "U"   
[15] "U"    "U"    "U"    "U"    "U"    "U"   

#通過因子轉(zhuǎn)換試圖去找出分類
cabin.first.char <- as.factor(substr(data.combined$Cabin, 1, 1))
str(cabin.first.char)
levels(cabin.first.char)
[1] "A" "B" "C" "D" "E" "F" "G" "T" "U"

ggplot(data.combined[1:891,],aes(x = cabin.first.char, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  facet_wrap(~Pclass) + 
  xlab('Parch') + 
  ylab('Count') + 
  ggtitle('How Cabin impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

Rplot12.jpeg
  • 缺失值較多造挽,再加上無明顯特征規(guī)律,初步判定無特征資質(zhì)

登錄港口(Embarked)特征影響

#登錄港口(Embarked):C = Cherbourg, Q = Queenstown, S = Southampton三個(gè)弄痹,適合作為Factor(因子)處理
str(data.combined$Embarked)
levels(as.factor(data.combined$Embarked))
[1] ""  "C" "Q" "S"

#train數(shù)據(jù)集中有2個(gè)缺失值饭入,個(gè)數(shù)相對(duì)總數(shù)來說可忽略不計(jì)
table(data.combined[1:891,"Embarked"])

      C   Q   S 
  2 168  77 644 

ggplot(data.combined[1:891,],aes(x = Embarked, y = ..count.., fill=factor(Survived))) + 
  geom_bar(stat = "count", position='stack') + 
  facet_wrap(~Pclass) + 
  xlab('Parch') + 
  ylab('Count') + 
  ggtitle('How Embarked impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

Rplot13.jpeg

-初步判斷無明顯特征規(guī)律,可判斷其無特征屬性
經(jīng)過對(duì)以下變量:船艙等級(jí)肛真、名字谐丢、性別、年齡蚓让、家庭組成人數(shù)乾忱、船票號(hào)、
船票費(fèi)用历极、客艙號(hào)窄瘟、登錄港口的特征影響排查,可認(rèn)為船艙等級(jí)趟卸、名字中的Title蹄葱、性別、家庭組成人數(shù)具有明顯的特征屬性锄列,其他變量沒有呈現(xiàn)明顯的特征規(guī)律图云,為避免過度擬合需要舍棄,同時(shí)名字中的Title變量有包含性別信息邻邮,如果同時(shí)將名字中的Title竣况、性別都作為自變量的話,也可能會(huì)造成過度擬合筒严,需要警惕丹泉。

模型設(shè)計(jì)

經(jīng)過對(duì)變量:船艙等級(jí)、名字鸭蛙、性別嘀掸、年齡、家庭組成人數(shù)规惰、船票號(hào)睬塌、
船票費(fèi)用、客艙號(hào)歇万、登錄港口的特征影響排查揩晴,可認(rèn)為船艙等級(jí)、名字中的Title贪磺、性別硫兰、家庭組成人數(shù)具有明顯的特征屬性,其他變量沒有呈現(xiàn)明顯的特征規(guī)律寒锚,為避免過度擬合需要舍棄劫映,同時(shí)名字中的Title變量有包含性別信息违孝,如果同時(shí)將名字中的Title、性別都作為自變量的話泳赋,也可能會(huì)造成過度擬合雌桑,需要警惕。
接下來要建立模型預(yù)測(cè)泰坦尼克號(hào)上乘客的生存狀況祖今。 在這校坑,我們使用隨機(jī)森林分類算法(The RandomForest Classification Algorithm) ,至于前期的那么多工作都是為了這一步驟服務(wù)的千诬。

#加載randomForest包
library(randomForest)
test.subset <-data.combined[1:891,]
test.subset$Title<-as.factor(test.subset$Title)

#選擇Pclass和Title兩個(gè)自變量
set.seed(1234)
forest_Pclass_Title <- randomForest(factor(Survived)~Pclass+Title,
                       data=test.subset, 
                       importance=TRUE, 
                       ntree=1000)
varImpPlot(forest_Pclass_Title)

#錯(cuò)誤率統(tǒng)計(jì)
> forest_Pclass_Title

Call:
 randomForest(formula = factor(Survived) ~ Pclass + Title, data = test.subset,      importance = TRUE, ntree = 1000) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 1

        OOB estimate of  error rate: 20.76%
Confusion matrix:
    0   1 class.error
0 533  16   0.0291439
1 169 173   0.4941520
隨機(jī)森林對(duì)影響乘客生還的自變量的重要性進(jìn)行排序.jpeg
#選擇Pclass耍目、Title、family.size三個(gè)自變量
set.seed(1234)
forest_Pclass_Title_family.size <- randomForest(factor(Survived)~Pclass+Title+family.size,
                                    data=test.subset, 
                                    importance=TRUE, 
                                    ntree=1000)
varImpPlot(forest_Pclass_Title_family.size)

#可以發(fā)現(xiàn)擇Pclass徐绑、Title邪驮、family.size三個(gè)自變量,比但選擇Pclass傲茄、Title耕捞,準(zhǔn)確率要高出3.2%左右
> forest_Pclass_Title_family.size

Call:
 randomForest(formula = factor(Survived) ~ Pclass + Title + family.size,      data = test.subset, importance = TRUE, ntree = 1000) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 1

        OOB estimate of  error rate: 17.51%
Confusion matrix:
    0   1 class.error
0 485  64   0.1165756
1  92 250   0.2690058


Rplot15.jpeg

通過上述比較,得到最優(yōu)的結(jié)果的選擇自變量是:Pclass烫幕、Title俺抽、family.size。
實(shí)驗(yàn)時(shí)较曼,我們也特地將前面我們已經(jīng)認(rèn)為無特征屬性的各自變量加入測(cè)試磷斧,而得到的結(jié)果則是導(dǎo)致總體的出錯(cuò)率增加,這里就不再贅述捷犹。

  • MeanDecreaseAccuracy衡量把一個(gè)變量的取值變?yōu)殡S機(jī)數(shù)弛饭,隨機(jī)森林預(yù)測(cè)準(zhǔn)確性的降低程度。該值越大表示該變量的重要性越大
  • MeanDecreaseGini通過基尼(Gini)指數(shù)計(jì)算每個(gè)變量對(duì)分類樹每個(gè)節(jié)點(diǎn)上觀測(cè)值的異質(zhì)性的影響萍歉,從而比較變量的重要性侣颂。該值越大表示該變量的重要性越大

預(yù)測(cè)

模型和自變量都確定,最后一步就是預(yù)測(cè)結(jié)果了枪孩,在這里可以把上面剛建立的模型直接應(yīng)用在測(cè)試集上憔晒。

validate_subset <- data.combined[892:1309,]
# 基于測(cè)試集進(jìn)行預(yù)測(cè)
prediction <- predict(forest_Pclass_Title_family.size,validate_subset)

# 將結(jié)果保存為數(shù)據(jù)框,按照Kaggle提交文檔的格式要求蔑舞。
solution <- data.frame(PassengerID = validate_subset$PassengerId, Survived = prediction)

# 將結(jié)果寫入文件
write.csv(solution, file = 'rf_mod_Solution1.csv', row.names = F)

得到的文件后拒担,就可以上傳Kaggle獲取自己的排名情況啦~
比賽頁面:Titanic: Machine Learning from Disaster

比賽界面.png

以下就是這次實(shí)驗(yàn)的排名結(jié)果:

排名結(jié)果.jpg
  • 比賽成績(jī)排名在前26%,不算是理想攻询,還有很多的進(jìn)步空間

總結(jié)

本篇文章是參考的《 Introduction to Data Science with R》教程步驟逐步的進(jìn)行从撼,完成的工作只是初步階段,后面會(huì)做以下改進(jìn)工作

  • 各自變量的缺失值處理
  • 交叉驗(yàn)證
  • 使用其他算法建立模型預(yù)測(cè)
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末钧栖,一起剝皮案震驚了整個(gè)濱河市低零,隨后出現(xiàn)的幾起案子婆翔,更是在濱河造成了極大的恐慌,老刑警劉巖掏婶,帶你破解...
    沈念sama閱讀 222,865評(píng)論 6 518
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件啃奴,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡气堕,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,296評(píng)論 3 399
  • 文/潘曉璐 我一進(jìn)店門畔咧,熙熙樓的掌柜王于貴愁眉苦臉地迎上來茎芭,“玉大人,你說我怎么就攤上這事誓沸∶纷” “怎么了?”我有些...
    開封第一講書人閱讀 169,631評(píng)論 0 364
  • 文/不壞的土叔 我叫張陵拜隧,是天一觀的道長(zhǎng)宿百。 經(jīng)常有香客問我,道長(zhǎng)洪添,這世上最難降的妖魔是什么垦页? 我笑而不...
    開封第一講書人閱讀 60,199評(píng)論 1 300
  • 正文 為了忘掉前任,我火速辦了婚禮干奢,結(jié)果婚禮上痊焊,老公的妹妹穿的比我還像新娘。我一直安慰自己忿峻,他們只是感情好薄啥,可當(dāng)我...
    茶點(diǎn)故事閱讀 69,196評(píng)論 6 398
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著逛尚,像睡著了一般垄惧。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上绰寞,一...
    開封第一講書人閱讀 52,793評(píng)論 1 314
  • 那天到逊,我揣著相機(jī)與錄音,去河邊找鬼滤钱。 笑死蕾管,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的菩暗。 我是一名探鬼主播掰曾,決...
    沈念sama閱讀 41,221評(píng)論 3 423
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼停团!你這毒婦竟也來了旷坦?” 一聲冷哼從身側(cè)響起掏熬,我...
    開封第一講書人閱讀 40,174評(píng)論 0 277
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎秒梅,沒想到半個(gè)月后旗芬,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,699評(píng)論 1 320
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡捆蜀,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,770評(píng)論 3 343
  • 正文 我和宋清朗相戀三年疮丛,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片辆它。...
    茶點(diǎn)故事閱讀 40,918評(píng)論 1 353
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡誊薄,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出锰茉,到底是詐尸還是另有隱情呢蔫,我是刑警寧澤,帶...
    沈念sama閱讀 36,573評(píng)論 5 351
  • 正文 年R本政府宣布飒筑,位于F島的核電站片吊,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏协屡。R本人自食惡果不足惜俏脊,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 42,255評(píng)論 3 336
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望肤晓。 院中可真熱鬧联予,春花似錦、人聲如沸材原。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,749評(píng)論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽余蟹。三九已至卷胯,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間威酒,已是汗流浹背窑睁。 一陣腳步聲響...
    開封第一講書人閱讀 33,862評(píng)論 1 274
  • 我被黑心中介騙來泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留葵孤,地道東北人担钮。 一個(gè)月前我還...
    沈念sama閱讀 49,364評(píng)論 3 379
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像尤仍,于是被迫代替她去往敵國(guó)和親箫津。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,926評(píng)論 2 361

推薦閱讀更多精彩內(nèi)容