Ensemble Model: Stacked Model Example R語言代碼詳解—V1.0

本文對 Ensemble Model: Stacked Model Example 中的 R 語言代碼進(jìn)行詳解飘蚯。

本段代碼介紹如下：
本文介紹了一種 ensemble model 即將若干種模型的預(yù)測結(jié)果合并菇用，來獲取房屋價(jià)格的預(yù)測值瞪醋。
如下圖標(biāo)顯示，其中 features sets 可以是一種或多種屬性集合：

model combination.png

1. 數(shù)據(jù)準(zhǔn)備

1.1 數(shù)據(jù)獲取

train.raw <- read.csv(file.path(DATA.DIR,"train.csv"),stringsAsFactors = FALSE)
test.raw <- read.csv(file.path(DATA.DIR,"test.csv"), stringsAsFactors = FALSE)

1.2 數(shù)據(jù)初始化

1.2.1 選取重要的 features 并分類

計(jì)算 features 的重要性

具體請參見 Boruta Feature Importance Analysis坛增。主要步驟如下:

區(qū)分字符型和數(shù)字型數(shù)據(jù)
給數(shù)據(jù)集分類
填充缺失值
1.數(shù)字型缺失則設(shè)為 -1
2.字符型缺失則設(shè)為“*MISSING"

執(zhí)行 Boruta 分析箩言，獲取 features 的重要程度

set.seed(13)
bor.results <- Boruta(sample.df,response,
           maxRuns=101,
           doTrace=0)

執(zhí)行后結(jié)果 plot 如下：

relative importance of each candidate explanatory attribute.png

分類 features

代碼示例如下：
CONFIRMED_ATTR <- c("MSSubClass","MSZoning","LotArea","LotShape",
                                  "LandContour","Neighborhood", …… ,"Fence")

1.2.2 為 Cross validation 進(jìn)行數(shù)據(jù)分割

# create folds for training
set.seed(13)
data_folds <- createFolds(train.raw$SalePrice, k=5)

語法說明：createFolds(train.raw$SalePrice, k=5)

Create Level 0 Model Feature Sets

將拆分出兩個(gè) Feature Set牺弄， Feature Set 1 和 2
都包括 Boruta Confirmed and Tentative attributes料饥。此處的每一個(gè) Feature Set 都是由用戶自定義的 R 函數(shù)生成的蒲犬。這些函數(shù)將原始的 Training Set 變成 Feature Set。此處并沒有使用額外的 Feature Engineering岸啡。

1.3 Feature Set 1

對 SalePrice 數(shù)據(jù)取 Log - Boruta Confirmed and tentative Attributes

具體語法:
id <- df$Id
if (class(df$SalePrice) != "NULL") {
    y <- log(df$SalePrice)
} else {
    y <- NULL
}

** 填補(bǔ)缺失值**

具體語法：
    # for numeric set missing values to -1 for purposes
num_attr <- intersect(predictor_vars,DATA_ATTR_TYPES$integer)
for (x in num_attr){
  predictors[[x]][is.na(predictors[[x]])] <- -1
}

# for character  atributes set missing value
char_attr <- intersect(predictor_vars,DATA_ATTR_TYPES$character)
for (x in char_attr){
  predictors[[x]][is.na(predictors[[x]])] <- "*MISSING*"
  predictors[[x]] <- factor(predictors[[x]])
}

1.4 Feature Set 2（xgboost）

同 Feature Set 1原叮，首先對 SalePrice 數(shù)據(jù)取 Log。
同 Feature Set 1巡蘸，然后填補(bǔ)缺失值奋隶。

2. Level 0 Model Training

2.1 Helper Function For Training

為后續(xù)建模做一些準(zhǔn)備的工作，包括根據(jù) cross-validation 中的一份數(shù)據(jù)而建模（trainOneFold）以及根據(jù)這份數(shù)據(jù)及其模型得到預(yù)測值悦荒。

** train model on one data fold**
如下將合并為一個(gè) funcion - prepL0FeatureSet1：
1.獲取特定的一份 cross-validation 數(shù)據(jù), 即 get fold specific cv data

cv.data <- list()
cv.data$predictors <- feature_set$train$predictors[this_fold,]
cv.data$ID <- feature_set$train$id[this_fold]
cv.data$y <- feature_set$train$y[this_fold]

2.對這一份數(shù)據(jù)唯欣，獲得相應(yīng)的 training data, 即
get training data for specific fold。

train.data <- list()
train.data$predictors <- feature_set$train$predictors[-this_fold,]
train.data$y <- feature_set$train$y[-this_fold]

3.使用 do.call() 一次性執(zhí)行操作逾冬，尋找合適的 model黍聂。

 fitted_mdl <- do.call(train,
                      c(list(x=train.data$predictors,y=train.data$y),
                    CARET.TRAIN.PARMS,
                    MODEL.SPECIFIC.PARMS,
                    CARET.TRAIN.OTHER.PARMS))

其中，

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

R 語言中 train()：Fit Predictive Models Over Different Tuning Parameters.

4.獲取預(yù)測值, 即 make prediction from a model fitted to one fold身腻。

      yhat <- predict(fitted_mdl,newdata = cv.data$predictors,type = "raw")
      score <- rmse(cv.data$y,yhat)  
      ans <- list(fitted_mdl=fitted_mdl,
            score=score,
            predictions=data.frame(ID=cv.data$ID,yhat=yhat,y=cv.data$y))

make prediction from a model fitted to one fold
根據(jù)已有的模型進(jìn)行預(yù)測产还，如下也包裝成一個(gè)函數(shù) function - makeOneFoldTestPrediction：

fitted_mdl <- this_fold$fitted_mdl
yhat <- predict(fitted_mdl,newdata = feature_set$test$predictors,type = "raw")

2.2 gbm model

set caret training parameters

The caret
package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation

CARET.TRAIN.PARMS <-list(method="gbm")   
CARET.TUNE.GRID <-expand.grid(n.trees=100, 
                            interaction.depth=10, 
                            shrinkage=0.1,
                            n.minobsinnode=10)
MODEL.SPECIFIC.PARMS <- list(verbose=0)

其中，

expand.grid(): 由所有的 supplied vectors or factors 新建一個(gè) data frame 嘀趟。

model specific training parameter

    CARET.TRAIN.CTRL <- trainControl(method="none",
                             verboseIter=FALSE,
                             classProbs=FALSE)

    CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                       tuneGrid=CARET.TUNE.GRID,
                       metric="RMSE")

其中脐区，

trainControl 生成一些列參數(shù)，這些參數(shù)將進(jìn)一步調(diào)控如何生成模型她按，可能的參數(shù)有：
method：resampling method
……
verboseIter: 邏輯語句來打印 training log牛隅。
classProbs: 邏輯語句來決定是否應(yīng)該計(jì)算 class probabilities

generate features for Level 1

為后續(xù) Level 1 Model Prediction 做準(zhǔn)備。

gbm_set <- llply(data_folds,trainOneFold,L0FeatureSet1)

其中酌泰，trainOneFold 是一個(gè)訓(xùn)練 Model媒佣，LOFeatureSet1 是一個(gè)處理過的 Feature 集合。

final model fit
最終選定一個(gè) GBM Model陵刹。

gbm_mdl <- do.call(train, c(list(x=L0FeatureSet1$train$predictors,y=L0FeatureSet1$train$y),
             CARET.TRAIN.PARMS,
             MODEL.SPECIFIC.PARMS,
             CARET.TRAIN.OTHER.PARMS))

CV Error Estimate

cv_y <- do.call(c,lapply(gbm_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(gbm_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(gbm_set,function(x){x$score}))))

其中默伍，cat is useful for producing output in user-defined functions.

** create test submission**
最終的預(yù)測值是根據(jù)不同的 data folds（根據(jù) cross validation 分成了若干 data folds）適用的不同 model 而生成的預(yù)測值的平均值，并寫入 .csv 文件衰琐。

test_gbm_yhat <- predict(gbm_mdl,newdata = L0FeatureSet1$test$predictors,type = "raw")
gbm_submission <- cbind(Id=L0FeatureSet1$test$id,SalePrice=exp(test_gbm_yhat))  
write.csv(gbm_submission,file="gbm_sumbission.csv",row.names=FALSE)

2.3 xgboost model

xgboost model 的流程也糊、算法和 gbm model 相同，具體解釋不再贅述羡宙，僅將主要流程和語法列舉如下：

set caret training parameters

CARET.TRAIN.PARMS <- list(method="xgbTree")   
CARET.TUNE.GRID <-  expand.grid(nrounds=800, 
                            max_depth=10, 
                            eta=0.03, 
                            gamma=0.1, 
                            colsample_bytree=0.4, 
                            min_child_weight=1)
MODEL.SPECIFIC.PARMS <- list(verbose=0)

** model specific training parameter**

CARET.TRAIN.CTRL <- trainControl(method="none",
                             verboseIter=FALSE,
                             classProbs=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                       tuneGrid=CARET.TUNE.GRID,
                       metric="RMSE")

generate Level 1 features

xgb_set <- llply(data_folds,trainOneFold,L0FeatureSet2)

final model fit

xgb_mdl <- do.call(train, c(list(x=L0FeatureSet2$train$predictors,y=L0FeatureSet2$train$y),
             CARET.TRAIN.PARMS,
             MODEL.SPECIFIC.PARMS,
             CARET.TRAIN.OTHER.PARMS))

CV Error Estimate

cv_y <- do.call(c,lapply(xgb_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(xgb_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(xgb_set,function(x){x$score}))))

** create test submission**

test_xgb_yhat <- predict(xgb_mdl,newdata = L0FeatureSet2$test$predictors,type = "raw")
xgb_submission <- cbind(Id=L0FeatureSet2$test$id,SalePrice=exp(test_xgb_yhat))
write.csv(xgb_submission,file="xgb_sumbission.csv",row.names=FALSE)

2.4 ranger model

ranger model 的流程狸剃、算法和 xgboost、gbm model 相同狗热，具體解釋不再贅述钞馁，僅將主要流程和語法列舉如下：

set caret training parameters

CARET.TRAIN.PARMS <- list(method="ranger")   
CARET.TUNE.GRID <-  expand.grid(mtry=2*as.integer(sqrt(ncol(L0FeatureSet1$train$predictors))))
MODEL.SPECIFIC.PARMS <- list(verbose=0,num.trees=500)

model specific training parameter

CARET.TRAIN.CTRL <- trainControl(method="none",
                             verboseIter=FALSE,
                             classProbs=FALSE)

CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                       tuneGrid=CARET.TUNE.GRID,
                       metric="RMSE")

generate Level 1 features

rngr_set <- llply(data_folds,trainOneFold,L0FeatureSet1)

final model fit

rngr_mdl <- do.call(train, c(list(x=L0FeatureSet1$train$predictors,y=L0FeatureSet1$train$y),
             CARET.TRAIN.PARMS,
             MODEL.SPECIFIC.PARMS,
             CARET.TRAIN.OTHER.PARMS))

CV Error Estimate

cv_y <- do.call(c,lapply(rngr_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(rngr_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(rngr_set,function(x){x$score}))))

create test submission

test_rngr_yhat <- predict(rngr_mdl,newdata = L0FeatureSet1$test$predictors,type = "raw")
rngr_submission <- cbind(Id=L0FeatureSet1$test$id,SalePrice=exp(test_rngr_yhat))
write.csv(rngr_submission,file="rngr_sumbission.csv",row.names=FALSE)

3. Level 1 Model Training

根據(jù)之前的結(jié)果，gbm_set匿刮、xgb_set指攒、rngr_set 分別指代的是 gbm、xgb僻焚、rngr 模型下取出來的 features允悦，獲取使用三個(gè)模型的預(yù)測值。

gbm_yhat <- do.call(c,lapply(gbm_set,function(x){x$predictions$yhat}))
xgb_yhat <- do.call(c,lapply(xgb_set,function(x){x$predictions$yhat}))
rngr_yhat <- do.call(c,lapply(rngr_set,function(x){x$predictions$yhat}))

3.1 Create predictions For Level 1 Model

問題：如下這一段沒有讀懂具體語法虑啤。

L1FeatureSet$train$id <- do.call(c,lapply(gbm_set,function(x){x$predictions$ID}))
L1FeatureSet$train$y <- do.call(c,lapply(gbm_set,function(x){x$predictions$y}))
predictors <- data.frame(gbm_yhat,xgb_yhat,rngr_yhat)
predictors_rank <- t(apply(predictors,1,rank))
colnames(predictors_rank) <- paste0("rank_",names(predictors))
L1FeatureSet$train$predictors <- predictors #cbind(predictors,predictors_rank)
L1FeatureSet$test$id <- gbm_submission[,"Id"]
L1FeatureSet$test$predictors <- data.frame(gbm_yhat=test_gbm_yhat, xgb_yhat=test_xgb_yhat, rngr_yhat=test_rngr_yhat)

3.2 Neural Net Model

同之前 Level 0 Model 的大致流程：

** set caret training parameters**

CARET.TRAIN.PARMS <- list(method="nnet") 
CARET.TUNE.GRID <-  NULL  # NULL 使用了默認(rèn)的微調(diào)參數(shù)

model specific training parameter

CARET.TRAIN.CTRL <- trainControl(method="repeatedcv",
                             number=5,
                             repeats=1,
                             verboseIter=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                        maximize=FALSE,
                       tuneGrid=CARET.TUNE.GRID,
                       tuneLength=7,
                       metric="RMSE")
# Other model specific parameters
MODEL.SPECIFIC.PARMS <- list(verbose=FALSE,linout=TRUE,trace=FALSE)

train the model

l1_nnet_mdl <- do.call(train, 
                       c(list(x=L1FeatureSet$train$predictors, y=L1FeatureSet$train$y),
                        CARET.TRAIN.PARMS,
                        MODEL.SPECIFIC.PARMS,
                        CARET.TRAIN.OTHER.PARMS))

附錄

For additional information on model stacking see these references:

MLWave: Kaggle Ensembling Guide
Kaggle Forum Posting: Stacking
Winning Data Science Competitions: Jeong-Yoon Lee This talk is about 90 minutes long. The sections relevant to model stacking are discussed in these segments (h:mm:ss to h:mm:ss): 1:05:25 to 1:12:15 and 1:21:30 to 1:27:00.

最后編輯于：2017.12.07 16:51:37

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末隙弛，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子狞山，更是在濱河造成了極大的恐慌全闷，老刑警劉巖，帶你破解...
沈念sama閱讀 216,544評論 6贊 501
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件萍启，死亡現(xiàn)場離奇詭異总珠，居然都是意外死亡屏鳍，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,430評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門局服，熙熙樓的掌柜王于貴愁眉苦臉地迎上來钓瞭，“玉大人，你說我怎么就攤上這事淫奔∩轿校” “怎么了？”我有些...
開封第一講書人閱讀 162,764評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵唆迁，是天一觀的道長鸭丛。經(jīng)常有香客問我，道長唐责，這世上最難降的妖魔是什么鳞溉？我笑而不...
開封第一講書人閱讀 58,193評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮鼠哥，結(jié)果婚禮上穿挨，老公的妹妹穿的比我還像新娘。我一直安慰自己肴盏，他們只是感情好科盛，可當(dāng)我...
茶點(diǎn)故事閱讀 67,216評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著菜皂，像睡著了一般贞绵。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上恍飘，一...
開封第一講書人閱讀 51,182評論 1贊 299
城市分裂傳說
那天榨崩，我揣著相機(jī)與錄音，去河邊找鬼章母。笑死母蛛，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的乳怎。我是一名探鬼主播彩郊，決...
沈念sama閱讀 40,063評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼蚪缀！你這毒婦竟也來了秫逝？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 38,917評論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤询枚，失蹤者是張志新（化名）和其女友劉穎违帆，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體金蜀，經(jīng)...
沈念sama閱讀 45,329評論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡刷后，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,543評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年的畴，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片尝胆。...
茶點(diǎn)故事閱讀 39,722評論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡丧裁，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出班巩，到底是詐尸還是另有隱情，我是刑警寧澤嘶炭，帶...
沈念sama閱讀 35,425評論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布抱慌，位于F島的核電站，受9級特大地震影響眨猎，放射性物質(zhì)發(fā)生泄漏抑进。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,019評論 3贊 326
男人毒藥：我在死后第九天來索命
文/蒙蒙一睡陪、第九天我趴在偏房一處隱蔽的房頂上張望寺渗。院中可真熱鬧，春花似錦兰迫、人聲如沸信殊。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,671評論 0贊 22
一樁弒父案汁果，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽涡拘。三九已至，卻和暖如春据德，著一層夾襖步出監(jiān)牢的瞬間鳄乏，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,825評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工棘利，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留橱野，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,729評論 2贊 368
代替公主和親
正文我出身青樓善玫，卻偏偏與公主長得像水援，于是被迫代替她去往敵國和親。傳聞我的和親對象是個(gè)殘疾皇子茅郎，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,614評論 2贊 353

Ensemble Model: Stacked Model Example R語言代碼詳解—V1.0

1. 數(shù)據(jù)準(zhǔn)備

1.1 數(shù)據(jù)獲取

1.2 數(shù)據(jù)初始化

1.2.1 選取重要的 features 并分類

1.2.2 為 Cross validation 進(jìn)行數(shù)據(jù)分割

Create Level 0 Model Feature Sets

1.3 Feature Set 1

1.4 Feature Set 2（xgboost）

2. Level 0 Model Training

2.1 Helper Function For Training

2.2 gbm model

2.3 xgboost model

2.4 ranger model

3. Level 1 Model Training

3.1 Create predictions For Level 1 Model

3.2 Neural Net Model

附錄

推薦閱讀更多精彩內(nèi)容