本文對 Ensemble Model: Stacked Model Example 中的 R 語言代碼進(jìn)行詳解飘蚯。
本段代碼介紹如下:
本文介紹了一種 ensemble model 即將若干種模型的預(yù)測結(jié)果合并菇用,來獲取房屋價(jià)格的預(yù)測值瞪醋。
如下圖標(biāo)顯示,其中 features sets 可以是一種或多種屬性集合:
1. 數(shù)據(jù)準(zhǔn)備
1.1 數(shù)據(jù)獲取
train.raw <- read.csv(file.path(DATA.DIR,"train.csv"),stringsAsFactors = FALSE)
test.raw <- read.csv(file.path(DATA.DIR,"test.csv"), stringsAsFactors = FALSE)
1.2 數(shù)據(jù)初始化
1.2.1 選取重要的 features 并分類
計(jì)算 features 的重要性
具體請參見 Boruta Feature Importance Analysis坛增。主要步驟如下:
區(qū)分字符型和數(shù)字型數(shù)據(jù)
給數(shù)據(jù)集分類
填充缺失值
1.數(shù)字型缺失則設(shè)為 -1
2.字符型缺失則設(shè)為“*MISSING"-
執(zhí)行 Boruta 分析箩言,獲取 features 的重要程度
set.seed(13) bor.results <- Boruta(sample.df,response, maxRuns=101, doTrace=0)
執(zhí)行后結(jié)果 plot 如下:
分類 features
代碼示例如下:
CONFIRMED_ATTR <- c("MSSubClass","MSZoning","LotArea","LotShape",
"LandContour","Neighborhood", …… ,"Fence")
1.2.2 為 Cross validation 進(jìn)行數(shù)據(jù)分割
# create folds for training
set.seed(13)
data_folds <- createFolds(train.raw$SalePrice, k=5)
語法說明:createFolds(train.raw$SalePrice, k=5)
Create Level 0 Model Feature Sets
將拆分出兩個(gè) Feature Set牺弄, Feature Set 1 和 2
都包括 Boruta Confirmed and Tentative attributes料饥。此處的每一個(gè) Feature Set 都是由用戶自定義的 R 函數(shù)生成的蒲犬。這些函數(shù)將原始的 Training Set 變成 Feature Set。此處并沒有使用額外的 Feature Engineering岸啡。
1.3 Feature Set 1
對 SalePrice 數(shù)據(jù)取 Log - Boruta Confirmed and tentative Attributes
具體語法:
id <- df$Id
if (class(df$SalePrice) != "NULL") {
y <- log(df$SalePrice)
} else {
y <- NULL
}
** 填補(bǔ)缺失值**
具體語法:
# for numeric set missing values to -1 for purposes
num_attr <- intersect(predictor_vars,DATA_ATTR_TYPES$integer)
for (x in num_attr){
predictors[[x]][is.na(predictors[[x]])] <- -1
}
# for character atributes set missing value
char_attr <- intersect(predictor_vars,DATA_ATTR_TYPES$character)
for (x in char_attr){
predictors[[x]][is.na(predictors[[x]])] <- "*MISSING*"
predictors[[x]] <- factor(predictors[[x]])
}
1.4 Feature Set 2(xgboost)
同 Feature Set 1原叮,首先對 SalePrice 數(shù)據(jù)取 Log。
同 Feature Set 1巡蘸,然后填補(bǔ)缺失值奋隶。
2. Level 0 Model Training
2.1 Helper Function For Training
為后續(xù)建模做一些準(zhǔn)備的工作,包括根據(jù) cross-validation 中的一份數(shù)據(jù)而建模(trainOneFold)以及根據(jù)這份數(shù)據(jù)及其模型得到預(yù)測值悦荒。
** train model on one data fold**
如下將合并為一個(gè) funcion - prepL0FeatureSet1:
1.獲取特定的一份 cross-validation 數(shù)據(jù), 即 get fold specific cv data
cv.data <- list()
cv.data$predictors <- feature_set$train$predictors[this_fold,]
cv.data$ID <- feature_set$train$id[this_fold]
cv.data$y <- feature_set$train$y[this_fold]
2.對這一份數(shù)據(jù)唯欣,獲得相應(yīng)的 training data, 即
get training data for specific fold。
train.data <- list()
train.data$predictors <- feature_set$train$predictors[-this_fold,]
train.data$y <- feature_set$train$y[-this_fold]
3.使用 do.call() 一次性執(zhí)行操作逾冬,尋找合適的 model黍聂。
fitted_mdl <- do.call(train,
c(list(x=train.data$predictors,y=train.data$y),
CARET.TRAIN.PARMS,
MODEL.SPECIFIC.PARMS,
CARET.TRAIN.OTHER.PARMS))
其中,
- do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
- R 語言中 train():Fit Predictive Models Over Different Tuning Parameters.
4.獲取預(yù)測值, 即 make prediction from a model fitted to one fold身腻。
yhat <- predict(fitted_mdl,newdata = cv.data$predictors,type = "raw")
score <- rmse(cv.data$y,yhat)
ans <- list(fitted_mdl=fitted_mdl,
score=score,
predictions=data.frame(ID=cv.data$ID,yhat=yhat,y=cv.data$y))
make prediction from a model fitted to one fold
根據(jù)已有的模型進(jìn)行預(yù)測产还,如下也包裝成一個(gè)函數(shù) function - makeOneFoldTestPrediction:
fitted_mdl <- this_fold$fitted_mdl
yhat <- predict(fitted_mdl,newdata = feature_set$test$predictors,type = "raw")
2.2 gbm model
set caret training parameters
The caret
package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
- data splitting
- pre-processing
- feature selection
- model tuning using resampling
- variable importance estimation
CARET.TRAIN.PARMS <-list(method="gbm")
CARET.TUNE.GRID <-expand.grid(n.trees=100,
interaction.depth=10,
shrinkage=0.1,
n.minobsinnode=10)
MODEL.SPECIFIC.PARMS <- list(verbose=0)
其中,
expand.grid(): 由所有的 supplied vectors or factors 新建一個(gè) data frame 嘀趟。
model specific training parameter
CARET.TRAIN.CTRL <- trainControl(method="none",
verboseIter=FALSE,
classProbs=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
tuneGrid=CARET.TUNE.GRID,
metric="RMSE")
其中脐区,
trainControl 生成一些列參數(shù),這些參數(shù)將進(jìn)一步調(diào)控如何生成模型她按,可能的參數(shù)有:
method:resampling method
……
verboseIter: 邏輯語句來打印 training log牛隅。
classProbs: 邏輯語句來決定是否應(yīng)該計(jì)算 class probabilities
generate features for Level 1
為后續(xù) Level 1 Model Prediction 做準(zhǔn)備。
gbm_set <- llply(data_folds,trainOneFold,L0FeatureSet1)
其中酌泰,trainOneFold 是一個(gè)訓(xùn)練 Model媒佣,LOFeatureSet1 是一個(gè)處理過的 Feature 集合。
final model fit
最終選定一個(gè) GBM Model陵刹。
gbm_mdl <- do.call(train, c(list(x=L0FeatureSet1$train$predictors,y=L0FeatureSet1$train$y),
CARET.TRAIN.PARMS,
MODEL.SPECIFIC.PARMS,
CARET.TRAIN.OTHER.PARMS))
CV Error Estimate
cv_y <- do.call(c,lapply(gbm_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(gbm_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(gbm_set,function(x){x$score}))))
其中默伍,cat is useful for producing output in user-defined functions.
** create test submission**
最終的預(yù)測值是根據(jù)不同的 data folds(根據(jù) cross validation 分成了若干 data folds)適用的不同 model 而生成的預(yù)測值的平均值,并寫入 .csv 文件衰琐。
test_gbm_yhat <- predict(gbm_mdl,newdata = L0FeatureSet1$test$predictors,type = "raw")
gbm_submission <- cbind(Id=L0FeatureSet1$test$id,SalePrice=exp(test_gbm_yhat))
write.csv(gbm_submission,file="gbm_sumbission.csv",row.names=FALSE)
2.3 xgboost model
xgboost model 的流程也糊、算法和 gbm model 相同,具體解釋不再贅述羡宙,僅將主要流程和語法列舉如下:
set caret training parameters
CARET.TRAIN.PARMS <- list(method="xgbTree")
CARET.TUNE.GRID <- expand.grid(nrounds=800,
max_depth=10,
eta=0.03,
gamma=0.1,
colsample_bytree=0.4,
min_child_weight=1)
MODEL.SPECIFIC.PARMS <- list(verbose=0)
** model specific training parameter**
CARET.TRAIN.CTRL <- trainControl(method="none",
verboseIter=FALSE,
classProbs=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
tuneGrid=CARET.TUNE.GRID,
metric="RMSE")
generate Level 1 features
xgb_set <- llply(data_folds,trainOneFold,L0FeatureSet2)
final model fit
xgb_mdl <- do.call(train, c(list(x=L0FeatureSet2$train$predictors,y=L0FeatureSet2$train$y),
CARET.TRAIN.PARMS,
MODEL.SPECIFIC.PARMS,
CARET.TRAIN.OTHER.PARMS))
CV Error Estimate
cv_y <- do.call(c,lapply(xgb_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(xgb_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(xgb_set,function(x){x$score}))))
** create test submission**
test_xgb_yhat <- predict(xgb_mdl,newdata = L0FeatureSet2$test$predictors,type = "raw")
xgb_submission <- cbind(Id=L0FeatureSet2$test$id,SalePrice=exp(test_xgb_yhat))
write.csv(xgb_submission,file="xgb_sumbission.csv",row.names=FALSE)
2.4 ranger model
ranger model 的流程狸剃、算法和 xgboost、gbm model 相同狗热,具體解釋不再贅述钞馁,僅將主要流程和語法列舉如下:
set caret training parameters
CARET.TRAIN.PARMS <- list(method="ranger")
CARET.TUNE.GRID <- expand.grid(mtry=2*as.integer(sqrt(ncol(L0FeatureSet1$train$predictors))))
MODEL.SPECIFIC.PARMS <- list(verbose=0,num.trees=500)
model specific training parameter
CARET.TRAIN.CTRL <- trainControl(method="none",
verboseIter=FALSE,
classProbs=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
tuneGrid=CARET.TUNE.GRID,
metric="RMSE")
generate Level 1 features
rngr_set <- llply(data_folds,trainOneFold,L0FeatureSet1)
final model fit
rngr_mdl <- do.call(train, c(list(x=L0FeatureSet1$train$predictors,y=L0FeatureSet1$train$y),
CARET.TRAIN.PARMS,
MODEL.SPECIFIC.PARMS,
CARET.TRAIN.OTHER.PARMS))
CV Error Estimate
cv_y <- do.call(c,lapply(rngr_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(rngr_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(rngr_set,function(x){x$score}))))
create test submission
test_rngr_yhat <- predict(rngr_mdl,newdata = L0FeatureSet1$test$predictors,type = "raw")
rngr_submission <- cbind(Id=L0FeatureSet1$test$id,SalePrice=exp(test_rngr_yhat))
write.csv(rngr_submission,file="rngr_sumbission.csv",row.names=FALSE)
3. Level 1 Model Training
根據(jù)之前的結(jié)果,gbm_set匿刮、xgb_set指攒、rngr_set 分別指代的是 gbm、xgb僻焚、rngr 模型下取出來的 features允悦, 獲取使用三個(gè)模型的預(yù)測值。
gbm_yhat <- do.call(c,lapply(gbm_set,function(x){x$predictions$yhat}))
xgb_yhat <- do.call(c,lapply(xgb_set,function(x){x$predictions$yhat}))
rngr_yhat <- do.call(c,lapply(rngr_set,function(x){x$predictions$yhat}))
3.1 Create predictions For Level 1 Model
問題:如下這一段沒有讀懂具體語法虑啤。
L1FeatureSet$train$id <- do.call(c,lapply(gbm_set,function(x){x$predictions$ID}))
L1FeatureSet$train$y <- do.call(c,lapply(gbm_set,function(x){x$predictions$y}))
predictors <- data.frame(gbm_yhat,xgb_yhat,rngr_yhat)
predictors_rank <- t(apply(predictors,1,rank))
colnames(predictors_rank) <- paste0("rank_",names(predictors))
L1FeatureSet$train$predictors <- predictors #cbind(predictors,predictors_rank)
L1FeatureSet$test$id <- gbm_submission[,"Id"]
L1FeatureSet$test$predictors <- data.frame(gbm_yhat=test_gbm_yhat, xgb_yhat=test_xgb_yhat, rngr_yhat=test_rngr_yhat)
3.2 Neural Net Model
同之前 Level 0 Model 的大致流程:
** set caret training parameters**
CARET.TRAIN.PARMS <- list(method="nnet")
CARET.TUNE.GRID <- NULL # NULL 使用了默認(rèn)的微調(diào)參數(shù)
model specific training parameter
CARET.TRAIN.CTRL <- trainControl(method="repeatedcv",
number=5,
repeats=1,
verboseIter=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
maximize=FALSE,
tuneGrid=CARET.TUNE.GRID,
tuneLength=7,
metric="RMSE")
# Other model specific parameters
MODEL.SPECIFIC.PARMS <- list(verbose=FALSE,linout=TRUE,trace=FALSE)
train the model
l1_nnet_mdl <- do.call(train,
c(list(x=L1FeatureSet$train$predictors, y=L1FeatureSet$train$y),
CARET.TRAIN.PARMS,
MODEL.SPECIFIC.PARMS,
CARET.TRAIN.OTHER.PARMS))
附錄
For additional information on model stacking see these references:
- MLWave: Kaggle Ensembling Guide
- Kaggle Forum Posting: Stacking
- Winning Data Science Competitions: Jeong-Yoon Lee This talk is about 90 minutes long. The sections relevant to model stacking are discussed in these segments (h:mm:ss to h:mm:ss): 1:05:25 to 1:12:15 and 1:21:30 to 1:27:00.