看到網(wǎng)上有很多關(guān)于Blending和Stacking有什么區(qū)別的討論,但感覺(jué)并沒(méi)有說(shuō)到點(diǎn)子上不从,最近在國(guó)外看了一篇比較好的博客,簡(jiǎn)潔明了地貼出了代碼和討論,所以我在這里總結(jié)一下病蛉。
直接說(shuō),其實(shí)Blending和Stacking基本上是一樣的瑰煎,除了有一點(diǎn)不同铺然,Blending它在訓(xùn)練基礎(chǔ)模型base model
的時(shí)候,并沒(méi)有使用kfold
方法(Stacking使用了Kfold)酒甸,而是拿了一部分?jǐn)?shù)據(jù)魄健,比如說(shuō)20%的數(shù)據(jù),這部分?jǐn)?shù)據(jù)不加入基礎(chǔ)模型的訓(xùn)練插勤,而是在基礎(chǔ)模型都訓(xùn)練好了以后沽瘦,去預(yù)測(cè)這部分沒(méi)有參與訓(xùn)練的數(shù)據(jù)得到預(yù)測(cè)概率,然后以各個(gè)模型的預(yù)測(cè)概率作為最終模型的特征农尖。
相關(guān)代碼如下:
from sklearn.model_selection import train_test_split
class BlendingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, holdout_pct=0.2, use_features_in_secondary=False):
self.base_models = base_models
self.meta_model = meta_model
self.holdout_pct = holdout_pct
self.use_features_in_secondary = use_features_in_secondary
def fit(self, X, y):
self.base_models_ = [clone(x) for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=self.holdout_pct)
holdout_predictions = np.zeros((X_holdout.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models_):
model.fit(X_train, y_train)
y_pred = model.predict(X_holdout)
holdout_predictions[:, i] = y_pred
if self.use_features_in_secondary:
self.meta_model_.fit(np.hstack((X_holdout, holdout_predictions)), y_holdout)
else:
self.meta_model_.fit(holdout_predictions, y_holdout)
return self
def predict(self, X):
meta_features = np.column_stack([
model.predict(X) for model in self.base_models_
])
if self.use_features_in_secondary:
return self.meta_model_.predict(np.hstack((X, meta_features)))
else:
return self.meta_model_.predict(meta_features)
Blending
的好處就是訓(xùn)練時(shí)間縮短析恋,這個(gè)比較好理解,畢竟拿了一部分?jǐn)?shù)據(jù)出來(lái)做holdout盛卡,在前面訓(xùn)練基模型的時(shí)候就只用了較少的數(shù)據(jù)助隧,在后面訓(xùn)練meta模型的時(shí)候holdout數(shù)據(jù)量又不大,自然總體上時(shí)間要加快窟扑。但壞處也比較明顯喇颁,主要體現(xiàn)在holdout數(shù)據(jù)量少這個(gè)問(wèn)題上。一個(gè)是前面在訓(xùn)練基模型的時(shí)候用的數(shù)據(jù)量比stacking少嚎货,第二個(gè)是在訓(xùn)練meta模型的時(shí)候holdout數(shù)據(jù)量少橘霎,可能會(huì)造成meta模型對(duì)于holdout數(shù)據(jù)的過(guò)擬合,第三個(gè)是因?yàn)閔oldout數(shù)據(jù)和訓(xùn)練數(shù)據(jù)不一樣殖属,自然會(huì)比使用kfold的stacking方式要精度更低姐叁。
而stacking
相當(dāng)于是按照n折對(duì)數(shù)據(jù)迭代劃分,每次劃分都有n-1份數(shù)據(jù)作為訓(xùn)練集洗显,1份數(shù)據(jù)作為預(yù)測(cè)后的結(jié)果外潜,更新到特征當(dāng)中。經(jīng)過(guò)n次后挠唆,那么就會(huì)產(chǎn)生最終的預(yù)測(cè)分?jǐn)?shù)处窥,當(dāng)然有很多數(shù)據(jù)都是重復(fù)使用的,這樣的好處就是數(shù)據(jù)量充足玄组,能夠充分利用數(shù)據(jù)滔驾,精度也會(huì)更高谒麦,壞處就是可能會(huì)造成信息泄露的問(wèn)題,因?yàn)樵趉fold當(dāng)中除了第一輪以外都是拿著用來(lái)訓(xùn)練好的模型去預(yù)測(cè)之前用來(lái)訓(xùn)練這個(gè)模型的數(shù)據(jù)哆致,會(huì)有這個(gè)風(fēng)險(xiǎn)绕德。
第一輪迭代
第二輪迭代
代碼如下:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5, use_features_in_secondary=False):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
self.use_features_in_secondary = use_features_in_secondary
def fit(self, X, y):
"""Fit all the models on the given dataset"""
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
# Train cloned base models and create out-of-fold predictions
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y):
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred
if self.use_features_in_secondary:
self.meta_model_.fit(np.hstack((X, out_of_fold_predictions)), y)
else:
self.meta_model_.fit(out_of_fold_predictions, y)
return self
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_ ])
if self.use_features_in_secondary:
return self.meta_model_.predict(np.hstack((X, meta_features)))
else:
return self.meta_model_.predict(meta_features)