1.blending
比如數(shù)據(jù)分成train和test梁沧,對(duì)于model_i(比如xgboost)触机,即對(duì)所有的數(shù)據(jù)訓(xùn)練模型model_i,預(yù)測(cè)test數(shù)據(jù)生成預(yù)測(cè)向量v_i, 然后對(duì)train做CV fold=5颠猴, ?然后對(duì)其他4份做訓(xùn)練數(shù)據(jù)洛巢,另外一份作為val數(shù)據(jù)瞻惋,得出模型model_i_j厦滤,然后對(duì)val預(yù)測(cè)生成向量t_i_j, 然后將5分向量concat生成t_i,這是對(duì)應(yīng)t_i與v_i對(duì)應(yīng)援岩, ?每個(gè)模型都能生成這樣一組向量,然后在頂層的模型比如LR或者線性對(duì)t向量進(jìn)行訓(xùn)練掏导,生成blender模型對(duì)v向量進(jìn)行預(yù)測(cè)
也就是需要生成如下的一個(gè)表窄俏,訓(xùn)練集數(shù)據(jù)為把數(shù)據(jù)切分交叉生成,測(cè)試集為訓(xùn)練數(shù)據(jù)全部訓(xùn)練對(duì)測(cè)試集預(yù)測(cè)生成
id
model_1
model_2
model_3
model_4
label
1
0.1
0.2
0.14
0.15
0
2
0.2
0.22
0.18
0.3
1
3
0.8
0.7
0.88
0.6
1
4
0.3
0.3
0.2
0.22
0
5
0.5
0.3
0.6
0.5
1
blending 的優(yōu)點(diǎn)是:比stacking簡(jiǎn)單碘菜,不會(huì)造成數(shù)據(jù)穿越凹蜈,generalizers和stackers使用不同的數(shù)據(jù),可以隨時(shí)添加其他模型到blender中忍啸。
與stacking的區(qū)別是:
stacking在預(yù)測(cè) 測(cè)試集上時(shí)直接基于訓(xùn)練數(shù)據(jù)的
blender在預(yù)測(cè) 測(cè)試集上每次cv的子集都會(huì)預(yù)測(cè)下預(yù)測(cè)集仰坦, n次cv取平均
Blending:用不相交的數(shù)據(jù)訓(xùn)練不同的 Base Model,將它們的輸出燃拼啤(加權(quán))平均悄晃。
Stacking:劃分訓(xùn)練數(shù)據(jù)集為兩個(gè)不相交的集合,在第一個(gè)集合上訓(xùn)練多個(gè)學(xué)習(xí)器凿滤,在第二個(gè)集合上測(cè)試這幾個(gè)學(xué)習(xí)器妈橄,把第三步得到的預(yù)測(cè)結(jié)果作為輸入,把正確的回應(yīng)作為輸出翁脆,訓(xùn)練一個(gè)高層學(xué)習(xí)器眷蚓。
模型融合的模塊
##模型融合的模塊
from heamy.dataset import Dataset
from heamy.estimator import Regressor,Classifier
from heamy.pipeline import ModelsPipeline
##sklearn中常見(jiàn)模塊
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor##隨機(jī)森林回歸
from sklearn.neighbors import KNeighborsRegressor##knn近鄰回歸
from sklearn.linear_model import ?LinearRegression #線性回歸模型
from ?sklearn.model_selectionimport train_test_split ##訓(xùn)練集好測(cè)試集分開(kāi)的模塊
from sklearn.metrics import ?mean_absolute_error ##加載評(píng)估的模塊
from sklearn import cross_validation,metrics
import pandas as pd
import os
os.chdir('F://gbdt學(xué)習(xí)')
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
stack
df=pd.DataFrame(columns=['y_test','stacks','blend','weights'])
df.y_test=y_test
# create dataset
dataset = Dataset(X_train,y_train,X_test)
# initialize RandomForest &LinearRegression
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression,parameters={'normalize': True},name='lr')
pipeline = ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.stack(k=10,seed=111)
# Train LinearRegression on stacked data(second stage) 線性疊加
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict() ?##測(cè)試集的預(yù)測(cè)結(jié)果
df.stacks=results
# Validate results using 10 foldcross-validation
results = stacker.validate(k=10,scorer=mean_absolute_error)
blending
# load boston dataset from sklearn
from sklearn.datasets import load_boston
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
# create dataset
dataset = Dataset(X_train,y_train,X_test)
# initialize RandomForest & LinearRegression
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression,parameters={'normalize': True},name='lr')
# Stack two models
# Returns new dataset with out-of-fold predictions
pipeline =ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.blend(proportion=0.2,seed=111)
# Train LinearRegression on stacked data(second stage)
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict() ##預(yù)測(cè)的結(jié)果
df.blend=results
# Validate results using 10 foldcross-validation
results = stacker.validate(k=10,scorer=mean_absolute_error)
weights
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 151},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression,parameters={'normalize': True},name='lr')
model_knn = Regressor(dataset=dataset, estimator=KNeighborsRegressor,parameters={'n_neighbors': 15},name='knn')
pipeline = ModelsPipeline(model_rf,model_lr,model_knn)
weights = pipeline.find_weights(mean_absolute_error)
result = pipeline.weight(weights)
results=result.execute() ##預(yù)測(cè)的結(jié)果
metrics.mean_absolute_error(y_test,results)
df.weights=results
df.to_csv('results.csv',index=False)