Task 5:模型融合

5.4 代碼示例

5.4.1 回歸\分類概率-融合:

1)簡單加權(quán)平均,結(jié)果直接融合

## 生成一些簡單的樣本數(shù)據(jù),test_prei 代表第i個模型的預(yù)測值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真實值
y_test_true = [1, 3, 2, 6] 
import numpy as np
import pandas as pd

## 定義結(jié)果的加權(quán)平均函數(shù)
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result
from sklearn import metrics
# 各模型的預(yù)測結(jié)果計算MAE
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true, test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true, test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true, test_pre3))
Pred1 MAE: 0.175
Pred2 MAE: 0.075
Pred3 MAE: 0.1
## 根據(jù)加權(quán)計算MAE
w = [0.3,0.4,0.3] # 定義比重權(quán)值
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true, Weighted_pre))
Weighted_pre MAE: 0.0575

可以發(fā)現(xiàn)加權(quán)結(jié)果相對于之前的結(jié)果是有提升的惩歉,這種我們稱其為簡單的加權(quán)平均撞叽。

還有一些特殊的形式季蚂,比如mean平均,median平均

## 定義結(jié)果的加權(quán)平均函數(shù)
def Mean_method(test_pre1,test_pre2,test_pre3):
    Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
    return Mean_result
Mean_pre = Mean_method(test_pre1,test_pre2,test_pre3)
print('Mean_pre MAE:',metrics.mean_absolute_error(y_test_true, Mean_pre))
Mean_pre MAE: 0.0666666666667
## 定義結(jié)果的加權(quán)平均函數(shù)
def Median_method(test_pre1,test_pre2,test_pre3):
    Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
    return Median_result
Median_pre = Median_method(test_pre1,test_pre2,test_pre3)
print('Median_pre MAE:',metrics.mean_absolute_error(y_test_true, Median_pre))
Median_pre MAE: 0.075

2) Stacking融合(回歸):

from sklearn import linear_model

def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
    model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
    Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
    return Stacking_result
## 生成一些簡單的樣本數(shù)據(jù)乒躺,test_prei 代表第i個模型的預(yù)測值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
# y_test_true 代表第模型的真實值
y_train_true = [3, 8, 9, 5] 

test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真實值
y_test_true = [1, 3, 2, 6] 
model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
                               test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))
Stacking_pre MAE: 0.0421348314607

可以發(fā)現(xiàn)模型結(jié)果相對于之前有進(jìn)一步的提升招盲,這是我們需要注意的一點是,對于第二層Stacking的模型不宜選取的過于復(fù)雜嘉冒,這樣會導(dǎo)致模型在訓(xùn)練集上過擬合曹货,從而使得在測試集上并不能達(dá)到很好的效果。

5.4.2 分類模型融合:

對于分類讳推,同樣的可以使用融合方法顶籽,比如簡單投票,Stacking...

from sklearn.datasets import make_blobs
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

1)Voting投票機(jī)制:

Voting即投票機(jī)制银觅,分為軟投票和硬投票兩種礼饱,其原理采用少數(shù)服從多數(shù)的思想。

'''
硬投票:對多個模型直接進(jìn)行投票,不區(qū)分模型結(jié)果的相對重要度镊绪,最終投票數(shù)最多的類為最終被預(yù)測的類匀伏。
'''
iris = datasets.load_iris()

x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7,
                     colsample_bytree=0.6, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1)

# 硬投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.97 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [SVM]
Accuracy: 0.94 (+/- 0.04) [Ensemble]
'''
軟投票:和硬投票原理相同,增加了設(shè)置權(quán)重的功能蝴韭,可以為不同模型設(shè)置不同權(quán)重够颠,進(jìn)而區(qū)別模型不同的重要度。
'''
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
                     colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)

# 軟投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='soft', weights=[2, 1, 1])
clf1.fit(x_train, y_train)

for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [SVM]
Accuracy: 0.96 (+/- 0.02) [Ensemble]

2)分類的Stacking\Blending融合:

stacking是一種分層模型集成框架万皿。

以兩層為例摧找,第一層由多個基學(xué)習(xí)器組成敛摘,其輸入為原始訓(xùn)練集枚抵,第二層的模型則是以第一層基學(xué)習(xí)器的輸出作為訓(xùn)練集進(jìn)行再訓(xùn)練胸竞,從而得到完整的stacking模型, stacking兩層模型都使用了全部的訓(xùn)練數(shù)據(jù)。

'''
5-Fold Stacking
'''
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier,GradientBoostingClassifier
import pandas as pd
#創(chuàng)建訓(xùn)練的數(shù)據(jù)集
data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]

#模型融合中使用到的各個單模型
clfs = [LogisticRegression(solver='lbfgs'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
 
#切分一部分?jǐn)?shù)據(jù)作為測試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)

dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_predict.shape[0], len(clfs)))

#5折stacking
n_splits = 5
skf = StratifiedKFold(n_splits)
skf = skf.split(X, y)

for j, clf in enumerate(clfs):
    #依次訓(xùn)練各個單模型
    dataset_blend_test_j = np.zeros((X_predict.shape[0], 5))
    for i, (train, test) in enumerate(skf):
        #5-Fold交叉訓(xùn)練减余,使用第i個部分作為預(yù)測,剩余的部分來訓(xùn)練模型惩系,獲得其預(yù)測的輸出作為第i部分的新特征位岔。
        X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
        clf.fit(X_train, y_train)
        y_submission = clf.predict_proba(X_test)[:, 1]
        dataset_blend_train[test, j] = y_submission
        dataset_blend_test_j[:, i] = clf.predict_proba(X_predict)[:, 1]
    #對于測試集,直接用這k個模型的預(yù)測值均值作為新的特征堡牡。
    dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_blend_test[:, j]))

clf = LogisticRegression(solver='lbfgs')
clf.fit(dataset_blend_train, y)
y_submission = clf.predict_proba(dataset_blend_test)[:, 1]

print("Val auc Score of Stacking: %f" % (roc_auc_score(y_predict, y_submission)))

val auc Score: 1.000000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
Val auc Score of Stacking: 1.000000

Blending抒抬,其實和Stacking是一種類似的多層模型融合的形式

其主要思路是把原始的訓(xùn)練集先分成兩部分,比如70%的數(shù)據(jù)作為新的訓(xùn)練集晤柄,剩下30%的數(shù)據(jù)作為測試集擦剑。

在第一層,我們在這70%的數(shù)據(jù)上訓(xùn)練多個模型芥颈,然后去預(yù)測那30%數(shù)據(jù)的label惠勒,同時也預(yù)測test集的label。

在第二層爬坑,我們就直接用這30%數(shù)據(jù)在第一層預(yù)測的結(jié)果做為新特征繼續(xù)訓(xùn)練纠屋,然后用test集第一層預(yù)測的label做特征,用第二層訓(xùn)練的模型做進(jìn)一步預(yù)測

其優(yōu)點在于:

  • 1.比stacking簡單(因為不用進(jìn)行k次的交叉驗證來獲得stacker feature)
  • 2.避開了一個信息泄露問題:generlizers和stacker使用了不一樣的數(shù)據(jù)集

缺點在于:

  • 1.使用了很少的數(shù)據(jù)(第二階段的blender只使用training set10%的量)
  • 2.blender可能會過擬合
  • 3.stacking使用多次的交叉驗證會比較穩(wěn)健
    '''
'''
Blending
'''
 
#創(chuàng)建訓(xùn)練的數(shù)據(jù)集
#創(chuàng)建訓(xùn)練的數(shù)據(jù)集
data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]
 
#模型融合中使用到的各個單模型
clfs = [LogisticRegression(solver='lbfgs'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        #ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
 
#切分一部分?jǐn)?shù)據(jù)作為測試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)

#切分訓(xùn)練數(shù)據(jù)集為d1,d2兩部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=2020)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
 
for j, clf in enumerate(clfs):
    #依次訓(xùn)練各個單模型
    clf.fit(X_d1, y_d1)
    y_submission = clf.predict_proba(X_d2)[:, 1]
    dataset_d1[:, j] = y_submission
    #對于測試集盾计,直接用這k個模型的預(yù)測值作為新的特征售担。
    dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))

#融合使用的模型
clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
Val auc Score of Blending: 1.000000

參考博客:https://blog.csdn.net/Noob_daniel/article/details/76087829

3)分類的Stacking融合(利用mlxtend):

!pip install mlxtend

import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier

from sklearn.model_selection import cross_val_score
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions

# 以python自帶的鳶尾花數(shù)據(jù)集為例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]

fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)

clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
        
    scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
    print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
    clf_cv_mean.append(scores.mean())
    clf_cv_std.append(scores.std())
        
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(label)

plt.show()

可以發(fā)現(xiàn) 基模型 用 'KNN', 'Random Forest', 'Naive Bayes' 然后再這基礎(chǔ)上 次級模型加一個 'LogisticRegression',模型測試效果有著很好的提升署辉。

5.4.3 一些其它方法:

將特征放進(jìn)模型中預(yù)測族铆,并將預(yù)測結(jié)果變換并作為新的特征加入原有特征中再經(jīng)過模型預(yù)測結(jié)果 (Stacking變化)

(可以反復(fù)預(yù)測多次將結(jié)果加入最后的特征中)

def Ensemble_add_feature(train,test,target,clfs):
    
    # n_flods = 5
    # skf = list(StratifiedKFold(y, n_folds=n_flods))

    train_ = np.zeros((train.shape[0],len(clfs*2)))
    test_ = np.zeros((test.shape[0],len(clfs*2)))

    for j,clf in enumerate(clfs):
        '''依次訓(xùn)練各個單模型'''
        # print(j, clf)
        '''使用第1個部分作為預(yù)測,第2部分來訓(xùn)練模型末捣,獲得其預(yù)測的輸出作為第2部分的新特征邦邦。'''
        # X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]

        clf.fit(train,target)
        y_train = clf.predict(train)
        y_test = clf.predict(test)

        ## 新特征生成
        train_[:,j*2] = y_train**2
        test_[:,j*2] = y_test**2
        train_[:, j+1] = np.exp(y_train)
        test_[:, j+1] = np.exp(y_test)
        # print("val auc Score: %f" % r2_score(y_predict, dataset_d2[:, j]))
        print('Method ',j)
    
    train_ = pd.DataFrame(train_)
    test_ = pd.DataFrame(test_)
    return train_,test_

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]

x_train,x_test,y_train,y_test=train_test_split(data,target,test_size=0.3)
x_train = pd.DataFrame(x_train) ; x_test = pd.DataFrame(x_test)

#模型融合中使用到的各個單模型
clfs = [LogisticRegression(),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]

New_train,New_test = Ensemble_add_feature(x_train,x_test,y_train,clfs)

clf = LogisticRegression()
# clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(New_train, y_train)
y_emb = clf.predict_proba(New_test)[:, 1]

print("Val auc Score of stacking: %f" % (roc_auc_score(y_test, y_emb)))
Method  0
Method  1
Method  2
Method  3
Method  4
Val auc Score of stacking: 1.000000

5.4.4 本賽題示例

import pandas as pd
import numpy as np
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
%matplotlib inline

import itertools
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
# from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# from mlxtend.plotting import plot_learning_curves
# from mlxtend.plotting import plot_decision_regions

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV,cross_val_score
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error
## 數(shù)據(jù)讀取
Train_data = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ')
TestA_data = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')

print(Train_data.shape)
print(TestA_data.shape)
(150000, 31)
(50000, 30)
Train_data.head()
numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','price']]
X_data = Train_data[feature_cols]
Y_data = Train_data['price']

X_test  = TestA_data[feature_cols]

print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (150000, 26)
X test shape: (50000, 26)
def Sta_inf(data):
    print('_min',np.min(data))
    print('_max:',np.max(data))
    print('_mean',np.mean(data))
    print('_ptp',np.ptp(data))
    print('_std',np.std(data))
    print('_var',np.var(data))
print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min 11
_max: 99999
_mean 5923.32733333
_ptp 99988
_std 7501.97346988
_var 56279605.9427
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)
def build_model_lr(x_train,y_train):
    reg_model = linear_model.LinearRegression()
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_ridge(x_train,y_train):
    reg_model = linear_model.Ridge(alpha=0.8)#alphas=range(1,100,5)
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_lasso(x_train,y_train):
    reg_model = linear_model.LassoCV()
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_gbdt(x_train,y_train):
    estimator =GradientBoostingRegressor(loss='ls',subsample= 0.85,max_depth= 5,n_estimators = 100)
    param_grid = { 
            'learning_rate': [0.05,0.08,0.1,0.2],
            }
    gbdt = GridSearchCV(estimator, param_grid,cv=3)
    gbdt.fit(x_train,y_train)
    print(gbdt.best_params_)
    # print(gbdt.best_estimator_ )
    return gbdt

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=120, learning_rate=0.08, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=5) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=63,n_estimators = 100)
    param_grid = {
        'learning_rate': [0.01, 0.05, 0.1],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

2)XGBoost的五折交叉回歸驗證實現(xiàn)

## xgb
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) # ,objective ='reg:squarederror'

scores_train = []
scores = []

## 5折交叉驗證方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
    
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
Train mae: 558.212360169
Val mae 693.120168439

3)劃分?jǐn)?shù)據(jù)集倒得,并用多種方法訓(xùn)練和預(yù)測

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

## Train and Predict
print('Predict LR...')
model_lr = build_model_lr(x_train,y_train)
val_lr = model_lr.predict(x_val)
subA_lr = model_lr.predict(X_test)

print('Predict Ridge...')
model_ridge = build_model_ridge(x_train,y_train)
val_ridge = model_ridge.predict(x_val)
subA_ridge = model_ridge.predict(X_test)

print('Predict Lasso...')
model_lasso = build_model_lasso(x_train,y_train)
val_lasso = model_lasso.predict(x_val)
subA_lasso = model_lasso.predict(X_test)

print('Predict GBDT...')
model_gbdt = build_model_gbdt(x_train,y_train)
val_gbdt = model_gbdt.predict(x_val)
subA_gbdt = model_gbdt.predict(X_test)

Predict LR...
Predict Ridge...
Predict Lasso...
Predict GBDT...
{'learning_rate': 0.1, 'n_estimators': 80}

一般比賽中效果最為顯著的兩種方法

print('predict XGB...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
subA_xgb = model_xgb.predict(X_test)

print('predict lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
subA_lgb = model_lgb.predict(X_test)
predict XGB...
predict lgb...
print('Sta inf of lgb:')
Sta_inf(subA_lgb)
Sta inf of lgb:
_min -126.864734992
_max: 90152.4775557
_mean 5917.96632163
_ptp 90279.3422907
_std 7358.88582391
_var 54153200.5693

1)加權(quán)融合

def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result

## Init the Weight
w = [0.3,0.4,0.3]

## 測試驗證集準(zhǔn)確度
val_pre = Weighted_method(val_lgb,val_xgb,val_gbdt,w)
MAE_Weighted = mean_absolute_error(y_val,val_pre)
print('MAE of Weighted of val:',MAE_Weighted)

## 預(yù)測數(shù)據(jù)部分
subA = Weighted_method(subA_lgb,subA_xgb,subA_gbdt,w)
print('Sta inf:')
Sta_inf(subA)
## 生成提交文件
sub = pd.DataFrame()
sub['SaleID'] = X_test.index
sub['price'] = subA
sub.to_csv('./sub_Weighted.csv',index=False)
MAE of Weighted of val: 730.877443666
Sta inf:
_min -2816.93914153
_max: 88576.7842223
_mean 5920.38233546
_ptp 91393.7233639
_std 7325.20946801
_var 53658693.7502
## 與簡單的LR(線性回歸)進(jìn)行對比
val_lr_pred = model_lr.predict(x_val)
MAE_lr = mean_absolute_error(y_val,val_lr_pred)
print('MAE of lr:',MAE_lr)
MAE of lr: 2597.45638384

2)Stacking融合

## Starking

## 第一層
train_lgb_pred = model_lgb.predict(x_train)
train_xgb_pred = model_xgb.predict(x_train)
train_gbdt_pred = model_gbdt.predict(x_train)

Strak_X_train = pd.DataFrame()
Strak_X_train['Method_1'] = train_lgb_pred
Strak_X_train['Method_2'] = train_xgb_pred
Strak_X_train['Method_3'] = train_gbdt_pred

Strak_X_val = pd.DataFrame()
Strak_X_val['Method_1'] = val_lgb
Strak_X_val['Method_2'] = val_xgb
Strak_X_val['Method_3'] = val_gbdt

Strak_X_test = pd.DataFrame()
Strak_X_test['Method_1'] = subA_lgb
Strak_X_test['Method_2'] = subA_xgb
Strak_X_test['Method_3'] = subA_gbdt
Strak_X_test.head()
## level2-method 
model_lr_Stacking = build_model_lr(Strak_X_train,y_train)
## 訓(xùn)練集
train_pre_Stacking = model_lr_Stacking.predict(Strak_X_train)
print('MAE of Stacking-LR:',mean_absolute_error(y_train,train_pre_Stacking))

## 驗證集
val_pre_Stacking = model_lr_Stacking.predict(Strak_X_val)
print('MAE of Stacking-LR:',mean_absolute_error(y_val,val_pre_Stacking))

## 預(yù)測集
print('Predict Stacking-LR...')
subA_Stacking = model_lr_Stacking.predict(Strak_X_test)

MAE of Stacking-LR: 628.399441036
MAE of Stacking-LR: 707.673951794
Predict Stacking-LR...
subA_Stacking[subA_Stacking<10]=10  ## 去除過小的預(yù)測值

sub = pd.DataFrame()
sub['SaleID'] = TestA_data.SaleID
sub['price'] = subA_Stacking
sub.to_csv('./sub_Stacking.csv',index=False)
print('Sta inf:')
Sta_inf(subA_Stacking)
Sta inf:
_min 10.0
_max: 90849.3729816
_mean 5917.39429976
_ptp 90839.3729816
_std 7396.09766172
_var 54702260.6217

3.4 經(jīng)驗總結(jié)

比賽的融合這個問題,個人的看法來說其實涉及多個層面帆竹,也是提分和提升模型魯棒性的一種重要方法:

  • 1)結(jié)果層面的融合秒紧,這種是最常見的融合方法,其可行的融合方法也有很多臭笆,比如根據(jù)結(jié)果的得分進(jìn)行加權(quán)融合绩聘,還可以做Log,exp處理等耗啦。在做結(jié)果融合的時候,有一個很重要的條件是模型結(jié)果的得分要比較近似机杜,然后結(jié)果的差異要比較大帜讲,這樣的結(jié)果融合往往有比較好的效果提升。

  • 2)特征層面的融合椒拗,這個層面其實感覺不叫融合似将,準(zhǔn)確說可以叫分割,很多時候如果我們用同種模型訓(xùn)練蚀苛,可以把特征進(jìn)行切分給不同的模型在验,然后在后面進(jìn)行模型或者結(jié)果融合有時也能產(chǎn)生比較好的效果。

  • 3)模型層面的融合堵未,模型層面的融合可能就涉及模型的堆疊和設(shè)計腋舌,比如加Staking層,部分模型的結(jié)果作為特征輸入等渗蟹,這些就需要多實驗和思考了块饺,基于模型層面的融合最好不同模型類型要有一定的差異,用同種模型不同的參數(shù)的收益一般是比較小的雌芽。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末授艰,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子世落,更是在濱河造成了極大的恐慌淮腾,老刑警劉巖,帶你破解...
    沈念sama閱讀 207,248評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異谷朝,居然都是意外死亡洲押,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,681評論 2 381
  • 文/潘曉璐 我一進(jìn)店門徘禁,熙熙樓的掌柜王于貴愁眉苦臉地迎上來诅诱,“玉大人,你說我怎么就攤上這事送朱∧锏矗” “怎么了?”我有些...
    開封第一講書人閱讀 153,443評論 0 344
  • 文/不壞的土叔 我叫張陵驶沼,是天一觀的道長炮沐。 經(jīng)常有香客問我,道長回怜,這世上最難降的妖魔是什么大年? 我笑而不...
    開封第一講書人閱讀 55,475評論 1 279
  • 正文 為了忘掉前任,我火速辦了婚禮玉雾,結(jié)果婚禮上翔试,老公的妹妹穿的比我還像新娘。我一直安慰自己复旬,他們只是感情好垦缅,可當(dāng)我...
    茶點故事閱讀 64,458評論 5 374
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著驹碍,像睡著了一般壁涎。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上志秃,一...
    開封第一講書人閱讀 49,185評論 1 284
  • 那天怔球,我揣著相機(jī)與錄音,去河邊找鬼浮还。 笑死竟坛,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的碑定。 我是一名探鬼主播流码,決...
    沈念sama閱讀 38,451評論 3 401
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼延刘!你這毒婦竟也來了漫试?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,112評論 0 261
  • 序言:老撾萬榮一對情侶失蹤碘赖,失蹤者是張志新(化名)和其女友劉穎驾荣,沒想到半個月后外构,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,609評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡播掷,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,083評論 2 325
  • 正文 我和宋清朗相戀三年审编,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片歧匈。...
    茶點故事閱讀 38,163評論 1 334
  • 序言:一個原本活蹦亂跳的男人離奇死亡垒酬,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出件炉,到底是詐尸還是另有隱情勘究,我是刑警寧澤,帶...
    沈念sama閱讀 33,803評論 4 323
  • 正文 年R本政府宣布斟冕,位于F島的核電站口糕,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏磕蛇。R本人自食惡果不足惜景描,卻給世界環(huán)境...
    茶點故事閱讀 39,357評論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望秀撇。 院中可真熱鬧超棺,春花似錦、人聲如沸呵燕。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,357評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽虏等。三九已至,卻和暖如春适肠,著一層夾襖步出監(jiān)牢的瞬間霍衫,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,590評論 1 261
  • 我被黑心中介騙來泰國打工侯养, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留敦跌,地道東北人。 一個月前我還...
    沈念sama閱讀 45,636評論 2 355
  • 正文 我出身青樓逛揩,卻偏偏與公主長得像柠傍,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子辩稽,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,925評論 2 344

推薦閱讀更多精彩內(nèi)容

  • 同DataWhale一起組隊學(xué)習(xí):https://tianchi.aliyun.com/notebook-ai/d...
    612twilight閱讀 1,001評論 0 2
  • 一般來說惧笛,通過融合多個不同的模型,可能提升機(jī)器學(xué)習(xí)的性能逞泄,這一方法在各種機(jī)器學(xué)習(xí)比賽中廣泛應(yīng)用患整,比如在kaggle...
    塵囂看客閱讀 19,478評論 3 19
  • KAGGLE ENSEMBLING GUIDE 聲明:文章來自網(wǎng)絡(luò)拜效,手動翻譯筆記,僅做學(xué)習(xí)參考各谚。文末附上原地址紧憾,轉(zhuǎn)...
    壹刀_文閱讀 2,591評論 0 3
  • 1.Voting 投票法針對分類模型,多個模型的分類結(jié)果進(jìn)行投票昌渤,少數(shù)服從多數(shù)赴穗。除了公平投票外,還可以給投...
    ZAK_ML閱讀 2,247評論 0 1
  • 1.轉(zhuǎn)換看問題的角度。面對問題履婉,你把它看得大煤篙,它就大,你把它看得小毁腿,它就小辑奈。你越在意它,它就越過不去已烤。相反鸠窗,你忽略...
    踐生情素閱讀 168評論 0 0