同DataWhale一起組隊(duì)學(xué)習(xí):https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12281978.0.0.6802593a2HCrSE&postId=95535
模型融合是比賽后期一個(gè)重要的環(huán)節(jié),大體來(lái)說(shuō)有如下的類型方式。
- 簡(jiǎn)單加權(quán)融合:
- 回歸(分類概率):算術(shù)平均融合(Arithmetic mean)熔号,幾何平均融合(Geometric mean)损离;
- 分類:投票(Voting)
- 綜合:排序融合(Rank averaging)岂嗓,log融合
- stacking/blending:
- 構(gòu)建多層模型,并利用預(yù)測(cè)結(jié)果再擬合預(yù)測(cè)。
- boosting/bagging(在xgboost,Adaboost,GBDT中已經(jīng)用到):
- 多樹(shù)的提升方法
Stacking相關(guān)理論介紹
1) 什么是 stacking
簡(jiǎn)單來(lái)說(shuō) stacking 就是當(dāng)用初始訓(xùn)練數(shù)據(jù)學(xué)習(xí)出若干個(gè)基學(xué)習(xí)器后整慎,將這幾個(gè)學(xué)習(xí)器的預(yù)測(cè)結(jié)果作為新的訓(xùn)練集,來(lái)學(xué)習(xí)一個(gè)新的學(xué)習(xí)器围苫。
將個(gè)體學(xué)習(xí)器結(jié)合在一起的時(shí)候使用的方法叫做結(jié)合策略裤园。對(duì)于分類問(wèn)題,我們可以使用投票法來(lái)選擇輸出最多的類剂府。對(duì)于回歸問(wèn)題拧揽,我們可以將分類器輸出的結(jié)果求平均值。
上面說(shuō)的投票法和平均法都是很有效的結(jié)合策略腺占,還有一種結(jié)合策略是使用另外一個(gè)機(jī)器學(xué)習(xí)算法來(lái)將個(gè)體機(jī)器學(xué)習(xí)器的結(jié)果結(jié)合在一起强法,這個(gè)方法就是Stacking。
在stacking方法中湾笛,我們把個(gè)體學(xué)習(xí)器叫做初級(jí)學(xué)習(xí)器饮怯,用于結(jié)合的學(xué)習(xí)器叫做次級(jí)學(xué)習(xí)器或元學(xué)習(xí)器(meta-learner),次級(jí)學(xué)習(xí)器用于訓(xùn)練的數(shù)據(jù)叫做次級(jí)訓(xùn)練集嚎研。次級(jí)訓(xùn)練集是在訓(xùn)練集上用初級(jí)學(xué)習(xí)器得到的蓖墅。
2) 如何進(jìn)行 stacking
算法示意圖如下:
引用自 西瓜書(shū)《機(jī)器學(xué)習(xí)》
- 過(guò)程1-3 是訓(xùn)練出來(lái)個(gè)體學(xué)習(xí)器库倘,也就是初級(jí)學(xué)習(xí)器。
- 過(guò)程5-9是 使用訓(xùn)練出來(lái)的個(gè)體學(xué)習(xí)器來(lái)得預(yù)測(cè)的結(jié)果论矾,這個(gè)預(yù)測(cè)的結(jié)果當(dāng)做次級(jí)學(xué)習(xí)器的訓(xùn)練集教翩。
- 過(guò)程11 是用初級(jí)學(xué)習(xí)器預(yù)測(cè)的結(jié)果訓(xùn)練出次級(jí)學(xué)習(xí)器,得到我們最后訓(xùn)練的模型贪壳。
3)Stacking的方法講解
首先饱亿,我們先從一種“不那么正確”但是容易懂的Stacking方法講起。
Stacking模型本質(zhì)上是一種分層的結(jié)構(gòu)闰靴,這里簡(jiǎn)單起見(jiàn)彪笼,只分析二級(jí)Stacking.假設(shè)我們有2個(gè)基模型 Model1_1、Model1_2 和 一個(gè)次級(jí)模型Model2
Step 1. 基模型 Model1_1蚂且,對(duì)訓(xùn)練集train訓(xùn)練配猫,然后用于預(yù)測(cè) train 和 test 的標(biāo)簽列,分別是P1杏死,T1
Model1_1 模型訓(xùn)練:
訓(xùn)練后的模型 Model1_1 分別在 train 和 test 上預(yù)測(cè)泵肄,得到預(yù)測(cè)標(biāo)簽分別是P1,T1
Step 2. 基模型 Model1_2 淑翼,對(duì)訓(xùn)練集train訓(xùn)練腐巢,然后用于預(yù)測(cè)train和test的標(biāo)簽列,分別是P2玄括,T2
Model1_2 模型訓(xùn)練:
訓(xùn)練后的模型 Model1_2 分別在 train 和 test 上預(yù)測(cè)冯丙,得到預(yù)測(cè)標(biāo)簽分別是P2,T2
Step 3. 分別把P1,P2以及T1,T2合并惠豺,得到一個(gè)新的訓(xùn)練集和測(cè)試集train2,test2.
再用 次級(jí)模型 Model2 以真實(shí)訓(xùn)練集標(biāo)簽為標(biāo)簽訓(xùn)練,以train2為特征進(jìn)行訓(xùn)練,預(yù)測(cè)test2,得到最終的測(cè)試集預(yù)測(cè)的標(biāo)簽列 风宁。
這就是我們兩層堆疊的一種基本的原始思路想法洁墙。在不同模型預(yù)測(cè)的結(jié)果基礎(chǔ)上再加一層模型,進(jìn)行再訓(xùn)練戒财,從而得到模型最終的預(yù)測(cè)热监。
Stacking本質(zhì)上就是這么直接的思路,但是直接這樣有時(shí)對(duì)于如果訓(xùn)練集和測(cè)試集分布不那么一致的情況下是有一點(diǎn)問(wèn)題的饮寞,其問(wèn)題在于用初始模型訓(xùn)練的標(biāo)簽再利用真實(shí)標(biāo)簽進(jìn)行再訓(xùn)練孝扛,毫無(wú)疑問(wèn)會(huì)導(dǎo)致一定的模型過(guò)擬合訓(xùn)練集,這樣或許模型在測(cè)試集上的泛化能力或者說(shuō)效果會(huì)有一定的下降幽崩,因此現(xiàn)在的問(wèn)題變成了如何降低再訓(xùn)練的過(guò)擬合性苦始,這里我們一般有兩種方法。
- 次級(jí)模型盡量選擇簡(jiǎn)單的線性模型
- 利用K折交叉驗(yàn)證
K-折交叉驗(yàn)證:
訓(xùn)練:
預(yù)測(cè):
5.4 代碼示例
5.4.1 回歸\分類概率-融合:
1)簡(jiǎn)單加權(quán)平均慌申,結(jié)果直接融合
## 生成一些簡(jiǎn)單的樣本數(shù)據(jù)陌选,test_prei 代表第i個(gè)模型的預(yù)測(cè)值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]
# y_test_true 代表第模型的真實(shí)值
y_test_true = [1, 3, 2, 6]
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
## 定義結(jié)果的加權(quán)平均函數(shù)
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
return Weighted_result
from sklearn import metrics
# 各模型的預(yù)測(cè)結(jié)果計(jì)算MAE
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true, test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true, test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true, test_pre3))
Pred1 MAE: 0.1750000000000001
Pred2 MAE: 0.07499999999999993
Pred3 MAE: 0.10000000000000009
## 根據(jù)加權(quán)計(jì)算MAE
w = [0.3,0.4,0.3] # 定義比重權(quán)值
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true, Weighted_pre))
Weighted_pre MAE: 0.05750000000000027
可以發(fā)現(xiàn)加權(quán)結(jié)果相對(duì)于之前的結(jié)果是有提升的,這種我們稱其為簡(jiǎn)單的加權(quán)平均。
還有一些特殊的形式咨油,比如mean平均您炉,median平均
## 定義結(jié)果的加權(quán)平均函數(shù)
def Mean_method(test_pre1,test_pre2,test_pre3):
Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
return Mean_result
Mean_pre = Mean_method(test_pre1,test_pre2,test_pre3)
print('Mean_pre MAE:',metrics.mean_absolute_error(y_test_true, Mean_pre))
Mean_pre MAE: 0.06666666666666693
## 定義結(jié)果的加權(quán)平均函數(shù)
def Median_method(test_pre1,test_pre2,test_pre3):
Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
return Median_result
Median_pre = Median_method(test_pre1,test_pre2,test_pre3)
print('Median_pre MAE:',metrics.mean_absolute_error(y_test_true, Median_pre))
Median_pre MAE: 0.07500000000000007
2) Stacking融合(回歸):
from sklearn import linear_model
def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
return Stacking_result
## 生成一些簡(jiǎn)單的樣本數(shù)據(jù),test_prei 代表第i個(gè)模型的預(yù)測(cè)值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
# y_test_true 代表第模型的真實(shí)值
y_train_true = [3, 8, 9, 5]
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]
# y_test_true 代表第模型的真實(shí)值
y_test_true = [1, 3, 2, 6]
model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))
Stacking_pre MAE: 0.04213483146067476
可以發(fā)現(xiàn)模型結(jié)果相對(duì)于之前有進(jìn)一步的提升役电,這是我們需要注意的一點(diǎn)是赚爵,對(duì)于第二層Stacking的模型不宜選取的過(guò)于復(fù)雜,這樣會(huì)導(dǎo)致模型在訓(xùn)練集上過(guò)擬合法瑟,從而使得在測(cè)試集上并不能達(dá)到很好的效果冀膝。
5.4.2 分類模型融合:
對(duì)于分類,同樣的可以使用融合方法瓢谢,比如簡(jiǎn)單投票畸写,Stacking...
from sklearn.datasets import make_blobs
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
1)Voting投票機(jī)制:
Voting即投票機(jī)制,分為軟投票和硬投票兩種氓扛,其原理采用少數(shù)服從多數(shù)的思想枯芬。
'''
硬投票:對(duì)多個(gè)模型直接進(jìn)行投票,不區(qū)分模型結(jié)果的相對(duì)重要度采郎,最終投票數(shù)最多的類為最終被預(yù)測(cè)的類千所。
'''
iris = datasets.load_iris()
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7,
colsample_bytree=0.6, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1)
# 硬投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [SVM]
Accuracy: 0.95 (+/- 0.03) [Ensemble]
'''
軟投票:和硬投票原理相同,增加了設(shè)置權(quán)重的功能蒜埋,可以為不同模型設(shè)置不同權(quán)重淫痰,進(jìn)而區(qū)別模型不同的重要度。
'''
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)
# 軟投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='soft', weights=[2, 1, 1])
clf1.fit(x_train, y_train)
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [SVM]
Accuracy: 0.96 (+/- 0.02) [Ensemble]
2)分類的Stacking\Blending融合:
stacking是一種分層模型集成框架整份。
以兩層為例待错,第一層由多個(gè)基學(xué)習(xí)器組成,其輸入為原始訓(xùn)練集烈评,第二層的模型則是以第一層基學(xué)習(xí)器的輸出作為訓(xùn)練集進(jìn)行再訓(xùn)練火俄,從而得到完整的stacking模型, stacking兩層模型都使用了全部的訓(xùn)練數(shù)據(jù)。
'''
5-Fold Stacking
'''
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier,GradientBoostingClassifier
import pandas as pd
#創(chuàng)建訓(xùn)練的數(shù)據(jù)集
data_0 = iris.data
data = data_0[:100,:]
target_0 = iris.target
target = target_0[:100]
#模型融合中使用到的各個(gè)單模型
clfs = [LogisticRegression(solver='lbfgs'),
RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
#切分一部分?jǐn)?shù)據(jù)作為測(cè)試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)
dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_predict.shape[0], len(clfs)))
#5折stacking
n_splits = 5
skf = StratifiedKFold(n_splits)
skf = skf.split(X, y)
for j, clf in enumerate(clfs):
#依次訓(xùn)練各個(gè)單模型
dataset_blend_test_j = np.zeros((X_predict.shape[0], 5))
for i, (train, test) in enumerate(skf):
#5-Fold交叉訓(xùn)練讲冠,使用第i個(gè)部分作為預(yù)測(cè)瓜客,剩余的部分來(lái)訓(xùn)練模型,獲得其預(yù)測(cè)的輸出作為第i部分的新特征竿开。
X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
clf.fit(X_train, y_train)
y_submission = clf.predict_proba(X_test)[:, 1]
dataset_blend_train[test, j] = y_submission
dataset_blend_test_j[:, i] = clf.predict_proba(X_predict)[:, 1]
#對(duì)于測(cè)試集谱仪,直接用這k個(gè)模型的預(yù)測(cè)值均值作為新的特征。
dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)
print("val auc Score: %f" % roc_auc_score(y_predict, dataset_blend_test[:, j]))
clf = LogisticRegression(solver='lbfgs')
clf.fit(dataset_blend_train, y)
y_submission = clf.predict_proba(dataset_blend_test)[:, 1]
print("Val auc Score of Stacking: %f" % (roc_auc_score(y_predict, y_submission)))
val auc Score: 1.000000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
Val auc Score of Stacking: 1.000000
Blending否彩,其實(shí)和Stacking是一種類似的多層模型融合的形式
其主要思路是把原始的訓(xùn)練集先分成兩部分疯攒,比如70%的數(shù)據(jù)作為新的訓(xùn)練集,剩下30%的數(shù)據(jù)作為測(cè)試集列荔。
在第一層卸例,我們?cè)谶@70%的數(shù)據(jù)上訓(xùn)練多個(gè)模型称杨,然后去預(yù)測(cè)那30%數(shù)據(jù)的label,同時(shí)也預(yù)測(cè)test集的label筷转。
在第二層姑原,我們就直接用這30%數(shù)據(jù)在第一層預(yù)測(cè)的結(jié)果做為新特征繼續(xù)訓(xùn)練,然后用test集第一層預(yù)測(cè)的label做特征呜舒,用第二層訓(xùn)練的模型做進(jìn)一步預(yù)測(cè)
其優(yōu)點(diǎn)在于:
- 1.比stacking簡(jiǎn)單(因?yàn)椴挥眠M(jìn)行k次的交叉驗(yàn)證來(lái)獲得stacker feature)
- 2.避開(kāi)了一個(gè)信息泄露問(wèn)題:generlizers和stacker使用了不一樣的數(shù)據(jù)集
缺點(diǎn)在于:
- 1.使用了很少的數(shù)據(jù)(第二階段的blender只使用training set10%的量)
- 2.blender可能會(huì)過(guò)擬合
- 3.stacking使用多次的交叉驗(yàn)證會(huì)比較穩(wěn)健
'''
'''
Blending
'''
#創(chuàng)建訓(xùn)練的數(shù)據(jù)集
#創(chuàng)建訓(xùn)練的數(shù)據(jù)集
data_0 = iris.data
data = data_0[:100,:]
target_0 = iris.target
target = target_0[:100]
#模型融合中使用到的各個(gè)單模型
clfs = [LogisticRegression(solver='lbfgs'),
RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
#ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
#切分一部分?jǐn)?shù)據(jù)作為測(cè)試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)
#切分訓(xùn)練數(shù)據(jù)集為d1,d2兩部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=2020)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
for j, clf in enumerate(clfs):
#依次訓(xùn)練各個(gè)單模型
clf.fit(X_d1, y_d1)
y_submission = clf.predict_proba(X_d2)[:, 1]
dataset_d1[:, j] = y_submission
#對(duì)于測(cè)試集锭汛,直接用這k個(gè)模型的預(yù)測(cè)值作為新的特征。
dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))
#融合使用的模型
clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
Val auc Score of Blending: 1.000000
參考博客:https://blog.csdn.net/Noob_daniel/article/details/76087829
3)分類的Stacking融合(利用mlxtend):
!pip install mlxtend
import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
# 以python自帶的鳶尾花數(shù)據(jù)集為例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
clf_cv_mean.append(scores.mean())
clf_cv_std.append(scores.std())
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(label)
plt.show()
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: mlxtend in d:\programdata\anaconda3\lib\site-packages (0.17.2)
Requirement already satisfied: scipy>=1.2.1 in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (1.4.1)
Requirement already satisfied: numpy>=1.16.2 in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (1.16.5)
Requirement already satisfied: matplotlib>=3.0.0 in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (3.1.1)
Requirement already satisfied: joblib>=0.13.2 in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (0.14.1)
Requirement already satisfied: pandas>=0.24.2 in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (0.25.3)
Requirement already satisfied: setuptools in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (41.4.0)
Requirement already satisfied: scikit-learn>=0.20.3 in d:\programdata\anaconda3\lib\site-packages (from mlxtend) (0.21.3)
Requirement already satisfied: cycler>=0.10 in d:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in d:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in d:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in d:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in d:\programdata\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2019.3)
Requirement already satisfied: six in d:\programdata\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.0.0->mlxtend) (1.12.0)
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.93 (+/- 0.05) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.95 (+/- 0.03) [Stacking Classifier]
可以發(fā)現(xiàn) 基模型 用 'KNN', 'Random Forest', 'Naive Bayes' 然后再這基礎(chǔ)上 次級(jí)模型加一個(gè) 'LogisticRegression'袭蝗,模型測(cè)試效果有著很好的提升唤殴。
5.4.3 一些其它方法:
將特征放進(jìn)模型中預(yù)測(cè),并將預(yù)測(cè)結(jié)果變換并作為新的特征加入原有特征中再經(jīng)過(guò)模型預(yù)測(cè)結(jié)果 (Stacking變化)
(可以反復(fù)預(yù)測(cè)多次將結(jié)果加入最后的特征中)
def Ensemble_add_feature(train,test,target,clfs):
# n_flods = 5
# skf = list(StratifiedKFold(y, n_folds=n_flods))
train_ = np.zeros((train.shape[0],len(clfs*2)))
test_ = np.zeros((test.shape[0],len(clfs*2)))
for j,clf in enumerate(clfs):
'''依次訓(xùn)練各個(gè)單模型'''
# print(j, clf)
'''使用第1個(gè)部分作為預(yù)測(cè)到腥,第2部分來(lái)訓(xùn)練模型朵逝,獲得其預(yù)測(cè)的輸出作為第2部分的新特征。'''
# X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
clf.fit(train,target)
y_train = clf.predict(train)
y_test = clf.predict(test)
## 新特征生成
train_[:,j*2] = y_train**2
test_[:,j*2] = y_test**2
train_[:, j+1] = np.exp(y_train)
test_[:, j+1] = np.exp(y_test)
# print("val auc Score: %f" % r2_score(y_predict, dataset_d2[:, j]))
print('Method ',j)
train_ = pd.DataFrame(train_)
test_ = pd.DataFrame(test_)
return train_,test_
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
data_0 = iris.data
data = data_0[:100,:]
target_0 = iris.target
target = target_0[:100]
x_train,x_test,y_train,y_test=train_test_split(data,target,test_size=0.3)
x_train = pd.DataFrame(x_train) ; x_test = pd.DataFrame(x_test)
#模型融合中使用到的各個(gè)單模型
clfs = [LogisticRegression(),
RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
New_train,New_test = Ensemble_add_feature(x_train,x_test,y_train,clfs)
clf = LogisticRegression()
# clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(New_train, y_train)
y_emb = clf.predict_proba(New_test)[:, 1]
print("Val auc Score of stacking: %f" % (roc_auc_score(y_test, y_emb)))
Method 0
Method 1
Method 2
Method 3
Method 4
Val auc Score of stacking: 1.000000
5.4.4 本賽題示例
import pandas as pd
import numpy as np
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')
%matplotlib inline
import itertools
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
# from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# from mlxtend.plotting import plot_learning_curves
# from mlxtend.plotting import plot_decision_regions
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV,cross_val_score
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
## 數(shù)據(jù)讀取
Train_data = pd.read_csv('datalab/used_car_train_20200313.csv', sep=' ')
TestA_data = pd.read_csv('datalab/used_car_testA_20200313.csv', sep=' ')
print(Train_data.shape)
print(TestA_data.shape)
(150000, 31)
(50000, 30)
Train_data.head()
numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType',
'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
dtype='object')
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','price']]
X_data = Train_data[feature_cols]
Y_data = Train_data['price']
X_test = TestA_data[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (150000, 26)
X test shape: (50000, 26)
def Sta_inf(data):
print('_min',np.min(data))
print('_max:',np.max(data))
print('_mean',np.mean(data))
print('_ptp',np.ptp(data))
print('_std',np.std(data))
print('_var',np.var(data))
print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min 11
_max: 99999
_mean 5923.327333333334
_ptp 99988
_std 7501.973469876438
_var 56279605.94272992
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)
def build_model_lr(x_train,y_train):
reg_model = linear_model.LinearRegression()
reg_model.fit(x_train,y_train)
return reg_model
def build_model_ridge(x_train,y_train):
reg_model = linear_model.Ridge(alpha=0.8)#alphas=range(1,100,5)
reg_model.fit(x_train,y_train)
return reg_model
def build_model_lasso(x_train,y_train):
reg_model = linear_model.LassoCV()
reg_model.fit(x_train,y_train)
return reg_model
def build_model_gbdt(x_train,y_train):
estimator = GradientBoostingRegressor(loss='ls',subsample= 0.85,max_depth= 5,n_estimators = 100)
param_grid = {
'learning_rate': [0.05,0.08,0.1,0.2],
}
gbdt = GridSearchCV(estimator, param_grid,cv=3)
gbdt.fit(x_train,y_train)
print(gbdt.best_params_)
# print(gbdt.best_estimator_ )
return gbdt
def build_model_xgb(x_train,y_train):
model = xgb.XGBRegressor(n_estimators=120, learning_rate=0.08, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=5) #, objective ='reg:squarederror'
model.fit(x_train, y_train)
return model
def build_model_lgb(x_train,y_train):
estimator = lgb.LGBMRegressor(num_leaves=63,n_estimators = 100)
param_grid = {
'learning_rate': [0.01, 0.05, 0.1],
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(x_train, y_train)
return gbm
2)XGBoost的五折交叉回歸驗(yàn)證實(shí)現(xiàn)
## xgb
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) # ,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉驗(yàn)證方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
train_x=X_data.iloc[train_ind].values
train_y=Y_data.iloc[train_ind]
val_x=X_data.iloc[val_ind].values
val_y=Y_data.iloc[val_ind]
xgr.fit(train_x,train_y)
pred_train_xgb=xgr.predict(train_x)
pred_xgb=xgr.predict(val_x)
score_train = mean_absolute_error(train_y,pred_train_xgb)
scores_train.append(score_train)
score = mean_absolute_error(val_y,pred_xgb)
scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
Train mae: 600.0127885014529
Val mae 691.9976473362078
3)劃分?jǐn)?shù)據(jù)集乡范,并用多種方法訓(xùn)練和預(yù)測(cè)
## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
## Train and Predict
print('Predict LR...')
model_lr = build_model_lr(x_train,y_train)
val_lr = model_lr.predict(x_val)
subA_lr = model_lr.predict(X_test)
print('Predict Ridge...')
model_ridge = build_model_ridge(x_train,y_train)
val_ridge = model_ridge.predict(x_val)
subA_ridge = model_ridge.predict(X_test)
print('Predict Lasso...')
model_lasso = build_model_lasso(x_train,y_train)
val_lasso = model_lasso.predict(x_val)
subA_lasso = model_lasso.predict(X_test)
print('Predict GBDT...')
model_gbdt = build_model_gbdt(x_train,y_train)
val_gbdt = model_gbdt.predict(x_val)
subA_gbdt = model_gbdt.predict(X_test)
Predict LR...
Predict Ridge...
Predict Lasso...
Predict GBDT...
{'learning_rate': 0.2}
一般比賽中效果最為顯著的兩種方法
print('predict XGB...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
subA_xgb = model_xgb.predict(X_test)
print('predict lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
subA_lgb = model_lgb.predict(X_test)
predict XGB...
predict lgb...
print('Sta inf of lgb:')
Sta_inf(subA_lgb)
Sta inf of lgb:
_min -113.02647702199383
_max: 90367.18180594654
_mean 5926.360831805605
_ptp 90480.20828296854
_std 7352.037499240903
_var 54052455.39024443
1)加權(quán)融合
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
return Weighted_result
## Init the Weight
w = [0.3,0.4,0.3]
## 測(cè)試驗(yàn)證集準(zhǔn)確度
val_pre = Weighted_method(val_lgb,val_xgb,val_gbdt,w)
MAE_Weighted = mean_absolute_error(y_val,val_pre)
print('MAE of Weighted of val:',MAE_Weighted)
## 預(yù)測(cè)數(shù)據(jù)部分
subA = Weighted_method(subA_lgb,subA_xgb,subA_gbdt,w)
print('Sta inf:')
Sta_inf(subA)
## 生成提交文件
sub = pd.DataFrame()
sub['SaleID'] = X_test.index
sub['price'] = subA
sub.to_csv('./sub_Weighted.csv',index=False)
MAE of Weighted of val: 721.1704120165163
Sta inf:
_min -197.09928483735297
_max: 91079.8298898976
_mean 5928.720726400139
_ptp 91276.92917473496
_std 7341.282090664513
_var 53894422.73471152
## 與簡(jiǎn)單的LR(線性回歸)進(jìn)行對(duì)比
val_lr_pred = model_lr.predict(x_val)
MAE_lr = mean_absolute_error(y_val,val_lr_pred)
print('MAE of lr:',MAE_lr)
MAE of lr: 2601.82041433559
2)Starking融合
## Starking
## 第一層
train_lgb_pred = model_lgb.predict(x_train)
train_xgb_pred = model_xgb.predict(x_train)
train_gbdt_pred = model_gbdt.predict(x_train)
Strak_X_train = pd.DataFrame()
Strak_X_train['Method_1'] = train_lgb_pred
Strak_X_train['Method_2'] = train_xgb_pred
Strak_X_train['Method_3'] = train_gbdt_pred
Strak_X_val = pd.DataFrame()
Strak_X_val['Method_1'] = val_lgb
Strak_X_val['Method_2'] = val_xgb
Strak_X_val['Method_3'] = val_gbdt
Strak_X_test = pd.DataFrame()
Strak_X_test['Method_1'] = subA_lgb
Strak_X_test['Method_2'] = subA_xgb
Strak_X_test['Method_3'] = subA_gbdt
Strak_X_test.head()
## level2-method
model_lr_Stacking = build_model_lr(Strak_X_train,y_train)
## 訓(xùn)練集
train_pre_Stacking = model_lr_Stacking.predict(Strak_X_train)
print('MAE of Stacking-LR:',mean_absolute_error(y_train,train_pre_Stacking))
## 驗(yàn)證集
val_pre_Stacking = model_lr_Stacking.predict(Strak_X_val)
print('MAE of Stacking-LR:',mean_absolute_error(y_val,val_pre_Stacking))
## 預(yù)測(cè)集
print('Predict Stacking-LR...')
subA_Stacking = model_lr_Stacking.predict(Strak_X_test)
MAE of Stacking-LR: 635.088640438716
MAE of Stacking-LR: 717.0504813030163
Predict Stacking-LR...
subA_Stacking[subA_Stacking<10]=10 ## 去除過(guò)小的預(yù)測(cè)值
sub = pd.DataFrame()
sub['SaleID'] = TestA_data.SaleID
sub['price'] = subA_Stacking
sub.to_csv('./sub_Stacking.csv',index=False)
print('Sta inf:')
Sta_inf(subA_Stacking)
Sta inf:
_min 10.0
_max: 93069.56247871982
_mean 5926.1644584540845
_ptp 93059.56247871982
_std 7391.202609036913
_var 54629876.00783407
3.4 經(jīng)驗(yàn)總結(jié)
比賽的融合這個(gè)問(wèn)題配名,個(gè)人的看法來(lái)說(shuō)其實(shí)涉及多個(gè)層面,也是提分和提升模型魯棒性的一種重要方法:
1)結(jié)果層面的融合晋辆,這種是最常見(jiàn)的融合方法渠脉,其可行的融合方法也有很多,比如根據(jù)結(jié)果的得分進(jìn)行加權(quán)融合瓶佳,還可以做Log芋膘,exp處理等。在做結(jié)果融合的時(shí)候霸饲,有一個(gè)很重要的條件是模型結(jié)果的得分要比較近似为朋,然后結(jié)果的差異要比較大,這樣的結(jié)果融合往往有比較好的效果提升厚脉。
2)特征層面的融合习寸,這個(gè)層面其實(shí)感覺(jué)不叫融合,準(zhǔn)確說(shuō)可以叫分割器仗,很多時(shí)候如果我們用同種模型訓(xùn)練融涣,可以把特征進(jìn)行切分給不同的模型童番,然后在后面進(jìn)行模型或者結(jié)果融合有時(shí)也能產(chǎn)生比較好的效果精钮。
3)模型層面的融合,模型層面的融合可能就涉及模型的堆疊和設(shè)計(jì)剃斧,比如加Staking層轨香,部分模型的結(jié)果作為特征輸入等,這些就需要多實(shí)驗(yàn)和思考了幼东,基于模型層面的融合最好不同模型類型要有一定的差異臂容,用同種模型不同的參數(shù)的收益一般是比較小的科雳。