Ensemble Learning
集成學(xué)習(xí)通過(guò)整合多個(gè)不同模型來(lái)實(shí)現(xiàn)對(duì)復(fù)雜問(wèn)題的更準(zhǔn)確建模,其實(shí)現(xiàn)方式多樣,但共同的特征是:從總體數(shù)據(jù)集的多個(gè)子集中發(fā)掘出針對(duì)該子集具有分類能力,但不一定在其他子集當(dāng)中同樣有效的特征腊满,通過(guò)該特征分別實(shí)現(xiàn)數(shù)據(jù)集的分類扯旷,再將結(jié)果進(jìn)行集成拯爽。
Bagging & Boosting 是集成學(xué)習(xí)的典型代表。最簡(jiǎn)單的 Ensemble Learning 就是通過(guò)將數(shù)據(jù)隨機(jī)劃分多個(gè)子集钧忽,在每個(gè)子集上進(jìn)行擬合毯炮,再將最終結(jié)果進(jìn)行平均化,這種方法也被稱為 Bagging耸黑,或者稱為 Bootstrap aggregation桃煎,其實(shí)現(xiàn)優(yōu)化的原理在于其可以通過(guò)平均化差異使得最終的模型可以有效的防止異常數(shù)據(jù)的影響。
Boosting
在此重點(diǎn)介紹以 AdaBoost (Adaptive Boosting) 為代表的 Boosting 算法大刊,Boosting 算法想解決的核心問(wèn)題就在于是否可以通過(guò)一系列的 Weak Learner 的組合來(lái)實(shí)現(xiàn)一個(gè)更高性能的 Learner 模型为迈,這個(gè)過(guò)程中的性能增強(qiáng)也就是 Boosting 這個(gè)名稱的由來(lái)。AdaBoost 算法中涉及到的兩個(gè)重要的概念是 Weak learner 和樣本權(quán)重:
- Weak Learner:被定義為對(duì)于服從任意分布 D 的樣本缺菌,其分類準(zhǔn)確率僅需高于隨機(jī)猜測(cè)的結(jié)果曲尸,在實(shí)際算法中 Weak Learner 可以是任意已知的監(jiān)督學(xué)習(xí)方法,如 SVM男翰,線性回歸另患,邏輯回歸,甚至是深度神經(jīng)網(wǎng)絡(luò)蛾绎,但最常用的是決策樹(shù)
What weak learner you should choose is then a trade off between 3 factors:
- The bias of the model. A lower bias is almost always better, but you don't want to pick something that will overfit (yes, boosting can and does overfit)
- The training time for the weak learner. Generally we want to be able to learn a weak learner quickly, as we are going to be building a few hundred (or thousand) of them.
- The prediction time for our weak learner. If we use a model that has a slow prediction rate, our ensemble of them is going to be a few hundred times slower!
- 樣本權(quán)重:在實(shí)際問(wèn)題中昆箕,很可能存在的情況是某些樣本相對(duì)另一些樣本來(lái)說(shuō)出現(xiàn)的可能性更大鸦列,此時(shí)僅僅采用平均化的方式過(guò)于簡(jiǎn)化,在實(shí)際的 Boosting 算法中一般采用加權(quán)平均的方式
The observation weight might come in as domain knowledge for example, we may know, based on past experience, that we are more likely to encounter a training example j compared to example k. Therefore we should assign a larger weight to example j, since it is the one that will have a greater influence on our function approximation process.
對(duì)于樣本給予權(quán)重的一個(gè)好處是我們可以在計(jì)算分類器的誤差的同時(shí)也將樣本權(quán)重考慮在內(nèi)鹏倘。具體地薯嗤,對(duì)于一個(gè)包含 N 個(gè)樣本的樣本集 (xi, yi),xi 為包含多個(gè)特征的向量纤泵,yi 的取值為 -1 或 1骆姐,假設(shè)對(duì)于任意一個(gè)分類器 G 來(lái)說(shuō),其針對(duì)任意輸入 xi 對(duì)應(yīng)的預(yù)測(cè)值為 G(xi) ∈ {-1, 1}捏题,此時(shí)在不考慮樣本權(quán)重時(shí)可以通過(guò)統(tǒng)計(jì)被錯(cuò)誤分類的樣本數(shù)量來(lái)計(jì)算誤差值:
- error = ΣI(yi ≠ G(xi)), i = 1, 2, ... , N
在考慮權(quán)重后玻褪,這一誤差值則可以改進(jìn)為:
- error = ΣwiI(yi ≠ G(xi)) / Σwi, i = 1, 2, ... , N
AdaBoost 算法實(shí)現(xiàn)
AdaBoost 算法的實(shí)現(xiàn)過(guò)程為:
首先初始化各個(gè)樣本的權(quán)重值,令 wi = 1 / N
-
對(duì)于 m = 1 到 M公荧,循環(huán)執(zhí)行:
- 在考慮樣本權(quán)重 wi 的前提下訓(xùn)練一個(gè)分類器 Gm(x)
- 計(jì)算考慮權(quán)重的誤差值 errm = ΣwiI(yi ≠ G(xi)) / Σwi, i = 1, 2, ... , N
- 計(jì)算 αm = log((1 - errm) / errm)
- 對(duì)于 i = 1, 2, ... , N带射,更新權(quán)重 wi = wi ? eαmI(yi ≠ G(xi))
G(x) = sign[αmGm(x)], m = 1, 2, ... , M
上述計(jì)算過(guò)程中 αm 和 errm 的關(guān)系如下圖所示,其中:
當(dāng) errm < 0.5循狰,也即若分類器的誤差表現(xiàn)高于隨機(jī)猜測(cè)時(shí)窟社,αm > 0,且誤差 errm 越小绪钥,αm 越大灿里,也即當(dāng)前這個(gè)分類器 Gm(x) 的準(zhǔn)確率越高
樣本權(quán)重更新過(guò)程中,當(dāng) Gm(x) 無(wú)法準(zhǔn)確分類樣本 xi 時(shí)程腹,wi 會(huì)變大匣吊,后續(xù)的分類器 Gm+1(x) 就會(huì)對(duì)這個(gè)樣本更加重視,反之亦然
The AdaBoost algorithm trains multiple weak classifiers on training data, and then combines those weak classifiers into a single boosted classifier. The combination is done through a weighted sum of the weak classifiers with weights dependent on the weak classifier accuracy.
注意事項(xiàng)
盡管 Boosting 在對(duì)抗過(guò)擬合方面有先天的優(yōu)勢(shì)跪楞,當(dāng)采用深度神經(jīng)網(wǎng)絡(luò)作為 Weak learner 時(shí)缀去,這一優(yōu)勢(shì)將不再明顯侣灶。另外甸祭,當(dāng)數(shù)據(jù)中存在 pink noise 也即 uniform noise 時(shí),Boosting 仍然無(wú)法避免過(guò)擬合褥影。