Why Would We Want to Ensemble Learners Together?
There are two competing variables in finding a well fitting machine learning model:?Bias?and?Variance.?
Bias: When a model has high bias, this means that means it doesn't do a good job of bending to the data.?
Variance: When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset.
1谊却、機器學習算法中兩個非常重要的影響因素:偏差和方差
高偏差機器學習算法會忽略訓練數(shù)據(jù),不能很好的擬合數(shù)據(jù)。
高方差的機器學習算法會對數(shù)據(jù)高度敏感,只能復(fù)現(xiàn)曾經(jīng)見過的的東西,對于之前從未見過的情況今妄,它的反應(yīng)非常差。(因為沒有適當?shù)钠钭屗夯碌臇|西)
真正想要的算法是兩者折中,也就是所謂的偏差--方差權(quán)衡婴噩。希望算法具有一定的泛化能力,但仍然對訓練數(shù)據(jù)開放羽德,能根據(jù)數(shù)據(jù)來調(diào)整模型几莽。
Introducing Randomness Into Ensembles
Another method that is used to improve ensemble methods is to introduce randomness into high variance algorithms before they are ensembled together. The introduction of randomness combats the tendency of these algorithms to overfit (or fit directly to the data available). There are two main ways that randomness is introduced:
Bootstrap the data?- that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
Subset the features?- in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.
2、隨機森林算法:
隨機從數(shù)據(jù)中挑選幾列宅静,并根據(jù)這些列構(gòu)建決策樹章蚣,然后隨機選取其他的幾列,再次構(gòu)建決策樹姨夹,然后讓決策樹進行選擇纤垂。就只需讓所有的決策樹做出預(yù)測,并選取結(jié)果中顯示最多的磷账。
3峭沦、Bagging
4、Adaboost
5逃糟、Adaboost in sklearn
>>> from sklearn.ensemble import AdaBoostClassifier
>>> model = AdaBoostClassifier()
>>> model.fit(x_train, y_train)
>>> model.predict(x_test)
高參數(shù)
base_estimator:The model utilized for the weak learners (Warning:?Don't forget to import the model that you decide to use for the weak learner).
n_estimators:The maximum number of weak learners used.
>>> from sklearn.tree import DecisionTreeClassifier
>>> model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators =4)
回顧:
在這節(jié)課學習了集成方法吼鱼,兩個權(quán)衡變量:偏差和方差。高偏差低方差的模型對數(shù)據(jù)擬合不夠好绰咽,靈活性很低菇肃;低偏差高方差的模型會導(dǎo)致過擬合,靈活性太高了剃诅。
為了權(quán)衡偏差和方差巷送,集成方法是一種普遍使用的方法。
有兩種隨機化技術(shù)來對抗過擬合:
1矛辕、Bootstrap the data?- that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
2笑跛、Subset the features?- in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.
技術(shù)方法: