KAGGLE ENSEMBLING GUIDE(注腳)

[TOC]

About Trs

只是閱讀過程中對(duì)其中一些進(jìn)行注腳而已,更確切的內(nèi)容還是英文原文來的清晰,有些翻譯反而會(huì)誤導(dǎo)(包括我的)

KAGGLE ENSEMBLING GUIDE

JUNE 11, 2015 56 COMMENTS

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions.

For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending.

I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together with their results and code to try it out for yourself.

This is how you win ML competitions: you take other peoples’ work and ensemble them together.” Vitaly Kuznetsov NIPS2014

Creating ensembles from submission files

The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need the predictions on the test set for these methods — no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up.

使用別人提交的預(yù)測(cè)結(jié)果弟胀,當(dāng)做stacking模型d的第一層訓(xùn)練集輸入狸演,相當(dāng)于一組新的特征

Voting ensembles.

We first take a look at a simple majority vote ensemble. Let’s see why model ensembling reduces error rate and why it works better to ensemble low-correlated model predictions.

Error correcting codes

During space missions it is very important that all signals are correctly relayed.

If we have a signal in the form of a binary string like:

1110110011101111011111011011

and somehow this signal is corrupted (a bit is flipped) to:

1010110011101111011111011011

then lives could be lost.

A coding solution was found in error correcting codes. The simplest error correcting code is a repetition-code: Relay the signal multiple times in equally sized chunks and have a majority vote.

Original signal:
1110110011

Encoded:
10,3 101011001111101100111110110011

Decoding:
1010110011
1110110011
1110110011

Majority vote:
1110110011

Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a corrupted majority vote.

As long as the corruption is not completely unpredictable (has a 50% chance of occurring) then signals can be repaired.

repetition-code:重復(fù)校驗(yàn)編碼,就是傳輸三份對(duì)原始碼的編碼滑黔,解碼后根據(jù)投票機(jī)制進(jìn)行按位投票

A machine learning example

Suppose we have a test set of 10 samples. The ground truth is all positive (“1”):

1111111111

We furthermore have 3 binary classifiers (A,B,C) with a 70% accuracy. You can view these classifiers for now as pseudo-random number generators which output a “1” 70% of the time and a “0” 30% of the time.

We will now show how these pseudo-classifiers are able to obtain 78% accuracy through a voting ensemble.

A pinch of maths

For a majority vote with 3 members we can expect 4 outcomes:

All three are correct
  0.7 * 0.7 * 0.7
= 0.3429

Two are correct
  0.7 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7
= 0.4409

Two are wrong
  0.3 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.3
= 0.189

All three are wrong
  0.3 * 0.3 * 0.3
= 0.027

We see that most of the times (~44%) the majority vote corrects an error. This majority vote ensemble will be correct an average of ~78% (0.3429 + 0.4409 = 0.7838).

從概率的角度來分析集成模型的優(yōu)勢(shì)

Number of voters

Like repetition codes increase in their error-correcting capability when more codes are repeated, so do ensembles usually improve when adding more ensemble members.

理論上來說笆包,子模型越多,糾錯(cuò)能力越強(qiáng)

Using the same pinch of maths as above: a voting ensemble of 5 pseudo-random classifiers with 70% accuracy would be correct ~83% of the time. One or two errors are being corrected during ~66% of the majority votes. (0.36015 + 0.3087)

Correlation(模型相關(guān)性)

When I first joined the team for KDD-cup 2014, Marios Michailidis (KazAnova) proposed something peculiar. He calculated the Pearson correlation(皮爾遜相關(guān)系數(shù):變量線性相關(guān)程度略荡,1為正相關(guān)) for all our submission files and gathered a few well-performing models which were less correlated.

Creating an averaging ensemble from these diverse submissions gave us the biggest 50-spot jump on the leaderboard. Uncorrelated submissions clearly do better when ensembled than correlated submissions. But why?

To see this, let us take 3 simple models again. The ground truth is still all 1’s:

1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy.

These models are highly correlated in their predictions. When we take a majority vote we see no improvement:

1111111100 = 80% accuracy

Now we compare to 3 less-performing, but highly uncorrelated models:

1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy

When we ensemble this with a majority vote we get:

1111111101 = 90% accuracy

Which is an improvement: A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability.

低相關(guān)性的子模型進(jìn)行集成效果要比高度相關(guān)的模型集成效果好庵佣,這也就是為什么ensemble learning中其中一個(gè)重要條件就是集成的子模型需要保持差異性

Use for Kaggle: Forest Cover Type prediction

Forest
Forest

Majority votes make most sense when the evaluation metric requires hard predictions, for instance with (multiclass-) classification accuracy.

The forest cover type prediction challenge uses the UCI Forest CoverType dataset. The dataset has 54 attributes and there are 6 classes.

We create a simple starter model with a 500-tree Random Forest. We then create a few more models and pick the best performing one. For this task and our model selection an ExtraTreesClassifier works best.

Weighing

We then use a weighted majority vote. Why weighing? Usually we want to give a better model more weight in a vote(給好的模型分配更高的權(quán)重). So in our case we count the vote by the best model 3 times. The other 4 models count for one vote each.(這個(gè)實(shí)驗(yàn)中,我們給予好模型三票汛兜,而差模型1票的配比巴粪,然后再進(jìn)行投票)

The reasoning is as follows: The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.

三個(gè)臭皮匠頂個(gè)諸葛亮的道理,如果有很多較差的模型能夠保持同樣的決策粥谬,那么是可以代替好的模型的

We can expect this ensemble to repair a few erroneous choices by the best model, leading to a small improvement only. That’s our punishment for forgoing a democracy and creating a Plato’s Republic.

“Every city encompasses two cities that are at war with each other.” Plato in The Republic

Table 1. shows the result of training 5 models, and the resulting score when combining these with a weighted majority vote.

MODEL PUBLIC ACCURACY SCORE
GradientBoostingMachine 0.65057
RandomForest Gini 0.75107
RandomForest Entropy 0.75222
ExtraTrees Entropy 0.75524
ExtraTrees Gini (Best) 0.75571
Voting Ensemble (Democracy) 0.75337
Voting Ensemble (3*Best vs. Rest) 0.75667

實(shí)驗(yàn)可以看出肛根,每個(gè)分類器各執(zhí)一票的效果(即沒有權(quán)重差異性)要比分配權(quán)重的集成分類器效果要差

Use for Kaggle: CIFAR-10 Object detection in images

CIFAR-10
CIFAR-10

CIFAR-10 is another multi-class classification challenge where accuracy matters.

Our team leader for this challenge, Phil Culliton, first found the best setup to replicate a good model from dr. Graham.

Then he used a voting ensemble of around 30 convnets submissions (all scoring above 90% accuracy). The best single model of the ensemble scored 0.93170.

A voting ensemble of 30 models scored 0.94120. A ~0.01 reduction in error rate, pushing the resulting score beyond the estimated human classification accuracy.

Code

We have a sample voting script you could use at the MLWave Github repo. It operates on a directory of Kaggle submissions and creates a new submission. Update: Armando Segnini has added weighing.

Ensembling. Train 10 neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements.

One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer.

The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong. Ilya Sutskever A brief overview of Deep Learning.

Averaging

Averaging works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logaritmic loss).

There is not much more to averaging than taking the mean of individual model predictions. An often heard shorthand for this on Kaggle is “bagging submissions”.

Averaging predictions often reduces overfit(降低過擬合). You ideally want a smooth separation(平滑分類面) between classes, and a single model’s predictions can be a little rough around the edges.

Learning from noise
Learning from noise

The above image is from the Kaggle competition: Don’t Overfit!, the black line shows a better separation than the green line. The green line has learned from noisy datapoints. No worries! Averaging multiple different green lines should bring us closer to the black line.

Remember our goal is not to memorize the training data (there are far more efficient ways to store data than inside a random forest), but to generalize well to new unseen data.

集成模型還能夠盡量避免過擬合,如果單個(gè)模型對(duì)分類面太過苛刻(比如說硬間隔的SVM)帝嗡,那么很容易過擬合晶通,而多個(gè)模型的集成則會(huì)減弱這種過擬合現(xiàn)象,起到均勻平滑作用

Kaggle use: Bag of Words Meets Bags of Popcorn

Icons
Icons
This is a movie sentiment analysis contest. In a previous post we used an online perceptron script to get 95.2 AUC.

The perceptron(感知器) is a decent linear classifier(線性分類器) which is guaranteed to find a separation if the data is linearly separable. This is a welcome property to have, but you have to realize a perceptron stops learning once this separation is reached. It does not necessarily find the best separation for new data.

對(duì)單個(gè)線性分類器而言哟玷,沒有必要一次性達(dá)到最好的分類面效果

So what would happen if we initialize 5 perceptrons with random weights and combine their predictions through an average? Why, we get an improvement on the test set!

MODEL PUBLIC AUC SCORE
Perceptron 0.95288
Random Perceptron 0.95092
Random Perceptron 0.95128
Random Perceptron 0.95118
Random Perceptron 0.95072
Bagged Perceptrons 0.95427

Above results also illustrate that ensembling can (temporarily) save you from having to learn about the finer details and inner workings of a specific Machine Learning algorithm. If it works, great! If it doesn’t, not much harm done.

Perceptron bagging
Perceptron bagging

You also won’t get a penalty for averaging 10 exactly the same linear regressions. Bagging a single poorly cross-validated(交叉驗(yàn)證) and overfitted submission may even bring you some gain through adding diversity (thus less correlation).

Code

We have posted a simple averaging script on Github that takes as input a directory of .csv files and outputs an averaged submission. Update: Dat Le has added a geometric averaging script. Geometric mean can outperform a plain average.

Rank averaging

When averaging the outputs from multiple different models some problems can pop up. Not all predictors are perfectly calibrated(校準(zhǔn)): they may be over- or underconfident when predicting a low or high probability. Or the predictions clutter around a certain range.

In the extreme case you may have a submission which looks like this:

Id,Prediction
1,0.35000056
2,0.35000002
3,0.35000098
4,0.35000111

Such a prediction may do well on the leaderboard when the evaluation metric is ranking or threshold based like AUC. But when averaged with another model like:

Id,Prediction
1,0.57
2,0.04
3,0.96
4,0.99

it will not change the ensemble much at all.

前者的模型可能在單獨(dú)使用的時(shí)候能夠得到較高的排名狮辽,但是如果和后者的模型進(jìn)行ensemble,結(jié)果不會(huì)改變多少巢寡,所以要使用rank的normalizing來形成新的結(jié)果再去和其他的模型做ensemble

Our solution is to first turn the predictions into ranks, then averaging these ranks.

Id,Rank,Prediction
1,1,0.35000056
2,0,0.35000002
3,2,0.35000098
4,3,0.35000111

After normalizing(規(guī)范化) the averaged ranks between 0 and 1 (就是歸一化)you are sure to get an even distribution in your predictions. The resulting rank-averaged ensemble:

Id,Prediction
1,0.33
2,0.0
3,0.66
4,1.0

這里的規(guī)范化用的就是min-max的方法喉脖,如id=1的項(xiàng),(1-0)/(3-0) = 0.33

Historical ranks.

Ranking requires a test set. So what do you do when want predictions for a single new sample? You could rank it together with the old test set, but this will increase the complexity of your solution.

A solution is using historical ranks. Store the old test set predictions together with their rank. Now when you predict a new test sample like “0.35000110” you find the closest old prediction and take its historical rank (in this case rank “3” for “0.35000111”).

我們不能使用測(cè)試機(jī)去rank抑月,這樣會(huì)導(dǎo)致過擬合树叽,所以我們要從以前的歷史rank中找到比較接近的,直接去用它的排名即可

Kaggle use case: Acquire Valued Shoppers Challenge

Scissors
Scissors

Ranking averages do well on ranking and threshold-based metrics (like AUC) and search-engine quality metrics (like average precision at k).

The goal of the shopper challenge was to rank the chance that a shopper would become a repeat customer.

Our team first took an average of multiple Vowpal Wabbit models together with an R GLMNet model. Then we used a ranking average to improve the exact same ensemble.

MODEL PUBLIC PRIVATE
Vowpal Wabbit A 0.60764 0.59962
Vowpal Wabbit B 0.60737 0.59957
Vowpal Wabbit C 0.60757 0.59954
GLMNet 0.60433 0.59665
Average Bag 0.60795 0.60031
Rank average Bag 0.61027 0.60187

I already wrote about the Avito challenge where rank averaging gave us a hefty increase.

Finally, when weighted rank averaging the bagged perceptrons from the previous chapter (1x) with the new bag-of-words tutorial (3x) on fastML.com we improve that model’s performance from 0.96328 AUC to 0.96461 AUC.

Vowpal Wabbit是一個(gè)機(jī)器學(xué)習(xí)系統(tǒng),由C++開發(fā)谦絮,github地址:https://github.com/JohnLangford/vowpal_wabbit

Code

A simple work-horse rank averaging script is added to the MLWave Github repo.

Competitions are effective because there are any number of techniques that can be applied to any modeling problem, but we can’t know in advance which will be most effective. Anthony Goldbloom Data Prediction Competitions — Far More than Just a Bit of Fun

Whiskey blending
Whiskey blending

From ‘How Scotch Blended Whisky is Made’ on Youtube

Stacked Generalization & Blending

Averaging prediction files is nice and easy, but it’s not the only method that the top Kagglers are using. The serious gains start with stacking and blending. Hold on to your top-hats and petticoats: Here be dragons. With 7 heads. Standing on top of 30 other dragons.

如果說bagging是并行ensemble的話题诵,那么stacking就是"串行"ensemble,它會(huì)根據(jù)第一級(jí)模型的預(yù)測(cè)輸出當(dāng)做第二級(jí)模型的輸入

Netflix

Netflix organized and popularized the first data science competitions. Competitors in the movie recommendation challenge really pushed the state of the art on ensemble creation, perhaps so much so that Netflix decided not to implement the winning solution in production. That one was simply too complex.

Nevertheless, a number of papers and novel methods resulted from this challenge:

All are interesting, accessible and relevant reads when you want to improve your Kaggle game.

This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Netflix Engineers

Stacked generalization

Stacked generalization was introduced by Wolpert in a 1992 paper, 2 years before the seminal Breiman paper “Bagging Predictors“. Wolpert is famous for another very popular machine learning theorem: “There is no free lunch in search and optimization“.

The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

Stack 是利用一堆基分類器层皱,然后對(duì)輸出的結(jié)果再用二級(jí)分類器去擬合性锭,目的是減少泛化誤差

Let’s say you want to do 2-fold stacking:

2-fold stacking,個(gè)人愚見叫胖,類似于CV中的n-fold草冈,將訓(xùn)練集分成n分,然后用n-1份去訓(xùn)練模型,剩下的1分去做測(cè)試怎棱,這樣使用n次哩俭,用來評(píng)估模型的泛化性能,這里2-fold相當(dāng)于把 n = 2 的操作了拳恋。

  • Split the train set in 2 parts: train_a and train_b

    把訓(xùn)練集分成train_a,train_b兩部分

  • Fit a first-stage model on train_a and create predictions for train_b

    第一級(jí)的模型拿train_a作為訓(xùn)練集凡资,然后拿train_b作為測(cè)試集,產(chǎn)出預(yù)測(cè)

  • Fit the same model on train_b and create predictions for train_a

    第一級(jí)的模型拿train_b作為訓(xùn)練集诅岩,然后拿train_a作為測(cè)試集讳苦,產(chǎn)出預(yù)測(cè)

  • Finally fit the model on the entire train set and create predictions for the test set.

    最后將模型于所有train數(shù)據(jù)集上訓(xùn)練,對(duì)test數(shù)據(jù)集做prediction

  • Now train a second-stage stacker model on the probabilities from the first-stage model(s).

    最后將第一級(jí)模型的輸出作為訓(xùn)練數(shù)據(jù)吩谦,重新的train一個(gè)新的2階段模型鸳谜,

A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.

It is usually desirable that the level 0 generalizers are of all “types”, and not just simple variations of one another (e.g., we want surface-fitters, Turing-machine builders, statistical extrapolators, etc., etc.). In this way all possible ways of examining the learning set and trying to extrapolate(推斷) from it are being exploited. This is part of what is meant by saying that the level 0 generalizers should “span the space”.

[…] stacked generalization is a means of non-linearly combining generalizers to make a new generalizer, to try to optimally integrate what each of the original generalizers has to say about the learning set. The more each generalizer has to say (which isn’t duplicated in what the other generalizer’s have to say), the better the resultant stacked generalization. Wolpert (1992) Stacked Generalization

Blending

Blending is a word introduced by the Netflix winners. It is very close to stacked generalization, but a bit simpler and less risk of an information leak. Some researchers use “stacked ensembling” and “blending” interchangeably(交替使用).

With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.

它和Stacking很像,但是有一點(diǎn)不同的是:始終從訓(xùn)練數(shù)據(jù)及中留出一小部分?jǐn)?shù)據(jù)作為測(cè)試數(shù)據(jù)式廷。有效的防止了數(shù)據(jù)的泄露咐扭。

Blending has a few benefits:

  • It is simpler than stacking.

  • It wards against an information leak: The generalizers and stackers use different data.

  • You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

    Stacking 需要訓(xùn)練的時(shí)候采用相同的模型,但是blending則不需要

The cons(弊端) are:

  • You use less data overall

    因?yàn)榱袅艘徊糠肿鳛闇y(cè)試數(shù)據(jù)滑废,所以可用的訓(xùn)練數(shù)據(jù)更少

  • The final model may overfit to the holdout set.

    最后的模型可能會(huì)被訓(xùn)練的過擬合

  • Your CV is more solid with stacking (calculated over more folds) than using a single small holdout set.

As for performance, both techniques are able to give similar results, and it seems to be a matter of preference and skill which you prefer. I myself prefer stacking.

If you can not choose, you can always do both. Create stacked ensembles with stacked generalization and out-of-fold predictions. Then use a holdout set to further combine these models at a third stage.

其實(shí)stacking和blending兩種方法產(chǎn)生的效果相差不大蝗肪,所以任君挑選

Stacking with logistic regression

Stacking with logistic regression is one of the more basic and traditional ways of stacking. A script I found by Emanuele Olivetti helped me understand this.

When creating predictions for the test set, you can do that in one go, or take an average of the out-of-fold predictors. Though taking the average is the clean and more accurate way to do this, I still prefer to do it in one go as that slightly lowers both model and coding complexity.

Out-of-fold 方法其實(shí)就是交叉驗(yàn)證的評(píng)估方法,把訓(xùn)練集等分成n份蠕趁,然后取n-1份為訓(xùn)練集薛闪,1份為測(cè)試集合,測(cè)試n次俺陋,取平均豁延,即為模型的穩(wěn)定性評(píng)估 。refer Cross validation strategy when blending/stacking

Kaggle use: “Papirusy z Edhellond”

Gondor
Gondor

I used the above blend.py script by Emanuele to compete in this inClass competition. Stacking 8 base models (diverse ET’s, RF’s and GBM’s) with Logistic Regression gave me my second best score of 0.99409 accuracy, good for first place.

Kaggle use: KDD-cup 2014

Using this script I was able to improve a model from Yan Xu. Her model before stacking scored ~0.605 AUC. With stacking this improved to ~0.625.

Stacking with non-linear algorithms

Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.

Non-linear stacking with the original features on multiclass problems gives surprising gains. Obviously the first-stage predictions are very informative and get the highest feature importance. Non-linear algorithms find useful interactions between the original features and the meta-model features.

非線性模型用 stacking 也很6

Kaggle use: TUT Headpose Estimation Challenge

TUT headpose
TUT headpose
The TUT Headpose Estimation challenge can be treated as a multi-class multi-label classification challenge.

For every label a separate ensemble model was trained.

The following table shows the result of training individual models, and their improved scores when stacking the predicted class probabilities with an extremely randomized trees model.

MODEL PUBLIC MAE PRIVATE MAE
Random Forests 500 estimators 6.156 6.546
Extremely Randomized Trees 500 estimators 6.317 6.666
KNN-Classifier with 5 neighbors 6.828 7.460
Logistic Regression 6.694 6.949
Stacking with Extremely Randomized Trees 4.772 4.718

We see that stacked generalization with standard models is able to reduce the error by around 30%(!).

Read more about this result in the paper: Computer Vision for Head Pose Estimation: Review of a Competition.

MAE(Mean Absolute Error): 平均絕對(duì)誤差,也稱為L1范數(shù)損失

Code

You can find a function to create out-of-fold probability predictions in the MLWave Github repo. You could use numpy horizontal stacking (hstack) to create blended datasets.

Feature weighted linear stacking

Feature-weighted(特征加權(quán)) linear stacking stacks engineered meta-features together with model predictions. The hope is that the stacking model learns which base model is the best predictor for samples with a certain feature value. Linear algorithms are used to keep the resulting model fast and simple to inspect.

Blended prediction
Blended prediction

Vowpal Wabbit can implement a form of feature-weighted linear stacking out of the box. If we have a train set like:

1 |f f_1:0.55 f_2:0.78 f_3:7.9 |s RF:0.95 ET:0.97 GBM:0.92

We can add quadratic feature interactions between the s-featurespace and the f-featurespace by adding -q fs. The features in the f-namespace can be engineered meta-features like in the paper, or they can be the original features.

Quadratic linear stacking of models

This did not have a name so I made one up. It is very similar to feature-weighted linear stacking, but it creates combinations of model predictions. This improved the score on numerous experiments, most noticeably on the Modeling Women’s Healthcare Decision competition on DrivenData.

Using the same VW training set as before:

1 |f f_1:0.55 f_2:0.78 f_3:7.9 |s RF:0.95 ET:0.97 GBM:0.92

We can train with -q ss creating quadratic feature interactions (RF*GBM) between the model predictions.

This can easily be combined with feature-weighted linear stacking: -q fs -q ss, possibly improving on both.

So now you have a case where many base models should be created. You don’t know apriori which of these models are going to be helpful in the final meta model. In the case of two stage models, it is highly likely weak base models are preferred.

So why tune these base models very much at all? Perhaps tuning here is just obtaining model diversity. But at the end of the day you don’t know which base models will be helpful. And the final stage will likely be linear (which requires no tuning, or perhaps a single parameter to give some sparsity). Mike KimTuning doesn’t matter. Why are you doing it?

當(dāng)你不知道這些基模型對(duì)最后的meta模型是否產(chǎn)生效果時(shí)腊状,大可不必經(jīng)常替換他們

Stacking classifiers with regressors and vice versa

Stacking allows you to use classifiers for regression problems and vice versa. For instance, one may try a base model with quantile regression(分位數(shù)回歸) on a binary classification problem. A good stacker should be able to take information from the predictions, even though usually regression is not the best classifier.

Using classifiers for regression problems is a bit trickier. You use binning first: You turn the y-label into evenly spaced classes. A regression problem that requires you to predict wages can be turned into a multiclass classification problem like so:

  • Everything under 20k is class 1.
  • Everything between 20k and 40k is class 2.
  • Everything over 40k is class 3.

相當(dāng)于先對(duì)連續(xù)變量進(jìn)行區(qū)域劃分诱咏,然后進(jìn)行編碼,形成離散的類

The predicted probabilities for these classes can help a stacking regressor make better predictions.

“I learned that you never, ever, EVER go anywhere without your out-of-fold predictions. If I go to Hawaii or to the bathroom I am bringing them with. Never know when I need to train a 2nd or 3rd level meta-classifier” T. Sharf

Stacking unsupervised learned features

There is no reason we are restricted to using supervised learning techniques with stacking. You can also stack with unsupervised learning techniques.

K-Means clustering is a popular technique that makes sense here. Sofia-ML(快速增量算法套件) implements a fast online k-means algorithm suitable for this.

Another more recent interesting addition is to use t-SNE(t-SNE 是一種非線性降維算法缴挖,非常適用于高維數(shù)據(jù)降維到2維或者3維袋狞,進(jìn)行可視化): Reduce the dataset to 2 or 3 dimensions and stack this with a non-linear stacker. Using a holdout set for stacking/blending feels like the safest choice here. See here for a solution by Mike Kim, using t-SNE vectors and boosting them with XGBoost: ‘0.41599 via t-SNE meta-bagging‘.

Piotr shows a nice visualization with t-SNE on the Otto Product Classification Challenge data set.

Online Stacking

I spend quit a lot of time working out an idea I had for online stacking: first create small fully random trees from the hashed binary representation. Substract profit or add profit when the tree makes a correct prediction. Now take the most profitable and least profitable trees and add them to the feature representation.

It worked, but only on artificial data. For instance, a linear perceptron with online random tree stacking was able to learn a non-linear XOR-problem. It did not work on any real-life data I tried it on, and believe me, I tried. So from now on I’ll be suspicious of papers which only feature artificial data sets to showcase their new algorithm.

A similar idea did work for the author of the paper: random bit regression. Here many random linear functions are created from the features, and the best are found through heavy regularization. This I was able to replicate with success on some datasets. This will the topic of a future post.

A more concrete(真實(shí)的) example of (semi-) online stacking is with ad click prediction. Models trained on recent data perform better there. So when a dataset has a temporal effect, you could use Vowpal Wabbit to train on the entire dataset, and use a more complex and powerful tool like XGBoost to train on the last day of data. Then you stack the XGBoost predictions together with the samples and let Vowpal Wabbit do what it does best: optimizing loss functions.

The natural world is complex, so it figures that ensembling different models can capture more of this complexity. Ben Hamner ‘Machine learning best practices we’ve learned from hundreds of competitions’ (video)

Everything is a hyper-parameter

When doing stacking/blending/meta-modeling it is healthy to think of every action as a hyper-parameter for the stacker model.

So for instance:

  • Not scaling the data

  • Standard-Scaling the data

    比如 z-score將其規(guī)范為高斯分布

  • Min-max scaling the data

    縮放數(shù)據(jù)到 0~1之間

are simply extra parameters to be tuned to improve the ensemble performance. Likewise, the number of base models to use can be seen as a parameter to optimize. Feature selection (top 70%) or imputation (impute missing features with a 0) are other examples of meta-parameters.

Like a random gridsearch(網(wǎng)格法調(diào)參) is a good candidate for tuning algorithm parameters, so does it work for tuning these meta-parameters.

Sometimes it is useful to allow XGBoost to see what a KNN-classifier sees. – Marios Michailidis

Model Selection

You can further optimize scores by combining multiple ensembled models.

  • There is the ad-hoc approach: Use averaging, voting or rank averaging on manually-selected well-performing ensembles.

  • Greedy forward model selection (Caruana et al.). Start with a base ensemble of 3 or so good models. Add a model when it increases the train set score the most. By allowing put-back of models, a single model may be picked multiple times (weighing).

    使用貪心算法來選擇需要組合的模型,然后添加模型看下評(píng)分有沒有變高映屋,如果變高了則加上苟鸯,變低則去掉。這個(gè)有點(diǎn)像二分kmeans棚点,簡(jiǎn)單的說早处,就是將所有點(diǎn)先看成一個(gè)粗,然后簇一分為二乙濒,選擇其中的一個(gè)簇繼續(xù)劃分,選擇哪一個(gè)簇進(jìn)行劃分取決于對(duì)其劃分是否可以最大程度降低SSE(誤差平方和)的值。

  • Genetic model selection uses genetic algorithms and CV-scores as the fitness function. See for instance inversion‘s solution ‘Strategy for top 25 position‘.

  • I use a fully random method inspired by Caruana’s method: Create a 100 or so ensembles from randomly selected ensembles (without placeback). Then pick the highest scoring model.

    隨機(jī)組合挑得分高的模型

Automation

Otto Group
Otto Group
When stacking for the Otto product classification competition I quickly got a good top 10 spot. Adding more and more base models and bagging multiple stacked ensembles I was able to keep improving my score.

Once I had reached 7 base models stacked by 6 stackers, a sense of panic and gloom started to set in. Would I be able to replicate all of this? These complex and slow unwieldy models were out of my comfort zone of fast and simple Machine Learning.

I spend the rest of the competition building a way to automate stacking. For base models pure random algorithms with pure random parameters are trained. Wrappers were written to make classifiers like VW, Sofia-ML, RGF, MLP and XGBoost play nicely with the Scikit-learn API.

Whiteboard automated stacking
Whiteboard automated stacking

The first whiteboard sketch for a parallelized automated stacker with 3 buckets

For stackers I let the script use SVM, random forests, extremely randomized trees, GBM and XGBoost with random parameters and a random subset of base models.

Finally the created stackers are averaged when their fold-predictions on the train set produced a lower loss.

This automated stacker was able to rank 57th spot a week before the competition ended. It contributed to my final ensemble. The only difference was I never spend time tuning or selecting: I started the script, went to bed, and awoke to a good solution.

The automated stacker is able to get a top 10% score without any tuning or manual model selection on a competitive task with over 3000 competitors.

Automatic stacking is one of my new big interests. Expect a few follow-up articles on this. The best result of automatic stacking was found on the TUT Headpose Estimation challenge. This black-box solution beats the current state-of-the-art set by domain experts who created special-purpose algorithms for this particular problem.

Noteworthy: This was a multi-label classification problem. Predictions for both “yaw” and “pitch” were required. Since the “yaw” and “pitch”-labels of a head pose are interrelated, stacking a model with predictions for “yaw” increased the accuracy for “pitch” predictions and vice versa. An interesting result.

Models visualized as a network can be trained used back-propagation: then stacker models learn which base models reduce the error the most.

Ensemble Network
Ensemble Network
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末颁股,一起剝皮案震驚了整個(gè)濱河市么库,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌甘有,老刑警劉巖诉儒,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異亏掀,居然都是意外死亡忱反,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門滤愕,熙熙樓的掌柜王于貴愁眉苦臉地迎上來温算,“玉大人,你說我怎么就攤上這事间影∽⒏停” “怎么了?”我有些...
    開封第一講書人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵魂贬,是天一觀的道長巩割。 經(jīng)常有香客問我,道長付燥,這世上最難降的妖魔是什么宣谈? 我笑而不...
    開封第一講書人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任,我火速辦了婚禮键科,結(jié)果婚禮上闻丑,老公的妹妹穿的比我還像新娘。我一直安慰自己萝嘁,他們只是感情好梆掸,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著牙言,像睡著了一般酸钦。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上咱枉,一...
    開封第一講書人閱讀 48,970評(píng)論 1 284
  • 那天卑硫,我揣著相機(jī)與錄音,去河邊找鬼蚕断。 笑死欢伏,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的亿乳。 我是一名探鬼主播硝拧,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼径筏,長吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了障陶?” 一聲冷哼從身側(cè)響起滋恬,我...
    開封第一講書人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎抱究,沒想到半個(gè)月后恢氯,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡鼓寺,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年勋拟,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片妈候。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡敢靡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出州丹,到底是詐尸還是另有隱情醋安,我是刑警寧澤,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布墓毒,位于F島的核電站吓揪,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏所计。R本人自食惡果不足惜柠辞,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望主胧。 院中可真熱鬧叭首,春花似錦、人聲如沸踪栋。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽夷都。三九已至眷唉,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間囤官,已是汗流浹背冬阳。 一陣腳步聲響...
    開封第一講書人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留党饮,地道東北人肝陪。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓,卻偏偏與公主長得像刑顺,于是被迫代替她去往敵國和親氯窍。 傳聞我的和親對(duì)象是個(gè)殘疾皇子饲常,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容