Why cross-validation

TL; DR

Cross-validation (CV), or k-fold CV is a model evaluation method, which is extensively used for fine-tuning models.
Advantages:
1. Make full use of data;
2. Prevent overfitting^[1];
3. Estimation on variance of model when applied to real data;
Disadvantages:
1. Computational Expensive;
Only splitting the whole data set into training/test set may cause overfitting.
Implementation:
1. Split the whole data set into one training set and one test set;
2. Split the training set into k parts;
3. Repeat the experiments k times. Each time:
  1. One different part of data are held out and used as validation set;
  2. Train the model with the rest k-1 parts of data;
  3. Validate the model on validation set and got one result;
4. (optional) Average the total k results.

Intuition & Motivation

It's simple to know how to perform a CV but the concepts could be a bit confusing to machine learning beginners.
That's because the intuition behind CV isn't that straight forward, and it evolves several times until what it looks like today.
To understand why we need CV even though it's expensive and what if we don't use CV, we can have a look at some previous versions of validation methods and what's their problems.

Training/Test split

It's intuitive to split the data into a training set and a test set when given a bunch of data. So we can train the model with training set and see how it goes on test set and then we can use the result to refine our models. Everything seems right! No. It's correct to split the data into training set and test set but it's wrong to fine-tune models with test set. This is the first common misunderstanding about validation:

Do not use test set to fine-tune models

Even though data for training and test are isolated, the information about test set might leak to the model when the model is trained in this way repeatedly. You may keep optimizing your models until it yields good results on test set which might thus overfit the model to the test set. What we want from test set is an estimation on the generalizability of the model or in other word how the model would perform on real data that we never come across. Test set is only used after the model is properly trained. But how can we modify our models without knowing how it behaves? This is why there is another set of data named validation set.

Validation set ≠ Test set

Some beginners are confused with these two concepts because they have the same purpose: evaluating model, like testing the performance. The difference is that they are used at different time. Test set is first held out from data and will never be used in cross-validation. Then we split the rest of data into validation set and training set. Validation set is used to fine-tune your models. We may run the experiment repeatedly until we get a model that seems okay on validation set and finally we apply it to test set and see how it goes. The bad thing is that the model might still overfit to validation set. But the good thing is that we will have scores on both validation set and test set. If they are significantly different, we have more evidence to tell if the model is overfitting or underfitting.

It's nearly perfect except that too many data are wasted on pure evaluation and the model now is kinda sensitive to how we split the training set. K-fold cross-validation solve these pain points to some degree by training and validating the models for k times. And here comes the third common misunderstanding.

The entire CV is done on training set

You may wanna look back to the steps to perform CV in TL;DR. We perform CV only on training set to evaluate the model. In most cases, there are many choices for setting parameters and we will perform a CV for each of them (definitely time-consuming) and see which one gets highest score. Finally we can feed test data to the trained model. Because we get k results from CV some it also helps us to estimate how precise the model is (i.e. from standard deviation of these results).

Best practice

Below is the implementation of CV in Scikit-learn. The results are not perfect but it roughly shows how CV is combined with grid search method for parameters fine-tuning.

from __future__ import print_function
from __future__ import division
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Load data

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)


# Fit model with Grid search CV

def fit_model(X, y):

    cross_validator = KFold(4)    
    regressor = DecisionTreeRegressor(random_state=0)
    params = {"max_depth":range(3, 10)}
    scoring_func = make_scorer(r2_score, greater_is_better=True)
    grid = GridSearchCV(estimator=regressor,  
                        param_grid=params, 
                        scoring=scoring_func, 
                        cv=cross_validator)

    grid = grid.fit(X, y)

    return grid.best_estimator_


optimal_reg = fit_model(X_train, y_train)

print("Parameter 'max_depth' is {} for the optimal model.".format(optimal_reg.get_params()['max_depth']))


# Test model 

y_predict = optimal_reg.predict(X_test)
r2 = r2_score(y_predict, y_test)
print("Optimal model has r2 score: {:,.2f} on test data".format(r2))

To be more precise, overfitting cannot be completely avoided, but CV could help to identify overfitting. ?

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末戳杀，一起剝皮案震驚了整個濱河市羽峰，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖费韭，帶你破解...
沈念sama閱讀 217,657評論 6贊 505
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異矾睦，居然都是意外死亡，警方通過查閱死者的電腦和手機箱歧，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,889評論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來一膨，“玉大人呀邢，你說我怎么就攤上這事”鳎” “怎么了价淌？”我有些...
開封第一講書人閱讀 164,057評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長瞒津。經(jīng)常有香客問我蝉衣，道長，這世上最難降的妖魔是什么巷蚪？我笑而不...
開封第一講書人閱讀 58,509評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任病毡，我火速辦了婚禮，結(jié)果婚禮上钓辆，老公的妹妹穿的比我還像新娘剪验。我一直安慰自己，他們只是感情好前联，可當我...
茶點故事閱讀 67,562評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布功戚。她就那樣靜靜地躺著，像睡著了一般似嗤。火紅的嫁衣襯著肌膚如雪啸臀。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,443評論 1贊 302
城市分裂傳說
那天烁落，我揣著相機與錄音乘粒，去河邊找鬼。笑死伤塌，一個胖子當著我的面吹牛灯萍，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播每聪，決...
沈念sama閱讀 40,251評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼旦棉，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了药薯？” 一聲冷哼從身側(cè)響起绑洛，我...
開封第一講書人閱讀 39,129評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎童本，沒想到半個月后真屯，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,561評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡穷娱，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,779評論 3贊 335
?白月光啟示錄
正文我和宋清朗相戀三年绑蔫，在試婚紗的時候發(fā)現(xiàn)自己被綠了运沦。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,902評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡配深，死狀恐怖茶袒，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情凉馆，我是刑警寧澤薪寓，帶...
沈念sama閱讀 35,621評論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站澜共，受9級特大地震影響向叉，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜嗦董，卻給世界環(huán)境...
茶點故事閱讀 41,220評論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一母谎、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧京革，春花似錦奇唤、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,838評論 0贊 22
一樁弒父案咬扇，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至廊勃，卻和暖如春懈贺，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背坡垫。一陣腳步聲響...
開封第一講書人閱讀 32,971評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工梭灿，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人冰悠。一個月前我還...
沈念sama閱讀 48,025評論 2贊 370
代替公主和親
正文我出身青樓堡妒，卻偏偏與公主長得像，于是被迫代替她去往敵國和親溉卓。傳聞我的和親對象是個殘疾皇子皮迟，可洞房花燭夜當晚...
茶點故事閱讀 44,843評論 2贊 354