機(jī)器學(xué)習(xí)應(yīng)用建議（一）

決定下一步做什么

這里我們以房?jī)r(jià)預(yù)測(cè)為例着裹，假設(shè)我們需要使用線性回歸模型對(duì)房?jī)r(jià)進(jìn)行預(yù)測(cè)，其代價(jià)函數(shù)J(θ)如下圖所示。當(dāng)我們將已經(jīng)訓(xùn)練好的模型用來(lái)預(yù)測(cè)房?jī)r(jià)時(shí)问畅，我們發(fā)現(xiàn)有較大的誤差艾栋，那么我們下一步應(yīng)該怎么做爆存？

我們可能會(huì)想到以下幾種方法：

獲取更多的樣本；
嘗試減少特征變量的數(shù)量蝗砾；
嘗試獲取更多的特征變量先较；
嘗試增加多項(xiàng)式特征；
嘗試減小正則化參數(shù)λ的值悼粮；
嘗試增大正則化參數(shù)λ的值闲勺；
......

這些方法可能有用也可能沒用，我們更不應(yīng)該在實(shí)際應(yīng)用中隨機(jī)選擇上述方法扣猫。還有一點(diǎn)要說明的是：上述方法中的任意一個(gè)方法菜循，在具體實(shí)踐過程中都可能轉(zhuǎn)變?yōu)橐粋€(gè)為期半年甚至?xí)r間更長(zhǎng)的項(xiàng)目。因此申尤，我們需要引入機(jī)器學(xué)習(xí)診斷法來(lái)幫助我們決定下一步該做什么才是有效的方法癌幕。

Question:
Which of the following statements about diagnostics are true? Check all that apply.
A. It's hard to tell what will work to improve a learning algorithm, so the best approach is to with gut feeling and just see what works.
B. Diagnostics can give guidance as to what might be more fruitful things to try to improve a learning algorithm.
C. Diagnostics can be time-consuming to implement and try, but they can still be a very good use of your time.
D. A diagnostics can sometimes rule out certain courses of action (changes to your learning algorithm) as being unlikely to improve its performance significantly.

我們不難選出B，C和D這三個(gè)正確答案昧穿。

假設(shè)評(píng)估

對(duì)于我們之前所提及的欠擬合問題和過擬合問題勺远，我們是通過畫圖的方法來(lái)檢驗(yàn)的。如若訓(xùn)練集中有較多的特征變量時(shí)时鸵，我們就無(wú)法將函數(shù)圖給呈現(xiàn)出來(lái)谚中。

對(duì)此，我們可以將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集兩個(gè)部分寥枝，其中訓(xùn)練集占數(shù)據(jù)集的70%宪塔，測(cè)試集占數(shù)據(jù)集的30%。首先囊拜，我們利用訓(xùn)練集將代價(jià)函數(shù)J(θ)最小化某筐，然后利用測(cè)試集測(cè)試模型誤差，從而出模型是否出現(xiàn)欠擬合或過擬合問題冠跷。

注：如果數(shù)據(jù)集有一定規(guī)律南誊，則要從數(shù)據(jù)集中隨機(jī)選取70%的樣本作為訓(xùn)練集身诺，30%的樣本作為測(cè)試集。

線性回歸模型

利用訓(xùn)練集將代價(jià)函數(shù)J(θ)最小化抄囚，得到此時(shí)參數(shù)θ的值
利用測(cè)試集計(jì)算誤差：

邏輯回歸模型

利用訓(xùn)練集將代價(jià)函數(shù)J(θ)最小化霉赡，得到此時(shí)參數(shù)θ的值
利用測(cè)試集計(jì)算誤差：

除此之外，對(duì)于邏輯回歸模型我們還能計(jì)算誤分類率來(lái)幫助我們理解邏輯回歸模型的誤差幔托。

誤分類率

因此穴亏，我們可將誤差測(cè)試的表達(dá)式改寫為：

Question:
Suppose an implementation of linear regression (without regularization) is badly overfitting the training set. In this case, we would expect:
A. The training error J(θ) to be low and the test error J_test(θ) to be high
B. The training error J(θ) to be low and the test error J_test(θ) to be low
C. The training error J(θ) to be high and the test error J_test(θ) to be low
D. The training error J(θ) to be high and the test error J_test(θ) to be high

我們不難選出A這個(gè)正確答案。

補(bǔ)充筆記

Evaluating a Hypothesis

Once we have done some trouble shooting for errors in our predictions by:

Getting more training examples
Trying smaller sets of features
Trying additional features
Trying polynomial features
Increasing or decreasing λ

We can move on to evaluate our new hypothesis.

A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a training set and a test set. Typically, the training set consists of 70 % of your data and the test set is the remaining 30 %.

The new procedure using these two sets is then:

Learn Θ and minimize J_train(Θ) using the training set
Compute the test set error J_test(Θ)

The test set error

For linear regression:

For classification ~ Misclassification error (aka 0/1 misclassification error):

This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is:

This gives us the proportion of the test data that was misclassified.

模型選擇以及訓(xùn)練集重挑、交叉驗(yàn)證集和測(cè)試集的劃分

假設(shè)我們要在以下的多項(xiàng)式模型中選擇一個(gè)合適的模型：

(d = 1) h_θ(x) = θ₀ + θ₁x
(d = 2) h_θ(x) = θ₀ + θ₁x + θ₂x²
(d = 3) h_θ(x) = θ₀ + θ₁x + θ₂x² + θ₃x³
......
(d = 10) h_θ(x) = θ₀ + θ₁x + θ₂x² + θ₃x³ + ... + θ₁₀x¹⁰

其中嗓化，參數(shù)d表示多項(xiàng)式的次數(shù)。對(duì)于這種情況谬哀，我們使用將數(shù)據(jù)集劃分為訓(xùn)練集和測(cè)試集的方法刺覆。

上圖中，我們假設(shè)d=5時(shí)其測(cè)試誤差最小史煎。但這時(shí)我們只是找到了一個(gè)對(duì)于測(cè)試集非常擬合的模型谦屑，我們無(wú)法判斷其實(shí)際泛化誤差是否完美。

因此篇梭，我們不能再將數(shù)據(jù)集只分為兩部分氢橙，訓(xùn)練集和測(cè)試集。對(duì)此很洋，我們引入交叉驗(yàn)證集（Validation Set）充蓝。我們將數(shù)據(jù)集分為三個(gè)部分隧枫，訓(xùn)練集（60%）喉磁、交叉驗(yàn)證集（20%）和測(cè)試集（20%）。

對(duì)于上例官脓，我們需要計(jì)算三部分?jǐn)?shù)據(jù)集的誤差协怒。

最終，我們可得到d=4時(shí)測(cè)試誤差最小卑笨。

Question:
Consider the model selection procedure where we choose the degree of polynomial using a cross validation set. For the final model (with parameters θ), we might generally expect J_CV(θ) To be lower than J_test(θ) because:
A. An extra parameter (d, the degree of the polynomial) has been fit to the cross validation set.
B. An extra parameter (d, the degree of the polynomial) has been fit to the test set.
C. The cross validation set is usually smaller than the test set.
D. The cross validation set is usually larger than the test set.

終上所述孕暇，我們不難選出A這一正確答案。

補(bǔ)充筆記

Model Selection and Train/Validation/Test Sets

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.

Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:

Training set: 60%
Cross validation set: 20%
Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

Optimize the parameters in Θ using the training set for each polynomial degree.
Find the polynomial degree d with the least error using the cross validation set.
Estimate the generalization error using the test set with J_test(Θ^(d)), (d = theta from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

最后編輯于：2017.12.10 06:15:18

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末赤兴，一起剝皮案震驚了整個(gè)濱河市妖滔，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌桶良，老刑警劉巖座舍，帶你破解...
沈念sama閱讀 219,539評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異陨帆，居然都是意外死亡曲秉，警方通過查閱死者的電腦和手機(jī)采蚀，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,594評(píng)論 3贊 396
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)承二，“玉大人榆鼠，你說我怎么就攤上這事『ヰ” “怎么了妆够？”我有些...
開封第一講書人閱讀 165,871評(píng)論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)读虏。經(jīng)常有香客問我责静，道長(zhǎng)，這世上最難降的妖魔是什么盖桥？我笑而不...
開封第一講書人閱讀 58,963評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任灾螃，我火速辦了婚禮，結(jié)果婚禮上揩徊，老公的妹妹穿的比我還像新娘腰鬼。我一直安慰自己，他們只是感情好塑荒，可當(dāng)我...
茶點(diǎn)故事閱讀 67,984評(píng)論 6贊 393
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布熄赡。她就那樣靜靜地躺著，像睡著了一般齿税。火紅的嫁衣襯著肌膚如雪彼硫。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,763評(píng)論 1贊 307
城市分裂傳說
那天凌箕，我揣著相機(jī)與錄音拧篮，去河邊找鬼。笑死牵舱，一個(gè)胖子當(dāng)著我的面吹牛串绩，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播芜壁，決...
沈念sama閱讀 40,468評(píng)論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼礁凡，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了慧妄？” 一聲冷哼從身側(cè)響起顷牌，我...
開封第一講書人閱讀 39,357評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎塞淹，沒想到半個(gè)月后窟蓝，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,850評(píng)論 1贊 317
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡窖铡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 38,002評(píng)論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年疗锐，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了坊谁。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 40,144評(píng)論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡滑臊，死狀恐怖口芍，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情雇卷，我是刑警寧澤鬓椭，帶...
沈念sama閱讀 35,823評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站关划，受9級(jí)特大地震影響小染，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜贮折，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,483評(píng)論 3贊 331
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一裤翩、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧调榄，春花似錦踊赠、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,026評(píng)論 0贊 22
一樁弒父案筐带，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至缤灵，卻和暖如春伦籍，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背腮出。一陣腳步聲響...
開封第一講書人閱讀 33,150評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工帖鸦，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人利诺。一個(gè)月前我還...
沈念sama閱讀 48,415評(píng)論 3贊 373
代替公主和親
正文我出身青樓富蓄，卻偏偏與公主長(zhǎng)得像剩燥，于是被迫代替她去往敵國(guó)和親慢逾。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,092評(píng)論 2贊 355

機(jī)器學(xué)習(xí)應(yīng)用建議（一）

決定下一步做什么

假設(shè)評(píng)估

補(bǔ)充筆記

Evaluating a Hypothesis

模型選擇以及訓(xùn)練集重挑、交叉驗(yàn)證集和測(cè)試集的劃分

補(bǔ)充筆記

Model Selection and Train/Validation/Test Sets

推薦閱讀更多精彩內(nèi)容