sklearn.model_selection.learning_curve

本文是對(duì)scikit-learn.org上函數(shù)說(shuō)明<learning_curve>一文的翻譯。
包括其引用的用戶(hù)手冊(cè)-learning_curve

函數(shù)簽名Signature:

learning_curve(
    estimator,
    X,
    y,
    *,
    groups=None,
    train_sizes=array([0.1  , 0.325, 0.55 , 0.775, 1.   ]),
    cv=None,
    scoring=None,
    exploit_incremental_learning=False,
    n_jobs=None,
    pre_dispatch='all',
    verbose=0,
    shuffle=False,
    random_state=None,
    error_score=nan,
    return_times=False,
    fit_params=None,
)

簡(jiǎn)介

學(xué)習(xí)曲線(xiàn)碌补。

求出不同的訓(xùn)練集大小的交叉驗(yàn)證的訓(xùn)練和測(cè)試分?jǐn)?shù)

一個(gè)交叉驗(yàn)證的生成器把整個(gè)數(shù)據(jù)集拆分訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù)k次膜赃。不同大小的訓(xùn)練集的子集將被用來(lái)訓(xùn)練estimator，并計(jì)算每次訓(xùn)練子集的分?jǐn)?shù)校读。之后，the scores will be averaged over all k runs for each training subset size.

更多信息參考用戶(hù)指南

一并翻譯如下：

3.4. Validation curves: plotting scores to evaluate models

3.4 驗(yàn)證曲線(xiàn)：繪制估計(jì)模型的參數(shù)

每個(gè)estimator都有它自己的優(yōu)勢(shì)和缺點(diǎn)。它的泛化誤差能分解成偏差名秀，方差和噪聲象踊。Estimator的偏差是他在不同訓(xùn)練集上的平均誤差温亲。Estimator的方差表示它對(duì)不同訓(xùn)練集有多敏感。噪聲是數(shù)據(jù)本身的性質(zhì)杯矩。

下面的圖表中栈虚，我們看到一個(gè)函數(shù)(f(x) = \cos (\frac{3}{2} \pi x))和這個(gè)來(lái)自函數(shù)的噪聲樣例。下面用三個(gè)不同的estimators來(lái)fit這個(gè)函數(shù)：1史隆，4和15度的多項(xiàng)式特征的線(xiàn)性回歸魂务。第一個(gè)estimator最好情況下也只能在樣本和真實(shí)函數(shù)之間提供一個(gè)很差的適應(yīng)，因?yàn)樗?jiǎn)單了（高偏差）泌射，第二個(gè)estimator接近完美粘姜，最后一個(gè)estimator完美貼合了訓(xùn)練數(shù)據(jù)但是沒(méi)有很好的適應(yīng)真實(shí)函數(shù)，也就是說(shuō)它對(duì)不同的訓(xùn)練數(shù)據(jù)非常敏感（高方差）熔酷。

image.png

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


def true_fun(X):
    return np.cos(1.5 * np.pi * X)


np.random.seed(0)

n_samples = 30
degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(
        pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
    )

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title(
        "Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
            degrees[i], -scores.mean(), scores.std()
        )
    )
plt.show()

偏差和方差是estimator的固有性質(zhì)孤紧，我們經(jīng)常需要選擇學(xué)習(xí)算法的超參數(shù)來(lái)讓偏差和方差盡可能的低（see Bias-variance dilemma
）。減小模型方差的另一種方式使用更多的訓(xùn)練集拒秘。不過(guò)号显，你只有在true function太過(guò)復(fù)雜以至于不能用一個(gè)低方差的estimator很好的逼近的時(shí)候才需要采集更多的數(shù)據(jù)

在我們上面舉的這個(gè)簡(jiǎn)單一維問(wèn)題來(lái)說(shuō)，顯然無(wú)論這個(gè)estimator遭受多少偏差和方差躺酒。在高維空間中押蚤，模型都變得很難可視化。正是這個(gè)原因羹应，通常使用下面描述的工具揽碘。

3.4.1 驗(yàn)證曲線(xiàn)

驗(yàn)證模型需要用到評(píng)分函數(shù) (see Metrics and scoring: quantifying the quality of predictions)，例如分類(lèi)器的準(zhǔn)確度。選擇一個(gè)estimator的多個(gè)超參數(shù)的合適的方法當(dāng)然是網(wǎng)格搜索或者類(lèi)似的方法 (see Tuning the hyper-parameters of an estimator) 钾菊，也就是在一個(gè)或者多個(gè)驗(yàn)證集中選擇分?jǐn)?shù)最高的超參數(shù)帅矗。
注意我們優(yōu)化超參數(shù)基于的驗(yàn)證分?jǐn)?shù)有了偏差以及估計(jì)的泛化不再優(yōu)秀了。為了獲得更強(qiáng)的泛化能力需要在另外的測(cè)試集上計(jì)算分?jǐn)?shù)煞烫。

不過(guò)浑此，對(duì)于一些超參數(shù)值來(lái)說(shuō)，有時(shí)候把單個(gè)超參數(shù)對(duì)于訓(xùn)練分?jǐn)?shù)和驗(yàn)證分?jǐn)?shù)的影響畫(huà)出來(lái)對(duì)于找出estimator是否過(guò)擬合還是欠擬合很有幫助滞详。

這種情況下 validation_curve 就很有幫助了

>>> import numpy as np
>>> from sklearn.model_selection import validation_curve
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import Ridge

>>> np.random.seed(0)
>>> X, y = load_iris(return_X_y=True)
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]

>>> train_scores, valid_scores = validation_curve(
...     Ridge(), X, y, param_name="alpha", param_range=np.logspace(-7, 3, 3),
...     cv=5)
>>> train_scores
array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
       [0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
       [0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores
array([[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
       [0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
       [0.46..., 0.25..., 0.50..., 0.49..., 0.52...]])

如果訓(xùn)練分?jǐn)?shù)和驗(yàn)證分?jǐn)?shù)都很低凛俱，這個(gè)estimator就是欠擬合的，如果訓(xùn)練分?jǐn)?shù)很高料饥，驗(yàn)證分?jǐn)?shù)很低蒲犬，這個(gè)estimator就是過(guò)擬合的，不然它就是非常有效得岸啡。訓(xùn)練分?jǐn)?shù)很低原叮，驗(yàn)證分?jǐn)?shù)很高通常不可能。下面圖表中是使用digits數(shù)據(jù)集的一個(gè)SVM巡蘸，在不同(\gamma)\參數(shù)下的欠擬合奋隶，過(guò)擬合和有效的模型。

image.png

代碼：

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

X, y = load_digits(return_X_y=True)
subset_mask = np.isin(y, [1, 2])  # binary classification: 1 vs 2
X, y = X[subset_mask], y[subset_mask]

param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
    SVC(),
    X,
    y,
    param_name="gamma",
    param_range=param_range,
    scoring="accuracy",
    n_jobs=2,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with SVM")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(
    param_range, train_scores_mean, label="Training score", color="darkorange", lw=lw
)
plt.fill_between(
    param_range,
    train_scores_mean - train_scores_std,
    train_scores_mean + train_scores_std,
    alpha=0.2,
    color="darkorange",
    lw=lw,
)
plt.semilogx(
    param_range, test_scores_mean, label="Cross-validation score", color="navy", lw=lw
)
plt.fill_between(
    param_range,
    test_scores_mean - test_scores_std,
    test_scores_mean + test_scores_std,
    alpha=0.2,
    color="navy",
    lw=lw,
)
plt.legend(loc="best")
plt.show()

3.4.2 學(xué)習(xí)曲線(xiàn)

學(xué)習(xí)曲線(xiàn)展示了estimator在改變訓(xùn)練樣本數(shù)量時(shí)的驗(yàn)證和訓(xùn)練分?jǐn)?shù)悦荒。是一個(gè)找出增加訓(xùn)練集究竟有多大程度優(yōu)化唯欣，和estimator是否受到方差或者偏差的影響的工具“嵛叮考慮下面的例子：畫(huà)出來(lái)的樸素貝葉斯和SVM的學(xué)習(xí)曲線(xiàn)境氢。

樸素貝葉斯中，隨著訓(xùn)練集的加大碰纬，驗(yàn)證分?jǐn)?shù)和訓(xùn)練分?jǐn)?shù)匯聚到一個(gè)很低的值萍聊。這樣，增加訓(xùn)練集數(shù)據(jù)可能沒(méi)多少優(yōu)化了悦析。

相反脐区，同樣數(shù)量的數(shù)據(jù)，SVM的訓(xùn)練分?jǐn)?shù)比驗(yàn)證分?jǐn)?shù)高很多她按。增加訓(xùn)練樣本能夠增加泛化能力。

image.png

使用 learning_curve來(lái)生成我們需要在學(xué)習(xí)曲線(xiàn)中畫(huà)出來(lái)的值（已經(jīng)使用過(guò)的樣例的數(shù)量炕柔，訓(xùn)練集的平均分?jǐn)?shù)酌泰，以及驗(yàn)證集的平均分?jǐn)?shù)）

>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC

>>> train_sizes, train_scores, valid_scores = learning_curve(
...     SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> train_sizes
array([ 50, 80, 110])
>>> train_scores
array([[0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
       [0.98..., 1.   , 0.98..., 0.98..., 0.98...],
       [0.98..., 1.   , 0.98..., 0.98..., 0.99...]])
>>> valid_scores
array([[1. ,  0.93...,  1. ,  1. ,  0.96...],
       [1. ,  0.96...,  1. ,  1. ,  0.96...],
       [1. ,  0.96...,  1. ,  1. ,  0.96...]])

Parameters

estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.

X : array-like of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)
Target relative to X for classification or regression;
None for unsupervised learning.

groups : array-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into
train/test set. Only used in conjunction with a "Group" :term:cv
instance (e.g., :class:GroupKFold).

train_sizes : array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.

cv : int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:

- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used. These splitters are instantiated
with `shuffle=False` so the splits will be the same across calls.

Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.

.. versionchanged:: 0.22
    ``cv`` default value if None changed from 3-fold to 5-fold.

scoring : str or callable, default=None
A str (see model evaluation documentation) or
a scorer callable object / function with signature
scorer(estimator, X, y).

exploit_incremental_learning : bool, default=False
If the estimator supports incremental learning, this will be
used to speed up fitting for different training set sizes.

n_jobs : int, default=None
Number of jobs to run in parallel. Training the estimator and computing
the score are parallelized over the different training and test sets.
None means 1 unless in a :obj:joblib.parallel_backend context.
-1 means using all processors. See :term:Glossary <n_jobs>
for more details.

pre_dispatch : int or str, default='all'
Number of predispatched jobs for parallel execution (default is
all). The option can reduce the allocated memory. The str can
be an expression like '2*n_jobs'.

verbose : int, default=0
Controls the verbosity: the higher, the more messages.

shuffle : bool, default=False
Whether to shuffle training data before taking prefixes of it
based ontrain_sizes.

random_state : int, RandomState instance or None, default=None
Used when shuffle is True. Pass an int for reproducible
output across multiple function calls.
See :term:Glossary <random_state>.

error_score : 'raise' or numeric, default=np.nan
Value to assign to the score if an error occurs in estimator fitting.
If set to 'raise', the error is raised.
If a numeric value is given, FitFailedWarning is raised.

.. versionadded:: 0.20

return_times : bool, default=False
Whether to return the fit and score times.

fit_params : dict, default=None
Parameters to pass to the fit method of the estimator.

.. versionadded:: 0.24

Returns

train_sizes_abs : array of shape (n_unique_ticks,)
Numbers of training examples that has been used to generate the
learning curve. Note that the number of ticks might be less
than n_ticks because duplicate entries will be removed.

train_scores : array of shape (n_ticks, n_cv_folds)
Scores on training sets.

test_scores : array of shape (n_ticks, n_cv_folds)
Scores on test set.

fit_times : array of shape (n_ticks, n_cv_folds)
Times spent for fitting in seconds. Only present if return_times
is True.

score_times : array of shape (n_ticks, n_cv_folds)
Times spent for scoring in seconds. Only present if return_times
is True.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市匕累，隨后出現(xiàn)的幾起案子陵刹，更是在濱河造成了極大的恐慌，老刑警劉巖欢嘿，帶你破解...
沈念sama閱讀 218,755評(píng)論 6贊 507
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件衰琐，死亡現(xiàn)場(chǎng)離奇詭異也糊，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)羡宙，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,305評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)狸剃，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人狗热，你說(shuō)我怎么就攤上這事钞馁。” “怎么了匿刮？”我有些...
開(kāi)封第一講書(shū)人閱讀 165,138評(píng)論 0贊 355
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵僧凰，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我熟丸，道長(zhǎng)训措，這世上最難降的妖魔是什么？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,791評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任光羞，我火速辦了婚禮绩鸣，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘狞山。我一直安慰自己全闷，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,794評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布萍启。她就那樣靜靜地躺著总珠，像睡著了一般。火紅的嫁衣襯著肌膚如雪勘纯。梳的紋絲不亂的頭發(fā)上局服，一...
開(kāi)封第一講書(shū)人閱讀 51,631評(píng)論 1贊 305
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音驳遵，去河邊找鬼淫奔。笑死，一個(gè)胖子當(dāng)著我的面吹牛堤结，可吹牛的內(nèi)容都是我干的唆迁。我是一名探鬼主播，決...
沈念sama閱讀 40,362評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼竞穷，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼唐责！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起瘾带，我...
開(kāi)封第一講書(shū)人閱讀 39,264評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤鼠哥，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體朴恳，經(jīng)...
沈念sama閱讀 45,724評(píng)論 1贊 315
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡抄罕，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,900評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了于颖。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片呆贿。...
茶點(diǎn)故事閱讀 40,040評(píng)論 1贊 350
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖恍飘，靈堂內(nèi)的尸體忽然破棺而出榨崩，到底是詐尸還是另有隱情，我是刑警寧澤章母，帶...
沈念sama閱讀 35,742評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布母蛛，位于F島的核電站，受9級(jí)特大地震影響乳怎，放射性物質(zhì)發(fā)生泄漏彩郊。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,364評(píng)論 3贊 330
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一蚪缀、第九天我趴在偏房一處隱蔽的房頂上張望秫逝。院中可真熱鬧，春花似錦询枚、人聲如沸违帆。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,944評(píng)論 0贊 22
一樁弒父案金蜀，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)刷后。三九已至，卻和暖如春渊抄，著一層夾襖步出監(jiān)牢的瞬間尝胆，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 33,060評(píng)論 1贊 270
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工护桦，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留含衔，地道東北人。一個(gè)月前我還...
沈念sama閱讀 48,247評(píng)論 3贊 371
代替公主和親
正文我出身青樓二庵，卻偏偏與公主長(zhǎng)得像贪染，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子催享，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,979評(píng)論 2贊 355

sklearn.model_selection.learning_curve

函數(shù)簽名Signature:

簡(jiǎn)介

3.4. Validation curves: plotting scores to evaluate models

3.4 驗(yàn)證曲線(xiàn)：繪制估計(jì)模型的參數(shù)

3.4.1 驗(yàn)證曲線(xiàn)

3.4.2 學(xué)習(xí)曲線(xiàn)

Parameters

Returns

推薦閱讀更多精彩內(nèi)容