本文是對(duì)scikit-learn.org上函數(shù)說(shuō)明<learning_curve>一文的翻譯。
包括其引用的用戶(hù)手冊(cè)-learning_curve
函數(shù)簽名Signature:
learning_curve(
estimator,
X,
y,
*,
groups=None,
train_sizes=array([0.1 , 0.325, 0.55 , 0.775, 1. ]),
cv=None,
scoring=None,
exploit_incremental_learning=False,
n_jobs=None,
pre_dispatch='all',
verbose=0,
shuffle=False,
random_state=None,
error_score=nan,
return_times=False,
fit_params=None,
)
簡(jiǎn)介
學(xué)習(xí)曲線(xiàn)碌补。
求出不同的訓(xùn)練集大小的交叉驗(yàn)證的訓(xùn)練和測(cè)試分?jǐn)?shù)
一個(gè)交叉驗(yàn)證的生成器把整個(gè)數(shù)據(jù)集拆分訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù)k次膜赃。不同大小的訓(xùn)練集的子集將被用來(lái)訓(xùn)練estimator,并計(jì)算每次訓(xùn)練子集的分?jǐn)?shù)校读。之后,the scores will be averaged over all k runs for each training subset size.
更多信息參考用戶(hù)指南
一并翻譯如下:
3.4. Validation curves: plotting scores to evaluate models
3.4 驗(yàn)證曲線(xiàn):繪制估計(jì)模型的參數(shù)
每個(gè)estimator都有它自己的優(yōu)勢(shì)和缺點(diǎn)。它的泛化誤差能分解成偏差名秀,方差和噪聲象踊。Estimator的偏差是他在不同訓(xùn)練集上的平均誤差温亲。Estimator的方差表示它對(duì)不同訓(xùn)練集有多敏感。噪聲是數(shù)據(jù)本身的性質(zhì)杯矩。
下面的圖表中栈虚,我們看到一個(gè)函數(shù)(f(x) = \cos (\frac{3}{2} \pi x))和這個(gè)來(lái)自函數(shù)的噪聲樣例。下面用三個(gè)不同的estimators來(lái)fit這個(gè)函數(shù):1史隆,4和15度的多項(xiàng)式特征的線(xiàn)性回歸魂务。第一個(gè)estimator最好情況下也只能在樣本和真實(shí)函數(shù)之間提供一個(gè)很差的適應(yīng),因?yàn)樗?jiǎn)單了(高偏差)泌射,第二個(gè)estimator接近完美粘姜,最后一個(gè)estimator完美貼合了訓(xùn)練數(shù)據(jù)但是沒(méi)有很好的適應(yīng)真實(shí)函數(shù),也就是說(shuō)它對(duì)不同的訓(xùn)練數(shù)據(jù)非常敏感(高方差)熔酷。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline(
[
("polynomial_features", polynomial_features),
("linear_regression", linear_regression),
]
)
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(
pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title(
"Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
degrees[i], -scores.mean(), scores.std()
)
)
plt.show()
偏差和方差是estimator的固有性質(zhì)孤紧,我們經(jīng)常需要選擇學(xué)習(xí)算法的超參數(shù)來(lái)讓偏差和方差盡可能的低(see Bias-variance dilemma
)。減小模型方差的另一種方式使用更多的訓(xùn)練集拒秘。不過(guò)号显,你只有在true function太過(guò)復(fù)雜以至于不能用一個(gè)低方差的estimator很好的逼近的時(shí)候才需要采集更多的數(shù)據(jù)
在我們上面舉的這個(gè)簡(jiǎn)單一維問(wèn)題來(lái)說(shuō),顯然無(wú)論這個(gè)estimator遭受多少偏差和方差躺酒。在高維空間中押蚤,模型都變得很難可視化。正是這個(gè)原因羹应,通常使用下面描述的工具揽碘。
3.4.1 驗(yàn)證曲線(xiàn)
驗(yàn)證模型需要用到評(píng)分函數(shù) (see Metrics and scoring: quantifying the quality of predictions),例如分類(lèi)器的準(zhǔn)確度。選擇一個(gè)estimator的多個(gè)超參數(shù)的合適的方法當(dāng)然是網(wǎng)格搜索或者類(lèi)似的方法 (see Tuning the hyper-parameters of an estimator) 钾菊,也就是在一個(gè)或者多個(gè)驗(yàn)證集中選擇分?jǐn)?shù)最高的超參數(shù)帅矗。
注意我們優(yōu)化超參數(shù)基于的驗(yàn)證分?jǐn)?shù)有了偏差以及估計(jì)的泛化不再優(yōu)秀了。為了獲得更強(qiáng)的泛化能力需要在另外的測(cè)試集上計(jì)算分?jǐn)?shù)煞烫。
不過(guò)浑此,對(duì)于一些超參數(shù)值來(lái)說(shuō),有時(shí)候把單個(gè)超參數(shù)對(duì)于訓(xùn)練分?jǐn)?shù)和驗(yàn)證分?jǐn)?shù)的影響畫(huà)出來(lái)對(duì)于找出estimator是否過(guò)擬合還是欠擬合很有幫助滞详。
這種情況下 validation_curve
就很有幫助了
>>> import numpy as np
>>> from sklearn.model_selection import validation_curve
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import Ridge
>>> np.random.seed(0)
>>> X, y = load_iris(return_X_y=True)
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]
>>> train_scores, valid_scores = validation_curve(
... Ridge(), X, y, param_name="alpha", param_range=np.logspace(-7, 3, 3),
... cv=5)
>>> train_scores
array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores
array([[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
[0.46..., 0.25..., 0.50..., 0.49..., 0.52...]])
如果訓(xùn)練分?jǐn)?shù)和驗(yàn)證分?jǐn)?shù)都很低凛俱,這個(gè)estimator就是欠擬合的,如果訓(xùn)練分?jǐn)?shù)很高料饥,驗(yàn)證分?jǐn)?shù)很低蒲犬,這個(gè)estimator就是過(guò)擬合的,不然它就是非常有效得岸啡。訓(xùn)練分?jǐn)?shù)很低原叮,驗(yàn)證分?jǐn)?shù)很高通常不可能。下面圖表中是使用digits數(shù)據(jù)集的一個(gè)SVM巡蘸,在不同(\gamma)\參數(shù)下的欠擬合奋隶,過(guò)擬合和有效的模型。
代碼:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
X, y = load_digits(return_X_y=True)
subset_mask = np.isin(y, [1, 2]) # binary classification: 1 vs 2
X, y = X[subset_mask], y[subset_mask]
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
SVC(),
X,
y,
param_name="gamma",
param_range=param_range,
scoring="accuracy",
n_jobs=2,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SVM")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(
param_range, train_scores_mean, label="Training score", color="darkorange", lw=lw
)
plt.fill_between(
param_range,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2,
color="darkorange",
lw=lw,
)
plt.semilogx(
param_range, test_scores_mean, label="Cross-validation score", color="navy", lw=lw
)
plt.fill_between(
param_range,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2,
color="navy",
lw=lw,
)
plt.legend(loc="best")
plt.show()
3.4.2 學(xué)習(xí)曲線(xiàn)
學(xué)習(xí)曲線(xiàn)展示了estimator在改變訓(xùn)練樣本數(shù)量時(shí)的驗(yàn)證和訓(xùn)練分?jǐn)?shù)悦荒。是一個(gè)找出增加訓(xùn)練集究竟有多大程度優(yōu)化唯欣,和estimator是否受到方差或者偏差的影響的工具“嵛叮考慮下面的例子:畫(huà)出來(lái)的樸素貝葉斯和SVM的學(xué)習(xí)曲線(xiàn)境氢。
樸素貝葉斯中,隨著訓(xùn)練集的加大碰纬,驗(yàn)證分?jǐn)?shù)和訓(xùn)練分?jǐn)?shù)匯聚到一個(gè)很低的值萍聊。這樣,增加訓(xùn)練集數(shù)據(jù)可能沒(méi)多少優(yōu)化了悦析。
相反脐区,同樣數(shù)量的數(shù)據(jù),SVM的訓(xùn)練分?jǐn)?shù)比驗(yàn)證分?jǐn)?shù)高很多她按。增加訓(xùn)練樣本能夠增加泛化能力。
使用 learning_curve
來(lái)生成我們需要在學(xué)習(xí)曲線(xiàn)中畫(huà)出來(lái)的值(已經(jīng)使用過(guò)的樣例的數(shù)量炕柔,訓(xùn)練集的平均分?jǐn)?shù)酌泰,以及驗(yàn)證集的平均分?jǐn)?shù))
>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC
>>> train_sizes, train_scores, valid_scores = learning_curve(
... SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> train_sizes
array([ 50, 80, 110])
>>> train_scores
array([[0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
[0.98..., 1. , 0.98..., 0.98..., 0.98...],
[0.98..., 1. , 0.98..., 0.98..., 0.99...]])
>>> valid_scores
array([[1. , 0.93..., 1. , 1. , 0.96...],
[1. , 0.96..., 1. , 1. , 0.96...],
[1. , 0.96..., 1. , 1. , 0.96...]])
Parameters
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
X : array-like of shape (n_samples, n_features)
Training vector, where n_samples
is the number of samples and
n_features
is the number of features.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
Target relative to X for classification or regression;
None for unsupervised learning.
groups : array-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into
train/test set. Only used in conjunction with a "Group" :term:cv
instance (e.g., :class:GroupKFold
).
train_sizes : array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.
cv : int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used. These splitters are instantiated
with `shuffle=False` so the splits will be the same across calls.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.
.. versionchanged:: 0.22
``cv`` default value if None changed from 3-fold to 5-fold.
scoring : str or callable, default=None
A str (see model evaluation documentation) or
a scorer callable object / function with signature
scorer(estimator, X, y)
.
exploit_incremental_learning : bool, default=False
If the estimator supports incremental learning, this will be
used to speed up fitting for different training set sizes.
n_jobs : int, default=None
Number of jobs to run in parallel. Training the estimator and computing
the score are parallelized over the different training and test sets.
None
means 1 unless in a :obj:joblib.parallel_backend
context.
-1
means using all processors. See :term:Glossary <n_jobs>
for more details.
pre_dispatch : int or str, default='all'
Number of predispatched jobs for parallel execution (default is
all). The option can reduce the allocated memory. The str can
be an expression like '2*n_jobs'.
verbose : int, default=0
Controls the verbosity: the higher, the more messages.
shuffle : bool, default=False
Whether to shuffle training data before taking prefixes of it
based ontrain_sizes
.
random_state : int, RandomState instance or None, default=None
Used when shuffle
is True. Pass an int for reproducible
output across multiple function calls.
See :term:Glossary <random_state>
.
error_score : 'raise' or numeric, default=np.nan
Value to assign to the score if an error occurs in estimator fitting.
If set to 'raise', the error is raised.
If a numeric value is given, FitFailedWarning is raised.
.. versionadded:: 0.20
return_times : bool, default=False
Whether to return the fit and score times.
fit_params : dict, default=None
Parameters to pass to the fit method of the estimator.
.. versionadded:: 0.24
Returns
train_sizes_abs : array of shape (n_unique_ticks,)
Numbers of training examples that has been used to generate the
learning curve. Note that the number of ticks might be less
than n_ticks because duplicate entries will be removed.
train_scores : array of shape (n_ticks, n_cv_folds)
Scores on training sets.
test_scores : array of shape (n_ticks, n_cv_folds)
Scores on test set.
fit_times : array of shape (n_ticks, n_cv_folds)
Times spent for fitting in seconds. Only present if return_times
is True.
score_times : array of shape (n_ticks, n_cv_folds)
Times spent for scoring in seconds. Only present if return_times
is True.