1. 源碼分析
源碼閱讀的是Python著名的庫sklearn里的代碼蒙揣。sklearn里gbdt(sklearn/ensemble/gradient_boosting.py)相關的類有 GradientBoostingRegressor
和GradientBoostingClassifier
惊来,共同的父類是BaseGradientBoosting
.boost的基本實現(xiàn)在BaseGradientBoosting
里粒没。主要的幾個參數(shù)是(更詳細的看sklearn的文檔):
loss : {'ls', 'lad', 'huber', 'quantile'}, optional (default='ls')
loss function to be optimized. 'ls' refers to least squares
regression. 'lad' (least absolute deviation) is a highly robust
loss function solely based on order information of the input
variables. 'huber' is a combination of the two. 'quantile'
allows quantile regression (use `alpha` to specify the quantile).
learning_rate : float, optional (default=0.1)
learning rate shrinks the contribution of each tree by `learning_rate`.
There is a trade-off between learning_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting
is fairly robust to over-fitting so a large number usually
results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum
depth limits the number of nodes in the tree. Tune this parameter
for best performance; the best value depends on the interaction
of the input variables.
....
init : BaseEstimator, None, optional (default=None)
An estimator object that is used to compute the initial
predictions. ``init`` has to provide ``fit`` and ``predict``.
If None it uses ``loss.init_estimator``.
....
BaseGradientBoosting
主要的相關函數(shù)是fit
函數(shù)朗鸠。
初始化
sklearn提供了多種estimator
來做算法的第一步-初始化的工作。默認選用的是MeanEstimator
逗爹,即使用均值來作為初始的預測值谷市。其中fit()
函數(shù)是計算了mean值蛔垢,predict()
將X樣本的所有初始預測值y_pred
設為之前計算的mean值。(其他初始化方法類似)
class MeanEstimator(BaseEstimator):
"""An estimator predicting the mean of the training targets."""
def fit(self, X, y, sample_weight=None):
if sample_weight is None:
self.mean = np.mean(y)
else:
self.mean = np.average(y, weights=sample_weight)
def predict(self, X):
check_is_fitted(self, 'mean')
y = np.empty((X.shape[0], 1), dtype=np.float64)
y.fill(self.mean)
return y
計算殘差
sklearn提供了多種損失函數(shù)LossFunction
迫悠,默認為最小平方損失LeastSquaresError
鹏漆,在損失函數(shù)中通過計算負梯度來計算‘‘偽殘差’’。LeastSquaresError
的負梯度negative_gradient(
是y-y_pred
创泄,還有一個重要的函數(shù)是update_terminal_regions
用來迭代的時候更新最終的預測值y_pred
class LeastSquaresError(RegressionLossFunction):
"""Loss function for least squares (LS) estimation.
Terminal regions need not to be updated for least squares. """
def init_estimator(self):
return MeanEstimator()
def __call__(self, y, pred, sample_weight=None):
if sample_weight is None:
return np.mean((y - pred.ravel()) ** 2.0)
else:
return (1.0 / sample_weight.sum() *
np.sum(sample_weight * ((y - pred.ravel()) ** 2.0)))
def negative_gradient(self, y, pred, **kargs):
return y - pred.ravel()
def update_terminal_regions(self, tree, X, y, residual, y_pred,
sample_weight, sample_mask,
learning_rate=1.0, k=0):
"""Least squares does not need to update terminal regions.
But it has to update the predictions.
"""
# update predictions
y_pred[:, k] += learning_rate * tree.predict(X).ravel()
def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
residual, pred, sample_weight):
pass
迭代構(gòu)建模型
BaseGradientBoosting
的最重要的函數(shù)是fit()
函數(shù)艺玲。fit()
的開始是check_input、check_params等檢查的功能鞠抑。在check_params
檢查參數(shù)的時候初始化了損失函數(shù)self.loss_
饭聚。回歸樹‘‘self.n_classes_=1’’(不等于1是多分類的情況)搁拙,這對應這self.loss_.K=1秒梳。
----_check_params()--------
if self.loss == 'deviance':
loss_class = (MultinomialDeviance
if len(self.classes_) > 2
else BinomialDeviance)
else:
loss_class = LOSS_FUNCTIONS[self.loss]
if self.loss in ('huber', 'quantile'):
self.loss_ = loss_class(self.n_classes_, self.alpha)
else:
self.loss_ = loss_class(self.n_classes_)
接著判斷是否初始化過(主要是區(qū)分是否是熱啟動的),如果沒有初始化(我們主要看不是熱啟動的情況,下面代碼中的if)箕速,
--------fit()----------
if not self._is_initialized():
# init state
self._init_state()
# fit initial model - FIXME make sample_weight optional
self.init_.fit(X, y, sample_weight)
# init predictions
y_pred = self.init_.predict(X)
begin_at_stage = 0
else:
# add more estimators to fitted model
# invariant: warm_start = True
if self.n_estimators < self.estimators_.shape[0]:
raise ValueError('n_estimators=%d must be larger or equal to '
'estimators_.shape[0]=%d when '
'warm_start==True'
% (self.n_estimators,
self.estimators_.shape[0]))
begin_at_stage = self.estimators_.shape[0]
y_pred = self._decision_function(X)
self._resize_state()
初始化的時候酪碘,主要初始化最開始的模型,在_init_state()
創(chuàng)建初始模型用的estimator
----_init_state()----
def _init_state(self):
"""Initialize model state and allocate model state data structures. """
if self.init is None:
self.init_ = self.loss_.init_estimator()
elif isinstance(self.init, six.string_types):
self.init_ = INIT_ESTIMATORS[self.init]()
else:
self.init_ = self.init
self.estimators_ = np.empty((self.n_estimators, self.loss_.K),
dtype=np.object)
self.train_score_ = np.zeros((self.n_estimators,), dtype=np.float64)
# do oob?
if self.subsample < 1.0:
self.oob_improvement_ = np.zeros((self.n_estimators),
dtype=np.float64)
利用創(chuàng)建的self.init_
得到模型迭代之前最初始的y_pred盐茎。
然后是在_fit_stages()
函數(shù)里構(gòu)建boosting的過程婆跑。
_fit_stages()
中最重要的是迭代的更新每一個基礎的learner。
------_fit_stages()------
# perform boosting iterations
i = begin_at_stage
for i in range(begin_at_stage, self.n_estimators):
# subsampling
if do_oob:
sample_mask = _random_sample_mask(n_samples, n_inbag,
random_state)
# OOB score before adding this stage
old_oob_score = loss_(y[~sample_mask],
y_pred[~sample_mask],
sample_weight[~sample_mask])
# fit next stage of trees
y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,
sample_mask, random_state, X_idx_sorted,
X_csc, X_csr)
其中_fit_stage
是每一次的迭代庭呜,創(chuàng)建一棵決策回歸樹進行學習的過程滑进。具體見代碼中的中文注釋。
def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
random_state, X_idx_sorted, X_csc=None, X_csr=None):
"""Fit another stage of ``n_classes_`` trees to the boosting model. """
assert sample_mask.dtype == np.bool
loss = self.loss_
original_y = y
#回歸樹與二分類K=1
for k in range(loss.K):
if loss.is_multi_class:
y = np.array(original_y == k, dtype=np.float64)
#得到真正的y與當前的y_pred的殘差(殘差用的是負梯度計算的募谎。)
residual = loss.negative_gradient(y, y_pred, k=k,
sample_weight=sample_weight)
# induce regression tree on residuals
#以殘差作為當前這棵樹要預測的目標值
tree = DecisionTreeRegressor(
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state,
presort=self.presort)
if self.subsample < 1.0:
# no inplace multiplication!
sample_weight = sample_weight * sample_mask.astype(np.float64)
#以殘差作為當前這棵樹要預測的目標值
if X_csc is not None:
tree.fit(X_csc, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
else:
tree.fit(X, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
# update tree leaves
#調(diào)用‘‘ loss.update_terminal_regions’’用當前訓練的這棵樹的對X的預測結(jié)果來更新整個模型的y_pred
if X_csr is not None:
loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred,
sample_weight, sample_mask,
self.learning_rate, k=k)
else:
loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
sample_weight, sample_mask,
self.learning_rate, k=k)
# add tree to ensemble
#保留這一棵樹
self.estimators_[i, k] = tree
#返回這一次迭代的更新了以后的y_pred
return y_pred
這就是整個fit
也就是整個模型構(gòu)建的過程扶关。
除了構(gòu)建模型,即fit.GradientBoostingRegressor
還要實現(xiàn)自己另外一個重要的函數(shù)predict
数冬,通過構(gòu)建的模型進行預測节槐。
def predict(self, X):
"""Predict regression target for X.
Parameters
----------
X : array-like of shape = [n_samples, n_features]
The input samples.
Returns
-------
y : array of shape = [n_samples]
The predicted values.
"""
X = check_array(X, dtype=DTYPE, order="C")
return self._decision_function(X).ravel()
主要調(diào)用了self._decision_function
這個也是基類BaseGradientBoosting
的函數(shù)。
def _decision_function(self, X):
# for use in inner loop, not raveling the output in single-class case,
# not doing input validation.
score = self._init_decision_function(X)
predict_stages(self.estimators_, X, self.learning_rate, score)
return score
其中score = self._init_decision_function(X)
是利用self._init
初始化(如同訓練模型時的那樣)最開始的y_pred拐纱。
def _init_decision_function(self, X):
"""Check input and compute prediction of ``init``. """
self._check_initialized()
X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
if X.shape[1] != self.n_features:
raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
self.n_features, X.shape[1]))
score = self.init_.predict(X).astype(np.float64)
return score
然后是predict_stages
利用之前構(gòu)建的模型(多棵樹)迭代地獲得最終的預測值铜异。