gbdt 源碼分析

1. 源碼分析

源碼閱讀的是Python著名的庫sklearn里的代碼蒙揣。sklearn里gbdt（sklearn/ensemble/gradient_boosting.py）相關的類有 GradientBoostingRegressor和GradientBoostingClassifier惊来，共同的父類是BaseGradientBoosting.boost的基本實現(xiàn)在BaseGradientBoosting里粒没。主要的幾個參數(shù)是(更詳細的看sklearn的文檔）：

loss : {'ls', 'lad', 'huber', 'quantile'}, optional (default='ls')
        loss function to be optimized. 'ls' refers to least squares
        regression. 'lad' (least absolute deviation) is a highly robust
        loss function solely based on order information of the input
        variables. 'huber' is a combination of the two. 'quantile'
        allows quantile regression (use `alpha` to specify the quantile).

    learning_rate : float, optional (default=0.1)
        learning rate shrinks the contribution of each tree by `learning_rate`.
        There is a trade-off between learning_rate and n_estimators.

    n_estimators : int (default=100)
        The number of boosting stages to perform. Gradient boosting
        is fairly robust to over-fitting so a large number usually
        results in better performance.

    max_depth : integer, optional (default=3)
        maximum depth of the individual regression estimators. The maximum
        depth limits the number of nodes in the tree. Tune this parameter
        for best performance; the best value depends on the interaction
        of the input variables.
....
init : BaseEstimator, None, optional (default=None)
        An estimator object that is used to compute the initial
        predictions. ``init`` has to provide ``fit`` and ``predict``.
        If None it uses ``loss.init_estimator``.
....

BaseGradientBoosting主要的相關函數(shù)是fit函數(shù)朗鸠。

初始化

sklearn提供了多種estimator來做算法的第一步-初始化的工作。默認選用的是MeanEstimator逗爹，即使用均值來作為初始的預測值谷市。其中fit()函數(shù)是計算了mean值蛔垢，predict()將X樣本的所有初始預測值y_pred設為之前計算的mean值。（其他初始化方法類似）

class MeanEstimator(BaseEstimator):
    """An estimator predicting the mean of the training targets."""
    def fit(self, X, y, sample_weight=None):
        if sample_weight is None:
            self.mean = np.mean(y)
        else:
            self.mean = np.average(y, weights=sample_weight)

    def predict(self, X):
        check_is_fitted(self, 'mean')

        y = np.empty((X.shape[0], 1), dtype=np.float64)
        y.fill(self.mean)
        return y

計算殘差

sklearn提供了多種損失函數(shù)LossFunction迫悠，默認為最小平方損失LeastSquaresError鹏漆，在損失函數(shù)中通過計算負梯度來計算‘‘偽殘差’’。LeastSquaresError的負梯度negative_gradient(是y-y_pred创泄，還有一個重要的函數(shù)是update_terminal_regions用來迭代的時候更新最終的預測值y_pred

class LeastSquaresError(RegressionLossFunction):
    """Loss function for least squares (LS) estimation.
    Terminal regions need not to be updated for least squares. """
    def init_estimator(self):
        return MeanEstimator()

    def __call__(self, y, pred, sample_weight=None):
        if sample_weight is None:
            return np.mean((y - pred.ravel()) ** 2.0)
        else:
            return (1.0 / sample_weight.sum() *
                    np.sum(sample_weight * ((y - pred.ravel()) ** 2.0)))

    def negative_gradient(self, y, pred, **kargs):
        return y - pred.ravel()

    def update_terminal_regions(self, tree, X, y, residual, y_pred,
                                sample_weight, sample_mask,
                                learning_rate=1.0, k=0):
        """Least squares does not need to update terminal regions.

        But it has to update the predictions.
        """
        # update predictions
        y_pred[:, k] += learning_rate * tree.predict(X).ravel()

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, pred, sample_weight):
        pass

迭代構(gòu)建模型

BaseGradientBoosting的最重要的函數(shù)是fit()函數(shù)艺玲。fit()的開始是check_input、check_params等檢查的功能鞠抑。在check_params檢查參數(shù)的時候初始化了損失函數(shù)self.loss_饭聚。回歸樹‘‘self.n_classes_=1’’（不等于1是多分類的情況）搁拙，這對應這self.loss_.K=1秒梳。

----_check_params()--------
if self.loss == 'deviance':
            loss_class = (MultinomialDeviance
                          if len(self.classes_) > 2
                          else BinomialDeviance)
        else:
            loss_class = LOSS_FUNCTIONS[self.loss]

        if self.loss in ('huber', 'quantile'):
            self.loss_ = loss_class(self.n_classes_, self.alpha)
        else:
            self.loss_ = loss_class(self.n_classes_)

接著判斷是否初始化過（主要是區(qū)分是否是熱啟動的），如果沒有初始化(我們主要看不是熱啟動的情況,下面代碼中的if)箕速，

--------fit()----------
 if not self._is_initialized():
            # init state
            self._init_state()

            # fit initial model - FIXME make sample_weight optional
            self.init_.fit(X, y, sample_weight)

            # init predictions
            y_pred = self.init_.predict(X)
            begin_at_stage = 0
        else:
            # add more estimators to fitted model
            # invariant: warm_start = True
            if self.n_estimators < self.estimators_.shape[0]:
                raise ValueError('n_estimators=%d must be larger or equal to '
                                 'estimators_.shape[0]=%d when '
                                 'warm_start==True'
                                 % (self.n_estimators,
                                    self.estimators_.shape[0]))
            begin_at_stage = self.estimators_.shape[0]
            y_pred = self._decision_function(X)
            self._resize_state()

初始化的時候酪碘，主要初始化最開始的模型，在_init_state()創(chuàng)建初始模型用的estimator

----_init_state()----
def _init_state(self):
        """Initialize model state and allocate model state data structures. """

        if self.init is None:
            self.init_ = self.loss_.init_estimator()
        elif isinstance(self.init, six.string_types):
            self.init_ = INIT_ESTIMATORS[self.init]()
        else:
            self.init_ = self.init

        self.estimators_ = np.empty((self.n_estimators, self.loss_.K),
                                    dtype=np.object)
        self.train_score_ = np.zeros((self.n_estimators,), dtype=np.float64)
        # do oob?
        if self.subsample < 1.0:
            self.oob_improvement_ = np.zeros((self.n_estimators),
                                             dtype=np.float64)

利用創(chuàng)建的self.init_得到模型迭代之前最初始的y_pred盐茎。
然后是在_fit_stages()函數(shù)里構(gòu)建boosting的過程婆跑。
_fit_stages()中最重要的是迭代的更新每一個基礎的learner。

------_fit_stages()------
# perform boosting iterations
        i = begin_at_stage
        for i in range(begin_at_stage, self.n_estimators):

            # subsampling
            if do_oob:
                sample_mask = _random_sample_mask(n_samples, n_inbag,
                                                  random_state)
                # OOB score before adding this stage
                old_oob_score = loss_(y[~sample_mask],
                                      y_pred[~sample_mask],
                                      sample_weight[~sample_mask])

            # fit next stage of trees
            y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,
                                     sample_mask, random_state, X_idx_sorted,
                                     X_csc, X_csr)

其中_fit_stage是每一次的迭代庭呜，創(chuàng)建一棵決策回歸樹進行學習的過程滑进。具體見代碼中的中文注釋。

def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
                   random_state, X_idx_sorted, X_csc=None, X_csr=None):
        """Fit another stage of ``n_classes_`` trees to the boosting model. """

        assert sample_mask.dtype == np.bool
        loss = self.loss_
        original_y = y

        #回歸樹與二分類K=1
        for k in range(loss.K):
            if loss.is_multi_class:
                y = np.array(original_y == k, dtype=np.float64)
            #得到真正的y與當前的y_pred的殘差（殘差用的是負梯度計算的募谎。）
            residual = loss.negative_gradient(y, y_pred, k=k,
                                              sample_weight=sample_weight)

            # induce regression tree on residuals
            #以殘差作為當前這棵樹要預測的目標值
            tree = DecisionTreeRegressor(
                criterion=self.criterion,
                splitter='best',
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                min_weight_fraction_leaf=self.min_weight_fraction_leaf,
                max_features=self.max_features,
                max_leaf_nodes=self.max_leaf_nodes,
                random_state=random_state,
                presort=self.presort)

            if self.subsample < 1.0:
                # no inplace multiplication!
                sample_weight = sample_weight * sample_mask.astype(np.float64)
            #以殘差作為當前這棵樹要預測的目標值
            if X_csc is not None:
                tree.fit(X_csc, residual, sample_weight=sample_weight,
                         check_input=False, X_idx_sorted=X_idx_sorted)
            else:
                tree.fit(X, residual, sample_weight=sample_weight,
                         check_input=False, X_idx_sorted=X_idx_sorted)

            # update tree leaves
            #調(diào)用‘‘ loss.update_terminal_regions’’用當前訓練的這棵樹的對X的預測結(jié)果來更新整個模型的y_pred
            if X_csr is not None:
                loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred,
                                             sample_weight, sample_mask,
                                             self.learning_rate, k=k)
            else:
                loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
                                             sample_weight, sample_mask,
                                             self.learning_rate, k=k)

            # add tree to ensemble
            #保留這一棵樹
            self.estimators_[i, k] = tree
         #返回這一次迭代的更新了以后的y_pred
        return y_pred

這就是整個fit也就是整個模型構(gòu)建的過程扶关。
除了構(gòu)建模型，即fit.GradientBoostingRegressor還要實現(xiàn)自己另外一個重要的函數(shù)predict数冬，通過構(gòu)建的模型進行預測节槐。

def predict(self, X):
        """Predict regression target for X.

        Parameters
        ----------
        X : array-like of shape = [n_samples, n_features]
            The input samples.

        Returns
        -------
        y : array of shape = [n_samples]
            The predicted values.
        """
        X = check_array(X, dtype=DTYPE, order="C")
        return self._decision_function(X).ravel()

主要調(diào)用了self._decision_function這個也是基類BaseGradientBoosting的函數(shù)。

def _decision_function(self, X):
        # for use in inner loop, not raveling the output in single-class case,
        # not doing input validation.
        score = self._init_decision_function(X)
        predict_stages(self.estimators_, X, self.learning_rate, score)
        return score

其中score = self._init_decision_function(X)是利用self._init初始化（如同訓練模型時的那樣）最開始的y_pred拐纱。

     def _init_decision_function(self, X):
        """Check input and compute prediction of ``init``. """
        self._check_initialized()
        X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
        if X.shape[1] != self.n_features:
            raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
                self.n_features, X.shape[1]))
        score = self.init_.predict(X).astype(np.float64)
        return score

然后是predict_stages利用之前構(gòu)建的模型（多棵樹）迭代地獲得最終的預測值铜异。

2. 參考

最后編輯于：2017.12.04 05:20:48

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市秸架，隨后出現(xiàn)的幾起案子揍庄，更是在濱河造成了極大的恐慌，老刑警劉巖东抹，帶你破解...
沈念sama閱讀 206,482評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件蚂子，死亡現(xiàn)場離奇詭異，居然都是意外死亡缭黔，警方通過查閱死者的電腦和手機食茎，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,377評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來馏谨，“玉大人别渔，你說我怎么就攤上這事【寤ィ” “怎么了哎媚？”我有些...
開封第一講書人閱讀 152,762評論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長壹哺。經(jīng)常有香客問我抄伍，道長，這世上最難降的妖魔是什么管宵？我笑而不...
開封第一講書人閱讀 55,273評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任截珍，我火速辦了婚禮，結(jié)果婚禮上箩朴，老公的妹妹穿的比我還像新娘岗喉。我一直安慰自己，他們只是感情好炸庞，可當我...
茶點故事閱讀 64,289評論 5贊 373
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布钱床。她就那樣靜靜地躺著，像睡著了一般埠居。火紅的嫁衣襯著肌膚如雪查牌。梳的紋絲不亂的頭發(fā)上事期，一...
開封第一講書人閱讀 49,046評論 1贊 285
城市分裂傳說
那天，我揣著相機與錄音纸颜，去河邊找鬼兽泣。笑死，一個胖子當著我的面吹牛胁孙，可吹牛的內(nèi)容都是我干的唠倦。我是一名探鬼主播，決...
沈念sama閱讀 38,351評論 3贊 400
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼涮较，長吁一口氣：“原來是場噩夢啊……” “哼稠鼻！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起狂票，我...
開封第一講書人閱讀 36,988評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤候齿，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后苫亦，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體毛肋，經(jīng)...
沈念sama閱讀 43,476評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 35,948評論 2贊 324
?白月光啟示錄
正文我和宋清朗相戀三年屋剑，在試婚紗的時候發(fā)現(xiàn)自己被綠了润匙。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 38,064評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡唉匾，死狀恐怖孕讳，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情巍膘，我是刑警寧澤厂财，帶...
沈念sama閱讀 33,712評論 4贊 323
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站峡懈，受9級特大地震影響璃饱，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜肪康，卻給世界環(huán)境...
茶點故事閱讀 39,261評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一荚恶、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧磷支，春花似錦谒撼、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,264評論 0贊 19
一樁弒父案廓潜，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春辩蛋，著一層夾襖步出監(jiān)牢的瞬間呻畸，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,486評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工堪澎，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留擂错，地道東北人。一個月前我還...
沈念sama閱讀 45,511評論 2贊 354
代替公主和親
正文我出身青樓樱蛤，卻偏偏與公主長得像，于是被迫代替她去往敵國和親剑鞍。傳聞我的和親對象是個殘疾皇子昨凡，可洞房花燭夜當晚...
茶點故事閱讀 42,802評論 2贊 345

gbdt 源碼分析

1. 源碼分析

初始化

計算殘差

迭代構(gòu)建模型

2. 參考

推薦閱讀更多精彩內(nèi)容