LightGBM
LightGBM(Light Gradient Boosting Machine)是一款基于決策樹算法的分布式梯度提升框架。為了滿足工業(yè)界縮短模型計(jì)算時(shí)間的需求番枚,LightGBM的設(shè)計(jì)思路主要是兩點(diǎn):
- 減小數(shù)據(jù)對(duì)內(nèi)存的使用,保證單個(gè)機(jī)器在不犧牲速度的情況下葫笼,盡可能地用上更多的數(shù)據(jù);
- 減小通信的代價(jià)路星,提升多機(jī)并行時(shí)的效率,實(shí)現(xiàn)在計(jì)算上的線性加速洋丐。由此可見挥等,LightGBM的設(shè)計(jì)初衷就是提供一個(gè)快速高效、低內(nèi)存占用肝劲、高準(zhǔn)確度、支持并行和大規(guī)模數(shù)據(jù)處理的數(shù)據(jù)科學(xué)工具涡相。
LightGBM是微軟旗下的Distributed Machine Learning Toolkit (DMKT)的一個(gè)項(xiàng)目,由2014年首屆阿里巴巴大數(shù)據(jù)競(jìng)賽獲勝者之一柯國霖主持開發(fā)催蝗。雖然其開源時(shí)間才僅僅2個(gè)月,但是其快速高效的特點(diǎn)已經(jīng)在數(shù)據(jù)科學(xué)競(jìng)賽中嶄露頭角先朦。Allstate Claims Severity競(jìng)賽中的冠軍解決方案里就使用了LightGBM缰冤,并對(duì)其大嘉贊賞喳魏。
特性
- 優(yōu)化速度與內(nèi)存使用棉浸。
- 稀疏優(yōu)化刺彩。
- 優(yōu)化準(zhǔn)確率。使用leaf-wise生長方式创倔,可以處理分類變量。
- 優(yōu)化網(wǎng)絡(luò)通訊畦攘。
- 支持三種模式并行。
(1)特征并行:
a. Workers find local best split point {feature, threshold} on the local feature set.
b. Communicate local best splits with each other and get the best one.
c. Perform the best split.
(2)數(shù)據(jù)并行:
a. Instead of “Merge global histograms from all local histograms”, LightGBM use “Reduce Scatter” to merge histograms of different (non-overlapping) features for different workers. Then workers find the local best split on local merged histograms and sync up the global best split.
b. As aforementioned, LightGBM uses histogram subtraction to speed up training. Based on this, we can communicate histograms only for one leaf, and get its neighbor’s histograms by subtraction as well.
(3)投票并行:
Voting parallel further reduces the communication cost in data-parallel to constant cost. It uses two-stage voting to reduce the communication cost of feature histograms.
常見問題
LightGBM和XGBoost有什么區(qū)別叹螟?他們的loss一樣么台盯? 算法層面有什么區(qū)別?
答:LightGBM:基于Histogram的決策樹算法爷恳;Leaf-wise的葉子生長策略;Cache命中率優(yōu)化温亲;直接支持類別特征(categorical Feature);XGBoost:預(yù)排序栈虚;Level-wise的層級(jí)生長策略;特征對(duì)梯度的訪問是一種隨機(jī)訪問魂务。LightGBM有哪些實(shí)現(xiàn),各有什么區(qū)別粘姜?
答:gbdt:梯度提升決策樹,串行速度慢孤紧,容易過擬合;rf:隨機(jī)森林臭猜,并行速度快躺酒;dart:訓(xùn)練較慢蔑歌;goss:容易過擬合。