??GPU算力的優(yōu)越性,在深度學(xué)習(xí)方面已經(jīng)體現(xiàn)得很充分了,稅務(wù)領(lǐng)域的落地應(yīng)用可以參閱我的文章《升級(jí)HanLP并使用GPU后端識(shí)別發(fā)票貨物勞務(wù)名稱》循未、《HanLP識(shí)別發(fā)票貨物勞務(wù)名稱之三 GPU加速》以及另一篇文章《外一篇:深度學(xué)習(xí)之VGG16模型雪豹識(shí)別》麻昼,HanLP使用的是Tensorflow及PyTorch深度學(xué)習(xí)框架塞关,有興趣的廠商也可以用自己的框架試試蜒犯。
??這些文章都是Python上跑的,R語(yǔ)言上Tensorflow及Keras有相應(yīng)的接口包(后端運(yùn)行還是在Python上)硫戈,見(jiàn)《R語(yǔ)言深度學(xué)習(xí)》锰什,最近也開(kāi)發(fā)了R語(yǔ)言原生的深度學(xué)習(xí)框架Torch for R,以及原生的Apache MXNET等丁逝,后面兩個(gè)還沒(méi)有跑過(guò)汁胆,有時(shí)間可以試一下。
??有關(guān)Linux上GPU的安裝與使用霜幼,可以參閱我在簡(jiǎn)書(shū)上的系列文章嫩码。
??在傳統(tǒng)的機(jī)器學(xué)習(xí)應(yīng)用領(lǐng)域,主要是分類(lèi)與回歸罪既,也有一些算法實(shí)現(xiàn)嘗試?yán)肎PU的算力來(lái)提升性能铸题。前文《墨爾本房?jī)r(jià)回歸模型(Python)》及《用Tidy Models實(shí)現(xiàn)墨爾本房?jī)r(jià)回歸模型(R)》中,XGBoost琢感,LightGBM丢间,CatBoost這3種Kaggle上公認(rèn)的世界頂尖水平GBDT(梯度下降決策樹(shù))算法實(shí)現(xiàn),都支持GPU運(yùn)行驹针,這就提出了一個(gè)問(wèn)題和機(jī)會(huì)烘挫,來(lái)探索一下該領(lǐng)域?qū)崿F(xiàn)GPU算力優(yōu)越性的可能性和條件。這是個(gè)很有實(shí)用意義的問(wèn)題柬甥,各大云平臺(tái)及PC饮六、筆記本上那么多GPU,能否充分利用苛蒲,是選擇算法實(shí)現(xiàn)和技術(shù)路線的一個(gè)重要參考標(biāo)準(zhǔn)卤橄。人家已經(jīng)做出來(lái)了,網(wǎng)上也有不少實(shí)例展示了大規(guī)模數(shù)據(jù)集上分類(lèi)或回歸算法GPU算力的優(yōu)越性撤防,所以可能性是肯定的虽风,問(wèn)題是在自己的落地應(yīng)用場(chǎng)景中找到實(shí)現(xiàn)的條件棒口,需要實(shí)測(cè)了解一下寄月。
??在墨爾本房?jī)r(jià)回歸分析模型的例子中辜膝,實(shí)測(cè)顯示不管是Python的實(shí)現(xiàn)還是R語(yǔ)言的實(shí)現(xiàn),GPU(Nvidia GeForce RTX 2060 Max-Q漾肮,1920個(gè)CUDA核)上都比CPU(Intel Core i7 8核[16虛擬核])上慢厂抖,需要深入了解原因,是GPU(參數(shù))沒(méi)用對(duì)呢還是數(shù)據(jù)集本身的特點(diǎn)克懊,還是硬件本身的能力就是如此忱辅,從而搞清落地應(yīng)用場(chǎng)景中實(shí)現(xiàn)GPU算力優(yōu)越性的條件。
一谭溉、R語(yǔ)言測(cè)試
??最近在寫(xiě)Tidy Models的介紹文章墙懂,先講講R語(yǔ)言上的情況,后面再講講Python上的情況扮念,結(jié)果是一樣的损搬。
1、XGBoost算法柜与。
??XGBoost開(kāi)源算法框架由University of Washington主導(dǎo)開(kāi)發(fā)巧勤,默認(rèn)安裝的CRAN XGBoost是不支持GPU的,要安裝其Github主頁(yè)上的發(fā)行版弄匕,上面有預(yù)編譯好的Windows及Linux版颅悉,下載安裝即可,目前是1.7.3.1版迁匠。運(yùn)行時(shí)增加一個(gè)參數(shù)tree_method="gpu_hist"
即可剩瓶。
set_engine('xgboost', tree_method="gpu_hist")
# -----------------------------------------------------------------------------------------
library(tidymodels)
library(kableExtra)
library(tidyr)
# All operating systems,注冊(cè)并行處理
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores())
registerDoParallel(cl)
# 優(yōu)先使用tidymodels的同名函數(shù)城丧。
tidymodels_prefer()
# 異常值閾值30
threshold<- 30
# ----------------------------------------------------------------------------------------
# 加載經(jīng)過(guò)預(yù)處理的數(shù)據(jù)
melbourne<- read.csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv")
# 過(guò)濾缺失值
# Error: Missing data in columns: BuildingArea.
# 47 obs.
missing <- filter(melbourne, BuildingArea==0)
melbourne <- filter(melbourne, BuildingArea!=0)
# 劃分訓(xùn)練集與測(cè)試集
set.seed(2023)
melbourne_split <- initial_split(melbourne, prop = 0.80)
melbourne_train <- training(melbourne_split)
melbourne_test <- testing(melbourne_split)
# ----------------------------------------------------------------------------------------------------
# 貝葉斯優(yōu)化
# 可以調(diào)整菜譜參數(shù)延曙、模型主參數(shù)及引擎相關(guān)參數(shù)芙贫。
# 定義菜譜:回歸公式與預(yù)處理
melbourne_rec<-
recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
+ Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
# 標(biāo)準(zhǔn)化數(shù)值型變量
step_normalize(all_numeric_predictors())
# 定義模型:XGB搂鲫, 定義要調(diào)整的參數(shù),tree_method="gpu_hist"磺平,使用GPU魂仍。
xgb_spec <-
boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = tune()) %>%
set_engine('xgboost', tree_method="gpu_hist") %>%
set_mode('regression')
# 定義工作流
xgb_wflow <-
workflow() %>%
add_model(xgb_spec) %>%
add_recipe(melbourne_rec)
# 全部參數(shù)的邊界都已確定。
xgb_param <- xgb_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = threshold(c(0.01,0.5))) %>%
update(trees = trees(c(500,1000))) %>%
update(tree_depth = tree_depth(c(5,15))) %>%
update(sample_size = threshold(c(0.5,1))) %>%
finalize(melbourne_train)
xgb_param
# 查看參數(shù)邊界拣挪,都已確定
xgb_param %>% extract_parameter_dials("trees")
xgb_param %>% extract_parameter_dials("min_n")
xgb_param %>% extract_parameter_dials("tree_depth")
xgb_param %>% extract_parameter_dials("learn_rate")
xgb_param %>% extract_parameter_dials("loss_reduction")
xgb_param %>% extract_parameter_dials("sample_size")
xgb_param %>% extract_parameter_dials("stop_iter")
melbourne_folds <- vfold_cv(melbourne, v = 5)
# 執(zhí)行貝葉斯優(yōu)化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
xgb_res_bo <-
xgb_wflow %>%
tune_bayes(
resamples = melbourne_folds,
metrics = metric_set(rsq, rmse, mae),
initial = 10,
param_info = xgb_param,
iter = 100,
control = ctrl
)
t2<-proc.time()
cat(t2-t1)
# CPU 3014 1.99 3989.22 NA NA
# GPU 5892.78 4.16 8416.28 NA NA
??可以看到GPU上反而慢了一倍多擦酌,在訓(xùn)練的過(guò)程中觀察網(wǎng)絡(luò)的流量,發(fā)現(xiàn)GPU與CPU之間數(shù)據(jù)拷貝的流量不小菠劝。
2赊舶、LightGBM算法。
??LightGBM算法是微軟開(kāi)發(fā)的,默認(rèn)的CRAN安裝也是不支持GPU笼平,GPU版要下載源碼編譯园骆,具體請(qǐng)參閱《Installation Guide: Build GPU Version》以及LightGBM R-package Github主頁(yè),編譯好GPU版LightGBM后寓调,運(yùn)行項(xiàng)目主目錄下的build_r.R打包生成GPU版的lightgbm R包并安裝锌唾。GPU版默認(rèn)是OpenCL API,Nvidia也支持夺英,如果要編譯CUDA專(zhuān)用API晌涕,請(qǐng)參閱《Installation Guide: Build CUDA Version》。當(dāng)時(shí)在Python上測(cè)試時(shí)用CMake + VS Build Tools編譯的3.2.1.99版痛悯,看了一下LightGBM的發(fā)布信息余黎,目前最新的版本是3.3.4,主要是適配R-4.2载萌,3.2.1之后的版本惧财,在GPU支持上沒(méi)有大的更新,就先用著3.2.1.99版測(cè)試炒考,以后有需要再升級(jí)可缚。該文檔提供了LightGBM原生R語(yǔ)言API的簡(jiǎn)單測(cè)試?yán)樱?shù)據(jù)集下顯然是CPU比GPU要快斋枢。
Rscript build_r.R --use-gpu
??參閱LightGBM參數(shù)文檔帘靡,OpenCL API下它需要兩個(gè)參數(shù)來(lái)確定GPU的廠商及設(shè)備編號(hào):gpu_platform_id
與gpu_device_id
,可以用工具GPUCapsViewer來(lái)查看瓤帚,如下圖所示描姚,但LightGBM中的編號(hào)是從0開(kāi)始的,引用時(shí)都要減1戈次,比如我的筆記本上有集成的intel顯卡轩勘,它的gpu_platform_id
是1,Nvidia的gpu_platform_id
是2怯邪,R程序中引用時(shí)绊寻,gpu_platform_id
是2-1=1,gpu_device_id
是1-1=0悬秉。
??加載數(shù)據(jù)等相同的程序就不重復(fù)了澄步,指定
device="gpu"
等參數(shù)就可以使用GPU。
set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0)
# 為L(zhǎng)ightGBM提供 parsnip接口支持
library(bonsai)
# 貝葉斯優(yōu)化
# 可以調(diào)整菜譜參數(shù)和泌、模型主參數(shù)及引擎相關(guān)參數(shù)村缸。
# 定義菜譜:回歸公式與預(yù)處理
melbourne_rec<-
recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
+ Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
# 標(biāo)準(zhǔn)化數(shù)值型變量
step_normalize(all_numeric_predictors())
# 定義模型:Light GBM, 定義要調(diào)整的參數(shù)
lgbm_spec <-
boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(),
loss_reduction = tune(), sample_size = tune(), mtry=tune()) %>%
# set_engine('lightgbm') %>%
# 有一個(gè)集成的Intel顯卡武氓,它的gpu_platform_id=0梯皿,gpu_device_id = 0仇箱,Nvidia獨(dú)立顯卡的gpu_platform_id=1
set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0) %>%
set_mode('regression')
# 定義工作流
lgbm_wflow <-
workflow() %>%
add_model(lgbm_spec) %>%
add_recipe(melbourne_rec)
# mtry參數(shù)的邊界未完全確定,用finalize()函數(shù)確定东羹。
lgbm_param <- lgbm_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = threshold(c(0.01,0.5))) %>%
update(trees = trees(c(500,1000))) %>%
update(tree_depth = tree_depth(c(5,15))) %>%
update(mtry = mtry(c(3,6))) %>%
update(sample_size = threshold(c(0.5,1))) %>%
finalize(melbourne_train)
# 查看參數(shù)邊界剂桥,都已確定
lgbm_param %>% extract_parameter_dials("trees")
lgbm_param %>% extract_parameter_dials("min_n")
lgbm_param %>% extract_parameter_dials("tree_depth")
lgbm_param %>% extract_parameter_dials("learn_rate")
lgbm_param %>% extract_parameter_dials("loss_reduction")
lgbm_param %>% extract_parameter_dials("sample_size")
lgbm_param %>% extract_parameter_dials("mtry")
melbourne_folds <- vfold_cv(melbourne, v = 5)
# 執(zhí)行貝葉斯優(yōu)化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
lgbm_res_bo <-
lgbm_wflow %>%
tune_bayes(
resamples = melbourne_folds,
metrics = metric_set(rsq, rmse, mae),
initial = 10,
param_info = lgbm_param,
iter = 100,
control = ctrl
)
t2<-proc.time()
cat(t2-t1)
#CPU 4760.83 2.64 5503.5 NA NA
#GPU 5834.04 5.57 8285.5 NA NA
??可以看到CPU也是比GPU要快了近一倍。
3百姓、CatBoost算法渊额。
??CatBoost是俄國(guó)Yandex搜索引擎開(kāi)發(fā)的開(kāi)源GBDT算法框架况木,它各個(gè)操作系統(tǒng)的預(yù)編譯版本都是支持GPU的垒拢,可以從項(xiàng)目主頁(yè)的最新版本處下載安裝,目前的最新版本是1.1.1火惊。使用時(shí)增加一個(gè)參數(shù)task_type = 'GPU'
即可求类,參數(shù)文檔。
# 為catboost提供 parsnip接口支持
library(treesnip)
# 定義菜譜:回歸公式與預(yù)處理
# 'Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
# 'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u'
melbourne_rec<-
recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
+ Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
#step_log(BuildingArea, base = 10) %>%
# 標(biāo)準(zhǔn)化數(shù)值型變量
step_normalize(all_numeric_predictors())
# 定義模型:Cat
cat_model<-
boost_tree(trees = 1000, learn_rate=0.05) %>%
set_engine("catboost",
loss_function = "RMSE",
eval_metric='RMSE',
task_type = 'GPU' # Catboost GPU上運(yùn)行的效率還不如CPU, 可能是數(shù)據(jù)集還不夠大屹耐。
) %>%
set_mode("regression")
# 定義工作流
cat_wflow <-
workflow() %>%
add_model(cat_model) %>%
add_recipe(melbourne_rec)
# 訓(xùn)練模型
t1<-proc.time()
cat_fit <- fit(cat_wflow, melbourne_train)
t2<-proc.time()
cat(t2-t1)
# CPU 2.42 0.07 2.68 NA NA
# GPU 12.77 3.44 12.78 NA NA
??CatBoost在GPU上慢了5倍多尸疆,所以暫時(shí)沒(méi)有測(cè)試100次迭代的貝葉斯優(yōu)化,但5折交叉驗(yàn)證時(shí)惶岭,它的CPU和GPU都幾乎滿格了寿弱。
二、Python測(cè)試
1按灶、XGBoost算法症革。
??XGBoost的Python版是預(yù)編譯的二進(jìn)制版本,已經(jīng)支持GPU鸯旁,用pip安裝即可噪矛。
pip install xgboost
??貝葉斯優(yōu)化,在GPU上訓(xùn)練時(shí)也是增加一個(gè)參數(shù)tree_method='gpu_hist'
铺罢。Python上的貝葉斯優(yōu)化實(shí)現(xiàn)與R上可能有所不同艇挨,它的高斯過(guò)程速度很快,可能只是估算1組候選參數(shù)(R上是數(shù)千組)韭赘,所以在Python上要迭代1000次缩滨,R上只迭代100次。在CPU與GPU模式之間切換只需更新貝葉斯優(yōu)化的代價(jià)函數(shù)f_xgb()即可泉瞻。
??各算法公共的部分脉漏,加載軟件包與數(shù)據(jù)。
# 加載公用包
# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')
# Basic Imports
import numpy as np
import pandas as pd
import time
# Preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score
# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Model Tuning
from hyperopt import fmin, tpe, hp, Trials
from hyperopt.fmin import generate_trials_to_calculate
# 加載數(shù)據(jù)瓦灶,劃分訓(xùn)練集與測(cè)試集鸠删,標(biāo)準(zhǔn)化數(shù)據(jù)
# 9015
df_NN = pd.read_csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv", encoding="utf-8")
X=df_NN[['Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u']]
y=df_NN['LogPrice']
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .20, random_state=42)
train_X2 = train_X.copy()
valid_X2 = valid_X.copy()
# Data standardization
mean = train_X.mean(axis=0)
train_X -= mean
std = train_X.std(axis=0)
train_X /= std
valid_X -= mean
valid_X /= std
XGBoost:
# ML Models
from xgboost import XGBRegressor
# 定義參數(shù)搜索空間,縮小參數(shù)取值范圍贼陶,搜索會(huì)快很多
space_xgb = {
'max_bin': hp.choice('max_bin', range(8, 128)), # CPU 50-501 GPU 8-128
'max_depth': hp.choice('max_depth', range(3, 11)),
'n_estimators': hp.choice('n_estimators', range(100, 1001)),
'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),
'subsample': hp.uniform('subsample', 0.5, 0.99),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
'reg_alpha': hp.uniform('reg_alpha', 0, 5), # lambda_l1
'reg_lambda': hp.uniform('reg_lambda', 0, 3), # lambda_l2
'gamma': hp.uniform('gamma',0.0, 10), # min_split_loss, min_split_gain
'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}
# 定義代價(jià)函數(shù)
def f_xgb(params):
# Set extra_trees=True to avoid overfitting
# CPU 4.96s/trial
# xgb = XGBRegressor(objective ='reg:squarederror', seed = 0,verbosity=0, **params)
# GPU 8.68s/trial
xgb = XGBRegressor(tree_method='gpu_hist', objective ='reg:squarederror', seed = 0,verbosity=0,**params)
#xgb_model = xgb.fit(train_X, train_y)
#acc = xgb_model.score(valid_X,valid_y)
# acc = cross_val_score(xgb, train_X, train_y).mean() # CPU
acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean() # GPU
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{
'max_bin':4, # default 256
'max_depth':5, # default 6
'n_estimators':578, # default 100
'learning_rate':0.05508679239402551, # default 0.3
'subsample':0.8429852720715357, # default 1.0
'colsample_bytree':0.8413894273173292, # default 1.0
'reg_alpha': 0.809791155072757, # default 0.0
'reg_lambda':1.4490119256389808, # default 1.0
'gamma':0.008478702584417519, # default 0.0
'min_child_weight':24.524635200338793, # default 1
}])
t1 = time.time()
# GPU: 1000trial [2:24:41, 8.68s/trial, best loss: -0.9080128034320879]
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=999, trials=trials)
t2 = time.time()
# 8681.310757875443
print("Time elapsed: ", t2-t1)
print('best:')
print(best_params)
??XGBoost Python版在GPU上訓(xùn)練的速度也是比CPU上慢了近一倍刃泡。
2巧娱、LightGBM算法。
??Python版安裝參閱LightGBM python-package主頁(yè)文檔烘贴,升級(jí)到最新的3.3.4禁添,這個(gè)安裝選項(xiàng)使用的是默認(rèn)的OpenCL API,這里用它來(lái)測(cè)試桨踪。
pip install lightgbm --install-option=--gpu
??Windows上安裝CUDA API專(zhuān)用版要先配好Visual Studio開(kāi)發(fā)環(huán)境老翘,pip要調(diào)用它來(lái)編譯。
pip install lightgbm --install-option=--cuda
??貝葉斯優(yōu)化锻离,調(diào)用lightgbm時(shí)增加幾個(gè)參數(shù):device='gpu', gpu_platform_id=1, gpu_device_id = 0
铺峭,注意它的參數(shù)max_bin
在CPU和GPU上的取值范圍不同,GPU上如果不正確設(shè)置會(huì)引起index out of range的錯(cuò)誤汽纠。公共的程序就不重復(fù)了卫键。
# ML Models
from lightgbm import LGBMRegressor
# --------------------------------------------------------------------------------------------------
# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://www.pythonf.cn/read/6998
# https://lightgbm.readthedocs.io/en/latest/Parameters.html
# https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#deal-with-over-fitting
# https://lightgbm.readthedocs.io/en/latest/GPU-Performance.html
# 處理過(guò)擬合
# 設(shè)置較少的直方圖數(shù)目 max_bin
# 設(shè)置較小的葉節(jié)點(diǎn)數(shù) num_leaves
# 使用 min_child_samples(min_data_in_leaf) 和 min_child_weight(= min_sum_hessian_in_leaf)
# 通過(guò)設(shè)置 subsample(bagging_fraction) 和 subsample_freq(= bagging_freq) 來(lái)使用 bagging
# 通過(guò)設(shè)置 colsample_bytree(feature_fraction) 來(lái)使用特征子抽樣
# 使用更大的訓(xùn)練數(shù)據(jù)
# 使用 reg_alpha(lambda_l1) , reg_lambda(lambda_l2) 和 min_split_gain(min_gain_to_split) 來(lái)使用正則
# 嘗試 max_depth 來(lái)避免生成過(guò)深的樹(shù)
# Try extra_trees
# Try increasing path_smooth
# trials = generate_trials_to_calculate([{'max_bin':63-8, # default CPU 255 GPU 63
# 'max_depth':5-3, # default -1
# 'num_leaves':31-20, # default 31
# 'min_child_samples':20-10, # default 20
# 'subsample_freq':1-1, # default 1
# 'n_estimators':6000-1000, # default 10
# 'learning_rate':0.01, # default 0.1
# 'subsample':0.75, # default 1.0
# 'colsample_bytree':0.8, # default 1.0
# 'lambda_l1':0.0, # default 0.0
# 'lambda_l2':0.0, # default 0.0
# 'min_child_weight':0.001, # default 0.001
# 'min_split_gain':0.0, # default 0.0
# #'path_smooth':0.0 # default 0.0
# }])
# 縮小參數(shù)取值范圍,搜索會(huì)快很多
space_lgbm = {
'max_bin': hp.choice('max_bin', range(8, 128)), # CPU 50-501 GPU 8-128
'max_depth': hp.choice('max_depth', range(3, 31)),
'num_leaves': hp.choice('num_leaves', range(10, 256)),
'min_child_samples': hp.choice('min_child_samples', range(10, 51)),
'subsample_freq': hp.choice('subsample_freq', range(1, 6)),
'n_estimators': hp.choice('n_estimators', range(500, 6001)),
'learning_rate': hp.uniform('learning_rate', 0.005, 0.15),
'subsample': hp.uniform('subsample', 0.5, 0.99),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
'reg_alpha': hp.uniform('reg_alpha', 0, 5), # lambda_l1
'reg_lambda': hp.uniform('reg_lambda', 0, 3), # lambda_l2
'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
'min_split_gain': hp.uniform('min_split_gain',0.0, 1),
#'path_smooth': hp.uniform('path_smooth',0.0, 3)
}
def f_lgbm(params):
# Set extra_trees=True to avoid overfitting
# lgbm = LGBMRegressor(seed=0,verbose=-1, **params) # CPU 4.96s/trial
lgbm = LGBMRegressor(device='gpu', gpu_platform_id=1, gpu_device_id = 0, num_threads =3, **params) # GPU 65.93s/trial
#lgb_model = lgbm.fit(train_X, train_y)
#acc = lgb_model.score(valid_X,valid_y)
# acc = cross_val_score(lgbm, train_X, train_y).mean() # CPU
acc = cross_val_score(lgbm, train_X, train_y, n_jobs=6).mean() # GPU
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':63, # default CPU 255 GPU 63
'max_depth':17, # default -1
'num_leaves':12, # default 31
'min_child_samples':14, # default 20
'subsample_freq':0, # default 1
'n_estimators':2647, # default 10
'learning_rate':0.0203187560767722, # default 0.1
'subsample':0.788703175392162, # default 1.0
'colsample_bytree':0.5203150334508861, # default 1.0
'reg_alpha': 0.988139501870491, # default 0.0
'reg_lambda':2.789779486137205, # default 0.0
'min_child_weight':21.813225361674828, # default 0.001
'min_split_gain':0.00039636685518264865, # default 0.0
#'path_smooth':0.0 # default 0.0
}])
t1 = time.time()
# 1000trial [5:23:09, 19.39s/trial, best loss: -0.9082183160929432] CPU
# 1000trial [1:22:39, 4.96s/trial, best loss: -0.9079837941918502] CPU
# 1000trial [1:02:28, 3.75s/trial, best loss: -0.9080477825539048] CPU
best_params = fmin(f_lgbm, space_lgbm, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
print(best_params)
??迭代100次虱朵,15.09s/trial莉炉,CPU滿格,Nvidia GPU過(guò)半碴犬,小數(shù)據(jù)集內(nèi)存消耗較低絮宁,GPU與CPU間有一些數(shù)據(jù)拷貝流量,應(yīng)該是正常的情況服协,結(jié)果也是CPU版3.28s/trial要快4倍多绍昂。
3、CatBoost算法蚯涮。
??pip安裝默認(rèn)已支持GPU:
pip install catboost
??CatBoost使用GPU只需要指定參數(shù)task_type='GPU'
治专,不過(guò)貝葉斯調(diào)參時(shí),這幾個(gè)參數(shù)是CPU版才有的遭顶,GPU版不支持:random_strength
张峰、subsample
,rsm
棒旗,然后參數(shù)border_count
GPU版與CPU版的取值范圍不同喘批,需要注意卜壕。公共的程序也不重復(fù)了爬舰,見(jiàn)前文。
# ML Models
from catboost import CatBoostRegressor
# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://github.com/talperetz/hyperspace/tree/master/GBDTs
# https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list
# https://catboost.ai/docs/concepts/parameter-tuning.html
# https://affine.ai/catboost-a-new-game-of-machine-learning/
'''
https://catboost.ai/en/docs/concepts/speed-up-training
Speeding up the training
1. iterations, worked
2. learning_rate, worked
2. boosting_type, Ordered, Plain, not worked
3. bootstrap_type, Bayesian, Bernoulli, MVS, Poisson, not worked
4. subsample, not worked
This parameter can be used if one of the following bootstrap types is selected:
Poisson Bernoulli MVS
5. one_hot_max_size, One-hot encoding
6. rsm, colsample_bylevel, Random subspace method
7.leaf_estimation_iterations, worked, set to 1.
Try setting the value to "1" or "5" to speed up the training on datasets with a small number of features.
8. max_ctr_complexity, worked, 0 or 2 to speed up trainning.
This parameter can affect the training time only if the dataset contains categorical features.
9. border_count, worked, set to less.
10.Reusing quantized datasets in Python, not applyable to cross_val_score()
11.Golden features. If the dataset has a feature, which is a strong predictor of the result, the
pre-quantisation of this feature may decrease the information that the model can get from it.
It is recommended to use an increased number of borders (1024) for this feature.
per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
'''
# default values
# trials = generate_trials_to_calculate([{'border_count':254-150, # default CPU 254 GPU 128
# 'iterations':1000-500, # default 1000
# 'depth': 6-2, # default 6
# 'random_strength':1.0, # default 1.0, CPU only
# 'learning_rate': 0.03, # default 0.03
# 'subsample':0.8, # default 0.8
# 'l2_leaf_reg': 3.0, # default 3
# 'rsm':0.8, # default 1.0 CPU only
# 'fold_len_multiplier':2.0, # default 2.0
# 'bagging_temperature':1.0 # default 1.0
# }])
# 縮小參數(shù)取值范圍虑瀑,搜索會(huì)快很多
space_cat = {'border_count': hp.choice('border_count', range(8, 128)), # CPU 150-351 GPU 8-128
'iterations': hp.choice('iterations', range(500, 1501)),
'depth': hp.choice('depth', range(2, 10)),
#'random_strength': hp.uniform('random_strength', 1, 20),
'learning_rate': hp.uniform('learning_rate', 0.005, 0.15),
# 'subsample': hp.uniform('subsample', 0.5, 1),
'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1, 100),
# 'rsm': hp.uniform('rsm', 0.5, 0.99), # colsample_bylevel
'fold_len_multiplier': hp.uniform('fold_len_multiplier', 1.0, 10.0),
'bagging_temperature': hp.uniform('bagging_temperature', 0.0, 1.0) }
def f_cat(params):
# cat = CatBoostRegressor(task_type='CPU', random_seed=0,
cat = CatBoostRegressor(task_type='GPU', random_seed=0,
# boosting_type='Plain', bootstrap_type = 'Bayesian', max_ctr_complexity=1,
one_hot_max_size=3,
leaf_estimation_iterations=1,
#per_float_feature_quantization=['3:border_count=1024', '4:border_count=1024'], # Golden features: lat, long
verbose=False, **params) # CPU 13.05s/trial
acc = cross_val_score(cat, train_X, train_y, n_jobs=3).mean()
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'border_count':112, # default CPU 254 GPU 128
'iterations':989, # default 1000
'depth': 4, # default 6
#'random_strength':6.6489521372262645, # default 1.0, CPU only
'learning_rate': 0.07811835381238333, # default 0.03
#'subsample':0.9484820488113903, # default 0.8
'l2_leaf_reg': 8.070279328038293, # default 3
#'rsm':0.7188098046587024, # default 1.0 CPU only
'fold_len_multiplier': 6.034216410528531, # default 2.0
'bagging_temperature':0.47787665340753926 # default 1.0
}])
t1 = time.time()
# 1000trial [50:28, 3.03s/trial, best loss: -0.905859099632395]
best_params = fmin(f_cat, space_cat, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
print('best:')
??這么小一個(gè)數(shù)據(jù)集逛拱,GPU滿格敌厘,CPU過(guò)半,內(nèi)存滿格朽合,迭代一次要350秒俱两,是不太正常的饱狂,應(yīng)該是哪些參數(shù)還沒(méi)有設(shè)對(duì),或者是數(shù)據(jù)集需要一些其它的預(yù)處理宪彩。
??與此相對(duì)休讳,CPU版CPU滿格,內(nèi)存消耗不大尿孔,7.46s/trial速度也不錯(cuò)俊柔。
三、LightGBM官方例子
??LightGBM官方的HIGGS分類(lèi)例子活合,說(shuō)GPU應(yīng)該有三倍以上的加速雏婶,在我的筆記本上性能大致相當(dāng)(GPU沒(méi)有滿負(fù)荷跑,因?yàn)镚BDT算法的GPU版其實(shí)是把部分的運(yùn)算搬到GPU上芜辕,很多運(yùn)算還會(huì)在CPU上算尚骄,一般是CPU滿格,GPU不會(huì)滿格侵续。),這說(shuō)明應(yīng)該是硬件能力的限制憨闰,聯(lián)想拯救者Y9000X的NVIDIA GeForce RTX 2060 Max-Q還不夠強(qiáng)悍状蜗。該數(shù)據(jù)集有1100萬(wàn)行28個(gè)變量,解壓后約7.5G鹉动,不小了轧坎,它的測(cè)試應(yīng)該有參考價(jià)值。劃分990萬(wàn)行為訓(xùn)練集泽示,110萬(wàn)行為驗(yàn)證集(10%)缸血。數(shù)據(jù)下載,參考資料械筛。
??各算法公共的部分捎泻,加載數(shù)據(jù)。
# -*- coding: utf-8 -*-
"""
Created on Thu Sep 23 15:58:22 2021
@author: Jean
"""
'''
This is a classification problem to distinguish between a signal process
which produces Higgs bosons and a background process which does not.
The first column is the class label (1 for signal, 0 for background),
followed by the 28 features (21 low-level features then 7 high-level features):
lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi,
jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag,
jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag,
m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.
'''
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
t1 = time.time()
# Load dataset
df = pd.read_csv("D:/temp/data/HIGGS/HIGGS.csv", encoding="utf-8", header=None)
# Target column changed to int
df.iloc[:,0] = df.iloc[:,0].astype(int)
t2 = time.time()
# 50.2845721244812
print(t2-t1)
df.shape
df.head(2)
t1 = time.time()
X = df.iloc[:,1:]
y = df.iloc[:,0]
t1 = time.time()
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .10, random_state=2023)
t2 = time.time()
print(t2-t1)
# 5.838041305541992
??在不同的迭代次數(shù)下比較CPU與GPU的性能埋哟。
# Set objective=regression to change to a regression problem
import lightgbm as lgb
# create dataset for lightgbm
dtrain = lgb.Dataset(train_X, train_y)
dvalid = lgb.Dataset(valid_X, valid_y, reference=dtrain)
t_cpu = []; t_nvidia = []; t_intel=[]
a_cpu = []; a_nvidia = []; a_intel=[]
t3 = time.time()
for num_iterations in [50,100,150,200]:
# CPU --------------------------------------------------------------------------------
params = {'objective':'binary',
'num_iterations':num_iterations,
'max_bin': 63,
'num_leaves': 255,
'learning_rate': 0.1,
'tree_learner': 'serial',
'task': 'train',
'is_training_metric': 'false',
'min_data_in_leaf': 1,
'min_sum_hessian_in_leaf': 100,
'ndcg_eval_at': [1, 3, 5, 10],
'device': 'cpu'
}
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
t1 = time.time()
t = round(t1-t0,2)
t_cpu.append(t)
# 50: 46.00722551345825 100: 138.17840361595154 150: 195.13047289848328
print('cpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
# AUC 0.8207000864285819 0.8304736031418638 0.8353184609588433
auc_score = roc_auc_score(valid_y,y_pred)
a_cpu.append(round(auc_score,4))
print(auc_score)
# NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
params = {'objective':'binary',
'num_iterations':num_iterations,
'max_bin': 63,
'num_leaves': 255,
'learning_rate': 0.1,
'tree_learner': 'serial',
'task': 'train',
'is_training_metric': 'false',
'min_data_in_leaf': 1,
'min_sum_hessian_in_leaf': 100,
'ndcg_eval_at': [1, 3, 5, 10],
'device': 'gpu',
'gpu_platform_id': 1,
'gpu_device_id': 0
}
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
t1 = time.time()
t = round(t1-t0,2)
t_nvidia.append(t)
# 50: 54.93808197975159 100: 103.01487278938293 150: 146.14963364601135
print('gpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
# AUC 0.8207000821252757 0.8304736011279031 0.8353184579727403
auc_score = roc_auc_score(valid_y,y_pred)
a_nvidia.append(round(auc_score,4))
print(auc_score)
# Intel(R) UHD Graphics ------------------------------------------------------------
params = {'objective':'binary',
'num_iterations':num_iterations,
'max_bin': 63,
'num_leaves': 255,
'learning_rate': 0.1,
'tree_learner': 'serial',
'task': 'train',
'is_training_metric': 'false',
'min_data_in_leaf': 1,
'min_sum_hessian_in_leaf': 100,
'ndcg_eval_at': [1, 3, 5, 10],
'device': 'gpu'
}
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
t1 = time.time()
t = round(t1-t0,2)
t_intel.append(t)
# 62.83425784111023
print('gpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
# AUC 0.8207000820323747
auc_score = roc_auc_score(valid_y,y_pred)
a_intel.append(round(auc_score,4))
print(auc_score)
t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))
??作圖笆豁。
perf_t = pd.DataFrame({"iterations":[50,100,150,200],"cpu":t_cpu,"Nvidia":t_nvidia,"Intel":t_intel})
perf_a = pd.DataFrame({"iterations":[50,100,150,200],"cpu":a_cpu,"Nvidia":a_nvidia,"Intel":a_intel})
perf_a["cpu"] = perf_a["cpu"]*100
perf_a["Nvidia"] = perf_a["Nvidia"]*100
perf_a["Intel"] = perf_a["Intel"]*100
iterations = [50,100,150,200]
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] #設(shè)置字體
plt.rcParams["axes.unicode_minus"]=False #正常顯示負(fù)號(hào)
fig,ax1 = plt.subplots()
ax2 = ax1.twinx() # 做鏡像處理
ax1.plot(iterations,t_cpu,'b', label="CPU")
ax1.plot(iterations,t_nvidia,'g', label="Nvidia")
ax1.plot(iterations,t_intel,'r', label="Intel")
ax1.legend(loc="upper left")
ax2.plot(iterations,perf_a["cpu"],"b--", label="CPU")
ax2.plot(iterations,perf_a["Nvidia"],"g--", label="Nvidia")
ax2.plot(iterations,perf_a["Intel"] ,"r--", label="Intel")
ax2.legend(loc="lower right")
ax1.set_xlabel('迭代次數(shù)') #設(shè)置x軸標(biāo)題
ax1.set_ylabel('時(shí)間(秒)') #設(shè)置Y1軸標(biāo)題
ax2.set_ylabel('AUC(%)') #設(shè)置Y2軸標(biāo)題
plt.show()
??CPU上測(cè)試, num_iterations=50
時(shí)約48秒赤赊。第一次運(yùn)行要加載數(shù)據(jù)闯狱,時(shí)間要長(zhǎng)一點(diǎn),以第二次運(yùn)行為準(zhǔn)(說(shuō)明:這幾個(gè)都是全部1100萬(wàn)條數(shù)據(jù)都用于訓(xùn)練的截圖)抛计。
[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191587 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
cpu version elapse time: 47.76462769508362
??Nvidia GPU上測(cè)試哄孤,
num_iterations=50
時(shí)約60秒,當(dāng)?shù)螖?shù)增多吹截,比如100次以后瘦陈,GPU會(huì)比CPU快朦肘。
[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 1 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 2060 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.480376 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 59.27095651626587
??CPU中集成的Intel GPU双饥,大約要70秒:
[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: Intel(R) UHD Graphics, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.262090 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 70.56596112251282
??上圖顯示媒抠,隨著參數(shù)迭代次數(shù)
num_iterations
(分別取值50、100咏花、150趴生、200)的增加,Nvidia GPU(綠色)的速度逐漸趕超了CPU(藍(lán)色)昏翰,當(dāng)num_iterations=200
時(shí)苍匆,已經(jīng)快了50%了,紅色實(shí)線顯示此時(shí)集成的Intel GPU也已經(jīng)比CPU快了棚菊。紅色虛線顯示了隨著迭代次數(shù)的增加浸踩,準(zhǔn)確率AUC也會(huì)逐漸提高,相同參數(shù)下统求,在不同設(shè)備上訓(xùn)練的精度沒(méi)有區(qū)別检碗,三條虛線是重合的。這說(shuō)明GPU算力的優(yōu)越性在此例中是可以驗(yàn)證的码邻,當(dāng)需要更高的精度時(shí)折剃,就需要更多的迭代次數(shù)去訓(xùn)練,此時(shí)GPU可以加速訓(xùn)練的過(guò)程像屋,就是說(shuō)數(shù)據(jù)集要大怕犁,迭代次數(shù)要多,優(yōu)越性才能體現(xiàn)出來(lái)己莺。
四奏甫、XGBoost上跑HIGGS數(shù)據(jù)集分類(lèi)
??有關(guān)XGBoost在kaggle HIGGS競(jìng)賽上參數(shù)討論的總結(jié)帖子并沒(méi)有給出好的AUC指標(biāo)。我從默認(rèn)值開(kāi)始用貝葉斯優(yōu)化從頭訓(xùn)練模型凌受,可參閱XGBoost參數(shù)文檔阵子。加載數(shù)據(jù)的代碼就不重復(fù)了。
from xgboost import XGBClassifier
space_xgb = {
'max_bin': hp.choice('max_bin', range(50, 512)), # CPU 50-501
'max_depth': hp.choice('max_depth', range(3, 11)),
'n_estimators': hp.choice('n_estimators', range(100, 1001)),
'learning_rate': hp.uniform('learning_rate', 0.01, 0.5),
'subsample': hp.uniform('subsample', 0.5, 1),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'reg_alpha': hp.uniform('reg_alpha', 0, 5), # lambda_l1
'reg_lambda': hp.uniform('reg_lambda', 0, 3), # lambda_l2
'gamma': hp.uniform('gamma',0.0, 10), # min_split_loss, min_split_gain
'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}
def f_xgb(params):
# xgb = XGBClassifier(objective ='binary:logistic', use_label_encoder=False, seed = 2023,\
# nthread=-1, verbosity=0, **params) # CPU
xgb = XGBClassifier(tree_method='gpu_hist', objective ='binary:logistic', use_label_encoder=False,\
nthread=-1, seed = 2023, verbosity=0,**params) # GPU
xgb_model = xgb.fit(train_X, train_y)
acc = xgb_model.score(valid_X,valid_y)
# acc = cross_val_score(xgb, train_X, train_y).mean() # CPU
# acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean() # GPU
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':256-50, # default 256
'max_depth':6-3, # default 6 [0,∞]
'n_estimators':200-100, # default 10
'learning_rate':0.3, # default 0.3 [0,1]
'subsample':1.0, # default 1.0 (0,1]
'colsample_bytree':1.0, # default 1.0 (0,1]
'reg_alpha':0, # default 0.0
'reg_lambda':1.0, # default 1.0
'gamma':0, # default 0.0 [0,∞]
'min_child_weight':1 # default 1 [0,∞]
}])
t1 = time.time()
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
??先用訓(xùn)練10次找到的參數(shù)來(lái)評(píng)估一下,雖然這組參數(shù)的AUC指標(biāo)不高腾么,GPU加速的效果還是比較顯著的奈梳,畫(huà)圖的代碼也不重復(fù)了。
t_cpu = []; t_nvidia = []
a_cpu = []; a_nvidia = []
t3 = time.time()
# for num_iterations in [50]:
for num_iterations in [50,100,150,200]:
# num_iterations =689
# CPU --------------------------------------------------------------------------------
params = {'objective':'binary:logistic',
'max_bin':286,
'n_estimators':num_iterations,
'learning_rate': 0.3359071085471539,
'max_depth': 4,
'min_child_weight':6.4817419839798385,
'colsample_bytree': 0.7209249276177966,
'subsample': 0.5532140686826488,
'reg_alpha': 2.2793074958255986,
'reg_lambda': 2.4142485681002315,
'gamma' :2.9324177415122934,
'nthread': -1,
'tree_method': 'hist'
}
t0 = time.time()
xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)
xgb_model = xgb.fit(train_X, train_y)
t1 = time.time()
t = round(t1-t0,2)
t_cpu.append(t)
print('cpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = xgb_model.predict(valid_X)
auc_score = roc_auc_score(valid_y,y_pred)
a_cpu.append(round(auc_score,4))
print(auc_score)
# NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
params = {'objective':'binary:logistic',
'max_bin':286,
'n_estimators':num_iterations,
'learning_rate': 0.3359071085471539,
'max_depth': 4,
'min_child_weight':6.4817419839798385,
'colsample_bytree': 0.7209249276177966,
'subsample': 0.5532140686826488,
'reg_alpha': 2.2793074958255986,
'reg_lambda': 2.4142485681002315,
'gamma' :2.9324177415122934,
'nthread': -1,
'tree_method': 'gpu_hist'
}
t0 = time.time()
xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)
xgb_model = xgb.fit(train_X, train_y)
t1 = time.time()
t = round(t1-t0,2)
t_nvidia.append(t)
print('gpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = xgb_model.predict(valid_X)
auc_score = roc_auc_score(valid_y,y_pred)
a_nvidia.append(round(auc_score,4))
print(auc_score)
t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))
五解虱、CatBoost上跑HIGGS數(shù)據(jù)集分類(lèi)
??前面已經(jīng)有兩個(gè)Python上使用GPU加速的例子攘须,所以這一次嘗要試一下R語(yǔ)言。
??Higgs Boson Machine Learning Challenge是Kaggle上8年前結(jié)束的一項(xiàng)競(jìng)賽殴泰,參賽的隊(duì)伍有1,784支于宙。這里有XGBoost的實(shí)現(xiàn)浮驳,相關(guān)的討論在該帖子。因?yàn)闆](méi)有好的AUC指標(biāo)捞魁,決定測(cè)試一下R語(yǔ)言上用貝葉斯優(yōu)化從頭開(kāi)始訓(xùn)練模型至会,看看能否找到好一點(diǎn)的參數(shù)組合。這么大的數(shù)據(jù)集谱俭,訓(xùn)練相當(dāng)耗時(shí)奉件,GPU加速的效果還是比較顯著的。
??貝葉斯優(yōu)化光生成前10組參數(shù)就1個(gè)多小時(shí)(tidymodels的高斯過(guò)程要求初始參數(shù)組數(shù)要比參數(shù)個(gè)數(shù)多昆著,Python上可以是1組县貌;tidymodels每次生成數(shù)千個(gè)候選參數(shù)組合,耗時(shí)較長(zhǎng)凑懂,但迭代次數(shù)少煤痕,Python上只生成1組,迭代次數(shù)要足夠多接谨,這是兩個(gè)平臺(tái)高斯過(guò)程實(shí)現(xiàn)的顯著不同)摆碉,每組參數(shù)訓(xùn)練模型大約需要5~10分鐘。為了加快速度疤坝,使用bootstrap()重采樣兆解,只做一次,而不是5折交叉驗(yàn)證跑揉,然后只迭代10次。
??然而其中有5次因?yàn)樯暾?qǐng)不到需要的內(nèi)存訓(xùn)練失敗了埠巨,網(wǎng)上搜到历谍,這是XGBoost在GPU上運(yùn)行的一個(gè)問(wèn)題,上一次訓(xùn)練完后辣垒,不會(huì)主動(dòng)釋放占用的內(nèi)存望侈,見(jiàn)該帖子,Python上有一些解決的辦法勋桶,見(jiàn)《XGBoost GPU Support》 Memory Usage一節(jié)及帖子《How do I free all memory on GshujuPU in XGBoost》脱衙。于是嘗試一下用CatBoost來(lái)測(cè)試。
??CatBoost多折交叉驗(yàn)證與網(wǎng)格搜索是支持GPU并行的例驹,下圖是Windows上用doParallel包2進(jìn)程5折交叉驗(yàn)證捐韩,CPU和GPU都有較高的負(fù)荷。但在貝葉斯優(yōu)化時(shí)鹃锈,如果設(shè)置了并行荤胁,則提示parsnip沒(méi)有找到boost-tree算法的CatBoost實(shí)現(xiàn),應(yīng)該是pkgs參數(shù)沒(méi)有起作用屎债,并行進(jìn)程沒(méi)有加載treesnip及catboost包仅政。
??參考該帖子垢油,在并行處理的cluster中為每個(gè)worker進(jìn)程預(yù)加載需要的包可解決問(wèn)題。
# 對(duì)于使用GPU的大數(shù)據(jù)集訓(xùn)練圆丹,驗(yàn)證2路最低限度并行滩愁。
# All operating systems,注冊(cè)并行處理辫封,并為每個(gè)并行處理worker加載需要的包硝枉。
# # https://github.com/tidymodels/tune/issues/157
library(doParallel)
cl <- makePSOCKcluster(2) # parallel::detectCores()
registerDoParallel(cl)
# 顯示有幾個(gè)worker
foreach::getDoParWorkers()
# 為每個(gè)并行處理worker加載需要的包
clusterEvalQ(cl,
{library(tidymodels)
library(treesnip)
library(catboost)
})
??CatBoost-HIGGS-GPU 2進(jìn)程并行6參數(shù)貝葉斯優(yōu)化。
??但是跑了一夜后秸讹,主進(jìn)程最終沒(méi)能建立與worker進(jìn)程的連接讀取結(jié)果檀咙,另外速度比單進(jìn)程要慢得多,單進(jìn)程上產(chǎn)生10個(gè)初始點(diǎn)璃诀,即進(jìn)行最初10次的訓(xùn)練弧可,一個(gè)多小時(shí)就跑完了,雙進(jìn)程要跑一夜(估計(jì)是參數(shù)較多內(nèi)存不夠劣欢,詳見(jiàn)下文)棕诵。
Forced gc(): 0.1 Seconds.
> Generating a set of 10 initial parameter results
Error in unserialize(socklist[[n]]) : error reading from connection
> t2<-proc.time()
> cat(t2-t1)
17.41 18.23 20864.81 NA NA
??然后再試試LightGBM,發(fā)覺(jué)LightGBM是可以兩路worker進(jìn)程在GPU上并行的凿将,只用貝葉斯優(yōu)化調(diào)試一個(gè)參數(shù)校套,比如下圖的tree_depth,可以順利跑完牧抵。但如果調(diào)試的參數(shù)比較多笛匙,比如7個(gè),就會(huì)因內(nèi)存不夠而產(chǎn)生大量的內(nèi)存交換磁盤(pán)IO犀变,把程序卡死跑不下來(lái)妹孙。LightGBM是CPU和內(nèi)存滿格,GPU 2路并行最高去到30%左右获枝。據(jù)說(shuō)單進(jìn)程訓(xùn)練配置的內(nèi)存大概要3倍左右的數(shù)據(jù)占用內(nèi)存蠢正,雙進(jìn)程就要6倍以上了,我的筆記本24G內(nèi)存跑多進(jìn)程并行訓(xùn)練HIGGS數(shù)據(jù)集還是不夠省店。
??LightGBM貝葉斯調(diào)試一個(gè)參數(shù),2個(gè)網(wǎng)格搜索初始值懦傍,迭代10次雹舀,其中5次發(fā)現(xiàn)了更優(yōu)的參數(shù),效果顯著谎脯。
Forced gc(): 0.2 Seconds.
-- Iteration 4 -----------------------------------------------------------------------------------------------------
i Current best: roc_auc=0.7369 (@iter 2)
i Gaussian process model
! Gaussian process model: X should be in range (0, 1)
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.046
i Estimating performance
√ Estimating performance
<3 Newest results: roc_auc=0.8011
Forced gc(): 0.3 Seconds.
-- Iteration 5 -----------------------------------------------------------------------------------------------------
i Current best: roc_auc=0.8011 (@iter 4)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
<3 Newest results: roc_auc=0.8116
??然后再試試CatBoost GPU 2進(jìn)程并行調(diào)一個(gè)參數(shù)葱跋,就順利的跑完了,每輪訓(xùn)練平均大約530秒。測(cè)試證明娱俺,跑大數(shù)據(jù)集稍味,內(nèi)存一定要夠大,參數(shù)不能太多荠卷。:)
-- Iteration 10 ----------------------------------------------------------------------------------------------------
i Current best: roc_auc=0.8295 (@iter 3)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
(x) Newest results: roc_auc=0.8293
> t2<-proc.time()
> cat(t2-t1)
105.55 5711.49 10611.05 NA NA
??為了更好的AUC指標(biāo)模庐,要調(diào)多個(gè)參數(shù),最后用單進(jìn)程跑CatBoost GPU上6個(gè)參數(shù)的調(diào)優(yōu)油宜,這里實(shí)際只跑了前50次掂碱,已經(jīng)是一夜了。
library(tidymodels)
library(kableExtra)
library(tidyr)
# 這個(gè)版本的treesnip支持classification慎冤。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 對(duì)于使用GPU的大數(shù)據(jù)集訓(xùn)練疼燥,驗(yàn)證2路最低限度并行,一個(gè)參數(shù)通過(guò)蚁堤,多參數(shù)跑不出來(lái)醉者,內(nèi)存不夠大。
# All operating systems披诗,注冊(cè)并行處理撬即,并為每個(gè)并行處理worker加載需要的包。
# # https://github.com/tidymodels/tune/issues/157
# library(doParallel)
# cl <- makePSOCKcluster(2) # parallel::detectCores()
# registerDoParallel(cl)
# # 顯示有幾個(gè)worker
# foreach::getDoParWorkers()
# # 為每個(gè)并行處理worker加載需要的包
# clusterEvalQ(cl,
# {library(tidymodels)
# library(treesnip)
# library(catboost)
# })
# 優(yōu)先使用tidymodels的同名函數(shù)呈队。
tidymodels_prefer()
# ----------------------------------------------------------------------------------------
# 加載經(jīng)過(guò)預(yù)處理的數(shù)據(jù)
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)
# 劃分訓(xùn)練集與測(cè)試集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test <- testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA
# ----------------------------------------------------------------------------------------
# 貝葉斯優(yōu)化
# 可以調(diào)整菜譜參數(shù)剥槐、模型主參數(shù)及引擎相關(guān)參數(shù)。
# 定義菜譜:回歸公式與預(yù)處理
higgs_rec<-
recipe(V1 ~ ., data = higgs_train) %>%
# 標(biāo)準(zhǔn)化數(shù)值型變量
step_normalize(all_numeric_predictors())
# 定義模型:XGB宪摧, 定義要調(diào)整的參數(shù)粒竖,tree_method="gpu_hist",使用GPU几于。
cat_spec <-
boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU') %>% #
set_mode('classification')
# 定義工作流
cat_wflow <-
workflow() %>%
add_model(cat_spec) %>%
add_recipe(higgs_rec)
# 全部參數(shù)的邊界都已確定温圆。
cat_param <- cat_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = threshold(c(0.01,0.5))) %>%
update(trees = trees(c(500,1000))) %>%
update(tree_depth = tree_depth(c(5,15))) %>%
update(mtry = mtry(c(3,6))) %>%
update(subsample = threshold(c(0.5,1)))
# 查看參數(shù)邊界,都已確定
cat_param
# 查看參數(shù)邊界孩革,都已確定
cat_param %>% extract_parameter_dials("trees")
cat_param %>% extract_parameter_dials("min_n")
cat_param %>% extract_parameter_dials("tree_depth")
cat_param %>% extract_parameter_dials("learn_rate")
cat_param %>% extract_parameter_dials("mtry")
cat_param %>% extract_parameter_dials("subsample")
# 對(duì)于大數(shù)據(jù)集來(lái)說(shuō),多折交叉驗(yàn)證的時(shí)間太長(zhǎng)了得运,用boostraps抽樣驗(yàn)證膝蜈,只做一次加快訓(xùn)練速度。
#higgs_folds <- vfold_cv(higgs_train, v = 5)
higgs_folds <- bootstraps(higgs_train, times = 1)
gc()
# 執(zhí)行貝葉斯優(yōu)化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
cat_res_bo <-
cat_wflow %>%
tune_bayes(
resamples = higgs_folds,
# metrics = metric_set(recall, precision, f_meas, accuracy, kap,roc_auc, sens, spec)
metrics = metric_set(accuracy, roc_auc, precision,),
initial = 10,
param_info = cat_param,
iter = 100,
control = ctrl,
# Hack了一下tune_bayes()函數(shù)熔掺,增加參數(shù)force_gc饱搏,迭代中每次訓(xùn)練前可以選擇強(qiáng)制回收內(nèi)存。
force_gc = TRUE
)
t2<-proc.time()
cat(t2-t1)
# 9435.55 269.2 9085.21 NA NA
# 畫(huà)圖查看貝葉斯優(yōu)化效果
autoplot(cat_res_bo, type = "performance", metric="roc_auc")
# 查看準(zhǔn)確率最高的模型
show_best(cat_res_bo, metric="precision")
show_best(cat_res_bo, metric="accuracy")
show_best(cat_res_bo, metric="roc_auc")
# 選擇準(zhǔn)確率最高的模型
select_best(cat_res_bo, metric="roc_auc")
# 直接讀取調(diào)參的最佳結(jié)果
cat_param_best<- select_best(cat_res_bo, metric="roc_auc")
# 最佳參數(shù)回填到工作流
cat_wflow_bo <-
cat_wflow %>%
finalize_workflow(cat_param_best)
cat_wflow_bo
# 用最佳參數(shù)在訓(xùn)練集全集上訓(xùn)練模型
t1<-proc.time()
# 回收內(nèi)存置逻,否則訓(xùn)練可能因申請(qǐng)不到內(nèi)存而失敗推沸,
# 前面貝葉斯優(yōu)化函數(shù)中如果加入回收內(nèi)存的機(jī)制,應(yīng)該就可以避免訓(xùn)練失敗。
gc()
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()
cat(t2-t1)
# 647.2 183.99 507.11 NA NA
# 測(cè)試集
# 預(yù)測(cè)值
# https://parsnip.tidymodels.org/reference/predict.model_fit.html
# https://yardstick.tidymodels.org/reference/roc_auc.html
t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob") %>%
bind_cols(predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "class"))
t2<-proc.time()
cat(t2-t1)
# 67.8 0.39 5.83 NA NA
# 合并真實(shí)值
higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
higgs_metrics <- metric_set(precision, accuracy)
higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
roc_auc(
higgs_test_bo,
truth = V1,
estimate=.pred_0,
options = list(smooth = TRUE)
)
> show_best(cat_res_bo, metric="precision")
# A tibble: 5 x 13
mtry trees min_n tree_depth learn_rate subsample .metric .estimator mean n std_err .config .iter
<int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 4 993 25 15 0.198 0.602 precision binary 0.703 1 NA Iter9 9
2 5 918 14 15 0.294 0.627 precision binary 0.702 1 NA Iter1 1
3 5 931 5 15 0.175 0.703 precision binary 0.701 1 NA Iter12 12
4 4 981 18 15 0.153 0.522 precision binary 0.701 1 NA Iter10 10
5 3 988 21 15 0.153 0.945 precision binary 0.701 1 NA Iter4 4
> show_best(cat_res_bo, metric="accuracy")
# A tibble: 5 x 13
mtry trees min_n tree_depth learn_rate subsample .metric .estimator mean n std_err .config .iter
<int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 4 997 16 15 0.118 0.685 accuracy binary 0.754 1 NA Iter18 18
2 3 985 36 15 0.124 0.583 accuracy binary 0.753 1 NA Iter17 17
3 5 966 8 15 0.128 0.869 accuracy binary 0.753 1 NA Iter11 11
4 5 989 3 15 0.117 0.816 accuracy binary 0.753 1 NA Iter15 15
5 4 981 18 15 0.153 0.522 accuracy binary 0.753 1 NA Iter10 10
> show_best(cat_res_bo, metric="roc_auc")
# A tibble: 5 x 13
mtry trees min_n tree_depth learn_rate subsample .metric .estimator mean n std_err .config .iter
<int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 6 986 24 14 0.117 0.558 roc_auc binary 0.849 1 NA Iter16 16
2 5 957 11 15 0.109 0.839 roc_auc binary 0.849 1 NA Iter14 14
3 4 997 16 15 0.118 0.685 roc_auc binary 0.849 1 NA Iter18 18
4 5 989 3 15 0.117 0.816 roc_auc binary 0.849 1 NA Iter15 15
5 3 985 36 15 0.124 0.583 roc_auc binary 0.848 1 NA Iter17 17
> select_best(cat_res_bo, metric="roc_auc")
# A tibble: 1 x 7
mtry trees min_n tree_depth learn_rate subsample .config
<int> <int> <int> <int> <dbl> <dbl> <chr>
1 6 986 24 14 0.117 0.558 Iter16
> higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.698
2 accuracy binary 0.755
> roc_auc(
+ higgs_test_bo,
+ truth = V1,
+ estimate=.pred_0,
+ options = list(smooth = TRUE)
+ )
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.854
??此時(shí)CPU與GPU的負(fù)荷都不高鬓催,不到30%肺素,硬件的能力還沒(méi)有充分發(fā)揮出來(lái)。
??注意這種并行是多進(jìn)程并行宇驾,進(jìn)程間不共享數(shù)據(jù)倍靡,通過(guò)doParallel包與父進(jìn)程通訊,有多少個(gè)進(jìn)程就有多少份數(shù)據(jù)拷貝课舍,所以內(nèi)存要求較大塌西。下文中CatBoost CPU模式下支持多線程(同一父進(jìn)程下),線程間可以共享同一份數(shù)據(jù)筝尾,內(nèi)存開(kāi)銷(xiāo)就小捡需。不過(guò)后面的測(cè)試表明CatBoost GPU模式nthread參數(shù)不起作用,似乎不支持多線程筹淫,即多進(jìn)程下不能再開(kāi)多線程加速了站辉。所以要充分利用GPU的處理能力,只能增加內(nèi)存了贸街。
??然后比較一下在GPU和CPU上的訓(xùn)練和預(yù)測(cè)的耗時(shí)庵寞。CatBoost的fit()函數(shù)支持多線程并行,treesnip包封裝后用nthread參數(shù)來(lái)設(shè)置薛匪。如果在相同的線程數(shù)下來(lái)比較捐川,毫無(wú)疑問(wèn)是GPU要快很多(事實(shí)上也是如此),因?yàn)槎嗔艘粋€(gè)GPU來(lái)協(xié)助計(jì)算逸尖。我的筆記本有8個(gè)物理核16個(gè)虛擬核古沥,CPU上開(kāi)12個(gè)線程時(shí)滿格。經(jīng)測(cè)試GPU上nthread參數(shù)沒(méi)有作用娇跟,但fit()函數(shù)并沒(méi)有貝葉斯優(yōu)化那樣開(kāi)多個(gè)進(jìn)程的選項(xiàng)岩齿,它是單進(jìn)程的。參閱資料苞俘。
library(tidymodels)
library(kableExtra)
library(tidyr)
# 這個(gè)版本的treesnip支持classification盹沈。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 對(duì)于使用GPU的大數(shù)據(jù)集訓(xùn)練,驗(yàn)證2路最低限度并行吃谣。
# All operating systems乞封,注冊(cè)并行處理,并為每個(gè)并行處理worker加載需要的包岗憋。
# # https://github.com/tidymodels/tune/issues/157
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
library(doParallel)
# cl <- makePSOCKcluster(parallel::detectCores())
cl <- makePSOCKcluster(12) # CPU fit
# cl <- makePSOCKcluster(2) # GPU fit
registerDoParallel(cl)
# 顯示有幾個(gè)worker
foreach::getDoParWorkers()
# 為每個(gè)并行處理worker加載需要的包
clusterEvalQ(cl,
{library(tidymodels)
library(treesnip)
library(catboost)
})
# 優(yōu)先使用tidymodels的同名函數(shù)肃晚。
tidymodels_prefer()
# ----------------------------------------------------------------------------------------
# 加載經(jīng)過(guò)預(yù)處理的數(shù)據(jù)
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)
# 劃分訓(xùn)練集與測(cè)試集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test <- testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA
# 用一組較好的參數(shù)比較CPU和GPU的性能----------------------------------------------------
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
# 定義菜譜:回歸公式與預(yù)處理
higgs_rec<-
recipe(V1 ~ ., data = higgs_train) %>%
# 標(biāo)準(zhǔn)化數(shù)值型變量
step_normalize(all_numeric_predictors())
# 定義模型:Catboost, 定義要調(diào)整的參數(shù)
cat_spec <-
boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
#set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU', nthread = 2) %>% # GPU
set_engine('catboost', subsample = tune("subsample"), task_type = 'CPU', nthread = 12) %>% # CPU
set_mode('classification')
# 定義工作流
cat_wflow <-
workflow() %>%
add_model(cat_spec) %>%
add_recipe(higgs_rec)
# 構(gòu)造最佳參數(shù)
cat_param_best<-
tibble(
mtry = 6,
trees = 986,
min_n = 24,
tree_depth = 14,
learn_rate = 0.117 ,
subsample = 0.558
)
# 最佳參數(shù)回填到工作流
cat_wflow_bo <-
cat_wflow %>%
finalize_workflow(cat_param_best)
# 用最佳參數(shù)在訓(xùn)練集全集上訓(xùn)練模型
t1<-proc.time()
# fit函數(shù)沒(méi)有并行仔戈,比的都是單進(jìn)程关串。
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()
cat(t2-t1)
#GPU單線程 650.52 183.77 511.34 NA NA
# CPU 12線程 65252.86 2728.51 6305.28 NA NA
# CPU 12線程 15980.84 672.2 1944.65 NA NA
#生成訓(xùn)練拧廊、測(cè)試預(yù)測(cè)及性能數(shù)據(jù)
t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob")
t2<-proc.time()
cat(t2-t1)
#GPU 36.42 0.07 2.69 NA NA
#CPU 40.11 0.9 3.35 NA NA
higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
roc_auc(
higgs_test_bo,
truth = V1,
estimate=.pred_0,
options = list(smooth = TRUE)
)
#GPU 85.4
#CPU 85.4
??因?yàn)檫@組參數(shù)訓(xùn)練要迭代986次吧碾,比較慢,CPU 12線程跑要6305.28秒飞蚓,GPU是511.34秒滤港,12倍,GPU算力的優(yōu)越性已經(jīng)得到充分的體現(xiàn)了(并且硬件的負(fù)荷不高)趴拧。預(yù)測(cè)都差不多溅漾,主要的加速在訓(xùn)練,數(shù)據(jù)集越大著榴,迭代的次數(shù)越多添履,GPU算力的優(yōu)越性越明顯。
參考資料:《When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide]》脑又,可以了解控制過(guò)擬合與訓(xùn)練速度的主要參數(shù)暮胧,以及算法之間的比較。