實(shí)現(xiàn)機(jī)器學(xué)習(xí)算法GPU算力的優(yōu)越性

??GPU算力的優(yōu)越性，在深度學(xué)習(xí)方面已經(jīng)體現(xiàn)得很充分了，稅務(wù)領(lǐng)域的落地應(yīng)用可以參閱我的文章《升級(jí)HanLP并使用GPU后端識(shí)別發(fā)票貨物勞務(wù)名稱》循未、《HanLP識(shí)別發(fā)票貨物勞務(wù)名稱之三 GPU加速》以及另一篇文章《外一篇：深度學(xué)習(xí)之VGG16模型雪豹識(shí)別》麻昼，HanLP使用的是Tensorflow及PyTorch深度學(xué)習(xí)框架塞关，有興趣的廠商也可以用自己的框架試試蜒犯。
??這些文章都是Python上跑的，R語(yǔ)言上Tensorflow及Keras有相應(yīng)的接口包（后端運(yùn)行還是在Python上）硫戈，見(jiàn)《R語(yǔ)言深度學(xué)習(xí)》锰什，最近也開(kāi)發(fā)了R語(yǔ)言原生的深度學(xué)習(xí)框架Torch for R，以及原生的Apache MXNET等丁逝，后面兩個(gè)還沒(méi)有跑過(guò)汁胆，有時(shí)間可以試一下。
??有關(guān)Linux上GPU的安裝與使用霜幼，可以參閱我在簡(jiǎn)書(shū)上的系列文章嫩码。
??在傳統(tǒng)的機(jī)器學(xué)習(xí)應(yīng)用領(lǐng)域，主要是分類(lèi)與回歸罪既，也有一些算法實(shí)現(xiàn)嘗試?yán)肎PU的算力來(lái)提升性能铸题。前文《墨爾本房?jī)r(jià)回歸模型(Python)》及《用Tidy Models實(shí)現(xiàn)墨爾本房?jī)r(jià)回歸模型(R)》中，XGBoost琢感，LightGBM丢间，CatBoost這3種Kaggle上公認(rèn)的世界頂尖水平GBDT(梯度下降決策樹(shù))算法實(shí)現(xiàn)，都支持GPU運(yùn)行驹针，這就提出了一個(gè)問(wèn)題和機(jī)會(huì)烘挫，來(lái)探索一下該領(lǐng)域?qū)崿F(xiàn)GPU算力優(yōu)越性的可能性和條件。這是個(gè)很有實(shí)用意義的問(wèn)題柬甥，各大云平臺(tái)及PC饮六、筆記本上那么多GPU，能否充分利用苛蒲，是選擇算法實(shí)現(xiàn)和技術(shù)路線的一個(gè)重要參考標(biāo)準(zhǔn)卤橄。人家已經(jīng)做出來(lái)了，網(wǎng)上也有不少實(shí)例展示了大規(guī)模數(shù)據(jù)集上分類(lèi)或回歸算法GPU算力的優(yōu)越性撤防，所以可能性是肯定的虽风，問(wèn)題是在自己的落地應(yīng)用場(chǎng)景中找到實(shí)現(xiàn)的條件棒口，需要實(shí)測(cè)了解一下寄月。
??在墨爾本房?jī)r(jià)回歸分析模型的例子中辜膝，實(shí)測(cè)顯示不管是Python的實(shí)現(xiàn)還是R語(yǔ)言的實(shí)現(xiàn)，GPU（Nvidia GeForce RTX 2060 Max-Q漾肮，1920個(gè)CUDA核）上都比CPU(Intel Core i7 8核[16虛擬核])上慢厂抖，需要深入了解原因，是GPU（參數(shù)）沒(méi)用對(duì)呢還是數(shù)據(jù)集本身的特點(diǎn)克懊，還是硬件本身的能力就是如此忱辅，從而搞清落地應(yīng)用場(chǎng)景中實(shí)現(xiàn)GPU算力優(yōu)越性的條件。
一谭溉、R語(yǔ)言測(cè)試
??最近在寫(xiě)Tidy Models的介紹文章墙懂，先講講R語(yǔ)言上的情況，后面再講講Python上的情況扮念，結(jié)果是一樣的损搬。
1、XGBoost算法柜与。
??XGBoost開(kāi)源算法框架由University of Washington主導(dǎo)開(kāi)發(fā)巧勤，默認(rèn)安裝的CRAN XGBoost是不支持GPU的，要安裝其Github主頁(yè)上的發(fā)行版弄匕，上面有預(yù)編譯好的Windows及Linux版颅悉，下載安裝即可，目前是1.7.3.1版迁匠。運(yùn)行時(shí)增加一個(gè)參數(shù)tree_method="gpu_hist"即可剩瓶。

set_engine('xgboost', tree_method="gpu_hist")

# -----------------------------------------------------------------------------------------
library(tidymodels)
library(kableExtra)
library(tidyr)
# All operating systems，注冊(cè)并行處理
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores())
registerDoParallel(cl)
# 優(yōu)先使用tidymodels的同名函數(shù)城丧。
tidymodels_prefer()

# 異常值閾值30
threshold<- 30

# ----------------------------------------------------------------------------------------
# 加載經(jīng)過(guò)預(yù)處理的數(shù)據(jù)
melbourne<- read.csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv")
# 過(guò)濾缺失值
# Error: Missing data in columns: BuildingArea.
# 47 obs.
missing <- filter(melbourne, BuildingArea==0)
melbourne <- filter(melbourne, BuildingArea!=0)

# 劃分訓(xùn)練集與測(cè)試集
set.seed(2023)
melbourne_split <- initial_split(melbourne, prop = 0.80)
melbourne_train <- training(melbourne_split)
melbourne_test  <-  testing(melbourne_split)

# ----------------------------------------------------------------------------------------------------
# 貝葉斯優(yōu)化
# 可以調(diào)整菜譜參數(shù)延曙、模型主參數(shù)及引擎相關(guān)參數(shù)芙贫。

# 定義菜譜：回歸公式與預(yù)處理
melbourne_rec<-
  recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
         + Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
  # 標(biāo)準(zhǔn)化數(shù)值型變量
  step_normalize(all_numeric_predictors())

# 定義模型：XGB搂鲫， 定義要調(diào)整的參數(shù)，tree_method="gpu_hist"磺平，使用GPU魂仍。
xgb_spec <-
  boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = tune()) %>%
  set_engine('xgboost', tree_method="gpu_hist") %>%
  set_mode('regression')

# 定義工作流
xgb_wflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(melbourne_rec)

# 全部參數(shù)的邊界都已確定。
xgb_param <- xgb_wflow %>%
  extract_parameter_set_dials() %>%
  update(learn_rate = threshold(c(0.01,0.5))) %>%
  update(trees = trees(c(500,1000))) %>%
  update(tree_depth = tree_depth(c(5,15))) %>%
  update(sample_size = threshold(c(0.5,1))) %>%
  finalize(melbourne_train)

xgb_param
# 查看參數(shù)邊界拣挪，都已確定
xgb_param %>% extract_parameter_dials("trees")
xgb_param %>% extract_parameter_dials("min_n")
xgb_param %>% extract_parameter_dials("tree_depth")
xgb_param %>% extract_parameter_dials("learn_rate")
xgb_param %>% extract_parameter_dials("loss_reduction")
xgb_param %>% extract_parameter_dials("sample_size")
xgb_param %>% extract_parameter_dials("stop_iter")

melbourne_folds <- vfold_cv(melbourne, v = 5)

# 執(zhí)行貝葉斯優(yōu)化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
xgb_res_bo <-
  xgb_wflow %>%
  tune_bayes(
    resamples = melbourne_folds,
    metrics = metric_set(rsq, rmse, mae),
    initial = 10,
    param_info = xgb_param,
    iter = 100,
    control = ctrl
  )
t2<-proc.time()

XGBoost使用GPU訓(xùn)練模型

cat(t2-t1)
# CPU 3014 1.99 3989.22 NA NA
# GPU 5892.78 4.16 8416.28 NA NA

??可以看到GPU上反而慢了一倍多擦酌，在訓(xùn)練的過(guò)程中觀察網(wǎng)絡(luò)的流量，發(fā)現(xiàn)GPU與CPU之間數(shù)據(jù)拷貝的流量不小菠劝。

2赊舶、LightGBM算法。
??LightGBM算法是微軟開(kāi)發(fā)的，默認(rèn)的CRAN安裝也是不支持GPU笼平，GPU版要下載源碼編譯园骆，具體請(qǐng)參閱《Installation Guide: Build GPU Version》以及LightGBM R-package Github主頁(yè)，編譯好GPU版LightGBM后寓调，運(yùn)行項(xiàng)目主目錄下的build_r.R打包生成GPU版的lightgbm R包并安裝锌唾。GPU版默認(rèn)是OpenCL API，Nvidia也支持夺英，如果要編譯CUDA專(zhuān)用API晌涕，請(qǐng)參閱《Installation Guide: Build CUDA Version》。當(dāng)時(shí)在Python上測(cè)試時(shí)用CMake + VS Build Tools編譯的3.2.1.99版痛悯，看了一下LightGBM的發(fā)布信息余黎，目前最新的版本是3.3.4，主要是適配R-4.2载萌，3.2.1之后的版本惧财，在GPU支持上沒(méi)有大的更新，就先用著3.2.1.99版測(cè)試炒考，以后有需要再升級(jí)可缚。該文檔提供了LightGBM原生R語(yǔ)言API的簡(jiǎn)單測(cè)試?yán)樱?shù)據(jù)集下顯然是CPU比GPU要快斋枢。

Rscript build_r.R --use-gpu

??參閱LightGBM參數(shù)文檔帘靡，OpenCL API下它需要兩個(gè)參數(shù)來(lái)確定GPU的廠商及設(shè)備編號(hào)：gpu_platform_id與gpu_device_id，可以用工具GPUCapsViewer來(lái)查看瓤帚，如下圖所示描姚，但LightGBM中的編號(hào)是從0開(kāi)始的，引用時(shí)都要減1戈次，比如我的筆記本上有集成的intel顯卡轩勘，它的gpu_platform_id是1，Nvidia的gpu_platform_id是2怯邪，R程序中引用時(shí)绊寻，gpu_platform_id是2-1=1，gpu_device_id是1-1=0悬秉。

查看系統(tǒng)中的GPU列表

??加載數(shù)據(jù)等相同的程序就不重復(fù)了澄步，指定device="gpu"等參數(shù)就可以使用GPU。

set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0)

# 為L(zhǎng)ightGBM提供 parsnip接口支持
library(bonsai)

# 貝葉斯優(yōu)化
# 可以調(diào)整菜譜參數(shù)和泌、模型主參數(shù)及引擎相關(guān)參數(shù)村缸。

# 定義菜譜：回歸公式與預(yù)處理
melbourne_rec<-
  recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
         + Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
  # 標(biāo)準(zhǔn)化數(shù)值型變量
  step_normalize(all_numeric_predictors())

# 定義模型：Light GBM， 定義要調(diào)整的參數(shù) 
lgbm_spec <-
  boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(),
             loss_reduction = tune(), sample_size = tune(), mtry=tune()) %>%
  # set_engine('lightgbm') %>%
  # 有一個(gè)集成的Intel顯卡武氓，它的gpu_platform_id=0梯皿，gpu_device_id = 0仇箱，Nvidia獨(dú)立顯卡的gpu_platform_id=1
  set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0) %>%
  set_mode('regression')

# 定義工作流
lgbm_wflow <- 
  workflow() %>% 
  add_model(lgbm_spec) %>% 
  add_recipe(melbourne_rec)

# mtry參數(shù)的邊界未完全確定，用finalize()函數(shù)確定东羹。
lgbm_param <- lgbm_wflow %>%
  extract_parameter_set_dials() %>%
  update(learn_rate = threshold(c(0.01,0.5))) %>%
  update(trees = trees(c(500,1000))) %>%
  update(tree_depth = tree_depth(c(5,15))) %>%
  update(mtry = mtry(c(3,6))) %>%
  update(sample_size = threshold(c(0.5,1))) %>%
  finalize(melbourne_train)

# 查看參數(shù)邊界剂桥，都已確定
lgbm_param %>% extract_parameter_dials("trees")
lgbm_param %>% extract_parameter_dials("min_n")
lgbm_param %>% extract_parameter_dials("tree_depth")
lgbm_param %>% extract_parameter_dials("learn_rate")
lgbm_param %>% extract_parameter_dials("loss_reduction")
lgbm_param %>% extract_parameter_dials("sample_size")
lgbm_param %>% extract_parameter_dials("mtry")

melbourne_folds <- vfold_cv(melbourne, v = 5)

# 執(zhí)行貝葉斯優(yōu)化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
lgbm_res_bo <-
  lgbm_wflow %>%
  tune_bayes(
    resamples = melbourne_folds,
    metrics = metric_set(rsq, rmse, mae),
    initial = 10,
    param_info = lgbm_param,
    iter = 100,
    control = ctrl
  )
t2<-proc.time()

cat(t2-t1)
#CPU 4760.83 2.64 5503.5 NA NA
#GPU 5834.04 5.57 8285.5 NA NA

??可以看到CPU也是比GPU要快了近一倍。

LightGBM在GPU上訓(xùn)練

3百姓、CatBoost算法渊额。
??CatBoost是俄國(guó)Yandex搜索引擎開(kāi)發(fā)的開(kāi)源GBDT算法框架况木，它各個(gè)操作系統(tǒng)的預(yù)編譯版本都是支持GPU的垒拢，可以從項(xiàng)目主頁(yè)的最新版本處下載安裝，目前的最新版本是1.1.1火惊。使用時(shí)增加一個(gè)參數(shù)task_type = 'GPU'即可求类，參數(shù)文檔。

# 為catboost提供 parsnip接口支持
library(treesnip)

# 定義菜譜：回歸公式與預(yù)處理
# 'Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
# 'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u'
melbourne_rec<-
  recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
         + Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
  #step_log(BuildingArea, base = 10) %>%
  # 標(biāo)準(zhǔn)化數(shù)值型變量
  step_normalize(all_numeric_predictors())

# 定義模型：Cat  
cat_model<-
  boost_tree(trees = 1000, learn_rate=0.05) %>%
  set_engine("catboost", 
             loss_function = "RMSE", 
             eval_metric='RMSE',
             task_type = 'GPU'           # Catboost GPU上運(yùn)行的效率還不如CPU, 可能是數(shù)據(jù)集還不夠大屹耐。
  )  %>%
  set_mode("regression")

# 定義工作流
cat_wflow <- 
  workflow() %>% 
  add_model(cat_model) %>% 
  add_recipe(melbourne_rec)

# 訓(xùn)練模型
t1<-proc.time()
cat_fit <- fit(cat_wflow, melbourne_train)
t2<-proc.time()

cat(t2-t1)
# CPU  2.42 0.07 2.68 NA NA
# GPU  12.77 3.44 12.78 NA NA

??CatBoost在GPU上慢了5倍多尸疆，所以暫時(shí)沒(méi)有測(cè)試100次迭代的貝葉斯優(yōu)化，但5折交叉驗(yàn)證時(shí)惶岭，它的CPU和GPU都幾乎滿格了寿弱。

CatBoost在GPU上訓(xùn)練

二、Python測(cè)試
1按灶、XGBoost算法症革。
??XGBoost的Python版是預(yù)編譯的二進(jìn)制版本，已經(jīng)支持GPU鸯旁，用pip安裝即可噪矛。

pip install xgboost

??貝葉斯優(yōu)化，在GPU上訓(xùn)練時(shí)也是增加一個(gè)參數(shù)tree_method='gpu_hist'铺罢。Python上的貝葉斯優(yōu)化實(shí)現(xiàn)與R上可能有所不同艇挨，它的高斯過(guò)程速度很快，可能只是估算1組候選參數(shù)（R上是數(shù)千組）韭赘，所以在Python上要迭代1000次缩滨，R上只迭代100次。在CPU與GPU模式之間切換只需更新貝葉斯優(yōu)化的代價(jià)函數(shù)f_xgb()即可泉瞻。
??各算法公共的部分脉漏，加載軟件包與數(shù)據(jù)。

# 加載公用包
# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

# Basic Imports 
import numpy as np
import pandas as pd
import time

# Preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score

# Metrics 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Model Tuning 
from hyperopt import fmin, tpe, hp, Trials
from hyperopt.fmin import generate_trials_to_calculate
    
# 加載數(shù)據(jù)瓦灶，劃分訓(xùn)練集與測(cè)試集鸠删，標(biāo)準(zhǔn)化數(shù)據(jù)
# 9015
df_NN = pd.read_csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv",  encoding="utf-8")

X=df_NN[['Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
          'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u']]
y=df_NN['LogPrice']
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .20, random_state=42)

train_X2 = train_X.copy()
valid_X2 = valid_X.copy()

# Data standardization
mean = train_X.mean(axis=0)
train_X -= mean
std = train_X.std(axis=0)
train_X /= std
valid_X -= mean
valid_X /= std

XGBoost:

# ML Models
from xgboost import XGBRegressor 

# 定義參數(shù)搜索空間，縮小參數(shù)取值范圍贼陶，搜索會(huì)快很多  
space_xgb = {
    'max_bin': hp.choice('max_bin', range(8, 128)),                  # CPU 50-501 GPU 8-128
    'max_depth': hp.choice('max_depth', range(3, 11)),    
    'n_estimators': hp.choice('n_estimators', range(100, 1001)),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),    
    'subsample': hp.uniform('subsample', 0.5, 0.99),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
    'reg_alpha': hp.uniform('reg_alpha', 0, 5),                       # lambda_l1
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),                     # lambda_l2
    'gamma': hp.uniform('gamma',0.0, 10),                             # min_split_loss, min_split_gain
    'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}

# 定義代價(jià)函數(shù)
def f_xgb(params):
    # Set extra_trees=True to avoid overfitting
     # CPU 4.96s/trial
    # xgb = XGBRegressor(objective ='reg:squarederror', seed = 0,verbosity=0, **params) 
    # GPU 8.68s/trial
    xgb = XGBRegressor(tree_method='gpu_hist', objective ='reg:squarederror', seed = 0,verbosity=0,**params)   
    #xgb_model = xgb.fit(train_X, train_y)
    #acc = xgb_model.score(valid_X,valid_y)    
    # acc = cross_val_score(xgb, train_X, train_y).mean()              # CPU
    acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean()  # GPU
    return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{
                                        'max_bin':4,                                 # default 256
                                        'max_depth':5,                               # default 6
                                        'n_estimators':578,                          # default 100
                                        'learning_rate':0.05508679239402551,         # default 0.3
                                        'subsample':0.8429852720715357,              # default 1.0
                                        'colsample_bytree':0.8413894273173292,       # default 1.0
                                        'reg_alpha': 0.809791155072757,              # default 0.0
                                        'reg_lambda':1.4490119256389808,             # default 1.0
                                        'gamma':0.008478702584417519,                # default 0.0                                        
                                        'min_child_weight':24.524635200338793,       # default 1
                                        }])

t1 = time.time()  
# GPU: 1000trial [2:24:41,  8.68s/trial, best loss: -0.9080128034320879]                           
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=999, trials=trials)
t2 = time.time()
# 8681.310757875443
print("Time elapsed: ", t2-t1)

print('best:')
print(best_params)

??XGBoost Python版在GPU上訓(xùn)練的速度也是比CPU上慢了近一倍刃泡。

Pyton版XGBoost在GPU上訓(xùn)練

2巧娱、LightGBM算法。
??Python版安裝參閱LightGBM python-package主頁(yè)文檔烘贴，升級(jí)到最新的3.3.4禁添，這個(gè)安裝選項(xiàng)使用的是默認(rèn)的OpenCL API，這里用它來(lái)測(cè)試桨踪。

pip install lightgbm --install-option=--gpu

??Windows上安裝CUDA API專(zhuān)用版要先配好Visual Studio開(kāi)發(fā)環(huán)境老翘，pip要調(diào)用它來(lái)編譯。

pip install lightgbm --install-option=--cuda

??貝葉斯優(yōu)化锻离，調(diào)用lightgbm時(shí)增加幾個(gè)參數(shù)：device='gpu', gpu_platform_id=1, gpu_device_id = 0铺峭，注意它的參數(shù)max_bin在CPU和GPU上的取值范圍不同，GPU上如果不正確設(shè)置會(huì)引起index out of range的錯(cuò)誤汽纠。公共的程序就不重復(fù)了卫键。

# ML Models
from lightgbm import LGBMRegressor 
    
# --------------------------------------------------------------------------------------------------
# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://www.pythonf.cn/read/6998
#            https://lightgbm.readthedocs.io/en/latest/Parameters.html
#            https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#deal-with-over-fitting
#            https://lightgbm.readthedocs.io/en/latest/GPU-Performance.html
# 處理過(guò)擬合

#     設(shè)置較少的直方圖數(shù)目 max_bin
#     設(shè)置較小的葉節(jié)點(diǎn)數(shù) num_leaves
#     使用 min_child_samples（min_data_in_leaf） 和 min_child_weight（= min_sum_hessian_in_leaf）
#     通過(guò)設(shè)置 subsample（bagging_fraction） 和 subsample_freq（= bagging_freq） 來(lái)使用 bagging
#     通過(guò)設(shè)置 colsample_bytree（feature_fraction） 來(lái)使用特征子抽樣
#     使用更大的訓(xùn)練數(shù)據(jù)
#     使用 reg_alpha（lambda_l1） , reg_lambda（lambda_l2） 和 min_split_gain（min_gain_to_split） 來(lái)使用正則
#     嘗試 max_depth 來(lái)避免生成過(guò)深的樹(shù)
#     Try extra_trees
#     Try increasing path_smooth

# trials = generate_trials_to_calculate([{'max_bin':63-8,               # default CPU 255 GPU 63
#                                         'max_depth':5-3,              # default -1
#                                         'num_leaves':31-20,           # default 31
#                                         'min_child_samples':20-10,    # default 20
#                                         'subsample_freq':1-1,         # default 1
#                                         'n_estimators':6000-1000,     # default 10
#                                         'learning_rate':0.01,         # default 0.1
#                                         'subsample':0.75,             # default 1.0
#                                         'colsample_bytree':0.8,       # default 1.0
#                                         'lambda_l1':0.0,              # default 0.0
#                                         'lambda_l2':0.0,              # default 0.0
#                                         'min_child_weight':0.001,     # default 0.001
#                                         'min_split_gain':0.0,         # default 0.0
#                                         #'path_smooth':0.0            # default 0.0
#                                         }])
# 縮小參數(shù)取值范圍，搜索會(huì)快很多  
space_lgbm = {
    'max_bin': hp.choice('max_bin', range(8, 128)),                  # CPU 50-501 GPU 8-128
    'max_depth': hp.choice('max_depth', range(3, 31)),    
    'num_leaves': hp.choice('num_leaves', range(10, 256)),
    'min_child_samples': hp.choice('min_child_samples', range(10, 51)), 
    'subsample_freq': hp.choice('subsample_freq', range(1, 6)),      
    'n_estimators': hp.choice('n_estimators', range(500, 6001)),
    'learning_rate': hp.uniform('learning_rate', 0.005, 0.15),    
    'subsample': hp.uniform('subsample', 0.5, 0.99),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
    'reg_alpha': hp.uniform('reg_alpha', 0, 5),                       # lambda_l1
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),                     # lambda_l2
    'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
    'min_split_gain': hp.uniform('min_split_gain',0.0, 1),
    #'path_smooth': hp.uniform('path_smooth',0.0, 3)
}
def f_lgbm(params):
    # Set extra_trees=True to avoid overfitting
    # lgbm = LGBMRegressor(seed=0,verbose=-1, **params)                 # CPU 4.96s/trial
    lgbm = LGBMRegressor(device='gpu', gpu_platform_id=1, gpu_device_id = 0, num_threads =3, **params)    # GPU 65.93s/trial
    #lgb_model = lgbm.fit(train_X, train_y)
    #acc = lgb_model.score(valid_X,valid_y)
    # acc = cross_val_score(lgbm, train_X, train_y).mean()             # CPU
    acc = cross_val_score(lgbm, train_X, train_y, n_jobs=6).mean()  # GPU
    return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':63,                # default CPU 255 GPU 63
                                        'max_depth':17,                             # default -1
                                        'num_leaves':12,                            # default 31
                                        'min_child_samples':14,                     # default 20
                                        'subsample_freq':0,                         # default 1
                                        'n_estimators':2647,                        # default 10
                                        'learning_rate':0.0203187560767722,         # default 0.1
                                        'subsample':0.788703175392162,              # default 1.0
                                        'colsample_bytree':0.5203150334508861,      # default 1.0
                                        'reg_alpha': 0.988139501870491,             # default 0.0
                                        'reg_lambda':2.789779486137205,             # default 0.0
                                        'min_child_weight':21.813225361674828,      # default 0.001
                                        'min_split_gain':0.00039636685518264865,    # default 0.0
                                        #'path_smooth':0.0                          # default 0.0
                                        }])

t1 = time.time()  
# 1000trial [5:23:09, 19.39s/trial, best loss: -0.9082183160929432] CPU  
# 1000trial [1:22:39,  4.96s/trial, best loss: -0.9079837941918502] CPU 
# 1000trial [1:02:28,  3.75s/trial, best loss: -0.9080477825539048] CPU
best_params = fmin(f_lgbm, space_lgbm, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
print(best_params)

Pyton版LightGBM在GPU上訓(xùn)練

??迭代100次虱朵，15.09s/trial莉炉，CPU滿格，Nvidia GPU過(guò)半碴犬，小數(shù)據(jù)集內(nèi)存消耗較低絮宁，GPU與CPU間有一些數(shù)據(jù)拷貝流量，應(yīng)該是正常的情況服协，結(jié)果也是CPU版3.28s/trial要快4倍多绍昂。

3、CatBoost算法蚯涮。
??pip安裝默認(rèn)已支持GPU：

pip install catboost

??CatBoost使用GPU只需要指定參數(shù)task_type='GPU'治专，不過(guò)貝葉斯調(diào)參時(shí)，這幾個(gè)參數(shù)是CPU版才有的遭顶，GPU版不支持：random_strength张峰、subsample，rsm棒旗，然后參數(shù)border_countGPU版與CPU版的取值范圍不同喘批，需要注意卜壕。公共的程序也不重復(fù)了爬舰，見(jiàn)前文。

# ML Models
from catboost import CatBoostRegressor

# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://github.com/talperetz/hyperspace/tree/master/GBDTs
#            https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list
#            https://catboost.ai/docs/concepts/parameter-tuning.html
#            https://affine.ai/catboost-a-new-game-of-machine-learning/
'''
https://catboost.ai/en/docs/concepts/speed-up-training
Speeding up the training
1. iterations, worked
2. learning_rate, worked
2. boosting_type, Ordered, Plain,  not worked
3. bootstrap_type, Bayesian, Bernoulli, MVS, Poisson, not worked
4. subsample, not worked
   This parameter can be used if one of the following bootstrap types is selected:
    Poisson   Bernoulli    MVS
5. one_hot_max_size, One-hot encoding
6. rsm, colsample_bylevel, Random subspace method
7.leaf_estimation_iterations, worked, set to 1.
   Try setting the value to "1" or "5" to speed up the training on datasets with a small number of features.
8. max_ctr_complexity, worked, 0 or 2 to speed up trainning.
   This parameter can affect the training time only if the dataset contains categorical features.
9. border_count, worked, set to less.
10.Reusing quantized datasets in Python, not applyable to cross_val_score()
11.Golden features. If the dataset has a feature, which is a strong predictor of the result, the
   pre-quantisation of this feature may decrease the information that the model can get from it. 
   It is recommended to use an increased number of borders (1024) for this feature.
   per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
'''
# default values
# trials = generate_trials_to_calculate([{'border_count':254-150,       # default CPU 254 GPU 128
#                                         'iterations':1000-500,        # default 1000
#                                         'depth': 6-2,                 # default 6
#                                         'random_strength':1.0,        # default 1.0, CPU only
#                                         'learning_rate': 0.03,        # default 0.03
#                                         'subsample':0.8,              # default 0.8
#                                         'l2_leaf_reg': 3.0,           # default 3
#                                         'rsm':0.8,                    # default 1.0  CPU only
#                                         'fold_len_multiplier':2.0,    # default 2.0
#                                         'bagging_temperature':1.0     # default 1.0
#                                         }])

# 縮小參數(shù)取值范圍虑瀑，搜索會(huì)快很多  
space_cat = {'border_count': hp.choice('border_count', range(8, 128)), # CPU 150-351 GPU 8-128
             'iterations': hp.choice('iterations', range(500, 1501)),    
             'depth': hp.choice('depth', range(2, 10)),  
             #'random_strength': hp.uniform('random_strength', 1, 20),           
             'learning_rate': hp.uniform('learning_rate', 0.005, 0.15), 
             # 'subsample': hp.uniform('subsample', 0.5, 1),    
             'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1, 100),
             # 'rsm': hp.uniform('rsm', 0.5, 0.99),                        # colsample_bylevel
             'fold_len_multiplier': hp.uniform('fold_len_multiplier', 1.0, 10.0),
             'bagging_temperature': hp.uniform('bagging_temperature', 0.0, 1.0) }

def f_cat(params):
    # cat = CatBoostRegressor(task_type='CPU', random_seed=0,
    cat = CatBoostRegressor(task_type='GPU', random_seed=0,                         
        # boosting_type='Plain', bootstrap_type = 'Bayesian', max_ctr_complexity=1,
        one_hot_max_size=3, 
        leaf_estimation_iterations=1,
        #per_float_feature_quantization=['3:border_count=1024', '4:border_count=1024'], # Golden features: lat, long
        verbose=False, **params)  # CPU 13.05s/trial
    acc = cross_val_score(cat, train_X, train_y,  n_jobs=3).mean()     
    return -acc

# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'border_count':112,                         # default CPU 254 GPU 128
                                        'iterations':989,                           # default 1000
                                        'depth': 4,                                 # default 6
                                        #'random_strength':6.6489521372262645,       # default 1.0, CPU only
                                        'learning_rate': 0.07811835381238333,       # default 0.03
                                        #'subsample':0.9484820488113903,             # default 0.8
                                        'l2_leaf_reg': 8.070279328038293,           # default 3
                                        #'rsm':0.7188098046587024,                   # default 1.0  CPU only
                                        'fold_len_multiplier': 6.034216410528531,   # default 2.0
                                        'bagging_temperature':0.47787665340753926   # default 1.0
                                        }])

t1 = time.time()  
# 1000trial [50:28,  3.03s/trial, best loss: -0.905859099632395]                         
best_params = fmin(f_cat, space_cat, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)

print('best:')

Pyton版CatBoost在GPU上訓(xùn)練

??這么小一個(gè)數(shù)據(jù)集逛拱，GPU滿格敌厘，CPU過(guò)半，內(nèi)存滿格朽合，迭代一次要350秒俱两，是不太正常的饱狂，應(yīng)該是哪些參數(shù)還沒(méi)有設(shè)對(duì)，或者是數(shù)據(jù)集需要一些其它的預(yù)處理宪彩。

Pyton版CatBoost在CPU上訓(xùn)練

??與此相對(duì)休讳，CPU版CPU滿格，內(nèi)存消耗不大尿孔，7.46s/trial速度也不錯(cuò)俊柔。

三、LightGBM官方例子
??LightGBM官方的HIGGS分類(lèi)例子活合，說(shuō)GPU應(yīng)該有三倍以上的加速雏婶，在我的筆記本上性能大致相當(dāng)（GPU沒(méi)有滿負(fù)荷跑，因?yàn)镚BDT算法的GPU版其實(shí)是把部分的運(yùn)算搬到GPU上芜辕，很多運(yùn)算還會(huì)在CPU上算尚骄，一般是CPU滿格，GPU不會(huì)滿格侵续。），這說(shuō)明應(yīng)該是硬件能力的限制憨闰，聯(lián)想拯救者Y9000X的NVIDIA GeForce RTX 2060 Max-Q還不夠強(qiáng)悍状蜗。該數(shù)據(jù)集有1100萬(wàn)行28個(gè)變量，解壓后約7.5G鹉动，不小了轧坎，它的測(cè)試應(yīng)該有參考價(jià)值。劃分990萬(wàn)行為訓(xùn)練集泽示，110萬(wàn)行為驗(yàn)證集(10%)缸血。數(shù)據(jù)下載，參考資料械筛。
??各算法公共的部分捎泻，加載數(shù)據(jù)。

# -*- coding: utf-8 -*-
"""
Created on Thu Sep 23 15:58:22 2021

@author: Jean
"""
'''
This is a classification problem to distinguish between a signal process 
which produces Higgs bosons and a background process which does not.
The first column is the class label (1 for signal, 0 for background), 
followed by the 28 features (21 low-level features then 7 high-level features): 
    lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi, 
    jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag,
    jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, 
    m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.  
'''
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

t1 = time.time()
# Load dataset
df = pd.read_csv("D:/temp/data/HIGGS/HIGGS.csv",  encoding="utf-8", header=None)
# Target column changed to int
df.iloc[:,0] = df.iloc[:,0].astype(int)
t2 = time.time()
# 50.2845721244812
print(t2-t1)
df.shape
df.head(2)

t1 = time.time()
X = df.iloc[:,1:]
y = df.iloc[:,0]
t1 = time.time()
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .10, random_state=2023)
t2 = time.time()
print(t2-t1)
# 5.838041305541992

??在不同的迭代次數(shù)下比較CPU與GPU的性能埋哟。

# Set objective=regression to change to a regression problem
import lightgbm as lgb
# create dataset for lightgbm
dtrain = lgb.Dataset(train_X, train_y)
dvalid = lgb.Dataset(valid_X, valid_y, reference=dtrain)

t_cpu = []; t_nvidia = []; t_intel=[]
a_cpu = []; a_nvidia = []; a_intel=[]

t3 = time.time()
for num_iterations in [50,100,150,200]:

    # CPU --------------------------------------------------------------------------------
    params = {'objective':'binary',
              'num_iterations':num_iterations,
              'max_bin': 63,
              'num_leaves': 255,
              'learning_rate': 0.1,
              'tree_learner': 'serial',
              'task': 'train',
              'is_training_metric': 'false',
              'min_data_in_leaf': 1,
              'min_sum_hessian_in_leaf': 100,
              'ndcg_eval_at': [1, 3, 5, 10],
              'device': 'cpu'
              }
    
    t0 = time.time()
    gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                    valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
    t1 = time.time()
    t = round(t1-t0,2)
    t_cpu.append(t)
    # 50: 46.00722551345825 100: 138.17840361595154 150: 195.13047289848328
    print('cpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
    # AUC 0.8207000864285819 0.8304736031418638 0.8353184609588433
    auc_score = roc_auc_score(valid_y,y_pred)
    a_cpu.append(round(auc_score,4))
    print(auc_score)
    
    # NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
    params = {'objective':'binary',
              'num_iterations':num_iterations,          
              'max_bin': 63,
              'num_leaves': 255,
              'learning_rate': 0.1,
              'tree_learner': 'serial',
              'task': 'train',
              'is_training_metric': 'false',
              'min_data_in_leaf': 1,
              'min_sum_hessian_in_leaf': 100,
              'ndcg_eval_at': [1, 3, 5, 10],
              'device': 'gpu',
              'gpu_platform_id': 1,
              'gpu_device_id': 0
    }
    
    t0 = time.time()
    gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                    valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
    t1 = time.time()
    t = round(t1-t0,2)
    t_nvidia.append(t)

    # 50: 54.93808197975159 100: 103.01487278938293  150: 146.14963364601135
    print('gpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
    # AUC 0.8207000821252757 0.8304736011279031 0.8353184579727403
    auc_score = roc_auc_score(valid_y,y_pred)
    a_nvidia.append(round(auc_score,4))
    print(auc_score)
    
    # Intel(R) UHD Graphics ------------------------------------------------------------
    params = {'objective':'binary',
              'num_iterations':num_iterations,          
              'max_bin': 63,
              'num_leaves': 255,
              'learning_rate': 0.1,
              'tree_learner': 'serial',
              'task': 'train',
              'is_training_metric': 'false',
              'min_data_in_leaf': 1,
              'min_sum_hessian_in_leaf': 100,
              'ndcg_eval_at': [1, 3, 5, 10],
              'device': 'gpu'
              }
    
    
    t0 = time.time()
    gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                    valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
    t1 = time.time()
    t = round(t1-t0,2)
    t_intel.append(t)
    
    # 62.83425784111023
    print('gpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
    # AUC 0.8207000820323747
    auc_score = roc_auc_score(valid_y,y_pred)
    a_intel.append(round(auc_score,4))    
    print(auc_score)

t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))

??作圖笆豁。

perf_t = pd.DataFrame({"iterations":[50,100,150,200],"cpu":t_cpu,"Nvidia":t_nvidia,"Intel":t_intel})
perf_a = pd.DataFrame({"iterations":[50,100,150,200],"cpu":a_cpu,"Nvidia":a_nvidia,"Intel":a_intel})
perf_a["cpu"] =  perf_a["cpu"]*100
perf_a["Nvidia"] =  perf_a["Nvidia"]*100
perf_a["Intel"] =  perf_a["Intel"]*100

iterations =  [50,100,150,200]

import matplotlib.pyplot as plt 

plt.rcParams["font.sans-serif"]=["SimHei"] #設(shè)置字體
plt.rcParams["axes.unicode_minus"]=False #正常顯示負(fù)號(hào)
fig,ax1 = plt.subplots()
ax2 = ax1.twinx()           # 做鏡像處理
ax1.plot(iterations,t_cpu,'b', label="CPU")
ax1.plot(iterations,t_nvidia,'g', label="Nvidia")
ax1.plot(iterations,t_intel,'r', label="Intel")
ax1.legend(loc="upper left")

ax2.plot(iterations,perf_a["cpu"],"b--", label="CPU")
ax2.plot(iterations,perf_a["Nvidia"],"g--", label="Nvidia")
ax2.plot(iterations,perf_a["Intel"] ,"r--", label="Intel")
ax2.legend(loc="lower right") 
ax1.set_xlabel('迭代次數(shù)')    #設(shè)置x軸標(biāo)題
ax1.set_ylabel('時(shí)間（秒）')   #設(shè)置Y1軸標(biāo)題
ax2.set_ylabel('AUC(%)')   #設(shè)置Y2軸標(biāo)題
plt.show()

??CPU上測(cè)試， num_iterations=50時(shí)約48秒赤赊。第一次運(yùn)行要加載數(shù)據(jù)闯狱，時(shí)間要長(zhǎng)一點(diǎn)，以第二次運(yùn)行為準(zhǔn)（說(shuō)明：這幾個(gè)都是全部1100萬(wàn)條數(shù)據(jù)都用于訓(xùn)練的截圖）抛计。

[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191587 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
cpu version elapse time: 47.76462769508362

Pyton-LightGBM-HIGGS數(shù)據(jù)集在CPU上訓(xùn)練

??Nvidia GPU上測(cè)試哄孤， num_iterations=50時(shí)約60秒，當(dāng)?shù)螖?shù)增多吹截，比如100次以后瘦陈，GPU會(huì)比CPU快朦肘。

[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 1 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 2060 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.480376 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 59.27095651626587

Pyton-LightGBM-HIGGS數(shù)據(jù)集在Nvidia GPU上訓(xùn)練，GPU負(fù)荷不高

??CPU中集成的Intel GPU双饥，大約要70秒：

[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: Intel(R) UHD Graphics, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.262090 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 70.56596112251282

Pyton-LightGBM-HIGGS數(shù)據(jù)集在CPU集成顯卡上訓(xùn)練

LightGBM的性能曲線

??上圖顯示媒抠，隨著參數(shù)迭代次數(shù)num_iterations（分別取值50、100咏花、150趴生、200）的增加，Nvidia GPU（綠色）的速度逐漸趕超了CPU（藍(lán)色）昏翰，當(dāng)num_iterations=200時(shí)苍匆，已經(jīng)快了50%了，紅色實(shí)線顯示此時(shí)集成的Intel GPU也已經(jīng)比CPU快了棚菊。紅色虛線顯示了隨著迭代次數(shù)的增加浸踩，準(zhǔn)確率AUC也會(huì)逐漸提高，相同參數(shù)下统求，在不同設(shè)備上訓(xùn)練的精度沒(méi)有區(qū)別检碗，三條虛線是重合的。這說(shuō)明GPU算力的優(yōu)越性在此例中是可以驗(yàn)證的码邻，當(dāng)需要更高的精度時(shí)折剃，就需要更多的迭代次數(shù)去訓(xùn)練，此時(shí)GPU可以加速訓(xùn)練的過(guò)程像屋，就是說(shuō)數(shù)據(jù)集要大怕犁，迭代次數(shù)要多，優(yōu)越性才能體現(xiàn)出來(lái)己莺。

四奏甫、XGBoost上跑HIGGS數(shù)據(jù)集分類(lèi)
??有關(guān)XGBoost在kaggle HIGGS競(jìng)賽上參數(shù)討論的總結(jié)帖子并沒(méi)有給出好的AUC指標(biāo)。我從默認(rèn)值開(kāi)始用貝葉斯優(yōu)化從頭訓(xùn)練模型凌受，可參閱XGBoost參數(shù)文檔阵子。加載數(shù)據(jù)的代碼就不重復(fù)了。

from xgboost import XGBClassifier
space_xgb = {
    'max_bin': hp.choice('max_bin', range(50, 512)),                  # CPU 50-501
    'max_depth': hp.choice('max_depth', range(3, 11)),    
    'n_estimators': hp.choice('n_estimators', range(100, 1001)),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.5),    
    'subsample': hp.uniform('subsample', 0.5, 1),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
    'reg_alpha': hp.uniform('reg_alpha', 0, 5),                       # lambda_l1
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),                     # lambda_l2
    'gamma': hp.uniform('gamma',0.0, 10),                             # min_split_loss, min_split_gain
    'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}

def f_xgb(params):
    # xgb = XGBClassifier(objective ='binary:logistic', use_label_encoder=False, seed = 2023,\
    #                     nthread=-1, verbosity=0, **params)  # CPU 
    xgb = XGBClassifier(tree_method='gpu_hist', objective ='binary:logistic', use_label_encoder=False,\
                        nthread=-1, seed = 2023, verbosity=0,**params)  # GPU    
    xgb_model = xgb.fit(train_X, train_y)
    acc = xgb_model.score(valid_X,valid_y)    
    # acc = cross_val_score(xgb, train_X, train_y).mean()            # CPU
    # acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean()  # GPU
    return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':256-50,                      # default 256
                                        'max_depth':6-3,                       # default 6 [0,∞]
                                        'n_estimators':200-100,                # default 10
                                        'learning_rate':0.3,                   # default 0.3 [0,1]
                                        'subsample':1.0,                       # default 1.0 (0,1]
                                        'colsample_bytree':1.0,                # default 1.0 (0,1]
                                        'reg_alpha':0,                         # default 0.0
                                        'reg_lambda':1.0,                      # default 1.0
                                        'gamma':0,                             # default 0.0 [0,∞]                                      
                                        'min_child_weight':1                   # default 1 [0,∞]
                                        }])
t1 = time.time()  
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)

XGBoost-HIGGS-GPU訓(xùn)練模型胁艰，CPU負(fù)荷不高款筑，GPU有時(shí)過(guò)半

??先用訓(xùn)練10次找到的參數(shù)來(lái)評(píng)估一下，雖然這組參數(shù)的AUC指標(biāo)不高腾么，GPU加速的效果還是比較顯著的奈梳，畫(huà)圖的代碼也不重復(fù)了。

XGBoost-HIGGS-GPU在低迭代次數(shù)下就已經(jīng)表現(xiàn)出顯著的加速效果

t_cpu = []; t_nvidia = []
a_cpu = []; a_nvidia = [] 

t3 = time.time()
# for num_iterations in [50]:
for num_iterations in [50,100,150,200]:
    # num_iterations =689
    # CPU --------------------------------------------------------------------------------
    params = {'objective':'binary:logistic',
              'max_bin':286,
              'n_estimators':num_iterations,
              'learning_rate': 0.3359071085471539,
              'max_depth': 4,
              'min_child_weight':6.4817419839798385,
              'colsample_bytree': 0.7209249276177966,
              'subsample': 0.5532140686826488,
              'reg_alpha': 2.2793074958255986,
              'reg_lambda': 2.4142485681002315,
              'gamma' :2.9324177415122934,
              'nthread': -1,              
              'tree_method': 'hist'
              }
           
    t0 = time.time()
    xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)                         
    xgb_model = xgb.fit(train_X, train_y)
    t1 = time.time()
    t = round(t1-t0,2)
    t_cpu.append(t)
    
    print('cpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = xgb_model.predict(valid_X)
    auc_score = roc_auc_score(valid_y,y_pred)
    a_cpu.append(round(auc_score,4))
    print(auc_score)
    
    # NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
    params = {'objective':'binary:logistic',
              'max_bin':286,
              'n_estimators':num_iterations,
              'learning_rate': 0.3359071085471539,
              'max_depth': 4,
              'min_child_weight':6.4817419839798385,
              'colsample_bytree': 0.7209249276177966,
              'subsample': 0.5532140686826488,
              'reg_alpha': 2.2793074958255986,
              'reg_lambda': 2.4142485681002315,
              'gamma' :2.9324177415122934,
              'nthread': -1,              
              'tree_method': 'gpu_hist'
              }
    
    t0 = time.time()
    xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)                         
    xgb_model = xgb.fit(train_X, train_y)
    t1 = time.time()
    t = round(t1-t0,2)
    t_nvidia.append(t)
    
    print('gpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = xgb_model.predict(valid_X)
    auc_score = roc_auc_score(valid_y,y_pred)
    a_nvidia.append(round(auc_score,4))
    print(auc_score)    

t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))

五解虱、CatBoost上跑HIGGS數(shù)據(jù)集分類(lèi)
??前面已經(jīng)有兩個(gè)Python上使用GPU加速的例子攘须，所以這一次嘗要試一下R語(yǔ)言。
??Higgs Boson Machine Learning Challenge是Kaggle上8年前結(jié)束的一項(xiàng)競(jìng)賽殴泰，參賽的隊(duì)伍有1,784支于宙。這里有XGBoost的實(shí)現(xiàn)浮驳，相關(guān)的討論在該帖子。因?yàn)闆](méi)有好的AUC指標(biāo)捞魁，決定測(cè)試一下R語(yǔ)言上用貝葉斯優(yōu)化從頭開(kāi)始訓(xùn)練模型至会，看看能否找到好一點(diǎn)的參數(shù)組合。這么大的數(shù)據(jù)集谱俭，訓(xùn)練相當(dāng)耗時(shí)奉件，GPU加速的效果還是比較顯著的。
??貝葉斯優(yōu)化光生成前10組參數(shù)就1個(gè)多小時(shí)（tidymodels的高斯過(guò)程要求初始參數(shù)組數(shù)要比參數(shù)個(gè)數(shù)多昆著，Python上可以是1組县貌；tidymodels每次生成數(shù)千個(gè)候選參數(shù)組合，耗時(shí)較長(zhǎng)凑懂，但迭代次數(shù)少煤痕，Python上只生成1組，迭代次數(shù)要足夠多接谨，這是兩個(gè)平臺(tái)高斯過(guò)程實(shí)現(xiàn)的顯著不同）摆碉，每組參數(shù)訓(xùn)練模型大約需要5~10分鐘。為了加快速度疤坝，使用bootstrap()重采樣兆解，只做一次，而不是5折交叉驗(yàn)證跑揉，然后只迭代10次。
??然而其中有5次因?yàn)樯暾?qǐng)不到需要的內(nèi)存訓(xùn)練失敗了埠巨，網(wǎng)上搜到历谍，這是XGBoost在GPU上運(yùn)行的一個(gè)問(wèn)題，上一次訓(xùn)練完后辣垒，不會(huì)主動(dòng)釋放占用的內(nèi)存望侈，見(jiàn)該帖子，Python上有一些解決的辦法勋桶，見(jiàn)《XGBoost GPU Support》 Memory Usage一節(jié)及帖子《How do I free all memory on GshujuPU in XGBoost》脱衙。于是嘗試一下用CatBoost來(lái)測(cè)試。
??CatBoost多折交叉驗(yàn)證與網(wǎng)格搜索是支持GPU并行的例驹，下圖是Windows上用doParallel包2進(jìn)程5折交叉驗(yàn)證捐韩，CPU和GPU都有較高的負(fù)荷。但在貝葉斯優(yōu)化時(shí)鹃锈，如果設(shè)置了并行荤胁，則提示parsnip沒(méi)有找到boost-tree算法的CatBoost實(shí)現(xiàn)，應(yīng)該是pkgs參數(shù)沒(méi)有起作用屎债，并行進(jìn)程沒(méi)有加載treesnip及catboost包仅政。

CatBoost-HIGGS-GPU 2進(jìn)程并行5折交叉驗(yàn)證

??參考該帖子垢油，在并行處理的cluster中為每個(gè)worker進(jìn)程預(yù)加載需要的包可解決問(wèn)題。

# 對(duì)于使用GPU的大數(shù)據(jù)集訓(xùn)練圆丹，驗(yàn)證2路最低限度并行滩愁。
# All operating systems，注冊(cè)并行處理辫封，并為每個(gè)并行處理worker加載需要的包硝枉。
# # https://github.com/tidymodels/tune/issues/157 
library(doParallel)
cl <- makePSOCKcluster(2)   # parallel::detectCores()
registerDoParallel(cl)
# 顯示有幾個(gè)worker
foreach::getDoParWorkers()
# 為每個(gè)并行處理worker加載需要的包
clusterEvalQ(cl, 
             {library(tidymodels)
              library(treesnip)
              library(catboost)
})

??CatBoost-HIGGS-GPU 2進(jìn)程并行6參數(shù)貝葉斯優(yōu)化。

CatBoost-HIGGS-GPU貝葉斯優(yōu)化并行處理

??但是跑了一夜后秸讹，主進(jìn)程最終沒(méi)能建立與worker進(jìn)程的連接讀取結(jié)果檀咙，另外速度比單進(jìn)程要慢得多，單進(jìn)程上產(chǎn)生10個(gè)初始點(diǎn)璃诀，即進(jìn)行最初10次的訓(xùn)練弧可，一個(gè)多小時(shí)就跑完了，雙進(jìn)程要跑一夜（估計(jì)是參數(shù)較多內(nèi)存不夠劣欢，詳見(jiàn)下文）棕诵。

Forced gc():  0.1  Seconds.

>  Generating a set of 10 initial parameter results
Error in unserialize(socklist[[n]]) : error reading from connection
> t2<-proc.time()
> cat(t2-t1)
17.41 18.23 20864.81 NA NA

??然后再試試LightGBM，發(fā)覺(jué)LightGBM是可以兩路worker進(jìn)程在GPU上并行的凿将，只用貝葉斯優(yōu)化調(diào)試一個(gè)參數(shù)校套，比如下圖的tree_depth，可以順利跑完牧抵。但如果調(diào)試的參數(shù)比較多笛匙，比如7個(gè)，就會(huì)因內(nèi)存不夠而產(chǎn)生大量的內(nèi)存交換磁盤(pán)IO犀变，把程序卡死跑不下來(lái)妹孙。LightGBM是CPU和內(nèi)存滿格，GPU 2路并行最高去到30%左右获枝。據(jù)說(shuō)單進(jìn)程訓(xùn)練配置的內(nèi)存大概要3倍左右的數(shù)據(jù)占用內(nèi)存蠢正，雙進(jìn)程就要6倍以上了，我的筆記本24G內(nèi)存跑多進(jìn)程并行訓(xùn)練HIGGS數(shù)據(jù)集還是不夠省店。

LightGBM-HIGGS-GPU雙worker并行是可以的嚣崭，調(diào)試一個(gè)參數(shù)GPU的負(fù)荷也不高

??LightGBM貝葉斯調(diào)試一個(gè)參數(shù)，2個(gè)網(wǎng)格搜索初始值懦傍，迭代10次雹舀，其中5次發(fā)現(xiàn)了更優(yōu)的參數(shù)，效果顯著谎脯。

Forced gc():  0.2  Seconds.

-- Iteration 4 -----------------------------------------------------------------------------------------------------

i Current best:     roc_auc=0.7369 (@iter 2)
i Gaussian process model
! Gaussian process model: X should be in range (0, 1)
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.046
i Estimating performance
√ Estimating performance
<3 Newest results:  roc_auc=0.8011

Forced gc():  0.3  Seconds.

-- Iteration 5 -----------------------------------------------------------------------------------------------------

i Current best:     roc_auc=0.8011 (@iter 4)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
<3 Newest results:  roc_auc=0.8116

??然后再試試CatBoost GPU 2進(jìn)程并行調(diào)一個(gè)參數(shù)葱跋，就順利的跑完了，每輪訓(xùn)練平均大約530秒。測(cè)試證明娱俺，跑大數(shù)據(jù)集稍味，內(nèi)存一定要夠大，參數(shù)不能太多荠卷。:)

-- Iteration 10 ----------------------------------------------------------------------------------------------------

i Current best:     roc_auc=0.8295 (@iter 3)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
(x) Newest results: roc_auc=0.8293
> t2<-proc.time()
> cat(t2-t1)
105.55 5711.49 10611.05 NA NA

??為了更好的AUC指標(biāo)模庐，要調(diào)多個(gè)參數(shù)，最后用單進(jìn)程跑CatBoost GPU上6個(gè)參數(shù)的調(diào)優(yōu)油宜，這里實(shí)際只跑了前50次掂碱，已經(jīng)是一夜了。

library(tidymodels)
library(kableExtra)
library(tidyr)
# 這個(gè)版本的treesnip支持classification慎冤。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 對(duì)于使用GPU的大數(shù)據(jù)集訓(xùn)練疼燥，驗(yàn)證2路最低限度并行，一個(gè)參數(shù)通過(guò)蚁堤，多參數(shù)跑不出來(lái)醉者，內(nèi)存不夠大。
# All operating systems披诗，注冊(cè)并行處理撬即，并為每個(gè)并行處理worker加載需要的包。
# # https://github.com/tidymodels/tune/issues/157 
# library(doParallel)
# cl <- makePSOCKcluster(2)   # parallel::detectCores()
# registerDoParallel(cl)
# # 顯示有幾個(gè)worker
# foreach::getDoParWorkers()
# # 為每個(gè)并行處理worker加載需要的包
# clusterEvalQ(cl, 
#              {library(tidymodels)
#                library(treesnip)
#                library(catboost)
#              })

# 優(yōu)先使用tidymodels的同名函數(shù)呈队。
tidymodels_prefer()

# ----------------------------------------------------------------------------------------
# 加載經(jīng)過(guò)預(yù)處理的數(shù)據(jù)
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)

# 劃分訓(xùn)練集與測(cè)試集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test  <-  testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA

# ----------------------------------------------------------------------------------------
# 貝葉斯優(yōu)化
# 可以調(diào)整菜譜參數(shù)剥槐、模型主參數(shù)及引擎相關(guān)參數(shù)。

# 定義菜譜：回歸公式與預(yù)處理
higgs_rec<-
  recipe(V1 ~ ., data = higgs_train) %>%
  # 標(biāo)準(zhǔn)化數(shù)值型變量
  step_normalize(all_numeric_predictors())

# 定義模型：XGB宪摧， 定義要調(diào)整的參數(shù)粒竖，tree_method="gpu_hist"，使用GPU几于。
cat_spec <-
  boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
  set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU') %>%    # 
  set_mode('classification')

# 定義工作流
cat_wflow <- 
  workflow() %>% 
  add_model(cat_spec) %>% 
  add_recipe(higgs_rec)

# 全部參數(shù)的邊界都已確定温圆。
cat_param <- cat_wflow %>%
  extract_parameter_set_dials() %>%
  update(learn_rate = threshold(c(0.01,0.5))) %>%
  update(trees = trees(c(500,1000))) %>%
  update(tree_depth = tree_depth(c(5,15))) %>%
  update(mtry = mtry(c(3,6))) %>%
  update(subsample = threshold(c(0.5,1)))

# 查看參數(shù)邊界，都已確定
cat_param

# 查看參數(shù)邊界孩革，都已確定
cat_param %>% extract_parameter_dials("trees")
cat_param %>% extract_parameter_dials("min_n")
cat_param %>% extract_parameter_dials("tree_depth")
cat_param %>% extract_parameter_dials("learn_rate")
cat_param %>% extract_parameter_dials("mtry")
cat_param %>% extract_parameter_dials("subsample")  

# 對(duì)于大數(shù)據(jù)集來(lái)說(shuō)，多折交叉驗(yàn)證的時(shí)間太長(zhǎng)了得运，用boostraps抽樣驗(yàn)證膝蜈，只做一次加快訓(xùn)練速度。
#higgs_folds <- vfold_cv(higgs_train, v = 5)
higgs_folds <- bootstraps(higgs_train, times = 1)

gc()

# 執(zhí)行貝葉斯優(yōu)化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
cat_res_bo <-
  cat_wflow %>%
  tune_bayes(
    resamples = higgs_folds,
    # metrics = metric_set(recall, precision, f_meas, accuracy, kap,roc_auc, sens, spec)
    metrics = metric_set(accuracy, roc_auc, precision,),  
    initial = 10,
    param_info = cat_param,
    iter = 100,
    control = ctrl,
    # Hack了一下tune_bayes()函數(shù)熔掺，增加參數(shù)force_gc饱搏，迭代中每次訓(xùn)練前可以選擇強(qiáng)制回收內(nèi)存。
    force_gc = TRUE
  )
t2<-proc.time()
cat(t2-t1)
# 9435.55 269.2 9085.21 NA NA

# 畫(huà)圖查看貝葉斯優(yōu)化效果
autoplot(cat_res_bo, type = "performance", metric="roc_auc")

# 查看準(zhǔn)確率最高的模型
show_best(cat_res_bo, metric="precision")
show_best(cat_res_bo, metric="accuracy")
show_best(cat_res_bo, metric="roc_auc")


# 選擇準(zhǔn)確率最高的模型
select_best(cat_res_bo, metric="roc_auc")
# 直接讀取調(diào)參的最佳結(jié)果
cat_param_best<- select_best(cat_res_bo, metric="roc_auc")

# 最佳參數(shù)回填到工作流
cat_wflow_bo <-
  cat_wflow %>%
  finalize_workflow(cat_param_best)
cat_wflow_bo

# 用最佳參數(shù)在訓(xùn)練集全集上訓(xùn)練模型
t1<-proc.time()
# 回收內(nèi)存置逻，否則訓(xùn)練可能因申請(qǐng)不到內(nèi)存而失敗推沸，
# 前面貝葉斯優(yōu)化函數(shù)中如果加入回收內(nèi)存的機(jī)制，應(yīng)該就可以避免訓(xùn)練失敗。
gc()
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()
cat(t2-t1)
#  647.2 183.99 507.11 NA NA

# 測(cè)試集
# 預(yù)測(cè)值
# https://parsnip.tidymodels.org/reference/predict.model_fit.html
# https://yardstick.tidymodels.org/reference/roc_auc.html
t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob") %>%
  bind_cols(predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "class"))
t2<-proc.time()
cat(t2-t1)
#  67.8 0.39 5.83 NA NA

# 合并真實(shí)值
higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
higgs_metrics <- metric_set(precision, accuracy)
higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
roc_auc(
  higgs_test_bo,
  truth = V1,
  estimate=.pred_0,
  options = list(smooth = TRUE)
)

> show_best(cat_res_bo, metric="precision")
# A tibble: 5 x 13
   mtry trees min_n tree_depth learn_rate subsample .metric   .estimator  mean     n std_err .config .iter
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>     <chr>      <dbl> <int>   <dbl> <chr>   <int>
1     4   993    25         15      0.198     0.602 precision binary     0.703     1      NA Iter9       9
2     5   918    14         15      0.294     0.627 precision binary     0.702     1      NA Iter1       1
3     5   931     5         15      0.175     0.703 precision binary     0.701     1      NA Iter12     12
4     4   981    18         15      0.153     0.522 precision binary     0.701     1      NA Iter10     10
5     3   988    21         15      0.153     0.945 precision binary     0.701     1      NA Iter4       4
> show_best(cat_res_bo, metric="accuracy")
# A tibble: 5 x 13
   mtry trees min_n tree_depth learn_rate subsample .metric  .estimator  mean     n std_err .config .iter
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>   <int>
1     4   997    16         15      0.118     0.685 accuracy binary     0.754     1      NA Iter18     18
2     3   985    36         15      0.124     0.583 accuracy binary     0.753     1      NA Iter17     17
3     5   966     8         15      0.128     0.869 accuracy binary     0.753     1      NA Iter11     11
4     5   989     3         15      0.117     0.816 accuracy binary     0.753     1      NA Iter15     15
5     4   981    18         15      0.153     0.522 accuracy binary     0.753     1      NA Iter10     10
> show_best(cat_res_bo, metric="roc_auc")
# A tibble: 5 x 13
   mtry trees min_n tree_depth learn_rate subsample .metric .estimator  mean     n std_err .config .iter
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   <int>
1     6   986    24         14      0.117     0.558 roc_auc binary     0.849     1      NA Iter16     16
2     5   957    11         15      0.109     0.839 roc_auc binary     0.849     1      NA Iter14     14
3     4   997    16         15      0.118     0.685 roc_auc binary     0.849     1      NA Iter18     18
4     5   989     3         15      0.117     0.816 roc_auc binary     0.849     1      NA Iter15     15
5     3   985    36         15      0.124     0.583 roc_auc binary     0.848     1      NA Iter17     17
> select_best(cat_res_bo, metric="roc_auc")
# A tibble: 1 x 7
   mtry trees min_n tree_depth learn_rate subsample .config
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>  
1     6   986    24         14      0.117     0.558 Iter16 
> higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
# A tibble: 2 x 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.698
2 accuracy  binary         0.755
> roc_auc(
+   higgs_test_bo,
+   truth = V1,
+   estimate=.pred_0,
+   options = list(smooth = TRUE)
+ )
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.854

??此時(shí)CPU與GPU的負(fù)荷都不高鬓催，不到30%肺素，硬件的能力還沒(méi)有充分發(fā)揮出來(lái)。

CatBoost-HIGGS-GPU-單進(jìn)程訓(xùn)練

??注意這種并行是多進(jìn)程并行宇驾，進(jìn)程間不共享數(shù)據(jù)倍靡，通過(guò)doParallel包與父進(jìn)程通訊，有多少個(gè)進(jìn)程就有多少份數(shù)據(jù)拷貝课舍，所以內(nèi)存要求較大塌西。下文中CatBoost CPU模式下支持多線程（同一父進(jìn)程下），線程間可以共享同一份數(shù)據(jù)筝尾，內(nèi)存開(kāi)銷(xiāo)就小捡需。不過(guò)后面的測(cè)試表明CatBoost GPU模式nthread參數(shù)不起作用，似乎不支持多線程筹淫，即多進(jìn)程下不能再開(kāi)多線程加速了站辉。所以要充分利用GPU的處理能力，只能增加內(nèi)存了贸街。
??然后比較一下在GPU和CPU上的訓(xùn)練和預(yù)測(cè)的耗時(shí)庵寞。CatBoost的fit()函數(shù)支持多線程并行，treesnip包封裝后用nthread參數(shù)來(lái)設(shè)置薛匪。如果在相同的線程數(shù)下來(lái)比較捐川，毫無(wú)疑問(wèn)是GPU要快很多（事實(shí)上也是如此），因?yàn)槎嗔艘粋€(gè)GPU來(lái)協(xié)助計(jì)算逸尖。我的筆記本有8個(gè)物理核16個(gè)虛擬核古沥，CPU上開(kāi)12個(gè)線程時(shí)滿格。經(jīng)測(cè)試GPU上nthread參數(shù)沒(méi)有作用娇跟，但fit()函數(shù)并沒(méi)有貝葉斯優(yōu)化那樣開(kāi)多個(gè)進(jìn)程的選項(xiàng)岩齿，它是單進(jìn)程的。參閱資料苞俘。

library(tidymodels)
library(kableExtra)
library(tidyr)
# 這個(gè)版本的treesnip支持classification盹沈。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 對(duì)于使用GPU的大數(shù)據(jù)集訓(xùn)練，驗(yàn)證2路最低限度并行吃谣。
# All operating systems乞封，注冊(cè)并行處理，并為每個(gè)并行處理worker加載需要的包岗憋。
# # https://github.com/tidymodels/tune/issues/157 
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
library(doParallel)
# cl <- makePSOCKcluster(parallel::detectCores()) 
cl <- makePSOCKcluster(12)   # CPU fit
# cl <- makePSOCKcluster(2)   # GPU fit
registerDoParallel(cl)
# 顯示有幾個(gè)worker
foreach::getDoParWorkers()
# 為每個(gè)并行處理worker加載需要的包
clusterEvalQ(cl,
             {library(tidymodels)
               library(treesnip)
               library(catboost)
             })

# 優(yōu)先使用tidymodels的同名函數(shù)肃晚。
tidymodels_prefer()

# ----------------------------------------------------------------------------------------
# 加載經(jīng)過(guò)預(yù)處理的數(shù)據(jù)
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)

# 劃分訓(xùn)練集與測(cè)試集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test  <-  testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA

# 用一組較好的參數(shù)比較CPU和GPU的性能----------------------------------------------------
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
# 定義菜譜：回歸公式與預(yù)處理
higgs_rec<-
  recipe(V1 ~ ., data = higgs_train) %>%
  # 標(biāo)準(zhǔn)化數(shù)值型變量
  step_normalize(all_numeric_predictors())

# 定義模型：Catboost， 定義要調(diào)整的參數(shù) 
cat_spec <-
  boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
  #set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU', nthread = 2) %>%  # GPU
  set_engine('catboost', subsample = tune("subsample"), task_type = 'CPU', nthread = 12) %>%  # CPU
  set_mode('classification')

# 定義工作流
cat_wflow <- 
  workflow() %>% 
  add_model(cat_spec) %>% 
  add_recipe(higgs_rec)

# 構(gòu)造最佳參數(shù)
cat_param_best<-
  tibble(
    mtry = 6,
    trees = 986,
    min_n = 24,
    tree_depth = 14,
    learn_rate = 0.117 ,
    subsample =  0.558
  )

# 最佳參數(shù)回填到工作流
cat_wflow_bo <-
  cat_wflow %>%
  finalize_workflow(cat_param_best)

# 用最佳參數(shù)在訓(xùn)練集全集上訓(xùn)練模型
t1<-proc.time()
# fit函數(shù)沒(méi)有并行仔戈，比的都是單進(jìn)程关串。
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()

cat(t2-t1)
#GPU單線程 650.52 183.77 511.34 NA NA
# CPU 12線程 65252.86 2728.51 6305.28 NA NA
# CPU 12線程 15980.84 672.2 1944.65 NA NA
#生成訓(xùn)練拧廊、測(cè)試預(yù)測(cè)及性能數(shù)據(jù)

t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob")
t2<-proc.time()
cat(t2-t1)
#GPU 36.42 0.07 2.69 NA NA
#CPU 40.11 0.9 3.35 NA NA

higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
roc_auc(
  higgs_test_bo,
  truth = V1,
  estimate=.pred_0,
  options = list(smooth = TRUE)
)
#GPU 85.4
#CPU 85.4

CatBoost-HIGGS-CPU-12線程訓(xùn)練，CPU滿格晋修。

??因?yàn)檫@組參數(shù)訓(xùn)練要迭代986次吧碾，比較慢，CPU 12線程跑要6305.28秒飞蚓，GPU是511.34秒滤港，12倍，GPU算力的優(yōu)越性已經(jīng)得到充分的體現(xiàn)了（并且硬件的負(fù)荷不高）趴拧。預(yù)測(cè)都差不多溅漾，主要的加速在訓(xùn)練，數(shù)據(jù)集越大著榴，迭代的次數(shù)越多添履，GPU算力的優(yōu)越性越明顯。

參考資料：《When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide]》脑又，可以了解控制過(guò)擬合與訓(xùn)練速度的主要參數(shù)暮胧，以及算法之間的比較。

最后編輯于：2023.01.20 10:37:02

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末问麸，一起剝皮案震驚了整個(gè)濱河市往衷，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌严卖，老刑警劉巖席舍，帶你破解...
沈念sama閱讀 217,277評(píng)論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異哮笆，居然都是意外死亡来颤，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,689評(píng)論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)稠肘，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)福铅，“玉大人，你說(shuō)我怎么就攤上這事项阴』” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 163,624評(píng)論 0贊 353
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵环揽，是天一觀的道長(zhǎng)拷沸。經(jīng)常有香客問(wèn)我，道長(zhǎng)薯演，這世上最難降的妖魔是什么？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,356評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任秧了，我火速辦了婚禮跨扮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己衡创，他們只是感情好帝嗡，可當(dāng)我...
茶點(diǎn)故事閱讀 67,402評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著璃氢，像睡著了一般哟玷。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上一也，一...
開(kāi)封第一講書(shū)人閱讀 51,292評(píng)論 1贊 301
城市分裂傳說(shuō)
那天巢寡，我揣著相機(jī)與錄音，去河邊找鬼椰苟。笑死抑月，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的舆蝴。我是一名探鬼主播谦絮，決...
沈念sama閱讀 40,135評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼洁仗！你這毒婦竟也來(lái)了层皱？” 一聲冷哼從身側(cè)響起，我...
開(kāi)封第一講書(shū)人閱讀 38,992評(píng)論 0贊 275
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤赠潦，失蹤者是張志新（化名）和其女友劉穎叫胖，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體祭椰，經(jīng)...
沈念sama閱讀 45,429評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡臭家，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,636評(píng)論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了方淤。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片钉赁。...
茶點(diǎn)故事閱讀 39,785評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖携茂，靈堂內(nèi)的尸體忽然破棺而出你踩，到底是詐尸還是另有隱情，我是刑警寧澤讳苦，帶...
沈念sama閱讀 35,492評(píng)論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布带膜，位于F島的核電站，受9級(jí)特大地震影響鸳谜，放射性物質(zhì)發(fā)生泄漏膝藕。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,092評(píng)論 3贊 328
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一咐扭、第九天我趴在偏房一處隱蔽的房頂上張望芭挽。院中可真熱鬧滑废，春花似錦、人聲如沸袜爪。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,723評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)辛馆。三九已至俺陋，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間昙篙，已是汗流浹背腊状。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,858評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留瓢对，地道東北人寿酌。一個(gè)月前我還...
沈念sama閱讀 47,891評(píng)論 2贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像硕蛹，于是被迫代替她去往敵國(guó)和親醇疼。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,713評(píng)論 2贊 354

實(shí)現(xiàn)機(jī)器學(xué)習(xí)算法GPU算力的優(yōu)越性

推薦閱讀更多精彩內(nèi)容