數(shù)據(jù)挖掘組隊學(xué)習(xí)之建模調(diào)參

同DataWhale一起組隊學(xué)習(xí)：https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12281978.0.0.6802593a2HCrSE&postId=95460

線性回歸模型：
- 線性回歸對于特征的要求；
- 處理長尾分布；
- 理解線性回歸模型；
模型性能驗證：
- 評價函數(shù)與目標函數(shù)厘灼；
- 交叉驗證方法肆糕；
- 留一驗證方法芥吟；
- 針對時間序列問題的驗證；
- 繪制學(xué)習(xí)率曲線先鱼；
- 繪制驗證曲線瓜客；
嵌入式特征選擇：
- Lasso回歸适瓦；
- Ridge回歸；
- 決策樹忆家；
模型對比：
- 常用線性模型犹菇；
- 常用非線性模型；
模型調(diào)參：
- 貪心調(diào)參方法芽卿；
- 網(wǎng)格調(diào)參方法；
- 貝葉斯調(diào)參方法胳搞；

4.3 相關(guān)原理介紹與推薦

由于相關(guān)算法原理篇幅較長卸例，本文推薦了一些博客與教材供初學(xué)者們進行學(xué)習(xí)。

4.3.6 推薦教材：

《機器學(xué)習(xí)》 https://book.douban.com/subject/26708119/
《統(tǒng)計學(xué)習(xí)方法》 https://book.douban.com/subject/10590856/
《Python大戰(zhàn)機器學(xué)習(xí)》 https://book.douban.com/subject/26987890/
《面向機器學(xué)習(xí)的特征工程》 https://book.douban.com/subject/26826639/
《數(shù)據(jù)科學(xué)家訪談錄》 https://book.douban.com/subject/30129410/

4.4 代碼示例

4.4.1 讀取數(shù)據(jù)

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

reduce_mem_usage 函數(shù)通過調(diào)整數(shù)據(jù)類型肌毅，幫助我們減少數(shù)據(jù)在內(nèi)存中占用的空間

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16520303.00 MB
Decreased by 73.4%

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
continuous_feature_names

['SaleID',
 'bodyType',
 'fuelType',
 'gearbox',
 'kilometer',
 'name',
 'notRepairedDamage',
 'offerType',
 'power',
 'seller',
 'train',
 'v_0',
 'v_1',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'used_time',
 'city',
 'brand_amount',
 'brand_price_max',
 'brand_price_median',
 'brand_price_min',
 'brand_price_sum',
 'brand_price_std',
 'brand_price_average',
 'power_bin']

4.4.2 線性回歸 & 五折交叉驗證 & 模擬真實業(yè)務(wù)情況

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

4.4.2 - 1 簡單建模

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model = model.fit(train_X, train_y)

查看訓(xùn)練的線性回歸模型的截距（intercept）與權(quán)重(coef)

print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:-110670.68276350443

[('v_6', 3367064.3416419127),
 ('v_8', 700675.5609398747),
 ('v_9', 170630.27723219778),
 ('v_7', 32322.66193202066),
 ('v_12', 20473.670796956518),
 ('v_3', 17868.079541495277),
 ('v_11', 11474.93899667472),
 ('v_13', 11261.764560013775),
 ('v_10', 2683.920090636247),
 ('gearbox', 881.8225039248076),
 ('fuelType', 363.9042507217425),
 ('bodyType', 189.60271012071334),
 ('city', 44.94975120521506),
 ('power', 28.553901616753308),
 ('brand_price_median', 0.5103728134078945),
 ('brand_price_std', 0.4503634709263306),
 ('brand_amount', 0.1488112039506502),
 ('brand_price_max', 0.0031910186703142654),
 ('SaleID', 5.355989919859153e-05),
 ('seller', 1.7292331904172897e-05),
 ('offerType', -3.609340637922287e-06),
 ('train', -8.841510862112045e-06),
 ('brand_price_sum', -2.1750068681876416e-05),
 ('name', -0.0002980012713070219),
 ('used_time', -0.0025158943328975666),
 ('brand_price_average', -0.4049048451011582),
 ('brand_price_min', -2.2467753486897433),
 ('power_bin', -34.42064411727693),
 ('v_14', -274.78411807745636),
 ('kilometer', -372.8975266607184),
 ('notRepairedDamage', -495.1903844630634),
 ('v_0', -2045.0549573528133),
 ('v_5', -11022.986240572642),
 ('v_4', -15121.731109855522),
 ('v_2', -26098.299920511883),
 ('v_1', -45556.18929722599)]

分析結(jié)果

from matplotlib import pyplot as plt

subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

繪制特征v_9的值與標簽的散點圖筷转，圖片發(fā)現(xiàn)模型的預(yù)測結(jié)果（藍色點）與真實標簽（黑色點）的分布差異較大，且部分預(yù)測值出現(xiàn)了小于0的情況悬而，說明我們的模型存在一些問題

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

The predicted price is obvious different from true price

數(shù)據(jù)挖掘之建模調(diào)參_20_1.png

通過作圖我們發(fā)現(xiàn)數(shù)據(jù)的標簽（price）呈現(xiàn)長尾分布呜舒，不利于我們的建模預(yù)測。原因是很多模型都假設(shè)數(shù)據(jù)誤差項符合正態(tài)分布笨奠，而長尾分布的數(shù)據(jù)違背了這一假設(shè)袭蝗。參考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829，一般來說特征得符合正太分布般婆，其結(jié)果也要服從正太分布

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

It is clear to see the price shows a typical exponential distribution

數(shù)據(jù)挖掘之建模調(diào)參_22_2.png

在這里我們對標簽進行了 $log(x+1)$ 變換到腥，使標簽貼近于正態(tài)分布

train_y_ln = np.log(train_y + 1)

import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

The transformed price seems like normal distribution

數(shù)據(jù)挖掘之建模調(diào)參_25_2.png

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:18.750749465596538

[('v_9', 8.052409900567598),
 ('v_5', 5.764236596653164),
 ('v_12', 1.6182081236790842),
 ('v_1', 1.479831058297348),
 ('v_11', 1.1669016563604946),
 ('v_13', 0.9404711296033863),
 ('v_7', 0.7137273083565177),
 ('v_3', 0.6837875771083749),
 ('v_0', 0.00850051800990644),
 ('power_bin', 0.00849796930289138),
 ('gearbox', 0.007922377278334032),
 ('fuelType', 0.006684769706830405),
 ('bodyType', 0.004523520092703996),
 ('power', 0.0007161894205359879),
 ('brand_price_min', 3.334351114751914e-05),
 ('brand_amount', 2.8978797042783603e-06),
 ('brand_price_median', 1.2571172872987556e-06),
 ('brand_price_std', 6.659176363456033e-07),
 ('brand_price_max', 6.194956307516954e-07),
 ('brand_price_average', 5.999345965077057e-07),
 ('SaleID', 2.11941700396436e-08),
 ('seller', 3.923616986867273e-11),
 ('train', -1.5702994460298214e-11),
 ('offerType', -2.2708945834892802e-11),
 ('brand_price_sum', -1.5126504215939166e-10),
 ('name', -7.015512588894369e-08),
 ('used_time', -4.122479372349444e-06),
 ('city', -0.0022187824810420242),
 ('v_14', -0.004234223418117319),
 ('kilometer', -0.013835866226884267),
 ('notRepairedDamage', -0.27027942349846545),
 ('v_4', -0.8315701200994575),
 ('v_2', -0.9470842241619425),
 ('v_10', -1.6261466689774864),
 ('v_8', -40.343007487616305),
 ('v_6', -238.79036385506956)]

再次進行可視化，發(fā)現(xiàn)預(yù)測結(jié)果與真實值較為接近蔚袍，且未出現(xiàn)異常狀況

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

The predicted price seems normal after np.log transforming

數(shù)據(jù)挖掘之建模調(diào)參_28_1.png

4.4.2 - 2 五折交叉驗證

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.0s finished

使用線性回歸模型融涣，對未處理標簽的特征數(shù)據(jù)進行五折交叉驗證（Error 1.36）

print('AVG:', np.mean(scores))

AVG: 1.3658023920314064

使用線性回歸模型，對處理過標簽的特征數(shù)據(jù)進行五折交叉驗證（Error 0.19）

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished

print('AVG:', np.mean(scores))

AVG: 0.19325301837047434

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

4.4.2 - 3 模擬真實業(yè)務(wù)情況

但在事實上精钮，由于我們并不具有預(yù)知未來的能力威鹿，五折交叉驗證在某些與時間相關(guān)的數(shù)據(jù)集上反而反映了不真實的情況。通過2018年的二手車價格預(yù)測2017年的二手車價格轨香，這顯然是不合理的忽你，因此我們還可以采用時間順序?qū)?shù)據(jù)集進行分隔。在本例中臂容，我們選用靠前時間的4/5樣本當作訓(xùn)練集科雳，靠后時間的1/5當作驗證集，最終結(jié)果與五折交叉驗證差距不大

import datetime

sample_feature = sample_feature.reset_index(drop=True)

# 這是劃分成五份后脓杉，選取了最近那一份的點
split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

0.19577667270300955

4.4.2 - 4 繪制學(xué)習(xí)率曲線與驗證曲線

from sklearn.model_selection import learning_curve, validation_curve

? learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#區(qū)域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

數(shù)據(jù)挖掘之建模調(diào)參_52_1.png

4.4.3 多種模型對比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

4.4.3 - 1 線性模型 & 嵌入式特征選擇

本章節(jié)默認糟秘，學(xué)習(xí)者已經(jīng)了解關(guān)于過擬合、模型復(fù)雜度球散、正則化等概念尿赚。否則請尋找相關(guān)資料或參考如下連接：

用簡單易懂的語言描述「過擬合 overfitting」？ https://www.zhihu.com/question/32246256/answer/55320482
模型復(fù)雜度與模型的泛化能力 http://yangyingming.com/article/434/
正則化的直觀理解 https://blog.csdn.net/jinping_shi/article/details/52433975

在過濾式和包裹式特征選擇方法中蕉堰，特征選擇過程與學(xué)習(xí)器訓(xùn)練過程有明顯的分別凌净。而嵌入式特征選擇在學(xué)習(xí)器訓(xùn)練過程中自動地進行特征選擇。嵌入式選擇最常用的是L1正則化與L2正則化嘁灯。在對線性回歸模型加入兩種正則化方法后泻蚊，他們分別變成了嶺回歸與Lasso回歸。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
Ridge is finished
Lasso is finished

對三種方法的效果對比

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:18.750417086915647

數(shù)據(jù)挖掘之建模調(diào)參_63_2.png

L2正則化在擬合過程中通常都傾向于讓權(quán)值盡可能小丑婿，最后構(gòu)造一個所有參數(shù)都比較小的模型性雄。因為一般認為參數(shù)值小的模型比較簡單没卸，能適應(yīng)不同的數(shù)據(jù)集，也在一定程度上避免了過擬合現(xiàn)象秒旋≡技疲可以設(shè)想一下對于一個線性回歸方程，若參數(shù)很大迁筛，那么只要數(shù)據(jù)偏移一點點煤蚌，就會對結(jié)果造成很大的影響；但如果參數(shù)足夠小细卧，數(shù)據(jù)偏移得多一點也不會對結(jié)果造成什么影響尉桩，專業(yè)一點的說法是『抗擾動能力強』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:4.671709787512927

數(shù)據(jù)挖掘之建模調(diào)參_65_2.png

L1正則化有助于生成一個稀疏權(quán)值矩陣，進而可以用于特征選擇贪庙。如下圖蜘犁，我們發(fā)現(xiàn)power與userd_time特征非常重要。

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:8.672182402894254

數(shù)據(jù)挖掘之建模調(diào)參_67_2.png

除此之外止邮，決策樹通過信息熵或GINI指數(shù)選擇分裂節(jié)點時这橙，優(yōu)先選擇的分裂特征也更加重要，這同樣是一種特征選擇的方法导披。XGBoost與LightGBM模型中的model_importance指標正是基于此計算的

4.4.3 - 2 非線性模型

除了線性模型以外屈扎，還有許多我們常用的非線性模型如下，在此篇幅有限不再一一講解原理撩匕。我們選擇了部分常用模型與線性模型進行效果比對鹰晨。

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

	LinearRegression	DecisionTreeRegressor	RandomForestRegressor	GradientBoostingRegressor	MLPRegressor	XGBRegressor	LGBMRegressor
cv1	0.191642	0.184566	0.136266	0.168626	124.299426	0.168698	0.141159
cv2	0.194986	0.187029	0.139693	0.171905	257.886236	0.172258	0.143363
cv3	0.192737	0.184839	0.136871	0.169553	236.829589	0.168604	0.142137
cv4	0.195329	0.182605	0.138689	0.172299	130.197264	0.172474	0.143461
cv5	0.194450	0.186626	0.137420	0.171206	268.090236	0.170898	0.141921

可以看到隨機森林模型在每一個fold中均取得了更好的效果

4.4.4 模型調(diào)參

在此我們介紹了三種常用的調(diào)參方法如下：

貪心算法 http://www.reibang.com/p/ab89df9759c8
網(wǎng)格調(diào)參 https://blog.csdn.net/weixin_43172660/article/details/83032029
貝葉斯調(diào)參 https://blog.csdn.net/linxid/article/details/81189154

## LGB的參數(shù)集合：

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

4.4.4 - 1 貪心調(diào)參

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score

sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])

數(shù)據(jù)挖掘之建模調(diào)參_81_1.png

4.4.4 - 2 Grid Search 調(diào)參

from sklearn.model_selection import GridSearchCV

parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)

clf.best_params_

{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}

model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)

np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

0.13754820909576437

4.4.4 - 3 貝葉斯調(diào)參

from bayes_opt import BayesianOptimization

def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

rf_bo.maximize()

1 - rf_bo.max['target']

0.1305927066054554