數(shù)據(jù)挖掘組隊學(xué)習(xí)之建模調(diào)參

同DataWhale一起組隊學(xué)習(xí):https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12281978.0.0.6802593a2HCrSE&postId=95460

  1. 線性回歸模型:
    • 線性回歸對于特征的要求;
    • 處理長尾分布;
    • 理解線性回歸模型;
  2. 模型性能驗證:
    • 評價函數(shù)與目標函數(shù)厘灼;
    • 交叉驗證方法肆糕;
    • 留一驗證方法芥吟;
    • 針對時間序列問題的驗證;
    • 繪制學(xué)習(xí)率曲線先鱼;
    • 繪制驗證曲線瓜客;
  3. 嵌入式特征選擇:
    • Lasso回歸适瓦;
    • Ridge回歸;
    • 決策樹忆家;
  4. 模型對比:
    • 常用線性模型犹菇;
    • 常用非線性模型;
  5. 模型調(diào)參:
    • 貪心調(diào)參方法芽卿;
    • 網(wǎng)格調(diào)參方法;
    • 貝葉斯調(diào)參方法胳搞;

4.3 相關(guān)原理介紹與推薦

由于相關(guān)算法原理篇幅較長卸例,本文推薦了一些博客與教材供初學(xué)者們進行學(xué)習(xí)。

4.3.1 線性回歸模型

https://zhuanlan.zhihu.com/p/49480391

4.3.2 決策樹模型

https://zhuanlan.zhihu.com/p/65304798

4.3.3 GBDT模型

https://zhuanlan.zhihu.com/p/45145899

4.3.4 XGBoost模型

https://zhuanlan.zhihu.com/p/86816771

4.3.5 LightGBM模型

https://zhuanlan.zhihu.com/p/89360721

4.3.6 推薦教材:

4.4 代碼示例

4.4.1 讀取數(shù)據(jù)

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

reduce_mem_usage 函數(shù)通過調(diào)整數(shù)據(jù)類型肌毅,幫助我們減少數(shù)據(jù)在內(nèi)存中占用的空間

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

output

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16520303.00 MB
Decreased by 73.4%
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
continuous_feature_names

output

['SaleID',
 'bodyType',
 'fuelType',
 'gearbox',
 'kilometer',
 'name',
 'notRepairedDamage',
 'offerType',
 'power',
 'seller',
 'train',
 'v_0',
 'v_1',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'used_time',
 'city',
 'brand_amount',
 'brand_price_max',
 'brand_price_median',
 'brand_price_min',
 'brand_price_sum',
 'brand_price_std',
 'brand_price_average',
 'power_bin']

4.4.2 線性回歸 & 五折交叉驗證 & 模擬真實業(yè)務(wù)情況

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

4.4.2 - 1 簡單建模

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)

查看訓(xùn)練的線性回歸模型的截距(intercept)與權(quán)重(coef)

print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

output

intercept:-110670.68276350443
[('v_6', 3367064.3416419127),
 ('v_8', 700675.5609398747),
 ('v_9', 170630.27723219778),
 ('v_7', 32322.66193202066),
 ('v_12', 20473.670796956518),
 ('v_3', 17868.079541495277),
 ('v_11', 11474.93899667472),
 ('v_13', 11261.764560013775),
 ('v_10', 2683.920090636247),
 ('gearbox', 881.8225039248076),
 ('fuelType', 363.9042507217425),
 ('bodyType', 189.60271012071334),
 ('city', 44.94975120521506),
 ('power', 28.553901616753308),
 ('brand_price_median', 0.5103728134078945),
 ('brand_price_std', 0.4503634709263306),
 ('brand_amount', 0.1488112039506502),
 ('brand_price_max', 0.0031910186703142654),
 ('SaleID', 5.355989919859153e-05),
 ('seller', 1.7292331904172897e-05),
 ('offerType', -3.609340637922287e-06),
 ('train', -8.841510862112045e-06),
 ('brand_price_sum', -2.1750068681876416e-05),
 ('name', -0.0002980012713070219),
 ('used_time', -0.0025158943328975666),
 ('brand_price_average', -0.4049048451011582),
 ('brand_price_min', -2.2467753486897433),
 ('power_bin', -34.42064411727693),
 ('v_14', -274.78411807745636),
 ('kilometer', -372.8975266607184),
 ('notRepairedDamage', -495.1903844630634),
 ('v_0', -2045.0549573528133),
 ('v_5', -11022.986240572642),
 ('v_4', -15121.731109855522),
 ('v_2', -26098.299920511883),
 ('v_1', -45556.18929722599)]

分析結(jié)果

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

繪制特征v_9的值與標簽的散點圖筷转,圖片發(fā)現(xiàn)模型的預(yù)測結(jié)果(藍色點)與真實標簽(黑色點)的分布差異較大,且部分預(yù)測值出現(xiàn)了小于0的情況悬而,說明我們的模型存在一些問題

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()
The predicted price is obvious different from true price
數(shù)據(jù)挖掘之建模調(diào)參_20_1.png

通過作圖我們發(fā)現(xiàn)數(shù)據(jù)的標簽(price)呈現(xiàn)長尾分布呜舒,不利于我們的建模預(yù)測。原因是很多模型都假設(shè)數(shù)據(jù)誤差項符合正態(tài)分布笨奠,而長尾分布的數(shù)據(jù)違背了這一假設(shè)袭蝗。參考博客:https://blog.csdn.net/Noob_daniel/article/details/76087829,一般來說特征得符合正太分布般婆,其結(jié)果也要服從正太分布

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
It is clear to see the price shows a typical exponential distribution
數(shù)據(jù)挖掘之建模調(diào)參_22_2.png

在這里我們對標簽進行了 log(x+1) 變換到腥,使標簽貼近于正態(tài)分布

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
The transformed price seems like normal distribution
數(shù)據(jù)挖掘之建模調(diào)參_25_2.png
model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

output

intercept:18.750749465596538
[('v_9', 8.052409900567598),
 ('v_5', 5.764236596653164),
 ('v_12', 1.6182081236790842),
 ('v_1', 1.479831058297348),
 ('v_11', 1.1669016563604946),
 ('v_13', 0.9404711296033863),
 ('v_7', 0.7137273083565177),
 ('v_3', 0.6837875771083749),
 ('v_0', 0.00850051800990644),
 ('power_bin', 0.00849796930289138),
 ('gearbox', 0.007922377278334032),
 ('fuelType', 0.006684769706830405),
 ('bodyType', 0.004523520092703996),
 ('power', 0.0007161894205359879),
 ('brand_price_min', 3.334351114751914e-05),
 ('brand_amount', 2.8978797042783603e-06),
 ('brand_price_median', 1.2571172872987556e-06),
 ('brand_price_std', 6.659176363456033e-07),
 ('brand_price_max', 6.194956307516954e-07),
 ('brand_price_average', 5.999345965077057e-07),
 ('SaleID', 2.11941700396436e-08),
 ('seller', 3.923616986867273e-11),
 ('train', -1.5702994460298214e-11),
 ('offerType', -2.2708945834892802e-11),
 ('brand_price_sum', -1.5126504215939166e-10),
 ('name', -7.015512588894369e-08),
 ('used_time', -4.122479372349444e-06),
 ('city', -0.0022187824810420242),
 ('v_14', -0.004234223418117319),
 ('kilometer', -0.013835866226884267),
 ('notRepairedDamage', -0.27027942349846545),
 ('v_4', -0.8315701200994575),
 ('v_2', -0.9470842241619425),
 ('v_10', -1.6261466689774864),
 ('v_8', -40.343007487616305),
 ('v_6', -238.79036385506956)]

再次進行可視化,發(fā)現(xiàn)預(yù)測結(jié)果與真實值較為接近蔚袍,且未出現(xiàn)異常狀況

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()
The predicted price seems normal after np.log transforming
數(shù)據(jù)挖掘之建模調(diào)參_28_1.png

4.4.2 - 2 五折交叉驗證

在使用訓(xùn)練集對參數(shù)進行訓(xùn)練的時候乡范,經(jīng)常會發(fā)現(xiàn)人們通常會將一整個訓(xùn)練集分為三個部分(比如mnist手寫訓(xùn)練集)。一般分為:訓(xùn)練集(train_set),評估集(valid_set)晋辆,測試集(test_set)這三個部分渠脉。這其實是為了保證訓(xùn)練效果而特意設(shè)置的。其中測試集很好理解瓶佳,其實就是完全不參與訓(xùn)練的數(shù)據(jù)连舍,僅僅用來觀測測試效果的數(shù)據(jù)。而訓(xùn)練集和評估集則牽涉到下面的知識了涩哟。

因為在實際的訓(xùn)練中索赏,訓(xùn)練的結(jié)果對于訓(xùn)練集的擬合程度通常還是挺好的(初始條件敏感),但是對于訓(xùn)練集之外的數(shù)據(jù)的擬合程度通常就不那么令人滿意了贴彼。因此我們通常并不會把所有的數(shù)據(jù)集都拿來訓(xùn)練潜腻,而是分出一部分來(這一部分不參加訓(xùn)練)對訓(xùn)練集生成的參數(shù)進行測試,相對客觀的判斷這些參數(shù)對訓(xùn)練集之外的數(shù)據(jù)的符合程度器仗。這種思想就稱為交叉驗證(Cross Validation)

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

output

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.0s finished

使用線性回歸模型融涣,對未處理標簽的特征數(shù)據(jù)進行五折交叉驗證(Error 1.36)

print('AVG:', np.mean(scores))

output

AVG: 1.3658023920314064

使用線性回歸模型,對處理過標簽的特征數(shù)據(jù)進行五折交叉驗證(Error 0.19)

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

output

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished
print('AVG:', np.mean(scores))

output

AVG: 0.19325301837047434
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

4.4.2 - 3 模擬真實業(yè)務(wù)情況

但在事實上精钮,由于我們并不具有預(yù)知未來的能力威鹿,五折交叉驗證在某些與時間相關(guān)的數(shù)據(jù)集上反而反映了不真實的情況。通過2018年的二手車價格預(yù)測2017年的二手車價格轨香,這顯然是不合理的忽你,因此我們還可以采用時間順序?qū)?shù)據(jù)集進行分隔。在本例中臂容,我們選用靠前時間的4/5樣本當作訓(xùn)練集科雳,靠后時間的1/5當作驗證集,最終結(jié)果與五折交叉驗證差距不大

import datetime
sample_feature = sample_feature.reset_index(drop=True)
# 這是劃分成五份后脓杉,選取了最近那一份的點
split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))

output

0.19577667270300955

4.4.2 - 4 繪制學(xué)習(xí)率曲線與驗證曲線

from sklearn.model_selection import learning_curve, validation_curve
? learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#區(qū)域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt  
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)  
數(shù)據(jù)挖掘之建模調(diào)參_52_1.png

4.4.3 多種模型對比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

4.4.3 - 1 線性模型 & 嵌入式特征選擇

本章節(jié)默認糟秘,學(xué)習(xí)者已經(jīng)了解關(guān)于過擬合、模型復(fù)雜度球散、正則化等概念尿赚。否則請尋找相關(guān)資料或參考如下連接:

在過濾式和包裹式特征選擇方法中蕉堰,特征選擇過程與學(xué)習(xí)器訓(xùn)練過程有明顯的分別凌净。而嵌入式特征選擇在學(xué)習(xí)器訓(xùn)練過程中自動地進行特征選擇。嵌入式選擇最常用的是L1正則化與L2正則化嘁灯。在對線性回歸模型加入兩種正則化方法后泻蚊,他們分別變成了嶺回歸與Lasso回歸。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
          Ridge(),
          Lasso()]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
Ridge is finished
Lasso is finished

對三種方法的效果對比

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:18.750417086915647
數(shù)據(jù)挖掘之建模調(diào)參_63_2.png

L2正則化在擬合過程中通常都傾向于讓權(quán)值盡可能小丑婿,最后構(gòu)造一個所有參數(shù)都比較小的模型性雄。因為一般認為參數(shù)值小的模型比較簡單没卸,能適應(yīng)不同的數(shù)據(jù)集,也在一定程度上避免了過擬合現(xiàn)象秒旋≡技疲可以設(shè)想一下對于一個線性回歸方程,若參數(shù)很大迁筛,那么只要數(shù)據(jù)偏移一點點煤蚌,就會對結(jié)果造成很大的影響;但如果參數(shù)足夠小细卧,數(shù)據(jù)偏移得多一點也不會對結(jié)果造成什么影響尉桩,專業(yè)一點的說法是『抗擾動能力強』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:4.671709787512927
數(shù)據(jù)挖掘之建模調(diào)參_65_2.png

L1正則化有助于生成一個稀疏權(quán)值矩陣,進而可以用于特征選擇贪庙。如下圖蜘犁,我們發(fā)現(xiàn)power與userd_time特征非常重要。

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:8.672182402894254
數(shù)據(jù)挖掘之建模調(diào)參_67_2.png

除此之外止邮,決策樹通過信息熵或GINI指數(shù)選擇分裂節(jié)點時这橙,優(yōu)先選擇的分裂特征也更加重要,這同樣是一種特征選擇的方法导披。XGBoost與LightGBM模型中的model_importance指標正是基于此計算的

4.4.3 - 2 非線性模型

除了線性模型以外屈扎,還有許多我們常用的非線性模型如下,在此篇幅有限不再一一講解原理撩匕。我們選擇了部分常用模型與線性模型進行效果比對鹰晨。

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
LinearRegression DecisionTreeRegressor RandomForestRegressor GradientBoostingRegressor MLPRegressor XGBRegressor LGBMRegressor
cv1 0.191642 0.184566 0.136266 0.168626 124.299426 0.168698 0.141159
cv2 0.194986 0.187029 0.139693 0.171905 257.886236 0.172258 0.143363
cv3 0.192737 0.184839 0.136871 0.169553 236.829589 0.168604 0.142137
cv4 0.195329 0.182605 0.138689 0.172299 130.197264 0.172474 0.143461
cv5 0.194450 0.186626 0.137420 0.171206 268.090236 0.170898 0.141921

可以看到隨機森林模型在每一個fold中均取得了更好的效果

4.4.4 模型調(diào)參

在此我們介紹了三種常用的調(diào)參方法如下:

## LGB的參數(shù)集合:

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

4.4.4 - 1 貪心調(diào)參

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
數(shù)據(jù)挖掘之建模調(diào)參_81_1.png

4.4.4 - 2 Grid Search 調(diào)參

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_
{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}
model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
0.13754820909576437

4.4.4 - 3 貝葉斯調(diào)參

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)
rf_bo.maximize()
1 - rf_bo.max['target']
0.1305927066054554
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市滑沧,隨后出現(xiàn)的幾起案子并村,更是在濱河造成了極大的恐慌,老刑警劉巖滓技,帶你破解...
    沈念sama閱讀 218,755評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異棚潦,居然都是意外死亡令漂,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,305評論 3 395
  • 文/潘曉璐 我一進店門丸边,熙熙樓的掌柜王于貴愁眉苦臉地迎上來叠必,“玉大人,你說我怎么就攤上這事妹窖∥吵” “怎么了?”我有些...
    開封第一講書人閱讀 165,138評論 0 355
  • 文/不壞的土叔 我叫張陵骄呼,是天一觀的道長共苛。 經(jīng)常有香客問我判没,道長,這世上最難降的妖魔是什么隅茎? 我笑而不...
    開封第一講書人閱讀 58,791評論 1 295
  • 正文 為了忘掉前任澄峰,我火速辦了婚禮,結(jié)果婚禮上辟犀,老公的妹妹穿的比我還像新娘俏竞。我一直安慰自己,他們只是感情好堂竟,可當我...
    茶點故事閱讀 67,794評論 6 392
  • 文/花漫 我一把揭開白布魂毁。 她就那樣靜靜地躺著,像睡著了一般出嘹。 火紅的嫁衣襯著肌膚如雪席楚。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,631評論 1 305
  • 那天疚漆,我揣著相機與錄音酣胀,去河邊找鬼。 笑死娶聘,一個胖子當著我的面吹牛闻镶,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播丸升,決...
    沈念sama閱讀 40,362評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼铆农,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了狡耻?” 一聲冷哼從身側(cè)響起墩剖,我...
    開封第一講書人閱讀 39,264評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎夷狰,沒想到半個月后岭皂,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,724評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡沼头,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,900評論 3 336
  • 正文 我和宋清朗相戀三年爷绘,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片进倍。...
    茶點故事閱讀 40,040評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡土至,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出猾昆,到底是詐尸還是另有隱情陶因,我是刑警寧澤,帶...
    沈念sama閱讀 35,742評論 5 346
  • 正文 年R本政府宣布垂蜗,位于F島的核電站楷扬,受9級特大地震影響解幽,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜毅否,卻給世界環(huán)境...
    茶點故事閱讀 41,364評論 3 330
  • 文/蒙蒙 一亚铁、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧螟加,春花似錦徘溢、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,944評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至黍图,卻和暖如春曾雕,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背助被。 一陣腳步聲響...
    開封第一講書人閱讀 33,060評論 1 270
  • 我被黑心中介騙來泰國打工剖张, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人揩环。 一個月前我還...
    沈念sama閱讀 48,247評論 3 371
  • 正文 我出身青樓搔弄,卻偏偏與公主長得像,于是被迫代替她去往敵國和親丰滑。 傳聞我的和親對象是個殘疾皇子顾犹,可洞房花燭夜當晚...
    茶點故事閱讀 44,979評論 2 355