模型調參——隨機森林在乳腺癌數(shù)據(jù)集上的調參應用

一导饲、數(shù)據(jù)集

Sklearn自帶數(shù)據(jù)集——乳腺癌數(shù)據(jù)集

二、模型選擇

乳腺癌數(shù)據(jù)集是二分類模型氯材，選擇隨機森林模型進行調參

三渣锦、調參流程

1）簡單建模，觀察模型在數(shù)據(jù)集上具體的表現(xiàn)效果
2）調參——n_estimators
3）調參——max_depth
4）調參——min_samples_leaf
5）調參——min_samples_split
6）調參——max_features
7）調參——criterion
8）確定最佳參數(shù)組合

四氢哮、調參詳解應用步驟

1）導入相關庫

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2）查看數(shù)據(jù)集概況

data=load_breast_cancer() #實例化
data.info()
data.data.shape
data.target.shape
data.target

3）簡單建模袋毙，觀察模型在數(shù)據(jù)集上具體的表現(xiàn)效果

rfc=RandomForestClassifier(n_estimators=100,random_state=90)
score_pre=cross_val_score(rfc,data.data,data.target,cv=10).mean()
score_pre

score_pre 分數(shù)為 0.9666925935528475

4）調參 n_estimators

scorel=[]
for i in range(1,201,10):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),(scorel.index(max(scorel))*10)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()

運行結果：

通過數(shù)據(jù)和學習曲線可以發(fā)現(xiàn)，當n_estimators=41的時候冗尤，階段性準確率最高听盖，達到0.9684480598046841

接下來縮小范圍，繼續(xù)探索n_estimators在 [35,45] 的表現(xiàn)效果

scorel=[]
for i in range(35,45):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),([*range(35,45)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])
plt.plot(range(35,45,1),scorel)
plt.show()

運行結果：

調整n_estimators效果顯著裂七，模型準確率立刻上升了0.0035皆看。接下來就進入網(wǎng)格搜索，我們將使用網(wǎng)格搜索對參數(shù)一個個進行調整碍讯。窺探如何通過復雜度-泛化誤差方法調整參數(shù)進而提高模型的準確度悬蔽。

5）調參max_depth

param_grid={'max_depth':[*np.arange(1,20,1)]}

rfc=RandomForestClassifier(n_estimators=39,random_state=90)
GS=GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

運行結果：

通過運行結果可以看到，網(wǎng)格搜索給出的最佳參數(shù)max_depth是11捉兴，此時最佳準確度為0.9718804920913884

但問題來了蝎困，相比前面第四步，此時限制max_depth減小倍啥，準確率反而降低了禾乘。隨機森林樹模型天生過擬合，降低模型復雜度理應可以提升準確率虽缕，但此時降低樹的最大深度卻使模型準確率降低了始藕，說明模型現(xiàn)在位于圖像左邊，即泛化誤差最低點的左邊。伍派。這和數(shù)據(jù)集本身有關江耀，但也有可能是我們調整的n_estimators對于數(shù)據(jù)集來說太大，因此將模型拉到泛化誤差最低點去了诉植。

當模型位于圖像左邊時祥国，我們需要的是增加模型復雜度（增加方差，減少偏差）的選項晾腔，因此max_depth應該盡量大舌稀，min_samples_leaf和min_samples_split都應該盡量小。這幾乎是在說明灼擂，除了max_features壁查，我們沒有任何參數(shù)可以調整了，因為max_depth剔应，min_samples_leaf和min_samples_split是剪枝參數(shù)睡腿，是減小復雜度的參數(shù)。在這里领斥，我們可以預言嫉到，我們已經非常接近模型的上限沃暗，模型很可能沒有辦法再進步了月洛。

6）調參max_features

grid_param={'max_features':np.arange(5,30)}

rfc=RandomForestClassifier(n_estimators=39,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

運行結果：

網(wǎng)格搜索給出的最佳參數(shù)max_features是5，此時最佳準確度為0.9718804920913884孽锥，模型的準確率還是降低了嚼黔。

網(wǎng)格搜索返回了max_features的最小值，可見max_features升高之后惜辑，模型的準確率降低了唬涧。這說明，我們把模型往右推盛撑，模型的泛化誤差增加了碎节。前面用max_depth往左推，現(xiàn)在用max_features往右推抵卫，泛化誤差都增加狮荔，這說明模型本身已經處于泛化誤差最低點，已經達到了模型的預測上限介粘，沒有參數(shù)可以左右的部分了殖氏。剩下的那些誤差，是噪聲決定的姻采，已經沒有方差和偏差的舞臺了雅采。

五、調整完畢，總結模型最佳參數(shù)組合

RandomForestClassifier(n_estimators=39,random_state=90)

調參前模型準確率：0.9666925935528475（96.67%）
調參后模型準確率：0.9719568317345088（97.20%）
模型提升的準確率：0.0052642381816613（+0.53%）

·································································································································································
完整代碼：

#導入相關庫
from sklearn.datasets import load_breast_cancer     #導入乳腺癌數(shù)據(jù)集模塊
from sklearn.ensemble import RandomForestClassifier #導入集成算法隨機森林模塊
from sklearn.model_selection import cross_val_score #導入交叉驗證模塊
from sklearn.model_selection import GridSearchCV    #導入網(wǎng)格搜索模塊
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#數(shù)據(jù)集概況
data=load_breast_cancer()   #實例化
data.info()                 #數(shù)據(jù)集概況
data.data.shape             #特征數(shù)據(jù)集形狀
data.target.shape           #標簽數(shù)據(jù)集形狀
data.target                 #標簽數(shù)據(jù)


#簡單建模婚瓜，觀察模型在數(shù)據(jù)集上具體的表現(xiàn)效果
rfc=RandomForestClassifier(n_estimators=100,random_state=90)      #實例化
score_pre=cross_val_score(rfc,data.data,data.target,cv=10).mean() #交叉驗證
score_pre

#調參n_estimators
scorel=[]
for i in range(1,201,10):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)  #設置n_estimators[1,201]依次建模評分
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),(scorel.index(max(scorel))*10)+1)
plt.figure(figsize=[20,5])  #繪制學習曲線
plt.plot(range(1,201,10),scorel)
plt.show()

scorel=[]
for i in range(35,45):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)  #設置n_estimators[35,45]依次建模評分
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),([*range(35,45)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])   #繪制學習曲線
plt.plot(range(35,45,1),scorel)
plt.show()

#調參max_depth 網(wǎng)格搜索最佳參數(shù)
param_grid={'max_depth':[*np.arange(1,20,1)]} #網(wǎng)格搜索設置參數(shù)及參數(shù)大小范圍
rfc=RandomForestClassifier(n_estimators=39,random_state=90) #實例化
GS=GridSearchCV(rfc,param_grid,cv=10) #網(wǎng)格搜索
GS.fit(data.data,data.target)  #訓練模型
GS.best_params_   #最佳參數(shù)
GS.best_score_    #最佳分數(shù)

#調參max_features 網(wǎng)格搜索最佳參數(shù)
grid_param={'max_features':np.arange(5,30)} #網(wǎng)格搜索設置參數(shù)及參數(shù)大小范圍
rfc=RandomForestClassifier(n_estimators=39,random_state=90) #實例化
GS=GridSearchCV(rfc,grid_param,cv=10) #網(wǎng)格搜索
GS.fit(data.data,data.target)  #訓練模型
GS.best_params_  #最佳參數(shù)
GS.best_score_   #最佳分數(shù)