Python 2.7
IDE Pychrm 5.0.3
sci-kit learn 0.18.1
前言
抖了個(gè)機(jī)靈荆永,不要來(lái)打我睦焕,這是沒(méi)有理論依據(jù)證明的,只是模型測(cè)試出來(lái)的確有效穿香,并且等待時(shí)間下降(約)為原來(lái)的十分之一<氲酢贞岭!刺不刺激,哈哈哈饱狂。
原理
基本思想:先找重點(diǎn)在細(xì)分曹步,再細(xì)分,伸縮Flexible你怕不怕休讳。以下簡(jiǎn)稱(chēng)這種方法為FCV
不知道CV的請(qǐng)看@MrLevo520--總結(jié):Bias(偏差)讲婚,Error(誤差),Variance(方差)及CV(交叉驗(yàn)證)
偽代碼
原理很好理解俊柔,直接上偽代碼筹麸,懶得打字活合,上手稿
FCV測(cè)試時(shí)間
以GBDT為例,我測(cè)試了下物赶,參數(shù) n_estimators從190到300白指,max_depth從2到9,CV=3
普通的GridSerachCV總共fit了11073=2310次酵紫,耗時(shí)1842min告嘲,也就是30.7個(gè)小時(shí),得出最優(yōu)參數(shù)n_estimators=289奖地,max_depth=3
FCV總共總共fit了345次橄唬,跑了166min,也就是2.8小時(shí)参歹,得出最優(yōu)參數(shù)n_estimators=256仰楚,max_depth=3
時(shí)間方面,相差11倍犬庇,那么效果呢僧界,請(qǐng)看下面的CV得分
FCV測(cè)試效果
選取了GBDT,RF,XGBOST,SVM做了交叉驗(yàn)證比較,同一算法之間保持相同參數(shù)臭挽。
GBDT的測(cè)試結(jié)果
clf1 = GradientBoostingClassifier(max_depth=3,n_estimators=289)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
-------------------------------------
clf2 = GradientBoostingClassifier(max_depth=3,n_estimators=256)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
------------------------------------
# 查看兩種方法的交叉驗(yàn)證效果
#傳統(tǒng)方法CV=5:[ 0.79807692 0.82038835 0.80684597 0.76108374 0.78163772]
#改進(jìn)方法CV=5:[ 0.79567308 0.82038835 0.799511 0.76847291 0.78411911]
---------------------------------------
#傳統(tǒng)方法CV=10:[ 0.83333333 0.78571429 0.84615385 0.7961165 0.81067961 0.80097087 0.77227723 0.78109453 0.785 0.74111675]
#FCV方法CV=10:[ 0.85238095 0.78095238 0.85096154 0.7961165 0.81553398 0.7961165 0.76732673 0.79104478 0.795 0.75126904]
Xgboost的測(cè)試結(jié)果
clf1 = XGBClassifier(max_depth=6,n_estimators=200)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
clf2 = XGBClassifier(max_depth=4,n_estimators=292)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
-----------------------------
#傳統(tǒng)方法CV=5:[ 0.79086538 0.83737864 0.80929095 0.79310345 0.7866005 ]
#FCV方法CV=5:[ 0.80288462 0.84466019 0.8190709 0.79064039 0.78163772]
RF的測(cè)試結(jié)果
注:由于RF的特殊性捂襟,選擇樣本的方式和選擇特征的方式都隨機(jī),所以即使交叉驗(yàn)證埋哟,效果也不是穩(wěn)定的笆豁,就像我在服務(wù)器上跑多進(jìn)程和筆記本上跑同一個(gè)程序,出來(lái)的最佳值一個(gè)是247赤赊,一個(gè)是253,并不是說(shuō)都會(huì)一致煞赢。(說(shuō)的好像GBDT,XGBOOST不是樹(shù)結(jié)構(gòu)似得0.o)
clf1 = RandomForestClassifier(n_estimators=205)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
clf2 = RandomForestClassifier(n_estimators=253)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
---------------
#傳統(tǒng)方法CV=5: [ 0.77884615 0.82038835 0.78239609 0.79310345 0.74937965]
#FCV方法CV=5:[ 0.75721154 0.83495146 0.79217604 0.79310345 0.75186104]
SVM的測(cè)試結(jié)果
clf1 = SVC(kernel='rbf', C=16, gamma=0.18)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
clf2 = SVC(kernel='rbf', C=94.75, gamma=0.17)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
# 默認(rèn)CV參數(shù)搜索(2^-8抛计,2^8),[ 0.88468992 0.88942774 0.88487805 0.8856305 0.8745098 ]
# FCV參數(shù)搜索:[ 0.90116279 0.9185257 0.90146341 0.90713587 0.89509804]
FCV優(yōu)點(diǎn)
- 分割測(cè)試閾照筑,極大提高速度吹截,相較于普通的尋優(yōu),速度提升Step倍左右凝危,尋優(yōu)范圍越大波俄,提升越明顯
- 可根據(jù)自己的數(shù)據(jù)集進(jìn)行彈性系數(shù)設(shè)置,默認(rèn)為0.5Step蛾默,設(shè)置有效的彈性系數(shù)有利于減少遺漏值
FCV缺點(diǎn)
- 有一定幾率陷入局部最優(yōu)
- 需要有一定的閾值經(jīng)驗(yàn)和步進(jìn)Step設(shè)置經(jīng)驗(yàn)
FCV代碼
以GBDT為例懦铺,(RF被我改成多進(jìn)程了),假設(shè)尋找兩個(gè)最優(yōu)參數(shù),概念和上面的是一樣的支鸡,上面的理解了冬念,這里沒(méi)啥問(wèn)題的趁窃。
# -*- coding: utf-8 -*-
from sklearn import metrics, model_selection
import LoadSolitData
import Confusion_Show as cs
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
def gradient_boosting_classifier(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep):
model = GradientBoostingClassifier()
param_grid = {'n_estimators': [i for i in range(Nminedge,Nmaxedge,Nstep)], 'max_depth': [j for j in range(Dminedge,Dmaxedge,Dstep)]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
return best_parameters["n_estimators"],best_parameters["max_depth"]
def FlexSearch(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep):
NmedEdge,DmedEdge = gradient_boosting_classifier(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep)
Nminedge = NmedEdge - Nstep-Nstep/2
Nmaxedge = NmedEdge + Nstep+Nstep/2
Dminedge = DmedEdge - Dstep
Dmaxedge = DmedEdge + Dstep
Nstep = Nstep/2
Dstep = Dstep
print "Current bestPara:N-%.2f,D-%.2f"%(NmedEdge,DmedEdge)
print "Next Range:N-%d-%d,D--%d-%d"%(Nminedge,Nmaxedge,Dminedge,Dmaxedge)
print "Next step:N-%.2f,D-%.2f"%(Nstep,Nstep)
if Nstep > 0 :
return FlexSearch(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep)
else:
print "The bestPara:N-%.2f,D-%.2f"%(NmedEdge,DmedEdge)
return NmedEdge,DmedEdge
if __name__ == '__main__':
train_data,train_label,test_data,test_label = LoadSolitData.TrainSize(1,0.2,[i for i in range(1,14)],"outclass")
#這里數(shù)據(jù)自己導(dǎo),我是寫(xiě)在別的子函數(shù)里面了急前。
NmedEdge,DmedEdge = FlexSearch(train_data, train_label, 190, 300, 10, 2, 9, 1)
Classifier = GradientBoostingClassifier(n_estimators=NmedEdge,max_depth=DmedEdge)
Classifier.fit(train_data,train_label)
predict_label = Classifier.predict(test_data)
report = metrics.classification_report(test_label, predict_label)
print report
What's More
采用多進(jìn)程+FCV的方法醒陆,速度再次提升一個(gè)數(shù)量級(jí)。我假定命名為MFCV方法
不知道什么是多進(jìn)程和多線程請(qǐng)看
@MrLevo520--Python小白帶小白初涉多進(jìn)程
@MrLevo520--Python小白帶小白初涉多線程
MFCV原理
在FCV的基礎(chǔ)上裆针,gap = (maxedge-minedge)/k刨摩,其中的k為進(jìn)程個(gè)數(shù),把大塊切割成小塊世吨,然后小塊在按照FCV進(jìn)行計(jì)算码邻,最后得出K個(gè)最優(yōu)值,再進(jìn)行K個(gè)最優(yōu)值選擇一下就可以了另假。
MFCV優(yōu)點(diǎn)
- 速度更快像屋,多進(jìn)程并發(fā)
- 因?yàn)榉指钇谄献鯢CV,然后再集合做CV,一定程度上避免FCV陷入局部最優(yōu)
MFCV缺點(diǎn)
- 與FCV不同,只能進(jìn)行單參數(shù)的多進(jìn)程并發(fā)尋優(yōu)边篮,兩個(gè)參數(shù)的現(xiàn)在多進(jìn)程我還沒(méi)想到怎么實(shí)現(xiàn)己莺,想到再說(shuō)。
MFCV測(cè)試時(shí)間
因?yàn)橹荒軉螀?shù)戈轿,所以RF為例凌受,參數(shù) n_estimators從190到330,CV=10思杯,普通GridSerachCV時(shí)間為97.2min胜蛉,而FCV時(shí)間為18min,MFCV為14min色乾,這里因?yàn)閒it總數(shù)太少了誊册,所以效果不是非常明顯,但是大家可以想象一下暖璧,單參數(shù)跨度很大的時(shí)候是什么效果案怯。。澎办。
MFCV測(cè)試效果
clf1 = RandomForestClassifier(n_estimators=205)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=10)
print score1
clf2 = RandomForestClassifier(n_estimators=239)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=10)
print score2
#傳統(tǒng)方法CV=10:[ 0.84169884 0.85328185 0.85907336 0.84912959 0.82101167 0.8540856 0.84015595 0.87968442 0.84980237 0.83003953]
#MFCV方法CV=10:[ 0.84169884 0.86100386 0.85521236 0.86266925 0.82101167 0.85019455 0.83625731 0.87771203 0.85177866 0.83201581]
效果和傳統(tǒng)方法相比嘲碱,相差不多,持平局蚀,并且速度提升一個(gè)數(shù)量集麦锯,我認(rèn)為可以采用
MFCV代碼
# -*- coding: utf-8 -*-
#! /usr/bin/python
from multiprocessing import Process
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import Confusion_Show as cs
import LoadSolitData
from sklearn.model_selection import GridSearchCV
def write2txt(data,storename):
f = open(storename,"w")
f.write(str(data))
f.close()
def V_candidate(candidatelist,train_x, train_y):
model = RandomForestClassifier()
param_grid = {'n_estimators': [i for i in candidatelist]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1,cv=10)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
return best_parameters["n_estimators"]
def random_forest_classifier(train_x, train_y,minedge,maxedge,step):
model = RandomForestClassifier()
param_grid = {'n_estimators': [i for i in range(minedge,maxedge,step)]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1,cv=10)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
return best_parameters["n_estimators"]
def FlexSearch(train_x, train_y,minedge,maxedge,step,storename):
mededge = random_forest_classifier(train_x, train_y,minedge,maxedge,step)
minedge = mededge - step-step/2
maxedge = mededge + step+step/2
step = step/2
print "Current bestPara:",mededge
print "Next Range%d-%d"%(minedge,maxedge)
print "Next step:",step
if step > 0:
return FlexSearch(train_x, train_y,minedge,maxedge,step,storename)
elif step==0:
print "The bestPara:",mededge
write2txt(mededge,"RF_%s.txt"%storename)
def Calc(classifier):
Classifier.fit(train_data,train_label)
predict_label = Classifier.predict(test_data)
predict_label_prob = Classifier.predict_proba(test_data)
total_cor_num = 0.0
dictTotalLabel = {}
dictCorrectLabel = {}
for label_i in range(len(predict_label)):
if predict_label[label_i] == test_label[label_i]:
total_cor_num += 1
if predict_label[label_i] not in dictCorrectLabel: dictCorrectLabel[predict_label[label_i]] = 0
dictCorrectLabel[predict_label[label_i]] += 1.0
if test_label[label_i] not in dictTotalLabel: dictTotalLabel[test_label[label_i]] = 0
dictTotalLabel[test_label[label_i]] += 1.0
accuracy = metrics.accuracy_score(test_label, predict_label) * 100
kappa_score = metrics.cohen_kappa_score(test_label, predict_label)
average_accuracy = 0.0
label_num = 0
for key_i in dictTotalLabel:
try:
average_accuracy += (dictCorrectLabel[key_i] / dictTotalLabel[key_i]) * 100
label_num += 1
except:
average_accuracy = average_accuracy
average_accuracy = average_accuracy / label_num
result = "OA:%.4f;AA:%.4f;KAPPA:%.4f"%(accuracy,average_accuracy,kappa_score)
print result
report = metrics.classification_report(test_label, predict_label)
print report
cm = metrics.confusion_matrix(test_label, predict_label)
cs.ConfusionMatrixPng(cm,['1','2','3','4','5','6','7','8','9','10','11','12','13'])
if __name__ == '__main__':
train_data,train_label,test_data,test_label = LoadSolitData.TrainSize(1,0.5,[i for i in range(1,14)],"outclass") # 數(shù)據(jù)集自導(dǎo)自己的
minedge = 180
maxedge = 330
step = 10
gap = (maxedge-minedge)/3
subprocess = []
for i in range(1,4):#設(shè)置了3個(gè)進(jìn)程
p = Process(target=FlexSearch,args=(train_data,train_label,minedge+(i-1)*gap,minedge+i*gap,step,i))
subprocess.append(p)
for j in subprocess:
j.start()
for k in subprocess:
k.join()
candidatelist = []
for i in range(1,4):
with open("RF_%s.txt"%i) as f:
candidatelist.append(int(f.readlines()[0].strip()))
print "candidatelist:",candidatelist
best_n_estimators = V_candidate(candidatelist, train_data,train_label)
print "best_n_estimators",best_n_estimators
Classifier = RandomForestClassifier(n_estimators=best_n_estimators)
Calc(Classifier)
總結(jié)
在單參數(shù)尋優(yōu)上可以選用MFCV,多參數(shù)尋優(yōu)上選擇FCV,經(jīng)過(guò)CV得分可知琅绅,效果與普通遍歷查詢(xún)保持一致扶欣,但是速度提升效果明顯,越是廣閾查詢(xún),提升越明顯宵蛀,已通過(guò)ISO9001認(rèn)證昆著,請(qǐng)放心食用。术陶。凑懂。。梧宫。接谨。算法非常簡(jiǎn)單啊,但萬(wàn)一哪天火了呢塘匣,所以我要保留所有權(quán)利脓豪,轉(zhuǎn)載請(qǐng)注明。(我能說(shuō)Bootstrap這么簡(jiǎn)單的一種思想是怎么統(tǒng)治統(tǒng)計(jì)學(xué)的么忌卤,BTW扫夜,GridSerach思想本來(lái)也很簡(jiǎn)單呢)
附錄1--FCV手稿
手稿1
手稿2
附錄2
如果你也想嘗試xgboost這個(gè)非常好用的分類(lèi)器,強(qiáng)烈建議你采用FCV進(jìn)行調(diào)參驰徊,因?yàn)閤gboost借助OpenMP笤闯,能自動(dòng)利用單機(jī)CPU的多核進(jìn)行并行計(jì)算,就像這樣棍厂,一運(yùn)行就陷入烤雞模式颗味,所以盡量減少調(diào)參時(shí)候的運(yùn)行次數(shù)極為重要!
如何安裝Xgboost請(qǐng)看@MrLevo520--解決:win10_x64 xgboost python安裝所遇到問(wèn)題
致謝
@abcjennifer--python并行調(diào)參——scikit-learn grid_search
@MrLevo520--解決:win10_x64 xgboost python安裝所遇到問(wèn)題 @wzmsltw--XGBoost-Python完全調(diào)參指南-參數(shù)解釋篇
@u010414589--xgboost 調(diào)參經(jīng)驗(yàn)
@MrLevo520--總結(jié):Bias(偏差)牺弹,Error(誤差)浦马,Variance(方差)及CV(交叉驗(yàn)證)
@MrLevo520--Python小白帶小白初涉多進(jìn)程
@MrLevo520--Python小白帶小白初涉多線程