關(guān)注小編的公眾號(hào)淌友,一起來(lái)交流學(xué)習(xí)吧悯蝉!
什么是Grid Search 網(wǎng)格搜索?
Grid Search:一種調(diào)參手段挺峡;窮舉搜索:在所有候選的參數(shù)選擇中葵孤,通過(guò)循環(huán)遍歷,嘗試每一種可能性橱赠,表現(xiàn)最好的參數(shù)就是最終的結(jié)果尤仍。其原理就像是在數(shù)組里找最大值。(為什么叫網(wǎng)格搜索狭姨?以有兩個(gè)參數(shù)的模型為例宰啦,參數(shù)a有3種可能,參數(shù)b有4種可能饼拍,把所有可能性列出來(lái)绑莺,可以表示成一個(gè)3*4的表格,其中每個(gè)cell就是一個(gè)網(wǎng)格惕耕,循環(huán)過(guò)程就像是在每個(gè)網(wǎng)格里遍歷、搜索诫肠,所以叫g(shù)rid search)
Simple Grid Search:簡(jiǎn)單的網(wǎng)格搜索
以2個(gè)參數(shù)的調(diào)優(yōu)過(guò)程為例:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)
print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0]))
#### grid search start
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)#對(duì)于每種參數(shù)可能的組合司澎,進(jìn)行一次訓(xùn)練;
svm.fit(X_train,y_train)
score = svm.score(X_test,y_test)
if score > best_score:#找到表現(xiàn)最好的參數(shù)
best_score = score
best_parameters = {'gamma':gamma,'C':C}
#### grid search end
print("Best score:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
輸出:
Size of training set:112 size of testing set:38
Best score:0.973684
Best parameters:{'gamma': 0.001, 'C': 100}
存在的問(wèn)題:
原始數(shù)據(jù)集劃分成訓(xùn)練集和測(cè)試集以后栋豫,其中測(cè)試集除了用作調(diào)整參數(shù)挤安,也用來(lái)測(cè)量模型的好壞;這樣做導(dǎo)致最終的評(píng)分結(jié)果比實(shí)際效果要好丧鸯。(因?yàn)闇y(cè)試集在調(diào)參過(guò)程中蛤铜,送到了模型里,而我們的目的是將訓(xùn)練模型應(yīng)用在unseen data上);
解決方法:
對(duì)訓(xùn)練集再進(jìn)行一次劃分围肥,分成訓(xùn)練集和驗(yàn)證集剿干,這樣劃分的結(jié)果就是:原始數(shù)據(jù)劃分為3份,分別為:訓(xùn)練集穆刻、驗(yàn)證集和測(cè)試集置尔;其中訓(xùn)練集用來(lái)模型訓(xùn)練,驗(yàn)證集用來(lái)調(diào)整參數(shù)氢伟,而測(cè)試集用來(lái)衡量模型表現(xiàn)好壞榜轿。
X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target,random_state=0)
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=1)
print("Size of training set:{} size of validation set:{} size of teseting set:{}".format(X_train.shape[0],X_val.shape[0],X_test.shape[0]))
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)
svm.fit(X_train,y_train)
score = svm.score(X_val,y_val)
if score > best_score:
best_score = score
best_parameters = {'gamma':gamma,'C':C}
svm = SVC(**best_parameters) #使用最佳參數(shù),構(gòu)建新的模型
svm.fit(X_trainval,y_trainval) #使用訓(xùn)練集和驗(yàn)證集進(jìn)行訓(xùn)練朵锣,more data always results in good performance.
test_score = svm.score(X_test,y_test) # evaluation模型評(píng)估
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Best score on test set:{:.2f}".format(test_score))
輸出:
Size of training set:84 size of validation set:28 size of teseting set:38
Best score on validation set:0.96
Best parameters:{'gamma': 0.001, 'C': 10}
Best score on test set:0.92
然而谬盐,這種間的的grid search方法,其最終的表現(xiàn)好壞與初始數(shù)據(jù)的劃分結(jié)果有很大的關(guān)系诚些,為了處理這種情況飞傀,我們采用交叉驗(yàn)證的方式來(lái)減少偶然性。
Grid Search with Cross Validation
from sklearn.model_selection import cross_val_score
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)
scores = cross_val_score(svm,X_trainval,y_trainval,cv=5) #5折交叉驗(yàn)證
score = scores.mean() #取平均數(shù)
if score > best_score:
best_score = score
best_parameters = {"gamma":gamma,"C":C}
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))
輸出:
Best score on validation set:0.97
Best parameters:{'gamma': 0.01, 'C': 100}
Score on testing set:0.97
交叉驗(yàn)證經(jīng)常與網(wǎng)格搜索進(jìn)行結(jié)合泣刹,作為參數(shù)評(píng)價(jià)的一種方法助析,這種方法叫做grid search with cross validation。sklearn因此設(shè)計(jì)了一個(gè)這樣的類GridSearchCV椅您,這個(gè)類實(shí)現(xiàn)了fit外冀,predict,score等方法掀泳,被當(dāng)做了一個(gè)estimator雪隧,使用fit方法,該過(guò)程中:(1)搜索到最佳參數(shù)员舵;(2)實(shí)例化了一個(gè)最佳參數(shù)的estimator脑沿;
from sklearn.model_selection import GridSearchCV
#把要調(diào)整的參數(shù)以及其候選值 列出來(lái);
param_grid = {"gamma":[0.001,0.01,0.1,1,10,100],
"C":[0.001,0.01,0.1,1,10,100]}
print("Parameters:{}".format(param_grid))
grid_search = GridSearchCV(SVC(),param_grid,cv=5) #實(shí)例化一個(gè)GridSearchCV類
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)
grid_search.fit(X_train,y_train) #訓(xùn)練马僻,找到最優(yōu)的參數(shù)庄拇,同時(shí)使用最優(yōu)的參數(shù)實(shí)例化一個(gè)新的SVC estimator。
print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
輸出:
Parameters:{'gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}
Test set score:0.97
Best parameters:{'C': 10, 'gamma': 0.1}
Best score on train set:0.98
Grid Search 調(diào)參方法存在的共性弊端就是:耗時(shí)韭邓;參數(shù)越多措近,候選值越多,耗費(fèi)時(shí)間越長(zhǎng)女淑!所以瞭郑,一般情況下,先定一個(gè)大范圍鸭你,然后再細(xì)化屈张。
總而言之擒权,言而總之
- Grid Search:一種調(diào)優(yōu)方法,在參數(shù)列表中進(jìn)行窮舉搜索,對(duì)每種情況進(jìn)行訓(xùn)練,找到最優(yōu)的參數(shù)梯捕;由此可知,這種方法的主要缺點(diǎn)是 比較耗時(shí)纳鼎!