Hyperopt是實現(xiàn)超參數(shù)優(yōu)化的python第三方庫, 最近發(fā)現(xiàn)其可以運(yùn)用mongo進(jìn)行并行計算, 稍微研究了一番,記錄并分享一下.
Mongo的安裝就不說了, 遵循鏈接內(nèi)容即可
安裝完成后啟動mongo, 運(yùn)行下官方的demo看一下:
import math
from hyperopt import fmin, tpe, hp
from hyperopt.mongoexp import MongoTrials
trials = MongoTrials('mongo://localhost:1234/foo_db/jobs', exp_key='exp1')
best = fmin(math.sin, hp.uniform('x', -2, 2), trials=trials, algo=tpe.suggest, max_evals=10)
以上的代碼中, 實例化 MongoTrials 并賦值給trials變量, 其第一個參數(shù)是 mongo 進(jìn)程, 數(shù)據(jù)庫是 'foodb', 'jobs' 表. 'exp_key' 是任務(wù)的編號.(如果修改這個參數(shù), 表明是一個新的任務(wù), 會重新運(yùn)行搜索而不是從數(shù)據(jù)庫中取結(jié)果).
實際運(yùn)行demin的過程中, fmin 會被阻塞. 這是因為 MongoTrials 會將 fmin 作為異步對象, 所以出現(xiàn)新的搜索點(參數(shù)組合)時, fmin 不會去評估目標(biāo)函數(shù)而是等待另一個進(jìn)程替它完成這個工作.
hyperopt-mongo-worker 腳本就是干這個活滴, 新開一個 shell 輸入
hyperopt-mongo-worker --mongo=localhost:1234/foo_db --poll-interval=0.1
第一個參數(shù)就是 mongo 的地址, 第二個參數(shù)是輪詢間隔. 由于demo很簡單, 我們很快就得到一個最優(yōu)的 x 值.
但以上的demo太簡單了, 我們想將自己編寫的模型替換掉 math.sin. 以一個隨機(jī)森林舉例:
import hyperopt.mongoexp
import pandas as pd
import numpy as np
from hyperopt import fmin, tpe, hp, space_eval, pyll, rand, anneal
from hyperopt.mongoexp import MongoTrials
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
def randomforest(args):
class_weight = args['class_weight']
criterion = args['criterion']
min_impurity_split = args['min_impurity_split']
n_estimators = args['n_estimators']
min_samples_leaf = args['min_samples_leaf']
min_samples_split = args['min_samples_split']
estim = RandomForestClassifier(
n_estimators=n_estimators,
class_weight=class_weight,
criterion=criterion,
min_impurity_decrease=min_impurity_split,
min_samples_leaf=min_samples_leaf,
min_samples_split = min_samples_split
)
y_pred = cross_val_predict(estim, train_x, train_y, cv=3)
metric = f1_score(train_y, y_pred)
return -metric
space = {
'class_weight': hp.choice('class_weight', [None, 'balanced']),
'criterion': hp.choice('criterion', ['gini', 'entropy']),
'min_impurity_split': hp.lognormal('min_impurity_split', 1e-10, 1e-4)*1e-7,
'min_samples_leaf': hp.randint('min_samples_leaf', 10)+1,
'min_samples_split': hp.randint('min_samples_split', 10)+1,
'n_estimators': hp.randint('n_estimators', 950)+50
}
if __name__ == '__main__':
trials = MongoTrials('mongo://localhost:1234/foo_db/jobs', exp_key='exp2')
best = fmin(fn=randomforest, space=space, algo=rand.suggest, max_evals=100, trials=trials)
print best
很遺憾有個屬性錯誤, 就是找不到 randomforest 這個模塊.
AttributeError: Can't get attribute 'randomforest' on <module '__main__' from ...hyperopt-mongo-worker
google了一下, 有網(wǎng)友給出了一些解決辦法, 我們先將 objective function 寫到另外的腳本中, 例如:
# hyperopt_model.py
# !-*- coding: utf-8 -*-
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
import pandas as pd
df = pd.read_csv('xxxxx.csv', header=0)
y, X = df[df.columns[0]], df[df.columns[1:]]
def randomforest(args):
n_estimators = args['n_estimators']
criterion = args['criterion']
max_features = args['max_features']
min_impurity_split = args['min_impurity_split']
min_samples_leaf = args['min_samples_leaf']
min_samples_split = args['min_samples_split']
class_weight = args['class_weight']
global X, y
clf = RandomForestClassifier(
class_weight=class_weight,
criterion=criterion,
max_features=max_features,
min_samples_leaf=min_samples_leaf,
min_impurity_split=min_impurity_split,
min_samples_split=min_samples_split,
n_estimators=n_estimators,
random_state=1
)
y_pred = cross_val_predict(clf, X, y, cv=3)
metric = accuracy_score(y, y_pred)
return -metric
將這個腳本命名為 hyperopt_model.py 并將其寫入環(huán)境變量中, 順便修改下最上面的腳本:
export PYTHONPATH="${PYTHONPATH}:<hyperopt_model.py>"
import pandas as pd
import numpy as np
import hyperopt_model
import hyperopt.mongoexp
from hyperopt import fmin, tpe, hp, space_eval, pyll, rand, anneal
from hyperopt.mongoexp import MongoTrials
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
if __name__ == '__main__':
trials = MongoTrials('mongo://localhost:1234/foo_db/jobs', exp_key='exp2')
best = fmin(fn=hyperopt_model.randomforest, space=hyperopt_model.space, algo=rand.suggest, max_evals=100, trials=trials)
print best
之后再運(yùn)行 hyperopt-mongo-worker 就ok了, 總體時間消耗大概降低了50% 左右.
我還嘗試了用進(jìn)程管理池管理這兩個進(jìn)程(代碼如下), 但是總有一些error沒有解決, 如果那位大佬有更好的方法, 煩請告知, 感謝!
# coding: utf-8
import sys
import logging
import hyperopt_model
from multiprocessing import Pool, Process
from hyperopt import fmin, tpe, hp, rand
from hyperopt.mongoexp import MongoTrials
def task1():
logging.basicConfig(stream=sys.stderr, level=logging.INFO)
print 'task1 running'
sys.exit(hyperopt.mongoexp.main_worker())
def task2(msg):
trials = MongoTrials('mongo://localhost:1234/foo_db/jobs', exp_key='exp3')
best = fmin(fn=hyperopt_model.randomforest, space=hyperopt_model.space, algo=rand.suggest, max_evals=100, trials=trials)
print msg
print 'task2 is running'
return best
if __name__ == '__main__':
pool = Pool(processes=4)
p = Process(target=task1)
p.start()
ret = pool.apply_async(task2, args=(1,))
pool.close()
pool.join()
p.join()
print 'processes done, result:'
print ret.get()
### hyperopt ### MongoDB ### 并行計算 ### 自定義超參優(yōu)化模型