前言
今天看到了一篇 AI前線的文章谷歌BigQuery ML正式上崗唱凯,只會用SQL也能玩轉(zhuǎn)機(jī)器學(xué)習(xí)溺森!霍骄。正好自己也在力推 StreamingPro的MLSQL。
今天就來對比下這兩款產(chǎn)品菠隆。
StreamingPro簡介
StreamingPro是一套基于Spark的數(shù)據(jù)平臺,MLSQL是基于StreamingPro的算法平臺狂秘。利用MLSQL骇径,你可以用類似SQL的方式完成數(shù)據(jù)的ETL,算法訓(xùn)練者春,模型部署等一整套ML Pipline破衔。MLSQL融合了數(shù)據(jù)平臺和算法平臺,可以讓你在一個平臺上把這些事情都搞定钱烟。
運(yùn)行方式
MLSQL支持Run as Application 和 Run as Service晰筛。MLSQL Run as Service很簡單,你可以直接在自己電腦上體驗(yàn): Five Minute Quick Tutorial
BigQuery ML 則是云端產(chǎn)品拴袭,從表象上來看读第,應(yīng)該也是Run As Service。
語法功能使用
BigQuery ML 訓(xùn)練一個算法的方式為:
CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
(model_type='linear_reg', labels=['arr_delay']) AS
SELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
BigQuery ML 也對原有的SQL語法做了增強(qiáng)拥刻,添加了新的關(guān)鍵之怜瞒,但是總體是遵循SQL原有語法形態(tài)的。
完成相同功能般哼,在MLSQL中中的做法如下:
select arr_delay, carrier, origin, dest, dep_delay,
taxi_out, distance from db.table
as lrCorpus;
train lrCorpus as LogisticRegressor.`/tmp/linear_regression_model`
where inputCol="features"
and labelCol="label"
;
同樣的吴汪,MLSQL也對SQL進(jìn)行擴(kuò)展和變更尘吗,就模型訓(xùn)練而言,改變會更大些浇坐。對應(yīng)的睬捶,訓(xùn)練完成后,你可以load 數(shù)據(jù)查看效果,結(jié)果類似這樣:
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
| modelPath|algIndex| alg| score| status| startTime| endTime| trainParams|
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
|/tmp/william/tmp/...| 1|org.apache.spark....|-1.9704115113779945|success|1532659750073|1532659757320|Map(ratingCol -> ...|
|/tmp/william/tmp/...| 0|org.apache.spark....|-1.8446490919033698|success|1532659757327|1532659760394|Map(ratingCol -> ...|
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
在預(yù)測方面近刘,BigQuery ML語法如下:
SELECT * FROM ML.PREDICT(MODEL flights.arrdelay,
(
SELECT
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance,
arr_delay AS actual_arr_delay
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
LIMIT 10))
ML指定模型名稱就可以調(diào)用對應(yīng)的預(yù)測函數(shù)擒贸。在MLSQL里,則需要分兩步:
先注冊模型觉渴,這樣就能得到一個函數(shù)(pa_lr_predict)介劫,名字你自己定義。
register LogisticRegressor.`/tmp/linear_regression_model` as pa_lr_predict options
modelVersion="1" ;
接著就可以使用了:
select pa_lr_predict(features) from lrCorpus limit 10 as predict_result;
和數(shù)據(jù)平臺集成
BigQuery ML 也支持利用SQL對數(shù)據(jù)做復(fù)雜處理案淋,因此可以很好的給模型準(zhǔn)備數(shù)據(jù)座韵。MLSQL也支持非常復(fù)雜的數(shù)據(jù)處理。
除了算法以外
“數(shù)據(jù)處理模型”以及SQL函數(shù)
值得一提的是踢京,MLSQL提供了非常多的“數(shù)據(jù)處理模型”以及SQL函數(shù)誉碴。比如我要把文本數(shù)據(jù)轉(zhuǎn)化為tfidf,一條指令即可:
-- 把文本字段轉(zhuǎn)化為tf/idf向量,可以自定義詞典
train orginal_text_corpus as TfIdfInPlace.`/tmp/tfidfinplace`
where inputCol="content"
-- 分詞相關(guān)配置
and ignoreNature="true"
and dicPaths="...."
-- 停用詞路徑
and stopWordPath="/tmp/tfidf/stopwords"
-- 高權(quán)重詞路徑
and priorityDicPath="/tmp/tfidf/prioritywords"
-- 高權(quán)重詞加權(quán)倍數(shù)
and priority="5.0"
-- ngram 配置
and nGram="2,3"
-- split 配置,以split為分隔符分詞瓣距,
and split=""
;
-- lwys_corpus_with_featurize 表里content字段目前已經(jīng)是向量了
load parquet.`/tmp/tfidf/data`
as lwys_corpus_with_featurize;
支持自定義實(shí)現(xiàn)算法
除了MLSQL里已經(jīng)實(shí)現(xiàn)的算法黔帕,你也可以用python腳本來完成自定義算法。目前通過PythonAlg模塊支持SKlearn, Tensorflow, Xgboost, Fasttext等眾多python算法框架蹈丸。Tensorflow則支持Cluster模式成黄。具體參看這里MLSQL自定義算法
部署
BigQuery ML 和MLSQL都支持直接在SQL里使用其預(yù)測功能。MLSQL還支持將模型部署成API服務(wù)逻杖。具體做法超級簡單:
- 單機(jī)模型運(yùn)行StreamingPro.
- 通過接口或者配置注冊算法模型
register NaiveBayes.
/tmp/bayes_modelas bayes_predict;
- 訪問預(yù)測接口
http://127.0.0.1:9003/model/predict? pipeline= bayes_predict&data=[[1,2,3...]]&dataType=vector
MLSQL 可以實(shí)現(xiàn)end2end模式部署奋岁,復(fù)用所有數(shù)據(jù)處理流程。更多參看MLSQL部署
模型多版本管理
訓(xùn)練時將keepVersion="true",每次運(yùn)行都會保留上一次版本荸百。具體參看模型版本管理
多個算法/多組參數(shù)并行運(yùn)行
如果算法自身已經(jīng)是分布式計(jì)算的闻伶,那么MLSQL允許多組參數(shù)順序執(zhí)行。比如這個:
train data as ALSInPlace.`/tmp/als` where
-- 第一組參數(shù)
`fitParam.0.maxIter`="5"
and `fitParam.0.regParam` = "0.01"
and `fitParam.0.userCol` = "userId"
and `fitParam.0.itemCol` = "movieId"
and `fitParam.0.ratingCol` = "rating"
-- 第二組參數(shù)
and `fitParam.1.maxIter`="1"
and `fitParam.1.regParam` = "0.1"
and `fitParam.1.userCol` = "userId"
and `fitParam.1.itemCol` = "movieId"
and `fitParam.1.ratingCol` = "rating"
-- 計(jì)算rmse
and evaluateTable="test"
and ratingCol="rating"
-- 針對用戶做推薦管搪,推薦數(shù)量為10
and `userRec` = "10"
-- 針對內(nèi)容推薦用戶虾攻,推薦數(shù)量為10
-- and `itemRec` = "10"
and coldStartStrategy="drop"
這是一個協(xié)同推薦的一個算法,使用者配置了兩組參數(shù)更鲁,因?yàn)樵撍惴ū旧硎欠植际降啮浚詢山M參數(shù)會串行運(yùn)行。
-- train sklearn model
train data as PythonAlg.`${modelPath}`
-- specify the location of the training script
where pythonScriptPath="${sklearnTrainPath}"
-- kafka params for log
and `kafkaParam.bootstrap.servers`="${kafkaDomain}"
and `kafkaParam.topic`="test"
and `kafkaParam.group_id`="g_test-2"
and `kafkaParam.userName`="pi-algo"
-- distribute training data, so the python training script can read
and enableDataLocal="true"
and dataLocalFormat="json"
-- sklearn params
-- use SVC
and `fitParam.0.moduleName`="sklearn.svm"
and `fitParam.0.className`="SVC"
and `fitParam.0.featureCol`="features"
and `fitParam.0.labelCol`="label"
and `fitParam.0.class_weight`="balanced"
and `fitParam.0.verbose`="true"
and `fitParam.1.moduleName`="sklearn.naive_bayes"
and `fitParam.1.className`="GaussianNB"
and `fitParam.1.featureCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.class_weight`="balanced"
and `fitParam.1.labelSize`="26"
-- python env
and `systemParam.pythonPath`="python"
and `systemParam.pythonParam`="-u"
and `systemParam.pythonVer`="2.7";
上面這個則是并行運(yùn)行兩個算法SVC/GaussianNB澡为。因?yàn)槊總€算法自身無法分布式運(yùn)行漂坏,所以MLSQL允許你并行運(yùn)行這兩個算法。
總結(jié)
BigQuery ML只是Google BigQuery服務(wù)的一部分。所以其實(shí)和其對比還有失偏頗顶别。MLSQL把數(shù)據(jù)平臺和算法平臺合二為一谷徙,在上面你可以做ETL,流式,也可以做算法驯绎,大家都統(tǒng)一用一套SQL語法完慧。MLSQL還提供了大量使用的“數(shù)據(jù)處理模型”和SQL函數(shù),這些無論對于訓(xùn)練還是預(yù)測都有非常大的幫助,可以使得數(shù)據(jù)預(yù)處理邏輯在訓(xùn)練和預(yù)測時得到復(fù)用剩失,基本無需額外開發(fā)屈尼,實(shí)現(xiàn)端到端的部署,減少企業(yè)成本拴孤。