谷歌BigQuery ML VS StreamingPro MLSQL

前言

今天看到了一篇 AI前線的文章谷歌BigQuery ML正式上崗唱凯，只會用SQL也能玩轉(zhuǎn)機(jī)器學(xué)習(xí)溺森！霍骄。正好自己也在力推 StreamingPro的MLSQL。
今天就來對比下這兩款產(chǎn)品菠隆。

StreamingPro簡介

StreamingPro是一套基于Spark的數(shù)據(jù)平臺，MLSQL是基于StreamingPro的算法平臺狂秘。利用MLSQL骇径，你可以用類似SQL的方式完成數(shù)據(jù)的ETL，算法訓(xùn)練者春，模型部署等一整套ML Pipline破衔。MLSQL融合了數(shù)據(jù)平臺和算法平臺，可以讓你在一個平臺上把這些事情都搞定钱烟。

運(yùn)行方式

MLSQL支持Run as Application 和 Run as Service晰筛。MLSQL Run as Service很簡單，你可以直接在自己電腦上體驗(yàn)： Five Minute Quick Tutorial
BigQuery ML 則是云端產(chǎn)品拴袭，從表象上來看读第，應(yīng)該也是Run As Service。

語法功能使用

BigQuery ML 訓(xùn)練一個算法的方式為：

CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
 (model_type='linear_reg', labels=['arr_delay']) AS
SELECT
 arr_delay,
 carrier,
 origin,
 dest,
 dep_delay,
 taxi_out,
 distance
FROM
 `cloud-training-demos.flights.tzcorr`
WHERE
 arr_delay IS NOT NULL

BigQuery ML 也對原有的SQL語法做了增強(qiáng)拥刻，添加了新的關(guān)鍵之怜瞒，但是總體是遵循SQL原有語法形態(tài)的。

完成相同功能般哼，在MLSQL中中的做法如下：

select arr_delay, carrier, origin, dest, dep_delay,
taxi_out, distance from db.table 
as lrCorpus;

train lrCorpus as LogisticRegressor.`/tmp/linear_regression_model`
where inputCol="features"
and labelCol="label"
;

同樣的吴汪，MLSQL也對SQL進(jìn)行擴(kuò)展和變更尘吗，就模型訓(xùn)練而言，改變會更大些浇坐。對應(yīng)的睬捶，訓(xùn)練完成后，你可以load 數(shù)據(jù)查看效果,結(jié)果類似這樣：

+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
|           modelPath|algIndex|                 alg|              score| status|    startTime|      endTime|         trainParams|
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
|/tmp/william/tmp/...|       1|org.apache.spark....|-1.9704115113779945|success|1532659750073|1532659757320|Map(ratingCol -> ...|
|/tmp/william/tmp/...|       0|org.apache.spark....|-1.8446490919033698|success|1532659757327|1532659760394|Map(ratingCol -> ...|
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+

在預(yù)測方面近刘，BigQuery ML語法如下：

SELECT * FROM ML.PREDICT(MODEL flights.arrdelay,
(
SELECT
 carrier,
 origin,
 dest,
 dep_delay,
 taxi_out,
 distance,
 arr_delay AS actual_arr_delay
FROM
 `cloud-training-demos.flights.tzcorr`
WHERE
 arr_delay IS NOT NULL
LIMIT 10))

ML指定模型名稱就可以調(diào)用對應(yīng)的預(yù)測函數(shù)擒贸。在MLSQL里，則需要分兩步：

先注冊模型觉渴，這樣就能得到一個函數(shù)（pa_lr_predict）介劫，名字你自己定義。

register LogisticRegressor.`/tmp/linear_regression_model` as pa_lr_predict options
modelVersion="1" ;

接著就可以使用了：

select pa_lr_predict(features) from lrCorpus limit 10 as predict_result;

和數(shù)據(jù)平臺集成

BigQuery ML 也支持利用SQL對數(shù)據(jù)做復(fù)雜處理案淋，因此可以很好的給模型準(zhǔn)備數(shù)據(jù)座韵。MLSQL也支持非常復(fù)雜的數(shù)據(jù)處理。

除了算法以外

“數(shù)據(jù)處理模型”以及SQL函數(shù)

值得一提的是踢京，MLSQL提供了非常多的“數(shù)據(jù)處理模型”以及SQL函數(shù)誉碴。比如我要把文本數(shù)據(jù)轉(zhuǎn)化為tfidf,一條指令即可：

-- 把文本字段轉(zhuǎn)化為tf/idf向量,可以自定義詞典
train orginal_text_corpus as TfIdfInPlace.`/tmp/tfidfinplace`
where inputCol="content"
-- 分詞相關(guān)配置
and ignoreNature="true"
and dicPaths="...."
-- 停用詞路徑
and stopWordPath="/tmp/tfidf/stopwords"
-- 高權(quán)重詞路徑
and priorityDicPath="/tmp/tfidf/prioritywords"
-- 高權(quán)重詞加權(quán)倍數(shù)
and priority="5.0"
-- ngram 配置
and nGram="2,3"
-- split 配置，以split為分隔符分詞瓣距，
and split=""
;

-- lwys_corpus_with_featurize 表里content字段目前已經(jīng)是向量了
load parquet.`/tmp/tfidf/data` 
as lwys_corpus_with_featurize;

支持自定義實(shí)現(xiàn)算法

除了MLSQL里已經(jīng)實(shí)現(xiàn)的算法黔帕，你也可以用python腳本來完成自定義算法。目前通過PythonAlg模塊支持SKlearn, Tensorflow, Xgboost, Fasttext等眾多python算法框架蹈丸。Tensorflow則支持Cluster模式成黄。具體參看這里MLSQL自定義算法

部署

BigQuery ML 和MLSQL都支持直接在SQL里使用其預(yù)測功能。MLSQL還支持將模型部署成API服務(wù)逻杖。具體做法超級簡單:

單機(jī)模型運(yùn)行StreamingPro.
通過接口或者配置注冊算法模型 register NaiveBayes./tmp/bayes_modelas bayes_predict;
訪問預(yù)測接口

http://127.0.0.1:9003/model/predict? pipeline= bayes_predict&data=[[1,2,3...]]&dataType=vector

MLSQL 可以實(shí)現(xiàn)end2end模式部署奋岁，復(fù)用所有數(shù)據(jù)處理流程。更多參看MLSQL部署

模型多版本管理

訓(xùn)練時將keepVersion="true",每次運(yùn)行都會保留上一次版本荸百。具體參看模型版本管理

多個算法/多組參數(shù)并行運(yùn)行

如果算法自身已經(jīng)是分布式計(jì)算的闻伶，那么MLSQL允許多組參數(shù)順序執(zhí)行。比如這個：

train data as ALSInPlace.`/tmp/als` where
-- 第一組參數(shù)
`fitParam.0.maxIter`="5"
and `fitParam.0.regParam` = "0.01"
and `fitParam.0.userCol` = "userId"
and `fitParam.0.itemCol` = "movieId"
and `fitParam.0.ratingCol` = "rating"
-- 第二組參數(shù)    
and `fitParam.1.maxIter`="1"
and `fitParam.1.regParam` = "0.1"
and `fitParam.1.userCol` = "userId"
and `fitParam.1.itemCol` = "movieId"
and `fitParam.1.ratingCol` = "rating"
-- 計(jì)算rmse     
and evaluateTable="test"
and ratingCol="rating"
-- 針對用戶做推薦管搪，推薦數(shù)量為10  
and `userRec` = "10"
-- 針對內(nèi)容推薦用戶虾攻，推薦數(shù)量為10
-- and `itemRec` = "10"
and coldStartStrategy="drop"

這是一個協(xié)同推薦的一個算法，使用者配置了兩組參數(shù)更鲁，因?yàn)樵撍惴ū旧硎欠植际降啮浚詢山M參數(shù)會串行運(yùn)行。

-- train sklearn model
train data as PythonAlg.`${modelPath}` 

-- specify the location of the training script 
where pythonScriptPath="${sklearnTrainPath}"

-- kafka params for log
and `kafkaParam.bootstrap.servers`="${kafkaDomain}"
and `kafkaParam.topic`="test"
and `kafkaParam.group_id`="g_test-2"
and `kafkaParam.userName`="pi-algo"
-- distribute training data, so the python training script can read 
and  enableDataLocal="true"
and  dataLocalFormat="json"

-- sklearn params
-- use SVC
and `fitParam.0.moduleName`="sklearn.svm"
and `fitParam.0.className`="SVC"
and `fitParam.0.featureCol`="features"
and `fitParam.0.labelCol`="label"
and `fitParam.0.class_weight`="balanced"
and `fitParam.0.verbose`="true"

and `fitParam.1.moduleName`="sklearn.naive_bayes"
and `fitParam.1.className`="GaussianNB"
and `fitParam.1.featureCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.class_weight`="balanced"
and `fitParam.1.labelSize`="26"

-- python env
and `systemParam.pythonPath`="python"
and `systemParam.pythonParam`="-u"
and `systemParam.pythonVer`="2.7";

上面這個則是并行運(yùn)行兩個算法SVC/GaussianNB澡为。因?yàn)槊總€算法自身無法分布式運(yùn)行漂坏，所以MLSQL允許你并行運(yùn)行這兩個算法。

總結(jié)

BigQuery ML只是Google BigQuery服務(wù)的一部分。所以其實(shí)和其對比還有失偏頗顶别。MLSQL把數(shù)據(jù)平臺和算法平臺合二為一谷徙，在上面你可以做ETL,流式，也可以做算法驯绎，大家都統(tǒng)一用一套SQL語法完慧。MLSQL還提供了大量使用的“數(shù)據(jù)處理模型”和SQL函數(shù),這些無論對于訓(xùn)練還是預(yù)測都有非常大的幫助，可以使得數(shù)據(jù)預(yù)處理邏輯在訓(xùn)練和預(yù)測時得到復(fù)用剩失，基本無需額外開發(fā)屈尼，實(shí)現(xiàn)端到端的部署，減少企業(yè)成本拴孤。