一、PredictionIO介紹
Apache PredictionIO 是一個(gè)孵化中的機(jī)器學(xué)習(xí)服務(wù)器督禽,它可以為為開發(fā)人員和數(shù)據(jù)科學(xué)家創(chuàng)建任何機(jī)器學(xué)習(xí)任務(wù)的預(yù)測引擎脆霎。官方原文:
Apache PredictionIO (incubating) is an open source Machine Learning Server built on
top of a state-of-the-art open source stack for developers and data scientists
to create predictive engines for any machine learning task.
PredictionIO以Spark為計(jì)算引擎,mysql or HBase+Elasticsearch or PostgreSql 為數(shù)據(jù)存儲(chǔ)赂蠢,并提供了常用的模板引擎:
1绪穆、Recommenders推薦引擎。集成了Spark MLlib的協(xié)同過濾算法虱岂,可以作為電子商務(wù)玖院、新聞、視頻方面的個(gè)性化推薦第岖。
2难菌、Classification分類引擎。集成了Spark MLlib的樸素貝葉斯算法蔑滓,提供文本內(nèi)容分類郊酒、預(yù)測用戶在當(dāng)前會(huì)話中轉(zhuǎn)化概率遇绞、(用戶)流失預(yù)測等服務(wù)。
3燎窘、NPL引擎摹闽。主要做情緒分析。
4褐健、還提供回歸付鹿、聚類等其他引擎。
我們選定引擎模板蚜迅,將用戶行為數(shù)據(jù)導(dǎo)入數(shù)據(jù)庫舵匾,PredictionIO幫我們完成了數(shù)據(jù)訓(xùn)練、建模等復(fù)雜問題谁不,并提供了返回預(yù)測數(shù)據(jù)的restful API坐梯。
二、PredictionIO服務(wù)搭建
在服務(wù)搭建這一階段刹帕,官方文檔變得更加重要吵血,它比任務(wù)其他的網(wǎng)絡(luò)資料都好用,初次搭建服務(wù)的時(shí)候轩拨,一定遵照官方文檔践瓷。重要的事說三遍,官方文檔亡蓉!官方文檔晕翠!官方文檔!
我在搭建服務(wù)的時(shí)候也遇到各種問題砍濒,最后把我之前已經(jīng)搭好正常運(yùn)行的spark環(huán)境淋肾,重新安裝官方文檔要求搭了一遍。所以這邊文章不過是官方文檔的翻譯爸邢,及里面的注意點(diǎn)樊卓。
1、環(huán)境要求杠河。
以下的配置很重要碌尔,至少保證是正確的,其他的版本不能保證券敌,Spark版本和其他編程語音的版本兼容很蛋疼唾戚。
Apache Spark 1.6.3 for Hadoop 2.6
JDK 1.8
和以下幾項(xiàng)的一項(xiàng):
PostgreSQL 9.1
or
MySql 5.1
or
Apache HBase 0.98.5
Elasticsearch 1.7.6
2、另外
scala版本待诅,官網(wǎng)寫的是2.10.6叹坦,我裝的是2.10.5,因?yàn)閟park 1.6.3 for Hadoop 2.6自帶的spark就是2.10.5卑雁,使用沒有問題募书。
自行去官網(wǎng)下載predictionIO壓縮包绪囱,我安裝的是0.11.0,解壓
tar zxvf PredictionIO-0.11.0-incubating.tar.gz
cd PredictionIO-0.11.0-incubating/bin/
需要執(zhí)行install進(jìn)行安裝莹捡,官方文檔沒有說
./install.sh 安裝
希望支持 scala 2.10.5 spark1.6.3 elasticsearch5.3.0
./make-distribution.sh -Dscala.version=2.10.5 -Dspark.version=1.6.3 -Delasticsearch.version=5.3.0
安裝依賴鬼吵,predictionIO自行下載依賴放在vendors目錄下(如spark),需要先建目錄道盏,如果依賴另外安裝而柑,不需要vendors
mkdir PredictionIO-0.11.0-incubating/vendors
安裝數(shù)據(jù)庫
mysql 5.1 / postgreSql 9.1 / HBase 0.98.5 + Elasticsearch 1.7.6
配置依賴環(huán)境參數(shù)
cd PredictionIO-0.11.0-incubating/conf/
vi pio-env.sh
配置 Spark、數(shù)據(jù)庫驅(qū)動(dòng)荷逞,沒有驅(qū)動(dòng)jar包需要自行下載
SPARK_HOME=/Users/jiazhaopu/program/spark-1.6.3
POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar
為了簡單,我用mysql作為存儲(chǔ)粹排,但配置文件里默認(rèn)使用PostgreSQL
修改Storage Repositories 為
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=MYSQL
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=MYSQL
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=MYSQL
# Storage Data Sources
# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio
# MySQL Example
PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://127.0.0.1:3306/pio
PIO_STORAGE_SOURCES_MYSQL_USERNAME=root
PIO_STORAGE_SOURCES_MYSQL_PASSWORD=root
#PIO_STORAGE_SOURCES_MYSQL_URL 配置了數(shù)據(jù)庫ip 庫名种远,所以需要先建數(shù)據(jù)庫,這里是pio庫
到這里PredictionIO基礎(chǔ)服務(wù)算搭好了,需要給PredictionIO配置環(huán)境變量
vi /etc/profile
export PATH=$PATH:/Users/jiazhaopu/program/spark-1.6.3/bin
export PATH=$PATH:/Users/jiazhaopu/program/mongodb-osx-x86_64-enterprise-3.4.9/bin
export PATH=$PATH:/usr/local/mysql/bin
export PATH=$PATH:/Users/jiazhaopu/program/apache-predictionio-0.11.0-incubating/PredictionIO/bin
export PATH=$PATH:/Users/jiazhaopu/program/scala-2.10.5/bin
配置完環(huán)境變量
啟動(dòng)命令
pio-start-all
如果您使用PostgreSQL或MySQL顽耳,請(qǐng)運(yùn)行以下命令啟動(dòng)PredictionIO Event Server:
pio eventserver &
輸出以下日志
[INFO] [HttpListener] Bound to /0.0.0.0:7070
[INFO] [EventServerActor] Bound received. EventServer is ready
說明啟動(dòng)成功
停服命令
pio-stop-all
{"status":"alive"}
執(zhí)行pio status查看服務(wù)運(yùn)行情況
[INFO] [Storage$] Verifying Model Data Backend (Source: MYSQL)...
[INFO] [Storage$] Verifying Event Data Backend (Source: MYSQL)...
[INFO] [Storage$] Test writing to Event Store (App Id 0)...
[INFO] [Management$] Your system is all ready to go.
jps -l命令查看已經(jīng)啟動(dòng)的
54305 sun.tools.jps.Jps
10546 org.jetbrains.jps.cmdline.Launcher
28594 org.apache.predictionio.tools.console.Console
738
54164 org.apache.predictionio.tools.console.Console
774 org.jetbrains.idea.maven.server.RemoteMavenServer
28649 org.apache.spark.deploy.SparkSubmit
三坠敷、部署模板引擎
1、下載模板引擎
這里下載的是一個(gè)數(shù)據(jù)推薦引擎射富,推薦原理是:
收集用戶買了哪些商品膝迎,用戶給商品打分這兩項(xiàng)數(shù)據(jù),通過協(xié)同過濾算法訓(xùn)練出用戶喜好模型胰耗,推薦用戶還會(huì)買哪些商品限次。
git clone https://github.com/apache/incubator-predictionio-template-recommender.git MyRecommendation
$ cd MyRecommendation
2、創(chuàng)建App ID 和 Access Key
pio app new MyApp1
會(huì)看到以下輸出
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [App$] Created new app:
[INFO] [App$] Name: MyApp1
[INFO] [App$] ID: 1
[INFO] [App$] Access Key: 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F
記住App ID柴灯、Access Key卖漫、Name,在收集數(shù)據(jù)的服務(wù)器和代碼里會(huì)用到
執(zhí)行 pio app list 命令會(huì)列出已存在的App 列表
[INFO] [App$] Name | ID | Access Key | Allowed Event(s)
[INFO] [App$] MyApp1 | 1 | 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F | (all)
[INFO] [App$] MyApp2 | 2 | io5lz6Eg4m3Xe4JZTBFE13GMAf1dhFl6ZteuJfrO84XpdOz9wRCrDU44EUaYuXq5 | (all)
[INFO] [App$] Finished listing 2 app(s).
3赠群、收集數(shù)據(jù)
PredictionIO提供了收集數(shù)據(jù)的接口羊始,在啟動(dòng)了Prediction服務(wù)后可以調(diào)用它,這里需要用到Access Key查描。
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "rate",
"entityType" : "user",
"entityId" : "u0",
"targetEntityType" : "item",
"targetEntityId" : "i0",
"properties" : {
"rating" : 5
}
"eventTime" : "2014-11-02T09:39:45.618-08:00"
}'
模板代碼里也提供了Python批量導(dǎo)入的方法突委,這里給出了兩個(gè)事件,
一個(gè)是用戶買商品冬三,一個(gè)是用戶給商品打分
import predictionio
client = predictionio.EventClient(
access_key=<ACCESS KEY>,
url=<URL OF EVENTSERVER>,
threads=5,
qsize=500
)
# 用戶給商品打分
client.create_event(
event="rate",
entity_type="user",
entity_id=<USER ID>,
target_entity_type="item",
target_entity_id=<ITEM ID>,
properties= { "rating" : float(<RATING>) }
)
# 用戶買商品
client.create_event(
event="buy",
entity_type="user",
entity_id=<USER ID>,
target_entity_type="item",
target_entity_id=<ITEM ID>
)
官方給出了數(shù)據(jù)模板
https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_movielens_data.txt
把數(shù)據(jù)導(dǎo)入到代碼中的數(shù)據(jù)文件匀油,并執(zhí)行導(dǎo)入代碼
cd MyRecommendation
$ curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_movielens_data.txt --create-dirs -o data/sample_movielens_data.txt
$ python data/import_eventserver.py --access_key $ACCESS_KEY
你會(huì)看到
Importing data...
1501 events are imported.
4、部署你的引擎
編輯你的模板代碼里的Engine.json长豁,將appName 改成你自己創(chuàng)建的
...
"datasource": {
"params" : {
"appName": "MyApp1"
}
},
...
Building 你的引擎
pio build --verbose
這里可能會(huì)出現(xiàn)如下錯(cuò)誤
[INFO] [Engine$] If the path above is incorrect, this process will fail.
[INFO] [Engine$] Uber JAR disabled. Making sure lib/pio-assembly-0.11.0-incubating.jar is absent.
[INFO] [Engine$] Going to run: /Users/jiazhaopu/program/apache-predictionio-0.11.0-incubating/PredictionIO/sbt/sbt package assemblyPackageDependency in /Users/jiazhaopu/workspace/incubator-predictionio-template-recommender
[ERROR] [Engine$] Downloading sbt launcher for 0.13.15:
[ERROR] [Engine$] From http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.15/sbt-launch.jar
[ERROR] [Engine$] To /Users/jiazhaopu/.sbt/launchers/0.13.15/sbt-launch.jar
是因?yàn)橄螺dsbt-launch.jar出錯(cuò)钧唐,可以自己下載好 sbt-launch.jar,放到響應(yīng)的位置匠襟,下載地址
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.15/sbt-launch.jar
最后你會(huì)看到
[INFO] [Console$] Your engine is ready for training.
準(zhǔn)備好訓(xùn)練數(shù)據(jù)了
5钝侠、訓(xùn)練數(shù)據(jù)
執(zhí)行以下命令開始訓(xùn)練數(shù)據(jù)
pio train
以為spark是基于內(nèi)存的計(jì)算框架该园,如果內(nèi)存、并行任務(wù)數(shù)配置不好帅韧,可能會(huì)運(yùn)行出錯(cuò)里初,尤其是在內(nèi)存不夠大的PC上,可能會(huì)OutOfMemoryError 或 StackOverflowError忽舟。
我遇到的StackOverflowError錯(cuò)誤
我這臺(tái)PC配置如下:
處理器 4核
內(nèi)存 16G
原因是因?yàn)閟park內(nèi)存双妨、并行任務(wù)數(shù)等相關(guān)參數(shù)配置不當(dāng)。
最終的配置結(jié)果:
#spark-defaults.conf
spark.eventLog.enabled true
spark.driver.memory 512M #驅(qū)動(dòng)內(nèi)存
spark.driver.maxResultSize 1g #驅(qū)動(dòng)最大內(nèi)存
spark.driver.extraJavaOptions -Xss32m-XX:PermSize=128M -XX:MaxPermSize=512M
spark.executor.memory 10g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="three"
#spark.sql.shuffle.partitions 4
spark.default.parallelism 800
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism 并行任務(wù)數(shù)叮阅,一般設(shè)置在500~1000之間刁品。我這臺(tái)機(jī)器 500 和 1000都會(huì)StackOverflowErro
spark.serializer org.apache.spark.serializer.KryoSerializer 序列化,也很重要
#spark-env.sh
export SPARK_WORKER_MEMORY=14g
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_EXECUTOR_CORES=4
export SCALA_HOME=/Users/jiazhaopu/program/scala-2.10.5
SPARK_WORKER_MEMORY 工作內(nèi)存浩姥,盡量大
SPARK_EXECUTOR_INSTANCES 執(zhí)行器實(shí)例數(shù)挑随,單機(jī)設(shè)置1
SPARK_EXECUTOR_CORES 執(zhí)行核數(shù),等于處理器核數(shù)
SPARK_EXECUTOR_INSTANCES * SPARK_EXECUTOR_CORES = 處理器核數(shù)
數(shù)據(jù)訓(xùn)練完成勒叠,可以看到
[INFO] [CoreWorkflow$] Training completed successfully.
部署引擎
$ pio deploy
部署成功
[INFO] [HttpListener] Bound to /0.0.0.0:8000
[INFO] [MasterActor] Bind successful. Ready to serve.
引擎默認(rèn)綁定到 http://localhost:8000/ 兜挨,打開鏈接看的
好了,現(xiàn)在可以提供API來訪問服務(wù)眯分,來獲取推薦結(jié)果拌汇。
為user 1 推薦的前4個(gè)數(shù)據(jù)
$ curl -H "Content-Type: application/json" \
-d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json
返回Json
按照評(píng)分降序
{
"itemScores":[
{"item":"22","score":4.072304374729956},
{"item":"62","score":4.058482414005789},
{"item":"75","score":4.046063009943821},
{"item":"68","score":3.8153661512945325}
]
}
到這里predictionIO的推薦服務(wù)全部搭建完成,謝謝大家的掌聲弊决。
下面是一些其他細(xì)節(jié)
1噪舀、訓(xùn)練數(shù)據(jù)
按說spark是基于內(nèi)存的計(jì)算,應(yīng)該是比較耗內(nèi)存丢氢,但是從表現(xiàn)上傅联,內(nèi)存占用很小,可能是數(shù)據(jù)小的原因疚察,但是cpu占用很大蒸走,最高突破90%。
2貌嫡、部署引擎
3比驻、數(shù)據(jù)存儲(chǔ)
為了簡單,為用了mysql存儲(chǔ)訓(xùn)練數(shù)據(jù)岛抄,predictionIO生成了
mysql> show tables;
+------------------------------+
| Tables_in_pio |
+------------------------------+
| pio_event_1 |
| pio_meta_accesskeys |
| pio_meta_apps |
| pio_meta_channels |
| pio_meta_engineinstances |
| pio_meta_evaluationinstances |
| pio_model_models |
+------------------------------+
7 rows in set (0.00 sec)
pio_event_1 保存訓(xùn)練數(shù)據(jù)
mysql> select * from pio_event_1 limit 10;
+----------------------------------+-------+------------+----------+------------------+----------------+----------------+---------------------+---------------+------+------+---------------------+------------------+
| id | event | entityType | entityId | targetEntityType | targetEntityId | properties | eventTime | eventTimeZone | tags | prId | creationTime | creationTimeZone |
+----------------------------------+-------+------------+----------+------------------+----------------+----------------+---------------------+---------------+------+------+---------------------+------------------+
| 00476c09f05240b1b46e519af98bbc5b | buy | user | 11 | item | 30 | {} | 2017-09-28 15:52:27 | UTC | NULL | NULL | 2017-09-28 15:52:27 | UTC |
| 004db334651f4aa6b074a91a879f90cf | buy | user | 8 | item | 69 | {} | 2017-09-28 15:52:27 | UTC | NULL | NULL | 2017-09-28 15:52:27 | UTC |
| 0063564f4ab945a9ae72c62654511526 | rate | user | 18 | item | 75 | {"rating":1.0} | 2017-09-28 15:52:29 | UTC | NULL | NULL | 2017-09-28 15:52:29 | UTC |
| 007b8e2e6b9a422dbb3080649d59bd82 | buy | user | 11 | item | 36 | {} | 2017-09-28 15:52:27 | UTC | NULL | NULL | 2017-09-28 15:52:27 | UTC |
| 009d6156c48f44a89eec9d2aa6c3736e | buy | user | 4 | item | 98 | {} | 2017-09-28 15:52:26 | UTC | NULL | NULL | 2017-09-28 15:52:26 | UTC |
| 00ca2326e9414b0f8ad35d824dbe2edd | rate | user | 22 | item | 19 | {"rating":1.0} | 2017-09-28 15:52:29 | UTC | NULL | NULL | 2017-09-28 15:52:29 | UTC |
| 00cc79d0dc20466b907cc58eb040d151 | rate | user | 3 | item | 47 | {"rating":1.0} | 2017-09-28 15:52:25 | UTC | NULL | NULL | 2017-09-28 15:52:25 | UTC |
| 014315797d564a94abc463dbd6e1e96b | rate | user | 17 | item | 60 | {"rating":1.0} | 2017-09-28 15:52:28 | UTC | NULL | NULL | 2017-09-28 15:52:28 | UTC |
| 01660b9b40fe411eae4f6fdf993082d4 | rate | user | 16 | item | 30 | {"rating":2.0} | 2017-09-28 15:52:28 | UTC | NULL | NULL | 2017-09-28 15:52:28 | UTC |
| 01b0a512e1aa40bf823fd01ba7712741 | buy | user | 1 | item | 54 | {} | 2017-09-28 15:52:25 | UTC | NULL | NULL | 2017-09-28 15:52:25 | UTC |
+----------------------------------+-------+------------+----------+------------------+----------------+----------------+---------------------+---------------+------+------+---------------------+------------------+
10 rows in set (0.00 sec)