該內(nèi)容主要是對
https://www.ctolib.com/IBM-elasticsearch-spark-recommender.html
的體會翻譯,并且整理源碼
用Apache Spark & Elasticsearch構(gòu)建推薦系統(tǒng)
安裝準(zhǔn)備
- 安裝es
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.3.0.tar.gz
$ tar xfz elasticsearch-5.3.0.tar.gz
image.png
- 安裝es 的向量排序插件 Elasticsearch vector scoring plugin
$ cd elasticsearch-5.3.0
$ ./bin/elasticsearch-plugin install https://github.com/MLnick/elasticsearch-vector-scoring/releases/download/v5.3.0/elasticsearch-vector-scoring-5.3.0.zip
- 啟動es
./bin/elasticsearch
查看已經(jīng)啟動了向量排序插件
image.png
- 安裝es的python客戶端
$ pip install elasticsearch
- 下載spark與es之間連接器
$ wget http://download.elastic.co/hadoop/elasticsearch-hadoop-5.3.0.zip
$ unzip elasticsearch-hadoop-5.3.0.zip
-
下載Spark
image.png
$ tar xfz spark-2.4.5-bin-hadoop2.7.tgz
- 安裝numpy
$ pip install numpy
- 下載訓(xùn)練數(shù)據(jù)
$ cd data
$ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
$ unzip ml-latest-small.zip
- 安裝啟動notebook
注意: notebook 環(huán)境需要 Python 2.7 or 3.x (且在 2.7.11 和 3.6.1測試通過)
$ pip install tmdbsimple
$ pip install jupyter
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./spark-2.4.5-bin-hadoop2.7/bin/pyspark --driver-memory 2g --driver-class-path ./elasticsearch-hadoop-5.3.0/dist/elasticsearch-spark-20_2.11-5.3.0.jar
image.png
- 下載案例
[https://github.com/mindis/elasticsearch-spark-recommender-demo/blob/master/notebooks/elasticsearch-spark-recommender.ipynb](https://github.com/mindis/elasticsearch-spark-recommender-demo/blob/master/notebooks/elasticsearch-spark-recommender.ipynb)
-
在notebook打開例子
image.png
image.png
image.png
image.png
案例說明
-
邏輯圖
image.png -
修改訓(xùn)練數(shù)據(jù)地址
image.png - 訓(xùn)練user和movies向量
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.1, rank=10, seed=42)
model = als.fit(ratings_from_es)
model.userFactors.show(5)
model.itemFactors.show(5)
- 向es寫入向量
movie_vectors.write.format("es") \
.option("es.mapping.id", "id") \
.option("es.write.operation", "update") \
.save("demo/movies", mode="append")
user_vectors.write.format("es") \
.option("es.mapping.id", "id") \
.option("es.write.operation", "update") \
.save("demo/users", mode="append")
- 查詢相似度電影
display_similar(2628, num=5)
-
注意點(diǎn)
以下在原案例文件有錯誤使套,調(diào)整為以下內(nèi)容
image.png