在我們 使用tensorflow 和 spark 結(jié)合的時候 扭仁,肯定非常激動,關(guān)鍵 我們打算 使用哪一種語言 建構(gòu)我們的機器學(xué)習(xí)代碼,最主要的的有四種 ,python java scala R,當然 python 是門檻較低的缸沃。使用java scala 一般人不一定能hold 的住,所以我們首先 講 python版的工具鏈
首先 假設(shè)你已經(jīng)有了一臺 mac pro
安裝了 python 3.5 或者3.6
jdk 8 县恕,最好不要用jdk9 美尸,jdk9 有很多問題
本地 安裝 了 tensorflow 和pyspark
當然 本地 homebrew 安裝 spark 2.1 和hadoop 2.8.1
想要 spark 和tensorflow 串聯(lián)在一起胯陋,你需要 spark-deeplearning 這個神器
我們git 一下 威廉大哥的 ,因為 威廉大哥擴展了 原來 的databircks 凉翻,添加了新功能的铺罢,一定要用威廉大哥的,否則 威廉大哥用的一些 方法 你找不到的泉瞻,比如
from sparkdl import TextEstimator
from sparkdl.transformers.easy_feature import EasyFeature
這些都是威廉自己擴展的鞭达,非常好用
https://github.com/allwefantasy/spark-deep-learning
mkdir sparkdl && cd sparkdl
git clone https://github.com/allwefantasy/spark-deep-learning.git . # 注意 點號
git checkout release
當然 單單它還不夠繁扎,我們還需要 tensorflow 與spark 底層元素交互的 媒介
tensorframe,并且 spark-DeepLearning 本身依賴 tensorframes
https://github.com/databricks/tensorframes/
git clone https://github.com/databricks/tensorframes.git
本身 如果你想直接使用pip 來安裝這兩個包 右犹,抱歉 pip倉庫沒有。
這里就涉及到了 pip 安裝 本地 package
其實 也沒有多難偿荷,幸好威廉大哥給了我一些錦囊妙計
想pip 安裝本地的package 寺庄,大致分兩步,
1.創(chuàng)建 一個啟動文件 setup.py, setup 文件可以參考威廉大哥的
[https://github.com/allwefantasy/spark-deep-learning/blob/release/python/setup.py](https://github.com/allwefantasy/spark-deep-learning/blob/release/python/setup.py)
2.在 setup.py文件中 配置package的屬性 文件路徑 等信息
- 執(zhí)行 一系列命令 最重要的是 執(zhí)行
python setup.py bdist_wheel
這樣就會生成 二進制文件 package Name-Version-py3.whl
4.然后 進入這個文件目錄 在Terminal 中執(zhí)行
pip install package-Name-Version-py3.whl # [anaconda]
pip3 install package-Name-Version-py3.whl # [python 3.6]
- pip list 和 pip3 list 驗證 是否安裝成功
- import package 查看模塊是否真實可以使用
這里面要注意的就是 在執(zhí)行 2的時候 要確定在同目錄下必須有package對應(yīng)的python源文件,否則即使生成了 whl文件筹麸,這個包也是一個假的不可以被使用的
另外 編寫 setup.py文件 要注意的就是包名和版本要 與實際一致物赶,否則可能真包安裝了仰楚,本身依賴的其他包也找不到它
sparkdl 的 setup.py
import codecs
import os
from setuptools import setup, find_packages
# See this web page for explanations:
# https://hynek.me/articles/sharing-your-labor-of-love-pypi-quick-and-dirty/
PACKAGES = ["sparkdl"]
KEYWORDS = ["spark", "deep learning", "distributed computing", "machine learning"]
CLASSIFIERS = [
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Natural Language :: English",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Topic :: Scientific/Engineering",
]
# Project root
ROOT = os.path.abspath(os.path.dirname(__file__))
#
#
# def read(*parts):
# """
# Build an absolute path from *parts* and and return the contents of the
# resulting file. Assume UTF-8 encoding.
# """
# with codecs.open(os.path.join(ROOT, *parts), "rb", "utf-8") as f:
# return f.read()
#
# def configuration(parent_package='', top_path=None):
# if os.path.exists('MANIFEST'):
# os.remove('MANIFEST')
#
# from numpy.distutils.misc_util import Configuration
# config = Configuration(None, parent_package, top_path)
#
# # Avoid non-useful msg:
# # "Ignoring attempt to set 'name' (from ... "
# config.set_options(ignore_setup_xxx_py=True,
# assume_default_configuration=True,
# delegate_options_to_subpackages=True,
# quiet=True)
#
# config.add_subpackage('sparkdl')
#
# return config
setup(
name="sparkdl",
description="Integration tools for running deep learning on Spark",
license="Apache 2.0",
url="https://github.com/allwefantasy/spark-deep-learning",
version="0.2.2",
author="Joseph Bradley",
author_email="joseph@databricks.com",
maintainer="Tim Hunter",
maintainer_email="timhunter@databricks.com",
keywords=KEYWORDS,
packages=find_packages(),
classifiers=CLASSIFIERS,
zip_safe=False,
include_package_data=True
)
tensorframes 的 setup.py
import codecs
import os
from setuptools import setup, find_packages
# See this web page for explanations:
# https://hynek.me/articles/sharing-your-labor-of-love-pypi-quick-and-dirty/
PACKAGES = ["tensorframes"]
KEYWORDS = ["spark", "deep learning", "distributed computing", "machine learning"]
CLASSIFIERS = [
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Natural Language :: English",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Topic :: Scientific/Engineering",
]
# Project root
ROOT = os.path.abspath(os.path.dirname(__file__))
#
#
# def read(*parts):
# """
# Build an absolute path from *parts* and and return the contents of the
# resulting file. Assume UTF-8 encoding.
# """
# with codecs.open(os.path.join(ROOT, *parts), "rb", "utf-8") as f:
# return f.read()
#
# def configuration(parent_package='', top_path=None):
# if os.path.exists('MANIFEST'):
# os.remove('MANIFEST')
#
# from numpy.distutils.misc_util import Configuration
# config = Configuration(None, parent_package, top_path)
#
# # Avoid non-useful msg:
# # "Ignoring attempt to set 'name' (from ... "
# config.set_options(ignore_setup_xxx_py=True,
# assume_default_configuration=True,
# delegate_options_to_subpackages=True,
# quiet=True)
#
# config.add_subpackage('sparkdl')
#
# return config
setup(
name="tensorframes",
description="Integration tools for running deep learning on Spark",
license="Apache 2.0",
url="https://github.com/databricks/tensorframes",
version="0.2.9",
author="Joseph Bradley",
author_email="joseph@databricks.com",
maintainer="Tim Hunter",
maintainer_email="timhunter@databricks.com",
keywords=KEYWORDS,
packages=find_packages(),
classifiers=CLASSIFIERS,
zip_safe=False,
include_package_data=True
)
然后 先 安裝 sparkdl 哄孤,進入 spark-deeplearning 目錄 打開Terminal
cd ./python && python setup.py bdist_wheel && cd dist
pip install sparkdl-0.2.2-py3-none-any.whl
pip3 install sparkdl-0.2.2-py3-none-any.whl
在安裝tensorframes 要注意的就是 tensorframes的根目錄下的python目錄沒有對應(yīng)的源文件冬念,需要找到源文件 復(fù)制到這里刨摩,一定要把 ./src/main/python/ 下的兩個文件目錄 tensorframes 和tensorframes_snippets 拷貝到 根目錄下的python目錄下,否則即使 安裝了tensorframes 也是不可以用的世吨,另外一定要 提前把pyspark 安裝好澡刹,否則 也是不可以用的
cp ./src/main/python/* ./python
cd ./python && python setup.py bdist_wheel && cd dist
pip install tensorframes-0.2.9-py3-none-any.whl
pip3 install tensorframes-0.2.9-py3-none-any.whl
命令可以參考 威廉大哥的
cd ./python && python [setup.py](setup.py) bdist_wheel && cd dist && pip uninstall sparkdl && pip install ./sparkdl-0.2.2-py2-none-any.whl && cd ..
然后我們在pycharm里就可以愉快的使用了