需求
主程序拆成多個(gè)子模塊方便復(fù)用:util.py, module1.py, module2.py, main.py袜漩。
Solution
對(duì)于 main.py 依賴的 util.py, module1.py, module2.py碧聪,需要先壓縮成一個(gè) .zip 文件,再通過 spark-submit 的 --py--files 選項(xiàng)上傳到 yarn碱工,mail.py 才能 import 這些子模塊示启。命令如下:
$ spark-submit --master=yarn --deploy-mode=cluster --jars elasticsearch-hadoop-5.3.1.jar --py-files deps.zip main.py
Oozie spark-action
需要在 <spark-opts> 后面加上相應(yīng)的 ** --py-files ** 選項(xiàng):
<spark-opts>
${OTHER_OPTS} --py-files hdfs://${HDFS_HOST}:${HDFS_PORT}/${DEP_PATH}/deps.zip
</spark-opts>
軟件版本
Spark 1.6.2仲吏,Oozie 4.2 測試 OK.
參考鏈接
Third party packages: If you require a third party package to be installed on the executors, do so when setting up the node (for example with pip).
Scripts: If you require custom scripts, create a *.zip (or *.egg) file containing all of your dependencies and ship it to the executors using--py-filescommand line option. Two things to look out for:
Make sure that you also include ‘nested dependencies’
Make sure that your *.py files are at the top level of the *.zip file.