參考:配置本地單機pyspark
https://www.cnblogs.com/jackchen-Net/p/6667205.html#_label3
在sitepackages下新建pyspark.pth 文件剂娄,并添加D:\apps\spark\spark-2.2.1-bin-hadoop2.7\python
submit一個py文件報錯
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
說明沒有hadoop執(zhí)行文件袖外,因為我們并沒安裝hadoop:
另外參考此篇hadoop安裝(記住也物,僅僅是hadoop)
http://blog.csdn.net/yaoqiwaimai/article/details/59114881
winutils.exe chmod 777 c:\tmp\hive(先創(chuàng)建文件夾后再執(zhí)行)
后面不要執(zhí)行spark-shell(這是基于scala的铺厨,我們目的是pyspark)
winutils.exe版本不對(現(xiàn)在版本hadoop2.7.3)练慕,重新下載(需要的話留下你的email)增蹭。hadoop.dll-and-winutils.exe-for-hadoop2.7.3-on-windows_X64-master.zip解壓并將所有子文件復(fù)制到hadoop的bin文件夾中
再次提交腳本還是error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
將自己jdk9改為jdk8就沒有問題了(自己坑自己)
警示信息太多:spark home的conf文件中l(wèi)ogXXX復(fù)制媚赖,去掉.template狈孔,修改log4j.rootCategory=WARN, console 這里INFO改為WARN
submit的測試腳本
# encoding: utf-8
import os
import sys
# os.environ['SPARK_HOME'] = r"D:\apps\spark\spark-2.2.1-bin-hadoop2.7"
# os.environ['HADOOP_HOME'] = r"D:\apps\spark\"
# You might need to enter your local IP
# os.environ['SPARK_LOCAL_IP']="192.168.2.138"
# Path for pyspark and py4j
sys.path.append(r"D:\apps\spark\spark-2.2.1-bin-hadoop2.7\python")
# sys.path.append(r"D:\apps\spark\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip") #pip install py4j后即可不要這句
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
sc = SparkContext(master='local')
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka"])
print("統(tǒng)計結(jié)果", words.count())