1. 版本信息
spark-2.4.4-bin-hadoop2.7.tgz
hadoop-2.7.7.tar.gz
scala-2.11.12.tgz
jdk-8u391-linux-x64.tar.gz
python3.7(python3.8及以上版本不能兼容spark2.4句狼,python版本要求3.5+)
pyspark==3.4.2
findspark==2.0.1
nebula-graph==3.4.0
2. 安裝
設(shè)置免密登錄
ssh-keygen -t rsa
ssh-copy-id root@node03
tar zxvf spark-2.4.4-bin-hadoop2.7.tgz -C /usr/local
tar zxvf hadoop-2.7.7.tar.gz -C /usr/local
tar zxvf scala-2.11.12.tgz -C /usr/local
tar zxvf jdk-8u391-linux-x64.tar.gz -C /usr/local
vim /etc/profile
export JAVA_HOME=/usr/local/jdk1.8.0_391
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7
export SCALA_HOME=/usr/local/scala-2.11.12
export HADOOP_HOME=/usr/local/hadoop-2.7.7
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
export PATH=$PATH:$MAVEN_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin
source /etc/profile
3. 配置hadoop驮捍、spark
3.1 hadoop(cd /usr/local/hadoop-2.7.7/)
etc/hadoop/slaves
node03
etc/hadoop/hadoop-env.sh 中添加
JAVA_HOME=/usr/local/jdk1.8.0_391
etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node03:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-2.7.7/data</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node03:50090</value>
</property>
</configuration>
etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
sbin目錄下start_all.sh打肝、stop_all.sh惭蹂、start-dfs.sh、stop-dfs.sh罚缕、start-yarn.sh艇纺、stop-yarn.sh中添加
HDFS_DATANODE_USER=root
YARN_RESOURCEMANAGER_USER=root
HDFS_NAMENODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
YARN_NODEMANAGER_USER=root
3.2 spark(/usr/local/spark-2.4.4-bin-hadoop2.7)
conf/slaves
node03
conf/spark-env.sh(從conf/spark-env.sh.template復(fù)制得到)中添加
export SPARK_MASTER_HOST=node03
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/
4. 啟動
# hdfs 格式化
hdfs namenode -format
/usr/local/hadoop-2.7.7/sbin/start-dfs.sh
/usr/local/hadoop-2.7.7/sbin/start-yarn.sh
/usr/local/spark-2.4.4-bin-hadoop2.7/sbin/start-all.sh
[root@node03 ~]# jps
29265 DataNode
21826 Master
100405 Jps
25269 NodeManager
29430 SecondaryNameNode
29591 ResourceManager
21898 Worker
29135 NameNode
spark各個(gè)端口:
1)Spark查看當(dāng)前Spark-shell運(yùn)行任務(wù)情況端口號:4040
2)Spark Master內(nèi)部通信服務(wù)端口號:7077 (類比于Hadoop的8020(9000)端口)
3)Spark Standalone模式Master Web端口號:8080(類比于Hadoop YARN任務(wù)運(yùn)行情況查看端口號:8088)
4)Spark歷史服務(wù)器端口號:18080 (類比于Hadoop歷史服務(wù)器端口號:19888)
hadoop各個(gè)端口:
1.HDFS:
NameNode:默認(rèn)端口為 8020(RPC),50070(Web UI)邮弹。
DataNode:默認(rèn)端口為 50010(數(shù)據(jù)傳輸)黔衡、50020(心跳)、50075(Web UI)腌乡。
SecondaryNameNode:默認(rèn)端口為 50090(Web UI)盟劫。
2.YARN:
ResourceManager:默認(rèn)端口為 8032(RPC)、8088(Web UI)与纽。
NodeManager:默認(rèn)端口為 8042(Web UI)侣签。
3.MapReduce:
JobHistoryServer:默認(rèn)端口為 10020(RPC)、19888(Web UI)急迂。
4.HBase:
HMaster:默認(rèn)端口為 16000(RPC)影所、16010(Web UI)。
RegionServer:默認(rèn)端口為 16020(RPC)僚碎、16030(Web UI)猴娩。
5.ZooKeeper:
默認(rèn)端口為 2181。
NameNode:維護(hù)hadoop文件系統(tǒng)的命名空間
8021 JobTracker:協(xié)調(diào)MapReduce任務(wù)的進(jìn)程
DataNode:存儲數(shù)據(jù)塊勺阐,并向客戶端提供數(shù)據(jù)
50060 TaskTracker:運(yùn)行MapReduce任務(wù)的節(jié)點(diǎn)
Secondary NameNode:定期合并hadoop文件系統(tǒng)編輯日志卷中,并發(fā)送到NameNode
5. pyspark
生成nebula-spark-connector-3.0.0.jar文件,打包完成在nebula-spark-connector/nebula-spark-connector/target目錄下
$ git clone https://github.com/vesoft-inc/nebula-spark-connector.git -b v3.0.0
$ cd nebula-spark-connector/nebula-spark-connector
$ mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true
python3.7加載nebula-spark-connector-3.0.0.jar文件讀取nebula點(diǎn)和邊渊抽,并保存到hdfs
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").config(
"spark.jars", "/root/nebula-spark-connector-3.0.0.jar").config(
"spark.driver.extraClassPath", "/root/nebula-spark-connector-3.0.0.jar").appName(
"nebula-connector").getOrCreate()
df = spark.read.format("com.vesoft.nebula.connector.NebulaDataSource").option(
"type", "vertex").option(
"spaceName", "player").option(
"label", "user").option(
"returnCols", "name,age").option(
"metaAddress", "<nebula-meta-ip>:9559").option(
"partitionNumber", 60).load()
# df = spark.read.format("com.vesoft.nebula.connector.NebulaDataSource").option(
# "type", "vertex").option(
# "spaceName", "player").option(
# "label", "relation").option(
# "returnCols", "create_date,type,sub_type,values").option(
# "metaAddress", "<nebula-meta-ip>:9559").option(
# "partitionNumber", 60).load()
# df.show(n=2)
# df.write.format("csv").save("./relation", header=True)
# df = spark.read.format("com.vesoft.nebula.connector.NebulaDataSource").option(
# "type", "edge").option(
# "spaceName", "player").option(
# "label", "has").option(
# "returnCols", "sdate,rtype").option(
# "metaAddress", "<nebula-meta-ip>:9559").option(
# "partitionNumber", 60).load()
df.show(n=2)
df.write.format("csv").save("hdfs://node03:9000/frame", header=True)
spark.stop()
# 默認(rèn)保存大小1.3M蟆豫,合并hdfs文件
hdfs dfs -cat /frame/part-* | hdfs dfs -copyFromLocal - /input/frame.csv
# 刪除臨時(shí)文件夾
hdfs dfs -rmr /frame