以下基于Hadoop2.7+Spark 2.4,Mac機(jī)器。三個(gè)虛擬機(jī)(一主兩從)做集群。
0.基礎(chǔ)工作
0.1 hosts
修改
vim /etc/hosts
192.168.165.130 hadoop03
192.168.165.129 hadoop02
192.168.165.128 hadoop01
:wq
傳一下
scp /etc/hosts root@hadoop01:/etc/
scp /etc/hosts root@hadoop02:/etc/
scp /etc/hosts root@hadoop03:/etc/
0.1 ssh免密登錄
- 創(chuàng)建root用戶的密鑰
cd ~/.ssh
ssh-keygen -t rsa
一路回車(chē)
- 互相認(rèn)證
# 追加
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# 復(fù)制一下
cp ~/.ssh/id_rsa.pub ~/.ssh/id_rsa.pub.01
# 傳給大家
scp ~/.ssh/id_rsa.pub.01 root@hadoop02:~/.ssh/
# 大家都這么相互傳之后樊零,把大家的公鑰加到認(rèn)證文件(以01為例)
cat id_rsa.pub.0* >> ~/.ssh/authorized_keys
在需要免密登錄的機(jī)器之間互相拷貝公鑰,然后追加到認(rèn)證文件中孽文。
0.2 裝Java
下載->解壓->上傳
0.3 裝scala
下載->解壓->上傳
0.4 裝maven
下載->解壓->上傳
0.5 裝zookeeper
- 下載->解壓->上傳
- 配置文件:
cp ./conf/zoo_sample.cfg ./conf/zoo.cfg
vim ./conf/zoo.cfg
#dataDir=/tmp/zookeeper
dataDir=/usr/local/zookeeper/data
dataLogDir=/usr/local/zookeeper/logs
server.1=hadoop01:2888:3888
server.2=hadoop02:2888:3888
server.3=hadoop03:2888:3888
myid
在dataDir 中創(chuàng)建myid文件淹接,里面寫(xiě)上server.X中的X啟動(dòng)
./bin/zkServer.sh start驗(yàn)證
[root@localhost logs]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Mode: follower
1. 安裝Hadoop
1.1 下載Hadoop安裝包或源碼包(自行編譯)
下載地址:http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
1.2 解壓縮
tar -zxvf ./hadoop-2.7.7.tar
1.3 修改配置文件(高可用)
以下文件都在/Users/pengjunzhe/Downloads/hadoop-2.7.7/etc/hadoop
中.
如果不需要配置高可用可以大大簡(jiǎn)化配置需要的工作叛溢,百度很多,不多贅述楷掉。
1.3.1 hadoop-env.sh
# The java implementation to use.
# export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/local/jdk
1.3.2 core-site.xml
<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop01:9000</value>
</property>
<!-- 指定hadoop運(yùn)行時(shí)產(chǎn)生文件的存儲(chǔ)目錄 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<!-- zookeeper地址 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop01:2181, hadoop02:2181, hadoop03:2181</value>
</property>
1.3.3 hdfs-site.xml
<!-- 指定hdfsnameservices為ns1霞势, 需要和core-site中一致-->
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<!-- ns01 下游兩個(gè)Namenode愕贡,分別是nn1 和 nn2 -->
<property>
<name>dfs.ha.namenodes.ns1</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的http通信地址-->
<property>
<name>dfs.namenode.http-address.ns1.nn1</name>
<value>hadoop01:50070</value>
</property>
<!-- nn1的RPC通信地址-->
<property>
<name>dfs.namenode.rpc-address.ns1.nn1</name>
<value>hadoop01:9000</value>
</property>
<!-- nn2的http通信地址-->
<property>
<name>dfs.namenode.http-address.ns1.nn2</name>
<value>hadoop02:50070</value>
</property>
<!-- nn2的RPC通信地址-->
<property>
<name>dfs.namenode.rpc-address.ns1.nn2</name>
<value>hadoop02:9000</value>
</property>
<!-- 指定NameNode的源數(shù)據(jù)在JournalNode上的存放位置-->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop01:8485;hadoop02:8485;hadoop03:8485/ns1</value>
</property>
<!-- 指定JournalNode在本地磁盤(pán)存放位置-->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/hadoop/journal</value>
</property>
<!-- 開(kāi)啟NameNode失敗自動(dòng)切換-->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 配置失敗自動(dòng)切換實(shí)現(xiàn)方式-->
<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!--設(shè)置隔離機(jī)制-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!--使用隔離機(jī)制時(shí)需要ssh免登陸-->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!--導(dǎo)致DN停止工作的壞硬盤(pán)最大數(shù),默認(rèn)為0-->
<!--<property>-->
<!--<name>dfs.datanode.failed.volunmes.tolerated</name>-->
<!--<value>2</value>-->
<!--</property>-->
1.3.4 mapred-site.xml
<!--History Server配置-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop01:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.joblist.cache.size</name>
<value>20000</value>
</property>
<!--MapReduce運(yùn)行在yarn上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>128</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx128m -Xms64m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx128m -Xms64m</value>
</property>
<property>
<name>mapreduce.client.submit.file.replication</name>
<value>20</value>
</property>
1.3.5 yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 開(kāi)啟ResourceManager HA-->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!---->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yarn-ha</value>
</property>
<!--開(kāi)啟Resource Manager失敗自動(dòng)切換-->
<property>
<name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!--RM失敗后正在運(yùn)行的RM回復(fù)后重新使用-->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!--運(yùn)行ResourceManager的兩個(gè)節(jié)點(diǎn)-->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<!--當(dāng)應(yīng)用程序未指定隊(duì)列名時(shí)巷屿,指定用戶名作為應(yīng)用程序所在的隊(duì)列名-->
<property>
<name>yarn.scheduler.fair.user-as-default-queue</name>
<value>true</value>
</property>
<!--RM狀態(tài)信息存儲(chǔ)方式-->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop01</value>
</property>
<!--RM1 HTTP web 訪問(wèn)地址-->
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>${yarn.resourcemanager.hostname.rm1}:8088</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop02</value>
</property>
<!--RM2 HTTP web 訪問(wèn)地址-->
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>${yarn.resourcemanager.hostname.rm2}:8088</value>
</property>
<!--NodeManager節(jié)點(diǎn)可使用的內(nèi)存大小-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<!--NodeManager節(jié)點(diǎn)可用總vcore數(shù)量-->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
<!--Zookeeper連接地址-->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
</property>
<!--NM本地任務(wù)運(yùn)行日志存儲(chǔ)路徑-->
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///usr/local/hadoop/data1/yarn/log
,file:///usr/local/hadoop/data2/yarn/log</value>
</property>
<!--ApplicationMaster占用的總內(nèi)存大小-->
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>
<!--單個(gè)任務(wù)可申請(qǐng)的最小物理內(nèi)存量-->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<!--單個(gè)任務(wù)可申請(qǐng)的最大物理內(nèi)存量-->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!--單個(gè)任務(wù)可生情的最小vcore量-->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!--單個(gè)任務(wù)可生情的最大vcore量-->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!--開(kāi)啟聚合日志-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!--聚合日志保存時(shí)長(zhǎng) -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>3600</value>
</property>
<!--聚合日志保存路徑-->
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/usr/local/hadoop/data/yarn-logs</value>
</property>
<!--使用公平調(diào)度器-->
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>/usr/local/hadoop/etc/hadoop/fair-scheduler.xml</value>
</property>
</configuration>
1.3.5.0 fair-scheduler.xml
<allocations>
<pool name="group1">
<maxResources>50000 mb, 10 vcores</maxResources>
<maxRunningApps>10</maxRunningApps>
<weight>1.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
</pool>
<pool name="group2">
<maxResources>80000 mb, 20 vcores</maxResources>
<maxRunningApps>20</maxRunningApps>
<weight>1.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
</pool>
<userMaxAppsDefault>99</userMaxAppsDefault>
<queuePlacementPolicy>
<rule name="promaryGroup" create="false" />
<rule name="secondaryGroupExistingQueue" craete="false" />
<rule name="reject" />
</queuePlacementPolicy>
</allocations>
1.3.6 slaves
hadoop01
hadoop02
hadoop03
1.4 分發(fā)
scp -r ./hadoop-2.7.7/ root@hadoop01:/usr/local/hadoop/
scp -r ./hadoop-2.7.7/ root@hadoop02:/usr/local/hadoop/
scp -r ./hadoop-2.7.7/ root@hadoop03:/usr/local/hadoop/
1.6 配置環(huán)境變量
vim ~/.bashrc
export JAVA_HOME=/usr/local/jdk
export MAVEN_HOME=/usr/local/maven
export SCALA_HOME=/usr/local/scala
export ZK_HOME=/usr/local/zookeeper
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$MAVEN_HOME/bin:/$SCALA_HOME/bin:$ZK_HOME/bin:$HADOOP_HOME/bin
source ~/.bashrc
1.8 啟動(dòng)journalnode(所有虛擬機(jī))
/usr/local/hadoop/sbin/hadoop-daemons.sh start journalnode
驗(yàn)證
jps
9324 JournalNode
9373 Jps
1.7 初始化HDFS
hadoop namenode -format
格式化之后會(huì)根據(jù)core-site.xml中的hadoop.tmp.dir配置生成一個(gè)文件夾固以,將這個(gè)文件夾scp到別的NameNode節(jié)點(diǎn)(這里是Hadoop02)。
scp -r /usr/local/hadoop/tmp/ root@hadoop02:/usr/local/hadoop/
1.8 格式化ZK(Hadoop01執(zhí)行)
hdfs zkfc -formatZK
1.9 啟動(dòng)HDFS和YARN(Hadoop01執(zhí)行)
./sbin/start-all.sh
1.10 驗(yàn)證安裝和啟動(dòng)是否成功
1.10.1 檢查進(jìn)程
jsp
11776 QuorumPeerMain
15840 DataNode
16026 JournalNode
16524 Jps
15742 NameNode
16206 DFSZKFailoverController
16414 NodeManager
1.10.2 檢查RM
yarn rmadmin -getServiceState rm1
active
1.10.2 檢查web界面
-
Standby的NameNode
http://hadoop01:50070/dfshealth.html#tab-overview
image.png -
Active的NameNode節(jié)點(diǎn)
http://hadoop01:50070/dfshealth.html#tab-overview
image.png -
active 的RM
image.png
1.10.3 運(yùn)行例子程序
2. 安裝Spark
2.1 下載Spark安裝包嘱巾,源碼包(自行編譯)
下載地址:http://spark.apache.org/downloads.html
2.2 解壓縮
tar -zxvf spark-2.4.0-bin-hadoop2.7.tar
2.3 修改配置文件
2.3.1 spark-env.sh
export JAVA_HOME=/usr/local/jdk
export SPARK_MASTER_HOST=hadoop01
export SPARK_MASTER_PORT=7077
2.3.2 slave
hadoop01
hadoop02
hadoop03
2.5 分發(fā)
文件夾
scp ./spark-2.4.0-bin-hadoop2.7 root@hadoop01:/usr/local/spark
scp ./spark-2.4.0-bin-hadoop2.7 root@hadoop02:/usr/local/spark
scp ./spark-2.4.0-bin-hadoop2.7 root@hadoop03:/usr/local/spark
2.6 配置環(huán)境變量
vim ~/.bashrc
export SPARK_HOMK=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
:wq
source ~/.bashrc
2.7 啟動(dòng)
./sbin/start-all.sh
2.8 驗(yàn)證
- hadoop01主機(jī)中有Master和Worker,其他主機(jī)中有Worker憨琳。
jps
22741 Worker
22668 Master
-
瀏覽器管理平臺(tái)
地址http://hadoop01:8080
image.png 運(yùn)行例子程序
./bin/run-example SparkPi 10
# 可以找到
Pi is roughly 3.1385551385551387
3 運(yùn)行Spark的三種方式
3.1 run-example
./bin/run-example SparkPi 10
3.2 spark-submit
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop01:7077 \
--driver-memory 512m \
--executor-memory 512m \
--total-executor-cores 2 \
/usr/local/spark/examples/jars/spark-examples_2.11-2.4.0.jar \
100
3.3 spark-shell
spark-shell \
--master spark://hadoop01:7077