Spark天生提供的并行計(jì)算括袒、分布式計(jì)算給大數(shù)據(jù)的分析提供了非常棒的平臺(tái)毕莱,以往的數(shù)據(jù)庫的很多操作都可以直接在Spark中進(jìn)行處理蔑鹦,其速度十分快逗概,遠(yuǎn)遠(yuǎn)比數(shù)據(jù)庫中的集合操作要爽很多弟晚,因此也準(zhǔn)備認(rèn)坑Spark。
首先是Spark的環(huán)境搭建逾苫,單純單機(jī)的Spark的環(huán)境還是十分簡單卿城,選擇也有很多種,比如docker铅搓,比如虛擬機(jī)瑟押,比如下載解壓就可以用。
環(huán)境搭建中主要包括以下幾個(gè)方面:
1星掰、SSH環(huán)境多望,Spark有很多種部署方式嫩舟,local、standalone怀偷、集群家厌,都需要SSH免登陸設(shè)置,SSH的免登陸設(shè)置只需要查找ssh-keygen證書免登陸設(shè)置就可以查到椎工,如果是單機(jī)疹蛉,要確保單機(jī)SSH不需要密碼官地,如果是集群,要確保集群間SSH不需要密碼(此處個(gè)人認(rèn)為應(yīng)該是master和slaves之間即可,不需要slaves之間的配置思恐。有人配置過的可以請(qǐng)明示一下L帷@植骸U殷荨)。
2八千、解壓你的Spark:spark-2.3.1-bin-hadoop2.7.tgz吗讶,當(dāng)然有需要的也會(huì)連帶安裝Hadoop(解壓hadoop-2.7.7.tar),往往Hadoop和Spark版本是需要對(duì)應(yīng)的恋捆,不然會(huì)出錯(cuò)照皆。
3、Java環(huán)境沸停,現(xiàn)在較新版本的Spark都需要1.8以上的JDK膜毁,最好是rpm的原場JDK。
4愤钾、配置環(huán)境瘟滨,如~/.bashrc,spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh環(huán)境能颁。
export JAVA_HOME=/usr/local/java/jdk1.8.0_192
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export HADOOP_HOME=/usr/local /hadoop-2.7.7
export SPARK_HOME=/usr/local/spark-2.3.1-bin-hadoop2.7
exportPATH=${JAVA_HOME}/bin:${JRE_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${PATH}
然后更新bash環(huán)境
source ~/.bashrc
此外還需要設(shè)置Spark的執(zhí)行環(huán)境杂瘸,編輯spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh
export JAVA_HOME=/usr/local/java/jdk1.8.0_192
export HADOOP_HOME=/usr/local/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/usr/local/spark-2.3.1-bin-hadoop2.7
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_HOST=127.0.0.1
export SPARK_LOCLA_DIRS=/usr/local/spark-2.3.1-bin-hadoop2.7
export SPARK_YARN_USER_ENV="CLASSPATH=/usr/local/hadoop-2.7.7/etc/hadoop"
5、讓hadoop使用Spark的shuffle
cp /usr/local/spark-2.3.1-bin-hadoop2.7/yarn/spark-2.3.1-yarn-shuffle.jar /usr/local/hadoop-2.7.7/share/hadoop/yarn/lib
6伙菊、配置hadoop設(shè)置/hadoop-2.7.7/etc/hadoop下的core-site.xml败玉,hdfs-site.xml,mapred_site.xml镜硕,yarn-site.xml运翼。詳細(xì)如下:
core-site.xml
<configuration>
<!-- Configuration defauleFS -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://127.0.0.1:8020</value>
</property>
<!-- Configuration dataTemp -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-2.7.7/data/tmp</value>
</property>
<!-- Configuration ZooKeeper -->
<property>
<name>ha.zookeeper.quorum</name>
<value>127.0.0.1:2181</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<!-- Configuration the amount of data backup -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- configuration master node -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.7/dfs/name</value>
</property>
<!-- Configuration slave node -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.7/dfs/data</value>
</property>
<!-- Configuration from the maximum number of nodes -->
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<!-- Is the configuration visible on the web page -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
mapred_site.xml
<configuration>
<!-- Configuration of compution frameword -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<final>true</final>
</property>
<!-- Setting up historical work record address -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>127.0.0.1:10020</value>
</property>
<!-- Configuration you can see history in webapp -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>127.0.0.1:19888</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- Configuring node management services -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<!-- Configuration specific calculation method -->
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<!-- Configuration log file address -->
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/usr/local/hadoop/hadoop-2.7.7/logs</value>
</property>
<!-- Configuration of resource management -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<!-- Configuration of resource management address -->
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<!-- Configuration of resource management for webapp address -->
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>127.0.0.1:8088</value>
</property>
<!-- If you used jdk1.8 that add config in this -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
7、啟動(dòng)兴枯,首先要對(duì)namenode進(jìn)行格式化(docker中南蹂,我沒有格式化,結(jié)果怎么都不能使用)念恍。
/usr/local/hadoop-2.7.7/bin/hadoop namenode -format
然后啟動(dòng)hadoop,
/usr/local/hadoop-2.7.7/sbin/start-all.sh
其次啟動(dòng)spark
/usr/local/spark-2.3.1-bin-hadoop2.7/sbin/start-all.sh
8六剥、關(guān)閉,首先關(guān)閉spark
/usr/local/spark-2.3.1-bin-hadoop2.7/sbin/stop-all.sh
然后關(guān)閉hadoop
/usr/local/hadoop-2.7.7/sbin/stop-all.sh