什么是spark 的dynamic allocate
它允許你的spark 根據(jù) spark的運(yùn)行時(shí)的負(fù)載情況動(dòng)態(tài)增加或者減少executors 的個(gè)數(shù)
-
下面展示的是一張靜態(tài)資源分配下spark實(shí)際使用的資源與分配的資源圖
-
下面展示的是動(dòng)態(tài)資源分配下spark 實(shí)際使用的資源與分配的資源 圖
什么場景下使用spark 的dynamic allocate
-
任何需要大的shuffle拉取的情況
開啟spark dynamic allocate 步驟設(shè)置
1. 設(shè)置環(huán)境變量
$ export HADOOP_CONF_DIR=/hadoop/hadoop-2.7XX/etc/hadoop
$ export HADOOP_HOME=/hadoop/hadoop-2.7XX
$ export SPARK_HOME=/hadoop/spark-2.4.0-bin-hadoop2.7
$ hds=(`cat ${HADOOP_CONF_DIR}/slaves` 'namenode1' 'namenode2')
# 配置hadoop cluster 的所有機(jī)器的host
# remeber to unset hds at the end
2. yarn文件配置
-
bakup yarn-site.xml
$ for i in ${hds[@]} ; do echo $i ; ssh $i "cp ${HADOOP_CONF_DIR}/yarn-site.xml ${HADOOP_CONF_DIR}/yarn-site.xml.pre_spark_shuffle.bak" ; done; $ for i in ${hds[@]} ; do echo $i ; ssh $i "ls ${HADOOP_CONF_DIR} | grep pre_spark_shuffle.bak" ; done;
-
modify yarn-site.xml
$ more ${HADOOP_CONF_DIR}/yarn-site.xml | grep -B 1 -A 2 "aux-services"
-
output
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle,spark_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property>
-
broadcast yarn-site.xml
$ for i in ${hds[@]} ; do echo $i ; scp ${HADOOP_CONF_DIR}/yarn-site.xml ${i}:${HADOOP_CONF_DIR}/ ; done ; $ for i in ${hds[@]} ; do echo $i ; ssh $i "cat ${HADOOP_CONF_DIR}/yarn-site.xml | grep -B 1 -A 2 'aux-services' " ; done ;
-
Check heapsize
$ more ${HADOOP_CONF_DIR}/yarn-env.sh | grep "YARN_HEAPSIZE"
-
output
YARN_HEAPSIZE=2000 # 根據(jù)實(shí)際情況調(diào)整
-
-
check yarn class path
$ yarn classpath | sed -r 's/:/\n/g' $ more ${HADOOP_CONF_DIR}/yarn-site.xml | grep "yarn.application.classpath" # if finds nothing , we can use default path $HADOOP_HOME/share/hadoop/yarn/
-
check yarn shuffle jar
$ find ${SPARK_HOME} -iname "*yarn-shuffle.jar" # get result : spark-2.4.0-yarn-shuffle.jar
-
copy yarn shuffle jar
$ for i in ${hds[@]} ; do echo $i ; scp `find ${SPARK_HOME} -iname "*yarn-shuffle.jar"` ${i}:$HADOOP_HOME/share/hadoop/yarn/ ; done ; $ for i in ${hds[@]} ; do echo $i ; ll -ltr $HADOOP_HOME/share/hadoop/yarn/ | grep shuffle ; done ;
-
Restart yarn
$ bash $HADOOP_HOME/sbin/stop-yarn.sh $ bash $HADOOP_HOME/sbin/start-yarn.sh
-
check application
$ for i in ${hds[@]} ; do echo $i ; ssh $i ". /etc/profile ; jps" | grep -i manager ; done;
3. spark 配置
-
spark-default.conf
$ more ${SPARK_HOME}/conf/spark-defaults.conf
-
and add the following entries:
spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 1 spark.dynamicAllocation.maxExecutors 4 spark.dynamicAllocation.executorIdleTimeout 60