(搭建集群部分借鑒了kiwenlau/hadoop-cluster-docker中的內(nèi)容俩块,不過那里的基礎環(huán)境是Ubuntu建芙,本人這里是用的CentOS7钧大,因此也糟了不少坑>∽亍)
目錄索引
一搂橙、編輯Hadoop運行環(huán)境中的配置文件
二孙乖、使用Dockerfile制作Hadoop的鏡像
三、激情的排坑之旅
四份氧、最終的文件內(nèi)容
? — config/core-site.xml
? — config/hadoop-env.sh
? — config/hdfs-site.xml
? — config/mapred-site.xml
? — config/run-wordcount.sh
? — config/slaves
? — config/ssh_config
? — config/start-hadoop.sh
? — config/yarn-site.xml
? — Dockerfile
? — start-container.sh
? — stop-container.sh
? — remove-container.sh
? — resize-cluster.sh
附:命令行純凈版
一唯袄、編輯Hadoop運行環(huán)境中的配置文件
-
創(chuàng)建文件夾和文件
先創(chuàng)建個文件夾來放相關的文件,并創(chuàng)建配置文件的文件夾蜗帜,新建幾個文件恋拷。
$ mkdir -p hadoop-docker/config $ cd hadoop-docker $ touch Dockerfile start-container.sh config/ssh_config config/start-hadoop.sh config/run-wordcount.sh
-
復制Hadoop的64位編譯文件
將編譯好的64位版本的Hadoop包復制到當前目錄中。(編譯64位Hadoop厅缺,看這里)
-
復制Hadoop中的配置文件
解壓編譯好的64位版本的Hadoop包蔬顾,從里面復制點配置項出來修改(不然命令行下全手寫還不累死啦!)湘捎。
$ export version=2.7.3 $ tar -xzvf hadoop-$version.tar.gz $ copy hadoop-$version/etc/hadoop/core-site.xml config/core-site.xml $ copy hadoop-$version/etc/hadoop/hadoop-env.sh config/hadoop-env.sh $ copy hadoop-$version/etc/hadoop/hdfs-site.xml config/hdfs-site.xml $ copy hadoop-$version/etc/hadoop/mapred-site.xml.template config/mapred-site.xml $ copy hadoop-$version/etc/hadoop/yarn-site.xml config/yarn-site.xml
-
編輯配置文件:ssh_config
Hadoop節(jié)點之間通訊使用的是ssh诀豁,這里設置ssh_config的配置文件,增加無密碼登錄設置窥妇。使用vi編輯config文件夾下的ssh_config舷胜,加入以下的內(nèi)容:
Host localhost StrictHostKeyChecking no Host 0.0.0.0 StrictHostKeyChecking no Host hadoop-* StrictHostKeyChecking no UserKnownHostsFile=/dev/null
-
編輯配置文件:core-site.xml
使用vi編輯config文件夾下的core-site.xml,在configuration中間加入以下內(nèi)容:
<!--指定namenode的地址--> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop-master:9000/</value> </property>
-
編輯配置文件:hdfs-site.xml
使用vi編輯config文件夾下的hdfs-site.xml活翩,在configuration中間加入以下內(nèi)容:
<!--指定hdfs的Name節(jié)點的保存目錄--> <property> <name>dfs.namenode.name.dir</name> <value>file:///root/hdfs/namenode</value> <description>NameNode directory for namespace and transaction logs storage.</description></property> <!--指定hdfs的Data節(jié)點的保存目錄--> <property> <name>dfs.datanode.data.dir</name> <value>file:///root/hdfs/datanode</value> <description>DataNode directory</description> </property> <!--指定hdfs保存數(shù)據(jù)的副本數(shù)量--> <property> <name>dfs.replication</name> <value>2</value> </property>
-
編輯配置文件:mapred-site.xml
使用vi編輯config文件夾下的mapred-site.xml烹骨,在configuration中間加入以下內(nèi)容:
<!--告訴hadoop以后MR運行在YARN上--> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
-
編輯配置文件:yarn-site.xml
使用vi編輯config文件夾下的yarn-site.xml,在configuration中間加入以下內(nèi)容:
<!--nodeManager獲取數(shù)據(jù)的方式是shuffle--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!--指定Yarn的老大(ResourceManager)的地址--> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop-master</value> </property>
-
編輯環(huán)境配置腳本:hadoop-env.sh
使用vi編輯config文件夾下的hadoop-env.sh材泄,找到JAVA_HOME設置沮焕,修改為如下(其他內(nèi)容不變):
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
(呵呵,其實這里有問題拉宗,后面排坑的時候再說峦树!)
編輯從節(jié)點記錄:slaves
hadoop-slave1
hadoop-slave2
這個后面有個腳本可以根據(jù)從節(jié)點數(shù)量自動生成辣辫。
- 編輯啟動Hadoop的腳本:start-hadoop.sh
使用vi編輯config文件夾下的start-hadoop.sh,用于在Master上執(zhí)行啟動hadoop的命令:
#!/bin/bash
echo -e "\n"
$HADOOP_HOME/sbin/start-dfs.sh
echo -e "\n"
$HADOOP_HOME/sbin/start-yarn.sh
echo -e "\n"
-
編輯運行入門程序WordCount的腳本:run-wordcount.sh
使用vi編輯config文件夾下的run-wordcount.sh魁巩,用來運行Hadoop的入門程序WordCount:
#!/bin/bash # test the hadoop cluster by running wordcount # create input files mkdir input echo "Hello Docker" >input/file2.txt echo "Hello Hadoop" >input/file1.txt # create input directory on HDFS hadoop fs -mkdir -p input # put input files to HDFS hdfs dfs -put ./input/* input # run wordcount hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.3-sources.jar org.apache.hadoop.examples.WordCount input output # print the input files echo -e "\ninput file1.txt:" hdfs dfs -cat input/file1.txt echo -e "\ninput file2.txt:" hdfs dfs -cat input/file2.txt # print the output of wordcount echo -e "\nwordcount output:" hdfs dfs -cat output/part-r-00000
二络它、使用Dockerfile制作Hadoop的鏡像
Hadoop鏡像中到相關配置文件和腳本都寫好了,開始編輯Dockerfile并制作Hadoop的鏡像歪赢。
使用vi打開Dockerfile,開始編輯其中的內(nèi)容单料。
-
添加基礎鏡像和基本信息
這里用的基礎鏡像是centos7環(huán)境并開通了systemd啟動管理程序埋凯,具體生成可參見之前的文章(使用Docker編譯64位的Hadoop)。
# 鏡像來源 FROM centos7-systemd # 鏡像創(chuàng)建者(寫入自己的信息) MAINTAINER "you" <your@email.here> # 指定目錄 WORKDIR /root
-
安裝運行環(huán)境需要的軟件
安裝Java jdk和openssh扫尖。
RUN yum update -y && \ yum install -y java-1.7.0-openjdk \ openssh-server
(其實這里還不夠白对,還是后面排坑的時候再說。)
-
復制Hadoop并安裝
這里設置了個環(huán)境變量方便在更換版本的時候修改换怖。
# 復制Hadoop ENV HADOOP_VERSION=2.7.3 COPY hadoop-$HADOOP_VERSION.tar.gz /root/hadoop-$HADOOP_VERSION.tar.gz # 安裝 RUN tar -xzvf hadoop-$HADOOP_VERSION.tar.gz && \ mv hadoop-$HADOOP_VERSION /usr/local/hadoop && \ rm hadoop-$HADOOP_VERSION.tar.gz
-
設置環(huán)境變量
設置JAVA_HOME甩恼,HADOOP_HOME。
ENV JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk ENV HADOOP_HOME=/usr/local/hadoop ENV PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
(這里也漏了一個沉颂,另外JAVA_HOME也有問題条摸,后面排坑的時候……)
-
設置SSH免密碼登錄
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-
復制配置文件到Hadoop下
COPY config/* /tmp/ RUN mv /tmp/ssh_config ~/.ssh/config && \ mv /tmp/hadoop-env.sh /usr/local/hadoop/etc/hadoop/hadoop-env.sh && \ mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \ mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \ mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \ mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \ mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \ mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \ mv /tmp/run-wordcount.sh ~/run-wordcount.sh RUN chmod +x ~/start-hadoop.sh && \ chmod +x ~/run-wordcount.sh && \ chmod +x $HADOOP_HOME/sbin/start-dfs.sh && \ chmod +x $HADOOP_HOME/sbin/start-yarn.sh
(這里也有個權(quán)限的問題,后面排坑……)
-
設置節(jié)點
創(chuàng)建目錄铸屉,并格式化HDFS钉蒲。
RUN mkdir -p ~/hdfs/namenode && \ mkdir -p ~/hdfs/datanode && \ mkdir $HADOOP_HOME/logs RUN $HADOOP_HOME/bin/hdfs namenode -format
-
設置容器打開后運行ssh
CentOS7基礎鏡像中開通了systemd作為啟動守護,所以這里使用systemctl開啟ssh服務彻坛。
CMD [ "sh", "-c", "systemctl start sshd; bash"]
這樣Dockerfile就編寫完成顷啼,再增加一個腳本用來啟動指定節(jié)點數(shù)量的Hadoop集群,內(nèi)容如下:
#!/bin/bash
# 默認節(jié)點數(shù)3個(即一個master,兩個slave)
N=${1:-3}
# 開啟Hadoop-Master容器
sudo docker rm -f hadoop-master &> /dev/null
echo "start hadoop-master container..."
sudo docker run -itd \
--net=hadoop \
-p 50070:50070 \
-p 8088:8088 \
--name hadoop-master \
--hostname hadoop-master \
hadoop-docker &> /dev/null
# 開啟Hadoop-Slave容器
i=1
while [ $i -lt $N ]
do
sudo docker rm -f hadoop-slave$i &> /dev/null
echo "start hadoop-slave$i container..."
sudo docker run -itd \
--net=hadoop \
--name hadoop-slave$i \
--hostname hadoop-slave$i \
hadoop-docker &> /dev/null
i=$(( $i + 1 ))
done
# 進入Hadoop-Master容器的命令行
sudo docker exec -it hadoop-master bash
(這里其實也有問題昌屉,后面……)
好了钙蒙,一切準備就緒了開始運行!…………這是咋的了呢间驮?……
三躬厌、激情的排坑之旅
果然沒有一切順利的,從弄鏡像開始就出問題了竞帽,一一排查解決吧烤咧!想直接看最后的正確內(nèi)容可以直接看【四、最終的文件內(nèi)容】或者【附:命令行純凈版】抢呆。
-
構(gòu)建鏡像:hdfs煮嫌,命令沒找到
$ docker build -t hadoop-docker .
報錯:hdfs命令沒找到,仔細再看報錯其實是:
libexec/hdfs-config.sh:No such file or directory
查找了一下抱虐,原來是缺少一個環(huán)境變量昌阿,在Dockerfile環(huán)境變量設置那里增加一行:
ENV HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
-
啟動Hadoop:JAVA_HOME,沒找到這個目錄
再次構(gòu)建,構(gòu)建成功了懦冰。使用腳本啟動全部集群容器(默認的3個Node):
$ ./start-container.sh
進入Hadoop-Master的命令行后灶轰,使用腳本啟動Hadoop:
$ ./start-hadoop.sh
報錯:/usr/lib/jvm/java-1.7.0-openjdk:No such file or directory
這個目錄其實是根據(jù)網(wǎng)上yum安裝的路徑自己猜的,應該有問題刷钢,即然在容器里直接去看看好了:
$ ls /usr/lib/jvm java-1.7.0-openjdk-1.7.0.131-2.6.9.0.e17_3.x86_64 jre jre-1.7.0 jre-1.7.0-openjdk jre-1.7.0-openjdk-1.7.0.131-2.6.9.0.e17_3.x86_64 jre-openjdk
暈……原來是有版本的小編號的笋颤,這以后是不是每次安裝版本不同了就不一樣了啊,如果寫這個進去豈不是每次都要生成好看看小編號再重新構(gòu)建内地?伴澄!想了想反正這里只是用運行環(huán)境,直接改成jre試試阱缓,一次嘗試修改JAVA_HOME為jre-1.7.0-openjdk非凌,包括Dockerfile以及config/hadoop-env.sh兩個文件。
-
啟動Hadoop:ssh荆针,連接失敗
重新構(gòu)建敞嗡,再次啟動Hadoop,還是報錯:ssh無法連接航背。試了一下ssh的服務根本沒啟動喉悴,直接運行:
$ systemctl start sshd
報錯:Failed to get D-Bus connection
網(wǎng)上反映這個錯誤的不少,據(jù)說是CentOS7在Docker下著名的Bug玖媚,最后找到解決方案粥惧,在我們啟動容器的腳本start-container.sh中修改啟動主節(jié)點和從節(jié)點的docker run命令,增加內(nèi)容:
sudo docker run -itd \ # 這行新加 --privileged -e "container=docker" -v /sys/fs/cgroup:/sys/fs/cgroup \ --net=hadoop \ -p 50070:50070 \ -p 8088:8088 \ --name hadoop-master \ --hostname hadoop-master \ hadoop-docker &> /dev/null \ # 這行新加 /usr/sbin/init
這樣保證在容器開啟時運行/usr/sbin/init最盅,以此開啟D-Bus服務突雪。
-
啟動Hadoop:ssh,不好的權(quán)限設置
再次啟動容器涡贱,啟動Hadoop咏删,還是報錯:
Bad owner or permissions on ~/.ssh/config
再次找到解決方案,在Dockerfile鏡像生成時修改這個config的權(quán)限為600问词,即在Dockerfile中增加一行:
RUN chmod 600 ~/.ssh/config
-
啟動Hadoop:ssh督函,無法連接
再次構(gòu)建,再次啟動容器激挪,再次啟動Hadoop辰狡,還來ssh無法連接!
原來我只安裝了server沒裝client嗎垄分,怎么連接宛篇!修改Dockerfile的軟件安裝為:
RUN yum update -y && \ yum install -y java-1.7.0-openjdk \ openssh-server \ openssh-clients
-
啟動Hadoop:hdfs,命令沒找到
還沒完氨∈叫倍!這次又是萬惡的hdfs命令沒找到偷卧,開始一直糾結(jié)在提示里的hdfs-config.sh這個文件沒有的問題,排查了很久吆倦,后來再仔細看了看發(fā)現(xiàn)了一個錯誤:
which:command not found
這……個which是個命令听诸?查一下原來就是這個which沒有安裝的原因!蚕泽!再次修改Dockerfile中的軟件安裝為:
RUN yum update -y && \ yum install -y java-1.7.0-openjdk \ openssh-server \ openssh-clients \ which
再次構(gòu)建……再次開啟容器……再次開啟Hadoop晌梨,正常了……
運行WordCount:
$ ./run-wordcount.sh
也正確了!
四须妻、最終的文件內(nèi)容
這里列出全部文件的內(nèi)容仔蝌,覺得麻煩也可以直接訪問項目hadoop-centos-docker。
config/core-site.xml
<?xml version="1.0"?>
<configuration>
<!--指定namenode的地址-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
</configuration>
config/hadoop-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""
###
# Advanced Users Only!
###
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
config/hdfs-site.xml
<?xml version="1.0"?>
<configuration>
<!--指定hdfs的Name節(jié)點的保存目錄-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///root/hdfs/namenode</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
</property>
<!--指定hdfs的Data節(jié)點的保存目錄-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///root/hdfs/datanode</value>
<description>DataNode directory</description>
</property>
<!--指定hdfs保存數(shù)據(jù)的副本數(shù)量-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
config/mapred-site.xml
<?xml version="1.0"?>
<configuration>
<!--告訴hadoop以后MR運行在YARN上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
config/run-wordcount.sh
#!/bin/bash
# test the hadoop cluster by running wordcount
# create input files
mkdir input
echo "Hello Docker" >input/file2.txt
echo "Hello Hadoop" >input/file1.txt
# create input directory on HDFS
hadoop fs -mkdir -p input
# put input files to HDFS
hdfs dfs -put ./input/* input
# run wordcount
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.3-sources.jar org.apache.hadoop.examples.WordCount input output
# print the input files
echo -e "\ninput file1.txt:"
hdfs dfs -cat input/file1.txt
echo -e "\ninput file2.txt:"
hdfs dfs -cat input/file2.txt
# print the output of wordcount
echo -e "\nwordcount output:"
hdfs dfs -cat output/part-r-00000
config/slaves
hadoop-slave1
hadoop-slave2
config/ssh_config
Host localhost
StrictHostKeyChecking no
Host 0.0.0.0
StrictHostKeyChecking no
Host hadoop-*
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
config/start-hadoop.sh
#!/bin/bash
echo -e "\n"
$HADOOP_HOME/sbin/start-dfs.sh
echo -e "\n"
$HADOOP_HOME/sbin/start-yarn.sh
echo -e "\n"
config/yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!--nodeManager獲取數(shù)據(jù)的方式是shuffle-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!--指定Yarn的老大(ResourceManager)的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>
Dockerfile
# 鏡像來源
FROM centos7-systemd
# 鏡像創(chuàng)建者(寫入自己的信息)
MAINTAINER "you" <your@email.here>
# 指定目錄
WORKDIR /root
# 安裝軟件
RUN yum update -y && \
yum install -y java-1.7.0-openjdk \
openssh-server \
openssh-clients \
which
# 復制Hadoop
ENV HADOOP_VERSION=2.7.3
COPY hadoop-$HADOOP_VERSION.tar.gz /root/hadoop-$HADOOP_VERSION.tar.gz
# 安裝
RUN tar -xzvf hadoop-$HADOOP_VERSION.tar.gz && \
mv hadoop-$HADOOP_VERSION /usr/local/hadoop && \
rm hadoop-$HADOOP_VERSION.tar.gz
# 設置環(huán)境變量
ENV JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk
ENV HADOOP_HOME=/usr/local/hadoop
ENV PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
ENV HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
# ssh無密碼登錄
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# 復制配置文件及配置權(quán)限
COPY config/* /tmp/
RUN mv /tmp/ssh_config ~/.ssh/config && \
mv /tmp/hadoop-env.sh /usr/local/hadoop/etc/hadoop/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
mv /tmp/run-wordcount.sh ~/run-wordcount.sh
RUN chmod 600 ~/.ssh/config && \
chmod +x ~/start-hadoop.sh && \
chmod +x ~/run-wordcount.sh && \
chmod +x $HADOOP_HOME/sbin/start-dfs.sh && \
chmod +x $HADOOP_HOME/sbin/start-yarn.sh
# 格式化namenode
RUN mkdir -p ~/hdfs/namenode && \
mkdir -p ~/hdfs/datanode && \
mkdir $HADOOP_HOME/logs
RUN $HADOOP_HOME/bin/hdfs namenode -format
# 啟動容器后開啟ssh服務
CMD [ "sh", "-c", "systemctl start sshd; bash"]
start-container.sh
#!/bin/bash
# 默認節(jié)點數(shù)3個(即一個master,兩個slave)
N=${1:-3}
# 開啟Hadoop-Master容器
sudo docker rm -f hadoop-master &> /dev/null
echo "start hadoop-master container..."
sudo docker run -itd \
--privileged -e "container=docker" -v /sys/fs/cgroup:/sys/fs/cgroup \
--net=hadoop \
-p 50070:50070 \
-p 8088:8088 \
--name hadoop-master \
--hostname hadoop-master \
hadoop-docker &> /dev/null \
/usr/sbin/init
# 開啟Hadoop-Slave容器
i=1
while [ $i -lt $N ]
do
sudo docker rm -f hadoop-slave$i &> /dev/null
echo "start hadoop-slave$i container..."
sudo docker run -itd \
--privileged -e "container=docker" -v /sys/fs/cgroup:/sys/fs/cgroup \
--net=hadoop \
--name hadoop-slave$i \
--hostname hadoop-slave$i \
hadoop-docker &> /dev/null \
/usr/sbin/init
i=$(( $i + 1 ))
done
# 進入Hadoop-Master容器的命令行
sudo docker exec -it hadoop-master bash
stop-container.sh
增加了一個關閉全部主從節(jié)點容器的腳本璧南。
#!/bin/bash
# 默認節(jié)點數(shù)3個(即一個master,兩個slave)
N=${1:-3}
# 關閉Hadoop-Master容器
sudo docker container stop hadoop-master
echo "stop hadoop-master container..."
# 開啟Hadoop-Slave容器
i=1
while [ $i -lt $N ]
do
echo "stop hadoop-slave$i container..."
sudo docker container stop hadoop-slave$i
i=$(( $i+1 ))
done
remove-container.sh
Docker失敗過程中會生成一些none鏡像,而且因為有依賴所以清除起來比較麻煩师逸,這里是在網(wǎng)上搜集的清除方法寫成單獨的腳本司倚,堪稱強迫癥患者的福音!
#!/bin/bash
# 默認為none鏡像
name=${1:-none}
# 刪除容器鏡像
docker ps -a | grep "Exited" | awk '{print $1 }' |xargs docker stop
docker ps -a | grep "Exited" | awk '{print $1 }' |xargs docker rm
docker images| grep $name | awk '{print $3 }' |xargs docker rmi
resize-cluster.sh
重新設置從節(jié)點數(shù)量并重構(gòu)鏡像的腳本篓像。
#!/bin/bash
# N is the node number of hadoop cluster
N=$1
if [ $# = 0 ]
then
echo "Please specify the node number of hadoop cluster!"
exit 1
fi
# change slaves file
i=1
rm config/slaves
while [ $i -lt $N ]
do
echo "hadoop-slave$i" >> config/slaves
((i++))
done
echo -e "\nrebuild docker hadoop image!\n"
# rebuild hadoop image
sudo docker build -t hadoop-docker .
# clear none image
./remove-container.sh
附:命令行純凈版
[xxx@localhost ~]$ mkdir -p hadoop-docker/config
[xxx@localhost ~]$ cd hadoop-docker
[xxx@localhost hadoop-docker]$ touch Dockerfile start-container.sh config/ssh_config config/start-hadoop.sh config/run-wordcount.sh
[xxx@localhost hadoop-docker]$ export version=2.7.3
[xxx@localhost hadoop-docker]$ cp ../hadoop-src/hadoop-$version-src/hadoop-dist/target/hadoop-$version.tar.gz hadoop-$version.tar.gz
[xxx@localhost hadoop-docker]$ tar -xzvf hadoop-$version.tar.gz
[xxx@localhost hadoop-docker]$ copy hadoop-$version/etc/hadoop/core-site.xml config/core-site.xml
[xxx@localhost hadoop-docker]$ copy hadoop-$version/etc/hadoop/hadoop-env.sh config/hadoop-env.sh
[xxx@localhost hadoop-docker]$ copy hadoop-$version/etc/hadoop/hdfs-site.xml config/hdfs-site.xml
[xxx@localhost hadoop-docker]$ copy hadoop-$version/etc/hadoop/mapred-site.xml.template config/mapred-site.xml
[xxx@localhost hadoop-docker]$ copy hadoop-$version/etc/hadoop/yarn-site.xml config/yarn-site.xml
[xxx@localhost hadoop-docker]$ vi config/core-site.xml
<?xml version="1.0"?>
<configuration>
<!--指定namenode的地址-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
</configuration>
~
~
~
[xxx@localhost hadoop-docker]$ vi config/hadoop-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""
###
# Advanced Users Only!
###
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
~
~
~
[xxx@localhost hadoop-docker]$ vi config/hdfs-site.xml
<?xml version="1.0"?>
<configuration>
<!--指定hdfs的Name節(jié)點的保存目錄-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///root/hdfs/namenode</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
</property>
<!--指定hdfs的Data節(jié)點的保存目錄-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///root/hdfs/datanode</value>
<description>DataNode directory</description>
</property>
<!--指定hdfs保存數(shù)據(jù)的副本數(shù)量-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
~
~
~
[xxx@localhost hadoop-docker]$ vi config/mapred-site.xml
<?xml version="1.0"?>
<configuration>
<!--告訴hadoop以后MR運行在YARN上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
~
~
~
[xxx@localhost hadoop-docker]$ vi config/run-wordcount.sh
#!/bin/bash
# test the hadoop cluster by running wordcount
# create input files
mkdir input
echo "Hello Docker" >input/file2.txt
echo "Hello Hadoop" >input/file1.txt
# create input directory on HDFS
hadoop fs -mkdir -p input
# put input files to HDFS
hdfs dfs -put ./input/* input
# run wordcount
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.3-sources.jar org.apache.hadoop.examples.WordCount input output
# print the input files
echo -e "\ninput file1.txt:"
hdfs dfs -cat input/file1.txt
echo -e "\ninput file2.txt:"
hdfs dfs -cat input/file2.txt
# print the output of wordcount
echo -e "\nwordcount output:"
hdfs dfs -cat output/part-r-00000
~
~
~
[xxx@localhost hadoop-docker]$ vi config/slaves
hadoop-slave1
hadoop-slave2
~
~
~
[xxx@localhost hadoop-docker]$ vi config/ssh_config
Host localhost
StrictHostKeyChecking no
Host 0.0.0.0
StrictHostKeyChecking no
Host hadoop-*
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
~
~
~
[xxx@localhost hadoop-docker]$ vi config/start-hadoop.sh
#!/bin/bash
echo -e "\n"
$HADOOP_HOME/sbin/start-dfs.sh
echo -e "\n"
$HADOOP_HOME/sbin/start-yarn.sh
echo -e "\n"
~
~
~
[xxx@localhost hadoop-docker]$ vi config/yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!--nodeManager獲取數(shù)據(jù)的方式是shuffle-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!--指定Yarn的老大(ResourceManager)的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>
~
~
~
[xxx@localhost hadoop-docker]$ vi Dockerfile
# 鏡像來源
FROM centos7-systemd
# 鏡像創(chuàng)建者(寫入自己的信息)
MAINTAINER "you" <your@email.here>
# 指定目錄
WORKDIR /root
# 安裝軟件
RUN yum update -y && \
yum install -y java-1.7.0-openjdk \
openssh-server \
openssh-clients \
which
# 復制Hadoop
ENV HADOOP_VERSION=2.7.3
COPY hadoop-$HADOOP_VERSION.tar.gz /root/hadoop-$HADOOP_VERSION.tar.gz
# 安裝
RUN tar -xzvf hadoop-$HADOOP_VERSION.tar.gz && \
mv hadoop-$HADOOP_VERSION /usr/local/hadoop && \
rm hadoop-$HADOOP_VERSION.tar.gz
# 設置環(huán)境變量
ENV JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk
ENV HADOOP_HOME=/usr/local/hadoop
ENV PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
ENV HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
# ssh無密碼登錄
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# 復制配置文件及配置權(quán)限
COPY config/* /tmp/
RUN mv /tmp/ssh_config ~/.ssh/config && \
mv /tmp/hadoop-env.sh /usr/local/hadoop/etc/hadoop/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
mv /tmp/run-wordcount.sh ~/run-wordcount.sh
RUN chmod 600 ~/.ssh/config && \
chmod +x ~/start-hadoop.sh && \
chmod +x ~/run-wordcount.sh && \
chmod +x $HADOOP_HOME/sbin/start-dfs.sh && \
chmod +x $HADOOP_HOME/sbin/start-yarn.sh
# 格式化namenode
RUN mkdir -p ~/hdfs/namenode && \
mkdir -p ~/hdfs/datanode && \
mkdir $HADOOP_HOME/logs
RUN $HADOOP_HOME/bin/hdfs namenode -format
# 啟動容器后開啟ssh服務
CMD [ "sh", "-c", "systemctl start sshd; bash"]
~
~
~
[xxx@localhost hadoop-docker]$ vi start-container.sh
#!/bin/bash
# 默認節(jié)點數(shù)3個(即一個master,兩個slave)
N=${1:-3}
# 開啟Hadoop-Master容器
sudo docker rm -f hadoop-master &> /dev/null
echo "start hadoop-master container..."
sudo docker run -itd \
--privileged -e "container=docker" -v /sys/fs/cgroup:/sys/fs/cgroup \
--net=hadoop \
-p 50070:50070 \
-p 8088:8088 \
--name hadoop-master \
--hostname hadoop-master \
hadoop-docker &> /dev/null \
/usr/sbin/init
# 開啟Hadoop-Slave容器
i=1
while [ $i -lt $N ]
do
sudo docker rm -f hadoop-slave$i &> /dev/null
echo "start hadoop-slave$i container..."
sudo docker run -itd \
--privileged -e "container=docker" -v /sys/fs/cgroup:/sys/fs/cgroup \
--net=hadoop \
--name hadoop-slave$i \
--hostname hadoop-slave$i \
hadoop-docker &> /dev/null \
/usr/sbin/init
i=$(( $i + 1 ))
done
# 進入Hadoop-Master容器的命令行
sudo docker exec -it hadoop-master bash
~
~
~
[xxx@localhost hadoop-docker]$ sudo docker build -t hadoop-docker .
[xxx@localhost hadoop-docker]$ ./start-container.sh
@hadoop-master[root@hadoop-master ~]# ./start-hadoop.sh
@hadoop-master[root@hadoop-master ~]# ./run-wordcount.sh
@hadoop-master[root@hadoop-master ~]# exit