在阿里云ECS服務(wù)器上搭建Hadoop集群
簡(jiǎn)介
Hadoop是一個(gè)開(kāi)源的分布式計(jì)算的基礎(chǔ)框架,其中最主要的組成部分則包括了hadoop分布式文件系統(tǒng)(hadoop distributed file system, 簡(jiǎn)稱hdfs)以及mapreduce功能,而mapreduce在hadoop中則使用yarn調(diào)度組件完成。眾所周知娶桦,google在分布式系統(tǒng)最有名的三個(gè)成果就是big table, google file system跟mapreduce,而hdfs則是對(duì)應(yīng)的google file system的一個(gè)開(kāi)源實(shí)現(xiàn)鞋诗,而yarn則是mapreduce的一個(gè)開(kāi)源實(shí)現(xiàn)筒主。在下文中則將會(huì)介紹如何在數(shù)臺(tái)阿里云的ecs服務(wù)器上搭建一個(gè)hadoop集群燥翅。為了方便下文中的配置骑篙,首先簡(jiǎn)要地介紹一下hdfs的架構(gòu),hdfs由一個(gè)namenode(用于存儲(chǔ)文件系統(tǒng)的meta data)森书,以及若干個(gè)datanode(存儲(chǔ)文件數(shù)據(jù))來(lái)組成靶端,其中namenode負(fù)責(zé)對(duì)整個(gè)分布式文件系統(tǒng)進(jìn)行管理,以及存儲(chǔ)文件命名數(shù)據(jù)凛膏,因此也可以被稱之為master node杨名,而datanode有時(shí)候也被稱為slavenode。(本篇教程主要參考自[1])
搭建Hadoop集群步驟
在阿里云上購(gòu)買ECS服務(wù)器
首先需要完成的是在阿里云上購(gòu)買運(yùn)行hadoop的ecs服務(wù)器(注意所有服務(wù)器都需要在同一一個(gè)可用區(qū)中)猖毫,由于我需要運(yùn)行的應(yīng)用需要較大的內(nèi)存台谍,因此我選擇了購(gòu)買ecs.r5.2xlarge實(shí)例,每臺(tái)服務(wù)器具有64GB內(nèi)存以及8個(gè)虛擬cpu核心吁断,總共購(gòu)買了9臺(tái)云服務(wù)器實(shí)例趁蕊。在購(gòu)買的過(guò)程中可以選擇服務(wù)器的hostname,為了方便起見(jiàn)仔役,我將其中將作為hadoop namenode(master node)的服務(wù)器hostname改為hadoop-master掷伙,而剩下的作為datanode(slave node)的服務(wù)器分別命名為"hadoop-slave001"..."hadoop-slave008"。
配置集群內(nèi)ssh無(wú)密碼登陸
添加hadoop賬戶
首先需要在每臺(tái)機(jī)器上都添加一個(gè)名為hadoop的賬戶又兵,在這里為了方便起見(jiàn)任柜,我將所有機(jī)器上的hadoop賬戶都添加了sudo權(quán)限,所需要的命令如下所示:(以下命令需要在root賬戶中執(zhí)行)
useradd hadoop
passwd hadoop
usermod -aG wheel hadoop
在使用root賬戶添加完成hadoop賬戶之后沛厨,就退出root賬戶宙地,登陸到hadoop賬戶中進(jìn)行接下來(lái)的所有操作;
接下來(lái)配置所有機(jī)器上的/etc/hosts文件(需要sudo權(quán)限)逆皮,在該文件中宅粥,添加整個(gè)集群的所有機(jī)器跟其內(nèi)網(wǎng)ip地址(注意是內(nèi)網(wǎng)ip而不是公網(wǎng)ip)的映射關(guān)系,修改完該文件之后的結(jié)果應(yīng)當(dāng)如下所示:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
[hadoop-master的內(nèi)網(wǎng)ip] hadoop-master hadoop-master
[hadoop-slave001的內(nèi)網(wǎng)ip] hadoop-slave001 hadoop-slave001
[hadoop-slave002的內(nèi)網(wǎng)ip] hadoop-slave002 hadoop-slave002
[hadoop-slave003的內(nèi)網(wǎng)ip] hadoop-slave003 hadoop-slave003
[hadoop-slave004的內(nèi)網(wǎng)ip] hadoop-slave004 hadoop-slave004
[hadoop-slave005的內(nèi)網(wǎng)ip] hadoop-slave005 hadoop-slave005
[hadoop-slave006的內(nèi)網(wǎng)ip] hadoop-slave006 hadoop-slave006
[hadoop-slave007的內(nèi)網(wǎng)ip] hadoop-slave007 hadoop-slave007
[hadoop-slave008的內(nèi)網(wǎng)ip] hadoop-slave008 hadoop-slave008
設(shè)置ssh密鑰登陸
由于hadoop需要各個(gè)機(jī)器之間無(wú)密碼ssh登陸來(lái)進(jìn)行通信页屠,因此下一步操作就是設(shè)置ssh密鑰登陸了粹胯,首先在每臺(tái)機(jī)器上生成本地的ssh密鑰對(duì):
ssh-keygen -b 4096
然后將公鑰發(fā)送給其他所有機(jī)器:
ssh-copy-id hadoop@hadoop-master
ssh-copy-id hadoop@hadoop-slave001
ssh-copy-id hadoop@hadoop-slave002
ssh-copy-id hadoop@hadoop-slave003
ssh-copy-id hadoop@hadoop-slave004
ssh-copy-id hadoop@hadoop-slave005
ssh-copy-id hadoop@hadoop-slave006
ssh-copy-id hadoop@hadoop-slave007
ssh-copy-id hadoop@hadoop-slave008
安裝Hadoop
安裝jdk
由于hadoop是搭建在jvm之上的,因此需要安裝java的開(kāi)發(fā)工具即jdk辰企,如以下命令所示:
sudo yum update
sudo yum install java-1.8.0-openjdk-devel
下載hadoop編譯好的二進(jìn)制文件
hadoop編譯好的二進(jìn)制文件可以在https://hadoop.apache.org/releases.html網(wǎng)站找到风纠,由于本人需要跑的應(yīng)用是在hadoop 2上開(kāi)發(fā)的,因此下載了2.7.7版本牢贸,如果下載3版本的話竹观,在后續(xù)的配置hadoop的部分會(huì)有稍微的不同(比如2.x版本的slaves配置文件在3.x版本中被命名為了workers文件,同時(shí)監(jiān)視hdfs的端口也有所不同)潜索。
下載與解壓hadoop 2.7.7二進(jìn)制文件的命令如下:
cd ~
wget https://www-us.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar xvf hadoop-2.7.7.tar.gz
mv hadoop-2.7.7 hadoop
配置環(huán)境變量
接下來(lái)為了方便調(diào)用hdfs等命令臭增,需要配置一下PATH環(huán)境變量:
vim ~/.bash_profile
然后在export PATH這一行前面添加
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH
然后保存并退出該文件。
配置hadoop
首先為了讓hadoop能夠識(shí)別到j(luò)dk的安裝位置竹习,需要進(jìn)行相應(yīng)配置誊抛,首先使用以下命令:
update-alternatives --display java
可以看到如下的輸出,其中的xxx/bin/java便是java的安裝位置整陌,而xxx便是jdk的目錄拗窃,在此該目錄被確定為"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre"。
java - status is auto.
link currently points to /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java - family java-1.8.0-openjdk.x86_64 priority 1800212
slave jre: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre
slave jre_exports: /usr/lib/jvm-exports/jre-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64
slave jjs: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/jjs
slave keytool: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/keytool
slave orbd: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/orbd
slave pack200: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/pack200
slave rmid: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/rmid
slave rmiregistry: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/rmiregistry
slave servertool: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/servertool
slave tnameserv: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/tnameserv
slave policytool: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/policytool
slave unpack200: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/unpack200
slave java.1.gz: /usr/share/man/man1/java-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave jjs.1.gz: /usr/share/man/man1/jjs-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave keytool.1.gz: /usr/share/man/man1/keytool-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave orbd.1.gz: /usr/share/man/man1/orbd-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave pack200.1.gz: /usr/share/man/man1/pack200-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave rmid.1.gz: /usr/share/man/man1/rmid-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave rmiregistry.1.gz: /usr/share/man/man1/rmiregistry-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave servertool.1.gz: /usr/share/man/man1/servertool-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave tnameserv.1.gz: /usr/share/man/man1/tnameserv-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave policytool.1.gz: /usr/share/man/man1/policytool-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave unpack200.1.gz: /usr/share/man/man1/unpack200-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
Current `best' version is /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java.
接下來(lái)打開(kāi)~/hadoop/etc/hadoop/hadoop-env.sh文件進(jìn)行編輯泌辫,找到"export JAVA_HOME=${JAVA_HOME}"這一行随夸,并且將其替換為“export JAVA_HOME={我們剛剛發(fā)現(xiàn)的jdk目錄}”,這個(gè)目錄在不同機(jī)器上可能不同震放,在本機(jī)器上修改為“export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre”宾毒。
接下來(lái)配置NameNode的位置(也就是文件系統(tǒng)hdfs的元數(shù)據(jù)),需要編輯文件"~/hadoop/etc/hadoop/core-site.xml"殿遂,編輯完的結(jié)果如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
接下來(lái)配置namenode和datanode在各自機(jī)器上的存放路徑诈铛,編輯文件“~/hadoop/etc/hadoop/hdfs-site.xml”,編輯后的結(jié)果如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
接下來(lái)配置YARN墨礁,首先執(zhí)行以下命令:
cd ~/hadoop/etc/hadoop
mv mapred-site.xml.template mapred-site.xml
然后編輯文件"~/hadoop/etc/hadoop/mapred-site.xml"癌瘾,編輯結(jié)果如下所示:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>yarn</value>
</property>
</configuration>
編輯文件“~/hadoop/etc/hadoop/yarn-site.xml”,編輯結(jié)果如下所示:
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
接下來(lái)配置slave nodes的列表饵溅,打開(kāi)文件"~/hadoop/etc/hadoop/slaves"妨退,編輯結(jié)果如下:
hadoop-slave001
hadoop-slave002
hadoop-slave003
hadoop-slave004
hadoop-slave005
hadoop-slave006
hadoop-slave007
hadoop-slave008
接下來(lái)將配置好的hadoop發(fā)送到各個(gè)slave機(jī)器上:
cd ~
scp -r hadoop hadoop-slave001:~
scp -r hadoop hadoop-slave002:~
scp -r hadoop hadoop-slave003:~
scp -r hadoop hadoop-slave004:~
scp -r hadoop hadoop-slave005:~
scp -r hadoop hadoop-slave006:~
scp -r hadoop hadoop-slave007:~
scp -r hadoop hadoop-slave008:~
接下來(lái)在master節(jié)點(diǎn)上格式化hdfs文件系統(tǒng):
hdfs namenode -format
接下來(lái)就可以啟動(dòng)hdfs了:
start-dfs.sh
接下來(lái)可以使用jps命令查看hdfs是否正常運(yùn)行,在hadoop-master上執(zhí)行"jps"命令的結(jié)果應(yīng)當(dāng)如下所示:(進(jìn)程號(hào)不一定一樣)
hadoop@hadoop-master ~> jps
5536 SecondaryNameNode
5317 NameNode
5691 Jps
而在hadoop-slave上執(zhí)行jps的結(jié)果應(yīng)當(dāng)如下:
[hadoop@hadoop-slave001 ~]$ jps
16753 Jps
16646 DataNode
如果需要關(guān)閉hdfs蜕企,則使用命令:
stop-dfs.sh
接下來(lái)啟動(dòng)yarn咬荷,執(zhí)行命令:
start-yarn.sh
如果yarn正常啟動(dòng),在執(zhí)行"jps"命令的時(shí)候應(yīng)該可以發(fā)現(xiàn)轻掩,在hadoop-master上多了一個(gè)名為“ResourceManager”的進(jìn)程幸乒,而在hadoop-slave上多了一個(gè)名為“NodeManager”的進(jìn)程;
如果需要關(guān)閉yarn唇牧,則:
stop-yarn.sh
測(cè)試Hadoop是否正確安裝
接下來(lái)運(yùn)行一些簡(jiǎn)單的樣例(統(tǒng)計(jì)某幾個(gè)文本文件中的單詞總數(shù))來(lái)判斷hadoop是否能在集群上正常運(yùn)行:
首先在hdfs上創(chuàng)建一個(gè)home目錄:
hdfs dfs -mkdir /home
hdfs dfs -mkdir /home/hadoop
下載數(shù)據(jù)集并復(fù)制(put命令)到hdfs上:
hdfs dfs -mkdir /home/hadoop/books
cd ~
mkdir books
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
hdfs dfs -put alice.txt holmes.txt frankenstein.txt /home/hadoop/books
查看hdfs上的數(shù)據(jù)集:
hdfs dfs -ls /home/hadoop/books
hdfs dfs -cat /home/hadoop/books/alice.txt
使用hadoop自帶的單詞數(shù)目統(tǒng)計(jì)樣例來(lái)統(tǒng)計(jì)數(shù)據(jù)集中的所有單詞的數(shù)目:
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount "/home/hadoop/books/*" /home/hadoop/output
順利的話可以在hdfs的/home/hadoop/output/目錄下看到輸出文件:
hadoop@hadoop-master ~> hdfs dfs -ls /home/hadoop/output
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2019-05-28 14:59 /home/hadoop/output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 789726 2019-05-28 14:59 /home/hadoop/output/part-r-00000
至此整個(gè)hadoop集群的環(huán)境就已經(jīng)基本搭建完成了~
參考資料
[1] https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/
如果我的文章給您帶來(lái)了幫助罕扎,并且您愿意給我一些小小的支持的話聚唐,以下這個(gè)是我的比特幣地址~
My bitcoin address: 3KsqM8tef5XJ9jPvWGEVXyJNpvyLLsrPZj