Hadoop介紹
Hadoop-大數(shù)據(jù)開源世界的亞當(dāng)夏娃撑教。
核心是HDFS數(shù)據(jù)存儲系統(tǒng),和MapReduce分布式計算框架。
HDFS
原理是把大塊數(shù)據(jù)切碎,
每個碎塊復(fù)制三份逃延,分開放在三個廉價機(jī)上,一直保持有三塊可用的數(shù)據(jù)互為備份轧拄。使用的時候只從其中一個備份讀出來揽祥,這個碎塊數(shù)據(jù)就有了。
存數(shù)據(jù)的叫datenode(格子間)檩电,管理datenode的叫namenode(執(zhí)傘人)拄丰。
MapReduce
原理是大任務(wù)先分堆處理-Map府树,再匯總處理結(jié)果-Reduce。分和匯是多臺服務(wù)器并行進(jìn)行料按,才能體現(xiàn)集群的威力奄侠。難度在于如何把任務(wù)拆解成符合MapReduce模型的分和匯,以及中間過程的輸入輸出<k,v> 都是什么载矿。
單機(jī)版Hadoop介紹
對于學(xué)習(xí)hadoop原理和hadoop開發(fā)的人來說垄潮,搭建一套hadoop系統(tǒng)是必須的。但
- 配置該系統(tǒng)是非常頭疼的闷盔,很多人配置過程就放棄了弯洗。
- 沒有服務(wù)器供你使用
這里介紹一種免配置的單機(jī)版hadoop安裝使用方法,可以簡單快速的跑一跑hadoop例子輔助學(xué)習(xí)逢勾、開發(fā)和測試牡整。
要求筆記本上裝了Linux虛擬機(jī),虛擬機(jī)上裝了docker溺拱。
安裝
使用docker下載sequenceiq/hadoop-docker:2.7.0鏡像并運行逃贝。
[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0
2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer
下載成功輸出
Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0
啟動
[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
Starting sshd: [ OK ]
Starting namenodes on [b7a42f79339c]
b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out
啟動成功后命令行shell會自動進(jìn)入Hadoop的容器環(huán)境,不需要執(zhí)行docker exec迫摔。在容器環(huán)境進(jìn)入/usr/local/hadoop/sbin沐扳,執(zhí)行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,如下
bash-4.1# cd /usr/local/hadoop/sbin
bash-4.1# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [b7a42f79339c]
b7a42f79339c: namenode running as process 128. Stop it first.
localhost: datanode running as process 219. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 402. Stop it first.
starting yarn daemons
resourcemanager running as process 547. Stop it first.
localhost: nodemanager running as process 641. Stop it first.
bash-4.1# ./mr-jobhistory-daemon.sh start historyserver
chown: missing operand after `/usr/local/hadoop/logs'
Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out
Hadoop啟動完成攒菠,如此簡單迫皱。
要問分布式部署有多麻煩,數(shù)數(shù)光配置文件就有多少個吧辖众!我親眼見過一個hadoop老鳥卓起,因為新?lián)Q的服務(wù)器hostname主機(jī)名帶橫線“-”,配了一上午凹炸,環(huán)境硬是沒起來戏阅。
運行自帶的例子
回到Hadoop主目錄,運行示例程序
bash-4.1# cd /usr/local/hadoop
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31
20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31
20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001
20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted application application_1594002714328_0001
20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/
20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001
20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false
20/07/05 22:35:04 INFO mapreduce.Job: map 0% reduce 0%
20/07/05 22:37:59 INFO mapreduce.Job: map 11% reduce 0%
20/07/05 22:38:05 INFO mapreduce.Job: map 12% reduce 0%
mapreduce計算完成,有如下輸出
20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=230541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=569
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5929
Total time spent by all reduces in occupied slots (ms)=8545
Total time spent by all map tasks (ms)=5929
Total time spent by all reduce tasks (ms)=8545
Total vcore-seconds taken by all map tasks=5929
Total vcore-seconds taken by all reduce tasks=8545
Total megabyte-seconds taken by all map tasks=6071296
Total megabyte-seconds taken by all reduce tasks=8750080
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=159
CPU time spent (ms)=1280
Physical memory (bytes) snapshot=303452160
Virtual memory (bytes) snapshot=1291390976
Total committed heap usage (bytes)=136450048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
hdfs命令查看輸出結(jié)果
bash-4.1# bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
例子講解
grep是一個在輸入中計算正則表達(dá)式匹配的mapreduce程序啤它,篩選出符合正則的字符串以及出現(xiàn)次數(shù)奕筐。
shell的grep結(jié)果會顯示完整的一行,這個命令只顯示行中匹配的那個字符串
grep input output 'dfs[a-z.]+'
正則表達(dá)式dfs[a-z.]+变骡,表示字符串要以dfs開頭离赫,后面是小寫字母或者換行符\n之外的任意單個字符都可以,數(shù)量一個或者多個塌碌。
輸入是input里的所有文件渊胸,
bash-4.1# ls -lrt
total 48
-rw-r--r--. 1 root root 690 May 16 2015 yarn-site.xml
-rw-r--r--. 1 root root 5511 May 16 2015 kms-site.xml
-rw-r--r--. 1 root root 3518 May 16 2015 kms-acls.xml
-rw-r--r--. 1 root root 620 May 16 2015 httpfs-site.xml
-rw-r--r--. 1 root root 775 May 16 2015 hdfs-site.xml
-rw-r--r--. 1 root root 9683 May 16 2015 hadoop-policy.xml
-rw-r--r--. 1 root root 774 May 16 2015 core-site.xml
-rw-r--r--. 1 root root 4436 May 16 2015 capacity-scheduler.xml
結(jié)果輸出到output。
計算流程如下
稍有不同的是這里有兩次reduce台妆,第二次reduce就是把結(jié)果按照出現(xiàn)次數(shù)排個序翎猛。map和reduce流程開發(fā)者自己隨意組合胖翰,只要各流程的輸入輸出能銜接上就行。
管理系統(tǒng)介紹
Hadoop提供了web界面的管理系統(tǒng)切厘,
端口號 | 用途 |
---|---|
50070 | Hadoop Namenode UI端口 |
50075 | Hadoop Datanode UI端口 |
50090 | Hadoop SecondaryNamenode 端口 |
50030 | JobTracker監(jiān)控端口 |
50060 | TaskTrackers端口 |
8088 | Yarn任務(wù)監(jiān)控端口 |
60010 | Hbase HMaster監(jiān)控UI端口 |
60030 | Hbase HRegionServer端口 |
8080 | Spark監(jiān)控UI端口 |
4040 | Spark任務(wù)UI端口 |
加命令參數(shù)
docker run命令要加入?yún)?shù)萨咳,才能訪問UI管理頁面
docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075 sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
執(zhí)行這條命令后在宿主機(jī)瀏覽器就可以查看系統(tǒng)了,當(dāng)然如果Linux有瀏覽器也可以查看疫稿。我的Linux沒有圖形界面培他,所以在宿主機(jī)查看。
50070 Hadoop Namenode UI端口
50075 Hadoop Datanode UI端口
8088 Yarn任務(wù)監(jiān)控端口
已完成和正在運行的mapreduce任務(wù)都可以在8088里查看遗座,上圖有g(shù)erp和wordcount兩個任務(wù)靶壮。
一些問題
一、./sbin/mr-jobhistory-daemon.sh start historyserver必須執(zhí)行员萍,否則運行任務(wù)過程中會報
20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
二、./start-all.sh必須執(zhí)行否則報形如
Unknown Job job_1592960164748_0001錯誤
三拣度、docker run命令后面必須加--privileged=true碎绎,否則運行任務(wù)過程中會報java.io.IOException: Job status not available
四、注意抗果,Hadoop 默認(rèn)不會覆蓋結(jié)果文件筋帖,因此再次運行上面實例會提示出錯,需要先將 ./output 刪除冤馏∪蒸铮或者換成output01試試?
總結(jié)
本文方法可以低成本的完成Hadoop的安裝配置逮光,對于學(xué)習(xí)理解和開發(fā)測試都有幫助的代箭。如果開發(fā)自己的Hadoop程序,需要將程序打jar包上傳到share/hadoop/mapreduce/目錄涕刚,執(zhí)行
bin/hadoop jar share/hadoop/mapreduce/yourtest.jar
來運行程序觀察效果嗡综。