CDH相關(guān)的軟件下載地址:http://archive.cloudera.com/cdh5/cdh/5/
cdh-5.7.0
生產(chǎn)或者測(cè)試環(huán)境選擇對(duì)應(yīng)CDH版本時(shí)耍缴,一定要采用尾號(hào)是一樣的版本
http://hadoop.apache.org/
對(duì)于Apache的頂級(jí)項(xiàng)目來(lái)說(shuō)砾肺,projectname.apache.org
Hadoop: hadoop.apache.org
Hive: hive.apache.org
Spark: spark.apache.org
HBase: hbase.apache.org
為什么很多公司選擇Hadoop作為大數(shù)據(jù)平臺(tái)的解決方案挽霉?
1)源碼開(kāi)源
2)社區(qū)活躍、參與者很多? Spark
3)涉及到分布式存儲(chǔ)和計(jì)算的方方面面:
Flume進(jìn)行數(shù)據(jù)采集
Spark/MR/Hive等進(jìn)行數(shù)據(jù)處理
HDFS/HBase進(jìn)行數(shù)據(jù)存儲(chǔ)
4) 已得到企業(yè)界的驗(yàn)證
HDFS架構(gòu)
1 Master(NameNode/NN)? 帶 N個(gè)Slaves(DataNode/DN)
HDFS/YARN/HBase
1個(gè)文件會(huì)被拆分成多個(gè)Block
blocksize:128M
130M ==> 2個(gè)Block: 128M 和 2M
NN:
1)負(fù)責(zé)客戶端請(qǐng)求的響應(yīng)
2)負(fù)責(zé)元數(shù)據(jù)(文件的名稱变汪、副本系數(shù)侠坎、Block存放的DN)的管理
DN:
1)存儲(chǔ)用戶的文件對(duì)應(yīng)的數(shù)據(jù)塊(Block)
2)要定期向NN發(fā)送心跳信息,匯報(bào)本身及其所有的block信息裙盾,健康狀況
A typical deployment has a dedicated machine that runs only the NameNode software.
Each of the other machines in the cluster runs one instance of the DataNode software.
The architecture does not preclude running multiple DataNodes on the same machine
but in a real deployment that is rarely the case.
NameNode + N個(gè)DataNode
建議:NN和DN是部署在不同的節(jié)點(diǎn)上
replication factor:副本系數(shù)实胸、副本因子
All blocks in a file except the last block are the same size
軟件存放目錄
hadoop/hadoop
/home/hadoop
software: 存放的是安裝的軟件包
app : 存放的是所有軟件的安裝目錄
data: 存放的是課程中所有使用的測(cè)試數(shù)據(jù)目錄
source: 存放的是軟件源碼目錄,spark
Hadoop環(huán)境搭建
1) 下載Hadoop
http://archive.cloudera.com/cdh5/cdh/5/
2.6.0-cdh5.7.0
wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.0.tar.gz
2)安裝jdk
下載
解壓到app目錄:tar -zxvf jdk-7u51-linux-x64.tar.gz -C ~/app/
驗(yàn)證安裝是否成功:~/app/jdk1.7.0_51/bin? ? ? ./java -version
建議把bin目錄配置到系統(tǒng)環(huán)境變量(~/.bash_profile)中
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_51
export PATH=$JAVA_HOME/bin:$PATH
3)機(jī)器參數(shù)設(shè)置
hostname: hadoop001
修改機(jī)器名:
NETWORKING=yes
HOSTNAME=hadoop001
設(shè)置ip和hostname的映射關(guān)系: /etc/hosts
192.168.199.200 hadoop001
127.0.0.1 localhost
ssh免密碼登陸(本步驟可以省略番官,但是后面你重啟hadoop進(jìn)程時(shí)是需要手工輸入密碼才行)
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
4)Hadoop配置文件修改: ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoophadoop-env.shexport JAVA_HOME=/home/hadoop/app/jdk1.7.0_51core-site.xmlfs.defaultFShdfs://hadoop001:8020hadoop.tmp.dir/home/hadoop/app/tmphdfs-site.xmldfs.replication1
5)格式化HDFS
注意:這一步操作庐完,只是在第一次時(shí)執(zhí)行,每次如果都格式化的話鲤拿,那么HDFS上的數(shù)據(jù)就會(huì)被清空
bin/hdfs namenode -format
6)啟動(dòng)HDFS
sbin/start-dfs.sh
驗(yàn)證是否啟動(dòng)成功:
jps
DataNode
SecondaryNameNode
NameNode
瀏覽器
http://hadoop001:50070/
7)停止HDFS
sbin/stop-dfs.sh
YARN架構(gòu)
1 RM(ResourceManager) + N NM(NodeManager)
ResourceManager的職責(zé): 一個(gè)集群active狀態(tài)的RM只有一個(gè)假褪,負(fù)責(zé)整個(gè)集群的資源管理和調(diào)度
1)處理客戶端的請(qǐng)求(啟動(dòng)/殺死)
2)啟動(dòng)/監(jiān)控ApplicationMaster(一個(gè)作業(yè)對(duì)應(yīng)一個(gè)AM)
3)監(jiān)控NM
4)系統(tǒng)的資源分配和調(diào)度
NodeManager:整個(gè)集群中有N個(gè),負(fù)責(zé)單個(gè)節(jié)點(diǎn)的資源管理和使用以及task的運(yùn)行情況
1)定期向RM匯報(bào)本節(jié)點(diǎn)的資源使用請(qǐng)求和各個(gè)Container的運(yùn)行狀態(tài)
2)接收并處理RM的container啟停的各種命令
3)單個(gè)節(jié)點(diǎn)的資源管理和任務(wù)管理
ApplicationMaster:每個(gè)應(yīng)用/作業(yè)對(duì)應(yīng)一個(gè)近顷,負(fù)責(zé)應(yīng)用程序的管理
1)數(shù)據(jù)切分
2)為應(yīng)用程序向RM申請(qǐng)資源(container)生音,并分配給內(nèi)部任務(wù)
3)與NM通信以啟停task, task是運(yùn)行在container中的
4)task的監(jiān)控和容錯(cuò)
Container:
對(duì)任務(wù)運(yùn)行情況的描述:cpu窒升、memory缀遍、環(huán)境變量
YARN執(zhí)行流程
1)用戶向YARN提交作業(yè)
2)RM為該作業(yè)分配第一個(gè)container(AM)
3)RM會(huì)與對(duì)應(yīng)的NM通信,要求NM在這個(gè)container上啟動(dòng)應(yīng)用程序的AM
4) AM首先向RM注冊(cè)饱须,然后AM將為各個(gè)任務(wù)申請(qǐng)資源域醇,并監(jiān)控運(yùn)行情況
5)AM采用輪訓(xùn)的方式通過(guò)RPC協(xié)議向RM申請(qǐng)和領(lǐng)取資源
6)AM申請(qǐng)到資源以后,便和相應(yīng)的NM通信蓉媳,要求NM啟動(dòng)任務(wù)
7)NM啟動(dòng)我們作業(yè)對(duì)應(yīng)的task
YARN環(huán)境搭建mapred-site.xmlmapreduce.framework.nameyarnyarn-site.xmlyarn.nodemanager.aux-servicesmapreduce_shuffle啟動(dòng)yarn:sbin/start-yarn.sh
驗(yàn)證是否啟動(dòng)成功
jps
ResourceManager
NodeManager
web: http://hadoop001:8088
停止yarn: sbin/stop-yarn.sh
提交mr作業(yè)到y(tǒng)arn上運(yùn)行: wc
/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar
hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /input/wc/hello.txt /output/wc/
當(dāng)我們?cè)俅螆?zhí)行該作業(yè)時(shí)譬挚,會(huì)報(bào)錯(cuò):
FileAlreadyExistsException:
Output directory hdfs://hadoop001:8020/output/wc already exists
Hive底層的執(zhí)行引擎有:MapReduce、Tez酪呻、Spark
Hive on MapReduce
Hive on Tez
Hive on Spark
壓縮:GZIP减宣、LZO、Snappy玩荠、BZIP2..
存儲(chǔ):TextFile漆腌、SequenceFile、RCFile阶冈、ORC闷尿、Parquet
UDF:自定義函數(shù)
Hive環(huán)境搭建
1)Hive下載:http://archive.cloudera.com/cdh5/cdh/5/
wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.7.0.tar.gz
2)解壓
tar -zxvf hive-1.1.0-cdh5.7.0.tar.gz -C ~/app/
3)配置
系統(tǒng)環(huán)境變量(~/.bahs_profile)
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH
實(shí)現(xiàn)安裝一個(gè)mysql, yum install xxxhive-site.xmljavax.jdo.option.ConnectionURLjdbc:mysql://localhost:3306/sparksql?createDatabaseIfNotExist=truejavax.jdo.option.ConnectionDriverNamecom.mysql.jdbc.Driverjavax.jdo.option.ConnectionUserNamerootjavax.jdo.option.ConnectionPasswordroot4)拷貝mysql驅(qū)動(dòng)到$HIVE_HOME/lib/5)啟動(dòng)hive: $HIVE_HOME/bin/hive
創(chuàng)建表
CREATE? TABLE table_name?
??[(col_name data_type [COMMENT col_comment])]
create table hive_wordcount(context string);
加載數(shù)據(jù)到hive表
LOAD DATA LOCAL INPATH 'filepath' INTO TABLE tablename
select word, count(1) from hive_wordcount lateral view explode(split(context,'\t')) wc as word group by word;
lateral view explode(): 是把每行記錄按照指定分隔符進(jìn)行拆解
hive ql提交執(zhí)行以后會(huì)生成mr作業(yè)女坑,并在yarn上運(yùn)行
create table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
create table dept(
deptno int,
dname string,
location string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
load data local inpath '/home/hadoop/data/emp.txt' into table emp;
load data local inpath '/home/hadoop/data/dept.txt' into table dept;
求每個(gè)部門(mén)的人數(shù)
select deptno, count(1) from emp group by deptno;