1竖席、知識點(diǎn)回顧
hive數(shù)據(jù)庫是hdfs上的文件夾耘纱,表也是文件夾,表里的數(shù)據(jù)是文件
hive建表
create table 表名(字段1 類型1,字段2 類型2……)
row format delimited fields terminated by '字段分隔符';
為一鍵啟動集群中的zookeeper毕荐,以下提供參考腳本:
#vim my-zkServer.sh 添加如下內(nèi)容:
/opt/zookeeper-3.4.10/bin/zkServer.sh $1
ssh hdp2 "source /etc/profile;/opt/zookeeper-3.4.10/bin/zkServer.sh $1"
ssh hdp3 "source /etc/profile;/opt/zookeeper-3.4.10/bin/zkServer.sh $1"
#source /etc/profile 的目的是將profile中的JAVA_HOME路徑提升為全局變量才能到相應(yīng)的服務(wù)器啟動具體操作:
#賦予權(quán)限 chmod +x my-zkServer.sh
#一鍵啟動zookeeper集群:./my-zkServer.sh start
#一鍵停止zookeeper集群:./my-zkServer.sh stop
2束析、測試階段可以把集群運(yùn)行環(huán)境切換為本地運(yùn)行 yarn -> local
<!-- vim mapred-site.xml -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
<!-- 則之后提交job運(yùn)行mr不需要啟動yarn -->
mapreduce.framework.name設(shè)置為local,則不會使用YARN集群來分配資源东跪,在本地節(jié)點(diǎn)執(zhí)行畸陡。在本地模式運(yùn)行的任務(wù),無法發(fā)揮集群的優(yōu)勢虽填。注:在web UI是查看不到本地模式運(yùn)行的任務(wù)。
3曹动、hive以服務(wù)方式運(yùn)行:
啟動服務(wù):./hive --service hiveserver2 &
查看端口:netstat -tunl
hiveserver端口:10000
4斋日、beeline客戶端連接hiveserver
bin/beeline
!connect jdbc:hive2://hdp1:10000
root
回車(hive的默認(rèn)不認(rèn)證用戶名密碼)
5、建表墓陈,內(nèi)部表和外部表
hive建表語句
建內(nèi)部表
create table t_student(id string,name string,age int,classNo string)
row format delimited
fields terminated by ',';
建外部表
create external table t_a(id string,name string)
row format delimited fields terminated by ','
location '/ainput';
刪除表
drop table xx
刪除內(nèi)部表是連同表結(jié)構(gòu)和數(shù)據(jù)一起刪除
刪除外部表只刪除表結(jié)構(gòu)恶守,不刪除存儲在hdfs上的數(shù)據(jù)
hive導(dǎo)入數(shù)據(jù)的語句
從hiverserver本地導(dǎo)入,注意不是beeline客戶端的本地
load data local inpath '/root/b.dat' into table t_b;
從hdfs上導(dǎo)入數(shù)據(jù)(移動)
load data inpath '/datafromhdfs/a.dat' into table t_a;
6贡必、笛卡爾積 join
笛卡爾積定義:
設(shè)A和B是兩個(gè)集合兔港,存在一個(gè)集合,它的元素是用A中元素為第一個(gè)元素仔拟,B中元素為第二個(gè)元素構(gòu)成的有序二元組衫樊。稱它為集合A和B的笛卡爾積集,記為A×B。即
A×B = {(a,b)|a∈A,b∈B}
例 A={1,2}
? B={a,b,c}
? A×B = {(1,a),(1,b),(1,c),(2,a),(2,b),(2,c)}
join例子:
a.dat
a,1
b,2
c,3
d,4
b.dat
a,xx
b,yy
d,zz
e,pp
創(chuàng)建表t_a 科侈,t_b與以上數(shù)據(jù)對應(yīng)并導(dǎo)入數(shù)據(jù)载佳。
create table t_a(id string,name string)
row format delimited fields terminated by ',';
create table t_b(id string,name string)
row format delimited fields terminated by ',';
load data local inpath '/root/a.dat' into table t_a;
load data local inpath '/root/b.dat' into table t_b;
6.1 笛卡爾積(內(nèi)連接):
select t_a.*,t_b.* from t_a inner join t_b;
+---------+-----------+---------+-----------+--+
| t_a.id | t_a.name | t_b.id | t_b.name |
+---------+-----------+---------+-----------+--+
| a | 1 | a | xx |
| b | 2 | a | xx |
| c | 3 | a | xx |
| d | 4 | a | xx |
| a | 1 | b | yy |
| b | 2 | b | yy |
| c | 3 | b | yy |
| d | 4 | b | yy |
| a | 1 | d | zz |
| b | 2 | d | zz |
| c | 3 | d | zz |
| d | 4 | d | zz |
| a | 1 | e | pp |
| b | 2 | e | pp |
| c | 3 | e | pp |
| d | 4 | e | pp |
+---------+-----------+---------+-----------+--+
6.2 帶條件的笛卡爾積
select t_a.*,t_b.* from t_a inner join t_b on t_a.id = t_b.id;
等同于:
select t_a.*,t_b.* from t_a inner join t_b where t_a.id = t_b.id
+---------+-----------+---------+-----------+--+
| t_a.id | t_a.name | t_b.id | t_b.name |
+---------+-----------+---------+-----------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
+---------+-----------+---------+-----------+--+
6.3 左外連接:左表的數(shù)據(jù)一定顯示,右表的數(shù)據(jù)有就顯示臀栈,沒有就不顯示
select t_a.*,t_b.* from t_a left outer join t_b;
6.3 帶條件的左外連接
select t_a.*,t_b.* from t_a left outer join t_b on t_a.id=t_b.id;
+---------+-----------+---------+-----------+--+
| t_a.id | t_a.name | t_b.id | t_b.name |
+---------+-----------+---------+-----------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| c | 3 | NULL | NULL |
| d | 4 | d | zz |
+---------+-----------+---------+-----------+--+
6.4 右外連接:右表的數(shù)據(jù)一定顯示蔫慧,左表的數(shù)據(jù)有就顯示,沒有就不顯示
select t_a.*,t_b.* from t_a right outer join t_b on t_a.id = t_b.id;
+---------+-----------+---------+-----------+--+
| t_a.id | t_a.name | t_b.id | t_b.name |
+---------+-----------+---------+-----------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
| NULL | NULL | e | pp |
+---------+-----------+---------+-----------+--+
6.5 全外連接:左右表的數(shù)據(jù)都顯示权薯,對應(yīng)不上的就為空
select t_a.*,t_b.* from t_a full outer join t_b on t_a.id=t_b.id;
+---------+-----------+---------+-----------+--+
| t_a.id | t_a.name | t_b.id | t_b.name |
+---------+-----------+---------+-----------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| c | 3 | NULL | NULL |
| d | 4 | d | zz |
| NULL | NULL | e | pp |
+---------+-----------+---------+-----------+--+
6.6 左半連接 相當(dāng)于指定條件的內(nèi)連接姑躲,但是只顯示左邊的表數(shù)據(jù)
select a.*from t_a a left semi join t_b b on a.id = b.id;
+-------+---------+--+
| a.id | a.name |
+-------+---------+--+
| a | 1 |
| b | 2 |
| d | 4 |
+-------+---------+--+
7、分區(qū)表盟蚣、PV肋联、UV
例如有這些數(shù)據(jù)
vim access.log.0804
192.168.33.3,http://www.sina.com/stu,2017-08-04 15:30:20
192.168.33.3,http://www.sina.com/teach,2017-08-04 15:35:20
192.168.33.4,http://www.sina.com/stu,2017-08-04 15:30:20
192.168.33.4,http://www.sina.com/job,2017-08-04 16:30:20
192.168.33.5,http://www.sina.com/job,2017-08-04 15:40:20
vim access.log.0805
192.168.33.3,http://www.sina.com/stu,2017-08-05 15:30:20
192.168.44.3,http://www.sina.com/teach,2017-08-05 15:35:20
192.168.33.44,http://www.sina.com/stu,2017-08-05 15:30:20
192.168.33.46,http://www.sina.com/job,2017-08-05 16:30:20
192.168.33.55,http://www.sina.com/job,2017-08-05 15:40:20
vim access.log.0806
192.168.133.3,http://www.sina.com/register,2017-08-06 15:30:20
192.168.111.3,http://www.sina.com/register,2017-08-06 15:35:20
192.168.34.44,http://www.sina.com/pay,2017-08-06 15:30:20
192.168.33.46,http://www.sina.com/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.sina.com/job,2017-08-06 15:40:20
192.168.33.46,http://www.sina.com/excersize,2017-08-06 16:30:20
192.168.33.25,http://www.sina.com/job,2017-08-06 15:40:20
192.168.33.36,http://www.sina.com/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.sina.com/job,2017-08-06 15:40:20
分區(qū)意義:
創(chuàng)建分區(qū)表:
create table t_access0(ip string,url string,access_time string)
row format delimited fields terminated by ',';
create table t_access(ip string,url string,access_time string)
partitioned by (day string)
row format delimited fields terminated by ',';
把數(shù)據(jù)導(dǎo)入分區(qū)表中:
load data local inpath '/root/access.log.0804' into table t_access partition(day='0804');
查看表分區(qū)信息:
show partitions t_access
7.1 求pv,每天pv
select url,count(1) cnts from t_access group by url;
select day,url,count(1) cnts from t_access group by day,url;
7.2 求uv, 每天uv
select url,count(distinct ip) cnts from t_access group by url;
select url,count(distinct(ip)) cnts from t_access group by url;
select day,url,count(distinct(ip)) cnts from t_access group by day,url;
7.3 求每個(gè)ip訪問同一個(gè)頁面的記錄中刁俭,最晚的一條
+--------------------------------+----------------+----------------------+--+
| url | ip | _c2 |
+--------------------------------+----------------+----------------------+--+
| http://www.sina.com/excersize | 192.168.33.36 | 2017-08-06 16:30:20 |
| http://www.sina.com/excersize | 192.168.33.46 | 2017-08-06 16:30:20 |
| http://www.sina.com/job | 192.168.33.25 | 2017-08-06 15:40:20 |
| http://www.sina.com/job | 192.168.33.4 | 2017-08-04 16:30:20 |
| http://www.sina.com/job | 192.168.33.46 | 2017-08-05 16:30:20 |
| http://www.sina.com/job | 192.168.33.5 | 2017-08-04 15:40:20 |
| http://www.sina.com/job | 192.168.33.55 | 2017-08-06 15:40:20 |
| http://www.sina.com/pay | 192.168.34.44 | 2017-08-06 15:30:20 |
| http://www.sina.com/register | 192.168.111.3 | 2017-08-06 15:35:20 |
| http://www.sina.com/register | 192.168.133.3 | 2017-08-06 15:30:20 |
| http://www.sina.com/stu | 192.168.33.3 | 2017-08-05 15:30:20 |
| http://www.sina.com/stu | 192.168.33.4 | 2017-08-04 15:30:20 |
| http://www.sina.com/stu | 192.168.33.44 | 2017-08-05 15:30:20 |
| http://www.sina.com/teach | 192.168.33.3 | 2017-08-04 15:35:20 |
| http://www.sina.com/teach | 192.168.44.3 | 2017-08-05 15:35:20 |
+--------------------------------+----------------+----------------------+--+
7.4 求8月6號的pv橄仍,uv
7.5 每天,pv排序
+-------+--------------------------------+-------+--+
| day | url | cnts |
+-------+--------------------------------+-------+--+
| 0806 | http://www.sina.com/job | 3 |
| 0806 | http://www.sina.com/excersize | 3 |
| 0806 | http://www.sina.com/register | 2 |
| 0805 | http://www.sina.com/stu | 2 |
| 0805 | http://www.sina.com/job | 2 |
| 0804 | http://www.sina.com/stu | 2 |
| 0804 | http://www.sina.com/job | 2 |
| 0806 | http://www.sina.com/pay | 1 |
| 0805 | http://www.sina.com/teach | 1 |
| 0804 | http://www.sina.com/teach | 1 |
+-------+--------------------------------+-------+--+
8牍戚、order by侮繁、distributed by、sort by
hive里的order by是全局排序,order by 是全排序如孝,所有的數(shù)據(jù)會發(fā)送給一個(gè)reduceTask進(jìn)行處理宪哩,在數(shù)據(jù)量大的時(shí)候,reduce就會超負(fù)荷
distribute by先把數(shù)據(jù)分發(fā)到各個(gè)reduce中第晰,然后sort by在各個(gè)reduce中進(jìn)行局部排序
cluster by mid 等于 distribute by mid sort by mid
cluster by后面不能跟desc锁孟,asc,默認(rèn)的只能升序
9茁瘦、Hive MR參數(shù)設(shè)置
--設(shè)置最大的reduce啟動個(gè)數(shù)
set hive.exec.reducers.max=10;
--設(shè)置reduce的啟動個(gè)數(shù)
set mapreduce.job.reduce=3