前提
最近在學習hive,碰到了級聯(lián)求和的問題.經(jīng)過一番思考學習,現(xiàn)在做些學習筆記.
需求
原始數(shù)據(jù)表
訪客 | 月份 | 訪問次數(shù) |
---|---|---|
A | 2015-01 | 5 |
A | 2015-01 | 15 |
B | 2015-01 | 5 |
A | 2015-01 | 8 |
B | 2015-01 | 25 |
A | 2015-01 | 5 |
A | 2015-02 | 4 |
A | 2015-02 | 6 |
B | 2015-02 | 10 |
B | 2015-02 | 5 |
根據(jù)上面的數(shù)據(jù)表輸出每個用戶每個月份的訪問次數(shù),并且每個月統(tǒng)計總的訪問次數(shù).最后的輸出格式如下
需要輸出報表
訪客 | 月份 | 月訪問總計 | 累計訪問總計 |
---|---|---|---|
A | 2015-01 | 33 | 33 |
A | 2015-02 | 10 | 43 |
B | 2015-01 | 30 | 30 |
B | 2015-02 | 15 | 45 |
實現(xiàn)步驟
1.創(chuàng)建hive表
create table t_access_times(username string,month string,cnt int)
row format delimited fields terminated by ',';
2.準備數(shù)據(jù) access.log
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
3.加載數(shù)據(jù)到表中
load data local inpath '/home/hadoop/access.log' into table t_access_times;
4.自join方式
- 先求每個用戶每個月的訪問總次數(shù)
+-----------+----------+---------+--+
| username | month | cnt |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+
- 將月總次數(shù)表 自己連接自己(自join)
select A.*,B.* FROM
(select username,month,sum(cnt) as cnt from t_access_times group by username,month) A
inner join
(select username,month,sum(cnt) as cntfrom t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month;
+-------------+----------+-----------+-------------+----------+--------
| A.username | A.month | A.cnt| B.username | B.month | B.cnt |
+-------------+----------+-----------+-------------+----------+--------
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+--------
剛開始這里不是很明白為什么加上where B.month >= A.month
的條件,這樣有什么意義?其實這是為后面的統(tǒng)計做準備.
現(xiàn)在來講講這個自join是怎么產生這樣的數(shù)據(jù)的.
hive的表連接我沒有研究過,這里暫時我用mysql的連接來舉例說明.我姑且認為它們的實現(xiàn)原理的是一樣的.
- 從表A中讀入一行數(shù)據(jù)R;
- 從數(shù)據(jù)行R中,取出username字段和where條件到B表中去查找;
- 在B表中找到滿足條件的行,跟R組成一行,作為結果集的一部分;
- 重復執(zhí)行步驟1到3,直到表A的末尾循環(huán)結束;
在這里,兩個表都做了一次全表掃描,所以總的掃描行數(shù)是 4 + 4 = 8;
內存中的判斷次數(shù)是 4 * 4 = 16;
- 最終的sql語句
select A.username,A.month,max(A.cnt) as cnt,sum(B.cnt) as accumulate
from
(select username,month,sum(cnt) as cntfrom t_access_times group by username,month) A
inner join
(select username,month,sum(cnt) as cntfrom t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;
--最終結果為:
+-------------+----------+---------+-------------+--+
| A.username | A.month | cnt| accumulate |
+-------------+----------+---------+-------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
+-------------+----------+---------+-------------+--+
5.窗口函數(shù)
還有一種方式也可以實現(xiàn)需求,那就是窗口函數(shù)
select
t.username,
t.month,
t.cnt,
sum(t.cnt) over(partition by t.username order by t.username,
t.month rows between unbounded preceding and current row) as accumlate
from(
select
username,month,
sum(cnt) as cnt
from t_access_times group by username,month) t
;
--最終結果為:
+-------------+----------+---------+-------------+--+
| A.username | A.month | cnt| accumulate |
+-------------+----------+---------+-------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
+-------------+----------+---------+-------------+--+