Hive分析函數(shù)和窗口函數(shù)
在Hive 0.11之后支持的,掃描多個(gè)輸入的行計(jì)算每行的結(jié)果困曙。通常和OVER,PARTITION BY, ORDER BY, WINDOWING配合使用恰矩。和傳統(tǒng)的分組結(jié)果不一樣记盒,傳統(tǒng)的結(jié)果每組中只有一個(gè)結(jié)果。分析函數(shù)的結(jié)果會(huì)出現(xiàn)多次外傅,和每條記錄都連接輸出纪吮。
語法形式如下:
Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])
窗口函數(shù)
函數(shù)名 | 說明 |
---|---|
FIRST_VALUE | 取出分組內(nèi)排序后俩檬,截止到當(dāng)前行,第一個(gè)值 |
LAST_VALUE | 取出分組內(nèi)排序后碾盟,截止到當(dāng)前行棚辽,最后一個(gè)值 |
LEAD(col, n, DEFAULT) | 用于統(tǒng)計(jì)窗口內(nèi)往下第n行的值。第一個(gè)參數(shù)為列名巷疼,第二個(gè)參數(shù)為往下第n行(可選晚胡,默認(rèn)為1)灵奖,第三個(gè)參數(shù)為默認(rèn)值(當(dāng)往下第n行為NULL時(shí)嚼沿,取默認(rèn)值) |
LAG(col,n,DEFAULT) | 與lead相反,用于統(tǒng)計(jì)窗口內(nèi)往下第n個(gè)值瓷患。第一個(gè)參數(shù)為列名骡尽,第二個(gè)參數(shù)為往上第n行(可選,默認(rèn)為1) |
OVER從句
- 使用標(biāo)準(zhǔn)的聚合函數(shù)COUNT,SUM,MIN,MAX,AVG
- 使用PARTITION BY語句擅编,使用一個(gè)或者多個(gè)原始數(shù)據(jù)類型的列
- 使用PARTITION BY與ORDER BY語句攀细,使用一個(gè)或者多個(gè)數(shù)據(jù)類型的分區(qū)或者拍序列
- 使用窗口規(guī)范,窗口規(guī)范支持一下格式:
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
當(dāng)ORDER BY后面缺少窗口從句條件爱态,窗口規(guī)范默認(rèn)是
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
當(dāng)ORDER BY和窗口從句都缺失谭贪,窗口規(guī)范默認(rèn)是:
ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
分析函數(shù)
函數(shù) | 說明 |
---|---|
ROW_NUMBER() | 從1開始,按照順序锦担,生成分組內(nèi)記錄的序列俭识,比如,按照pv降序排列洞渔,生成分組內(nèi)每天的pv名次套媚,ROW_NUMBER()的應(yīng)用 場景非常多,再比如磁椒,獲取分組內(nèi)排序第一的記錄堤瘤,獲取一個(gè)session中的第一條refer等 |
RANK() | 生成數(shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中留下空位 |
DENSE_RANK() | 生成數(shù)據(jù)項(xiàng)在分組中的排名浆熔,排名相等會(huì)在名次中不會(huì)留下空位 |
CUME_DIST() | 小于等于當(dāng)前值的行數(shù)除以分組內(nèi)總行數(shù)本辐。比如,統(tǒng)計(jì)小于等于當(dāng)前薪水的人數(shù)所占總?cè)藬?shù)的比例 |
PERCENT_RANK() | 分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi) 總行數(shù)-1 |
NTILE(n) | 用于將分組數(shù)據(jù)按照順序切分成n片医增,返回當(dāng)前切片值师郑,如果切片不均勻,默認(rèn)增加第一個(gè)切片的分布调窍。NTILE不支持ROWS BETWEEN,比如NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PERCEDING AND CURRENT ROW) |
Hive2.1.0及以后版本支持distinct
在聚合函數(shù)(sum, count, avg)中支持distinct宝冕,但是在order by或者 窗口限制中不支持。
conut(distinct a) over(partition by c)
Hive2.1.0以后支持在OVER從句中支持聚合函數(shù)
select rank() over(order by sum(b))
Hive2.2.0中在使用ORDER BY和窗口限制時(shí)支持distinct
count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)
通過實(shí)例深入理解窗口函數(shù)和分析函數(shù)
COUNT邓萨、SUM地梨、MIN菊卷、MAX、AVG案例分析
## 創(chuàng)建數(shù)據(jù)表
create table orders(
user_id string,
device_id string,
user_type string,
price float,
sales int);
## 添加數(shù)據(jù)orders.txt
zhangsa test1 new 67.1 2
lisi test2 old 43.32 1
wanger test3 new 88.88 3
liliu test4 new 66.0 1
tom test5 new 54.32 1
tomas test6 old 77.77 2
tomson test7 old 88.44 3
tom1 test8 new 56.55 6
tom2 test9 new 88.88 5
tom3 test10 new 66.66 5
## 開窗函數(shù)案例
select
user_id,
user_type,
sales,
-- 默認(rèn)從起點(diǎn)到當(dāng)前所有重復(fù)行
sum(sales) over(partition by user_type order by sales asc) as sales_1,
-- 從起點(diǎn)到當(dāng)前所有重復(fù)行與sales_1結(jié)果相同
sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
-- 從起點(diǎn)到當(dāng)前行宝剖,結(jié)果與sale_1結(jié)果不同
sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
-- 當(dāng)前行加上往前3行
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
-- 當(dāng)前范圍往上加3行
sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
-- 當(dāng)前行+往前3行+往后1行
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
--
sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
-- 當(dāng)前行+之后所有行
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
--
sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
-- 分組內(nèi)所有行
sum(sales) over(partition by user_type) as sales_10
from
orders
order by
user_type,
sales,
user_id;
##上述查詢結(jié)果如下:
| user_id | user_type | sales | sales_1 | sales_2 | sales_3 | sales_4 | sales_5 | sales_6 | sales_7 | sales_8 | sales_9 | sales_10 |
|----------|------------|--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------|
| liliu | new | 1 | 2 | 2 | 2 | 2 | 2 | 4 | 4 | 22 | 23 | 23 |
| tom | new | 1 | 2 | 2 | 1 | 1 | 2 | 2 | 4 | 23 | 23 | 23 |
| zhangsa | new | 2 | 4 | 4 | 4 | 4 | 4 | 7 | 7 | 21 | 21 | 23 |
| wanger | new | 3 | 7 | 7 | 7 | 7 | 7 | 12 | 7 | 19 | 19 | 23 |
| tom2 | new | 5 | 17 | 17 | 17 | 15 | 15 | 21 | 21 | 11 | 16 | 23 |
| tom3 | new | 5 | 17 | 17 | 12 | 11 | 15 | 16 | 21 | 16 | 16 | 23 |
| tom1 | new | 6 | 23 | 23 | 23 | 19 | 19 | 19 | 19 | 6 | 6 | 23 |
| lisi | old | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 3 | 6 | 6 | 6 |
| tomas | old | 2 | 3 | 3 | 3 | 3 | 3 | 6 | 6 | 5 | 5 | 6 |
| tomson | old | 3 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 3 | 6 |
注意
- 結(jié)果和ORDER BY相關(guān),默認(rèn)為升序
- 如果不指定ROWS BETWEEN,默認(rèn)為從起點(diǎn)到當(dāng)前行;
- 如果不指定ORDER BY洁闰,則將分組內(nèi)所有值累加;
- PRECEDING:往前
- FOLLOWING:往后
- CURRENT ROW:當(dāng)前行
- UNBOUNDED:無界限(起點(diǎn)或終點(diǎn))
- UNBOUNDED PRECEDING:表示從前面的起點(diǎn)
- UNBOUNDED FOLLOWING:表示到后面的終點(diǎn)
- 其他COUNT、AVG万细,MIN扑眉,MAX,和SUM用法一樣赖钞。
FIRST_VALUE和LAST_VALUE案例分析
select
user_id,
user_type,
sales,
ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
orders
order by
user_type,
sales;
##上述查詢結(jié)果如下:
| user_id | user_type | sales | row_num | max_sales_user | min_sales_user | curr_last_min_user | curr_last_max_user |
| ------- | --------- | ----- | ------- | -------------- | -------------- | ------------------ | ------------------ |
| tom | new | 1 | 1 | tom1 | tom | tom | liliu |
| liliu | new | 1 | 2 | tom1 | tom | tom | liliu |
| zhangsa | new | 2 | 3 | tom1 | tom | zhangsa | zhangsa |
| wanger | new | 3 | 4 | tom1 | tom | wanger | wanger |
| tom3 | new | 5 | 5 | tom1 | tom | tom3 | tom2 |
| tom2 | new | 5 | 6 | tom1 | tom | tom3 | tom2 |
| tom1 | new | 6 | 7 | tom1 | tom | tom1 | tom1 |
| lisi | old | 1 | 1 | tomson | lisi | lisi | lisi |
| tomas | old | 2 | 2 | tomson | lisi | tomas | tomas |
| tomson | old | 3 | 3 | tomson | lisi | tomson | tomson |
LEAD與LAG
select
user_id,
device_id,
sales,
ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
lead(device_id) over (order by sales) as default_after_one_line,
lag(device_id) over (order by sales) as default_before_one_line,
lead(device_id,2) over (order by sales) as after_two_line,
lag(device_id,2,'abc') over (order by sales) as before_two_line
from
orders
order by
sales;
上述查詢結(jié)果如下
| user_id | device_id | sales | row_num | default_after_one_line | default_before_one_line | after_two_line | before_two_line |
|----------|------------|--------|----------|-------------------------|--------------------------|-----------------|------------------|
| lisi | test2 | 1 | 3 | test6 | test4 | test1 | test5 |
| liliu | test4 | 1 | 2 | test2 | test5 | test6 | abc |
| tom | test5 | 1 | 1 | test4 | NULL | test2 | abc |
| zhangsa | test1 | 2 | 5 | test7 | test6 | test3 | test2 |
| tomas | test6 | 2 | 4 | test1 | test2 | test7 | test4 |
| wanger | test3 | 3 | 7 | test10 | test7 | test9 | test1 |
| tomson | test7 | 3 | 6 | test3 | test1 | test10 | test6 |
| tom2 | test9 | 5 | 9 | test8 | test10 | NULL | test3 |
| tom3 | test10 | 5 | 8 | test9 | test3 | test8 | test7 |
| tom1 | test8 | 6 | 10 | NULL | test9 | NULL | test10 |
RANK腰素、ROW_NUMBER、DENSE_RANK
select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;
##上述查詢結(jié)果如下
| user_id | user_type | sales | r | rn | dr |
| ------- | --------- | ----- | --- | --- | --- |
| tom1 | new | 6 | 1 | 1 | 1 |
| tom3 | new | 5 | 2 | 2 | 2 |
| tom2 | new | 5 | 2 | 3 | 2 |
| wanger | new | 3 | 4 | 4 | 3 |
| zhangsa | new | 2 | 5 | 5 | 4 |
| tom | new | 1 | 6 | 6 | 5 |
| liliu | new | 1 | 6 | 7 | 5 |
| tomson | old | 3 | 1 | 1 | 1 |
| tomas | old | 2 | 2 | 2 | 2 |
| lisi | old | 1 | 3 | 3 | 3 |
NTILE
select
user_type,sales,
--分組內(nèi)將數(shù)據(jù)分成2片
NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
--分組內(nèi)將數(shù)據(jù)分成3片
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
--分組內(nèi)將數(shù)據(jù)分成4片
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
--將所有數(shù)據(jù)分成4片
NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
orders
order by
user_type,
sales;
##上述查詢結(jié)果如下
| user_type | sales | nt2 | nt3 | nt4 | all_nt4 |
| --------- | ----- | --- | --- | --- | ------- |
| new | 1 | 1 | 1 | 1 | 1 |
| new | 1 | 1 | 1 | 1 | 1 |
| new | 2 | 1 | 1 | 2 | 2 |
| new | 3 | 1 | 2 | 2 | 3 |
| new | 5 | 2 | 2 | 3 | 4 |
| new | 5 | 2 | 3 | 3 | 3 |
| new | 6 | 2 | 3 | 4 | 4 |
| old | 1 | 1 | 1 | 1 | 1 |
| old | 2 | 1 | 2 | 2 | 2 |
| old | 3 | 2 | 3 | 3 | 2 |
求取sale前20%的用戶ID
select
user_id
from
(
select
user_id,
NTILE(5) OVER(ORDER BY sales desc) AS nt
from
orders
)A
where nt=1;
## 結(jié)果如下
+----------+
| user_id |
+----------+
| tom1 |
| tom3 |
+----------+
CUME_DIST雪营、PERCENT_RANK
select
user_id,user_type,sales,
--沒有partition,所有數(shù)據(jù)均為1組
CUME_DIST() OVER(ORDER BY sales) AS cd1,
--按照user_type進(jìn)行分組
CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
orders;
## 上述結(jié)果如下
+----------+------------+--------+------+----------------------+--+
| user_id | user_type | sales | cd1 | cd2 |
+----------+------------+--------+------+----------------------+--+
| liliu | new | 1 | 0.3 | 0.2857142857142857 |
| tom | new | 1 | 0.3 | 0.2857142857142857 |
| zhangsa | new | 2 | 0.5 | 0.42857142857142855 |
| wanger | new | 3 | 0.7 | 0.5714285714285714 |
| tom2 | new | 5 | 0.9 | 0.8571428571428571 |
| tom3 | new | 5 | 0.9 | 0.8571428571428571 |
| tom1 | new | 6 | 1.0 | 1.0 |
| lisi | old | 1 | 0.3 | 0.3333333333333333 |
| tomas | old | 2 | 0.5 | 0.6666666666666666 |
| tomson | old | 3 | 0.7 | 1.0 |
+----------+------------+--------+------+----------------------+--+
select
user_type,sales,
--分組內(nèi)總行數(shù)
SUM(1) OVER(PARTITION BY user_type) AS s,
--RANK值
RANK() OVER(ORDER BY sales) AS r,
PERCENT_RANK() OVER(ORDER BY sales) AS pr,
--分組內(nèi)
PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
orders;
## 上述結(jié)果如下
+------------+--------+----+-----+---------------------+---------------------+--+
| user_type | sales | s | r | pr | prg |
+------------+--------+----+-----+---------------------+---------------------+--+
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 2 | 7 | 4 | 0.3333333333333333 | 0.3333333333333333 |
| new | 3 | 7 | 6 | 0.5555555555555556 | 0.5 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 6 | 7 | 10 | 1.0 | 1.0 |
| old | 1 | 3 | 1 | 0.0 | 0.0 |
| old | 2 | 3 | 4 | 0.3333333333333333 | 0.5 |
| old | 3 | 3 | 6 | 0.5555555555555556 | 1.0 |
+------------+--------+----+-----+---------------------+---------------------+--+
參考博文
http://blog.csdn.net/scgaliguodong123_/article/details/60135385