一:前言
根據(jù)官網(wǎng)的介紹,hive推出的窗口函數(shù)功能是對hive sql的功能增強蒸苇,確實目前用于離線數(shù)據(jù)分析邏輯日趨復(fù)雜,很多場景都需要用到。以下就是對hive窗口函數(shù)的一個總結(jié)附上案例钧唐。
二:理解下什么是WINDOW子句(靈活控制窗口的子集)
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:當(dāng)前行
UNBOUNDED:起點(一般結(jié)合PRECEDING,F(xiàn)OLLOWING使用)
UNBOUNDED PRECEDING 表示該窗口最前面的行(起點)
UNBOUNDED FOLLOWING:表示該窗口最后面的行(終點)
比如說:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示從起點到當(dāng)前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到當(dāng)前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示當(dāng)前行到終點)
官網(wǎng)有一段話列出了哪些窗口函數(shù)是不支持window子句的匠襟,如下圖所示:
三:準(zhǔn)備需要演示的數(shù)據(jù)
insert overwrite table dw_tmp.window_function_temp
select
split(detail,',')[0] as uname
,split(detail,',')[1] as create_time
,split(detail,',')[2] as pv
from
(
select
concat('測試用戶,2019-10-02,7
#測試用戶,2019-10-05,4
#測試用戶,2019-10-07,5
#測試用戶,2019-10-03,6
#測試用戶,2019-10-04,3
#測試用戶,2019-10-01,3
#測試用戶,2019-10-06,4') as ct_str
) t
lateral view explode(split(ct_str,'#')) t2 as detail;
四:Windowing functions
1.LEAD(col,n,DEFAULT) 用于統(tǒng)計窗口內(nèi)往下第n行值第一個參數(shù)為列名钝侠,第二個參數(shù)為往下第n行(可選,默認為1酸舍,不可為負數(shù))帅韧,第三個參數(shù)為默認值(當(dāng)往下第n行為NULL時候,取默認值啃勉,如不指定忽舟,則為NULL)
2.LAG(col,n,DEFAULT) 用于統(tǒng)計窗口內(nèi)往上第n行值第一個參數(shù)為列名,第二個參數(shù)為往上第n行(可選淮阐,默認為1叮阅,不可為負數(shù)),第三個參數(shù)為默認值(當(dāng)往上第n行為NULL時候泣特,取默認值浩姥,如不指定,則為NULL)
select
uname
,create_time
,pv
,lead(pv,1,-9999) over (partition by uname order by create_time) as lead_1_pv
,lag(pv,1,-9999) over (partition by uname order by create_time) as lag_1_pv
from dw_tmp.window_function_temp;
3.FIRST_VALUE取分組內(nèi)排序后状您,截止到當(dāng)前行及刻,第一個值,這最多需要兩個參數(shù)竞阐。第一個參數(shù)是您想要第一個值的列缴饭,第二個(可選)參數(shù)必須是false默認為布爾值的布爾值。如果設(shè)置為true骆莹,則跳過空值颗搂。
4.LAST_VALUE取分組內(nèi)排序后,截止到當(dāng)前行,最后一個值幕垦,這最多需要兩個參數(shù)丢氢。第一個參數(shù)是您想要第一個值的列傅联,第二個(可選)參數(shù)必須是false默認為布爾值的布爾值。如果設(shè)置為true疚察,則跳過空值蒸走。
select
uname
,create_time
,pv
,first_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as first_value_pv
,last_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as last_value_pv
from dw_tmp.window_function_temp;
讓我們加上window子句來觀察一下變化,雖然FIRST_VALUE和LAST_VALUE不常于與window子句結(jié)合使用貌嫡。
select
uname
,create_time
,pv
,first_value(pv) over (partition by uname order by create_time) as first_value_pv
,first_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as window_first_value_pv
,last_value(pv) over (partition by uname order by create_time) as last_value_pv
,last_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as window_last_value_pv
from dw_tmp.window_function_temp;
五:aggregates functions
1.COUNT
2.SUM
3.MIN
4.MAX
5.AVG
目前支持這五種帶有聚合意義的窗口函數(shù)比驻,以常用SUM舉例。
select
uname
,create_time
,pv
,SUM(pv) over (partition by uname order by create_time) as sum_pv_1 --默認情況
,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_pv_2 --表示從起點到當(dāng)前行
,SUM(pv) over (partition by uname) as sum_pv_3 --表示窗口內(nèi)所有行
,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_pv_4 --表示起點到終點
,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) as sum_pv_5 --表示前2行到后面1行
from dw_tmp.window_function_temp;
從結(jié)果當(dāng)中其實可以得到結(jié)論岛抄,默認情況就是從起點到當(dāng)前行别惦,不帶order by語句其實就是表示窗口內(nèi)全部行都參與聚合處理,這里其實還有其他用法夫椭,讀者可以自行嘗試一下掸掸。
六:Analytics functions
1.ROW_NUMBER
從1開始,按照順序蹭秋,生成分組內(nèi)記錄的序列,row_number()的值不會存在重復(fù),當(dāng)排序的值相同時,按照表中記錄的順序進行排列;通常用于獲取分組內(nèi)排序第一的記錄;獲取一個session中的第一條refer等扰付。
2.RANK
生成數(shù)據(jù)項在分組中的排名,排名相等會在名次中留下空位仁讨。
3.DENSE_RANK
生成數(shù)據(jù)項在分組中的排名羽莺,排名相等會在名次中不會留下空位。
4.CUME_DIST
CUME_DIST 小于等于當(dāng)前值的行數(shù)/分組內(nèi)總行數(shù)
5.PERCENT_RANK
PERCENT_RANK 分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi)總行數(shù)-1
6.NTILE
NTILE(n) 用于將分組數(shù)據(jù)按照順序切分成n片陪竿,返回當(dāng)前切片值禽翼,如果切片不均勻,默認增加第一個切片的分布族跛。NTILE不支持ROWS BETWEEN
以上是帶有分析功能的窗口函數(shù)闰挡,使用的頻率沒有上面兩類高,但是也是需要掌握的礁哄。
我們先對1-3三種分析窗口函數(shù)進行演示
select
uname
,create_time
,pv
,ROW_NUMBER() over (partition by uname order by pv) as row_number_pv_1
,RANK() over (partition by uname order by pv) as row_number_pv_2
,DENSE_RANK() over (partition by uname order by pv) as row_number_pv_3
from dw_tmp.window_function_temp;
第4-5種:
select
uname
,create_time
,pv
,CUME_DIST() over (partition by uname order by pv) as CUME_DIST_pv_
,PERCENT_RANK() over (partition by uname order by pv) as PERCENT_RANK_pv_
from dw_tmp.window_function_temp;
第六種:NTILE
select
uname
,create_time
,pv
,NTILE(2) over (partition by uname order by pv) as NTILE_pv_1
,NTILE(3) over (partition by uname order by pv) as NTILE_pv_2
,NTILE(4) over (partition by uname order by pv) as NTILE_pv_3
from dw_tmp.window_function_temp;