Hive分析窗口函數(shù)(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK
- 數(shù)據(jù)準(zhǔn)備
cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,2
cookie2,2018-04-11,3
cookie2,2018-04-12,5
cookie2,2018-04-13,6
cookie2,2018-04-14,3
cookie2,2018-04-15,9
cookie2,2018-04-16,7
CREATE TABLE itzkx_t2 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
加載數(shù)據(jù):
load data local inpath '/export/hivedata/itzkx_t2.dat' into table itzkx_t2;
- NTILE
背景:
? 有時(shí)會(huì)有這樣的需求:如果數(shù)據(jù)排序后分為三部分,業(yè)務(wù)人員只關(guān)心其中的一部分,如何將這中間的三分之一數(shù)據(jù)拿出來呢?NTILE函數(shù)即可以滿足愤估。
ntile可以看成是:把有序的數(shù)據(jù)集合平均分配到指定的數(shù)量(num)個(gè)桶中, 將桶號(hào)分配給每一行呛谜。
如果不能平均分配沽瘦,則優(yōu)先分配較小編號(hào)的桶,并且各個(gè)桶中能放的行數(shù)最多相差1凄硼。
語法是:ntile (num) over ([partition_clause] order_by_clause) as xxx
然后可以根據(jù)桶號(hào)孕讳,選取前或后 n分之幾的數(shù)據(jù)匠楚。
數(shù)據(jù)會(huì)完整展示出來,只是給相應(yīng)的數(shù)據(jù)打標(biāo)簽厂财;具體要取幾分之幾的數(shù)據(jù)芋簿,需要再嵌套一層根據(jù)標(biāo)簽取出。
NTILE不支持ROWS BETWEEN蟀苛,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM itzkx_t2
ORDER BY cookieid,createtime;
比如益咬,統(tǒng)計(jì)一個(gè)cookie逮诲,pv數(shù)最多的前1/3的天
SELECT
cookieid,
createtime,
pv,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn
FROM itzkx_t2;
其中rn = 1 的記錄帜平,就是我們想要的結(jié)果
- ROW_NUMBER
ROW_NUMBER() 從1開始,按照順序梅鹦,生成分組內(nèi)記錄的序列
SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM itzkx_t2;
- RANK 和 DENSE_RANK
RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名裆甩,排名相等會(huì)在名次中留下空位
DENSE_RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中不會(huì)留下空位
SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM itzkx_t2
WHERE cookieid = 'cookie1';