目標(biāo):
1说铃、掌握 sum()蛇受、avg()等用于累計計算的聚合函數(shù),學(xué)會對行數(shù)的限制(移動計算)丹墨;
2、掌握 row_number(),rank()嬉愧、dense_rank()用于排序的函數(shù)贩挣;
3、掌握 ntile()用于分組查詢的函數(shù)没酣;
4王财、掌握 lag()、lead()偏移分析函數(shù)
窗口函數(shù)(window function):
與聚合函數(shù)類似裕便,但是窗口函數(shù)是每一行數(shù)據(jù)都生成一個結(jié)果绒净,聚合函數(shù)可以將多行數(shù)據(jù)按照規(guī)定聚合為一行,一般來說聚合后的行數(shù)要少于聚合前的行數(shù)偿衰,但是有時我們想要既顯示聚合前的數(shù)據(jù)挂疆,又要顯示聚合后的數(shù)據(jù),這時便引入了窗口函數(shù)下翎,窗口函數(shù)是在 select 時執(zhí)行的缤言,位于 order by 之前。
一视事、累計計算窗口函數(shù)(相當(dāng)于計算聚合的聚合胆萧?二次聚合?)
1俐东、sum(...) over(...)
在日常工作中跌穗,經(jīng)常遇到計算截止某月或某天的累計數(shù)值订晌,在Excel可以通過函數(shù)來實(shí)現(xiàn),
在HiveSQL里蚌吸,可以利用窗口函數(shù)實(shí)現(xiàn)。
1)2018年每月的支付總額和當(dāng)年累計支付總額
select a.month
,a.total_amount
,sum(a.total_amount) over(order by a.month)
from
( select month(dt) as month,sum(pay_amount) as total_amount
from user_trade
where year(dt)=2018
group by month(dt)
)a
2)對2017年和2018年公司的支付總額按月度累計進(jìn)行分析套利,按年度進(jìn)行匯總
寫法一:
select a.year
,a.month
,a.amount
,sum(a.amount) over(partition by a.year order by a.month)
from
(select substr(trade_time,1,4) as year
,substr(trade_time,6,2) as month
,sum(amount)as amount
from trade_2017
group by substr(trade_time,1,4),substr(trade_time,6,2)
union all
select substr(trade_time,1,4) as year
,substr(trade_time,6,2) as month
,sum(amount) as amount
from trade_2018
group by substr(trade_time,1,4),substr(trade_time,6,2)
)a
寫法二:
select a.year
,a.month
,a.pay_amount
,sum(a.pay_amount) over(partition by a.year order by a.month)
from
(select year(dt)as year
,month(dt) as month
,sum(pay_amount) as pay_amount
from user_trade
where year(dt) in (2017,2018)
group by year(dt),month(dt)
)a
說明:1、over中的 partition by 起到分組的作用验辞;
2、order by 按照什么順序進(jìn)行累加喊衫,升序ASC跌造、降序DESC族购,默認(rèn)升序
3、正確的分組是非常重要的寝杖,partition by 后面的字段是需要累計計算的區(qū)域违施,需要仔細(xì)理解
2、avg(...) over(...):移動平均
(計算三日留存瑟幕、七日留存磕蒲、三十日留存等方式可以使用這個函數(shù)只盹。)
3)對2018年每個月的近三個月進(jìn)行移動的求平均支付金額
select a.month
,a.amount
,avg(a.amount) over(order by a.month rows between 2 preceding and current row)
from
(select month(dt) as month
,sum(pay_amount) as amount
from user_trade
where year(dt)=2018
group by month(dt)
)a
注意:如果使用聚合函數(shù),則必須要在group by里使用殖卑,窗口函數(shù)使用時不用group by,
但是窗口函數(shù)必須要在子查詢里配合使用聚合函數(shù)才能得出正確的結(jié)果
3许起、語法總結(jié)
sum(...A...) over(partition by ...B... order by ...C... rows between ...D1... and ...D2...)
avg(...A...) over(partition by ...B... order by ...C... rows between ...D1... and ...D2...)
A:需要被加工的字段名稱
B:分組的字段名稱
C:排序的字段名稱
D:計算的行數(shù)范圍
rows between unbounded preceding and current row
包括本行和之前所有的行
rows between current row and unbounded following
包括本行和之后所有的行
rows between 3 preceding and current row
包括本行以內(nèi)和前三行
rows between 3 preceding and 1 following
從前三行到下一行(5行)
拓展:
max(...A...) over(partition by ...B... order by ...C... rows between ...D1... and ...D2...)
min(...A...) over(partition by ...B... order by ...C... rows between ...D1... and ...D2...)
二菩鲜、分區(qū)排序窗口函數(shù)(row_number()、rank()睦袖、dense_rank())
用法:這三個函數(shù)的作用都是返回相應(yīng)規(guī)則的排序序號,由于排序函數(shù)不是二次聚合計算伦乔,因此不一定要使用子查詢
row_number() over(partition by ...A... order by ...B...)
rank() over(partition by ...A... order by ...B...)
dense_rank() over(partition by ...A... order by ...B...)
A:分組的字段名稱
B:排序的字段名稱
注意:這3個函數(shù)的括號內(nèi)是不加任何字段名稱的!
row_number:為查詢出來的每一行生成一個序號爱只,依次排序且不會重復(fù);
rank和dense_rank:在各個分組內(nèi)恬试,rank()是跳躍排序疯暑,有兩個第一名時接下來就是第三名,
dense_rank()是連續(xù)排序妇拯,有兩個第一名時,仍然跟著第二名仗嗦。
4)2019年1月,用戶購買商品品類數(shù)量的排名
select user_name
,count(distinct goods_category) as goods_num
,row_number() over(order by count(distinct goods_category)) as row_number
,rank() over(order by count(distinct goods_category)) as rank
,dense_rank() over(order by count(distinct goods_category)) as dense_rank
from user_trade
where substr(dt,1,7)='2019-01'
group by user_name
注意:由于substr()函數(shù)截取的日期是字符串格式甘凭,所以要加引號稀拐,用日期函數(shù)截取的日期才可以不加引號丹弱。
5)選出2019年支付金額排名在第10、20蹈矮、30名的用戶
select a.user_name,a.amount,a.row_number
from
(select user_name
,sum(pay_amount) as amount
,row_number() over(order by sum(pay_amount) desc)as row_number
from user_trade
where year(dt)=2019
group by user_name
)a
where a.row_number in(10,20,30)
三鸣驱、切片排序窗口函數(shù)
ntile(n) over(...)
ntile(n) over(partition by ...A... order by ...B...)
n:切分的片數(shù)
A:分組的字段名稱
B:排序的字段名稱
ntile(n):用于將分組數(shù)據(jù)按照順序切分成n片,返回當(dāng)前切片值北滥。
ntile不支持rows between闸翅,比如
ntile(2) over(partition by ... order by ... rows between 3 preceding and current row)
如果切片數(shù)據(jù)不均勻,則前面的組分得的數(shù)據(jù)較多坚冀。
6)將2019年1月的支付用戶,按照支付金額分成5組
select user_name
,sum(pay_amount) as amount
,ntile(5) over(order by sum(pay_amount) desc)as ntile
from user_trade
where substr(dt,1,7)='2019-01'
group by user_name
7)選出2019年退款金額排名前10%的用戶
select a.user_name,a.amount,a.ntile
from
(select user_name
,sum(refund_amount) amount
,ntile(10) over(order by sum(refund_amount) desc)as ntile
from user_refund
where year(dt)=2019
group by user_name
)a
where ntile=1
四司训、偏移分析窗口函數(shù)
說明:Lag和Lead分析函數(shù)可以在同一次查詢中取出同一字段的前N行數(shù)據(jù)(Lag)和后N行的數(shù)據(jù)(Lead)作為獨(dú)立的列。
在實(shí)際應(yīng)用當(dāng)中勾徽,若要用到取今天和昨天的某字段差值時统扳,Lag和Lead函數(shù)的應(yīng)用就顯得尤為重要。
當(dāng)然吹由,這種操作可以用表的自連接實(shí)現(xiàn)盯腌,但是Lag和Lead與 left join、 right join等自連接相比腕够,效率更高,SQL語句更簡潔玫荣。
lag(exp_str,offset,defval) over(partition by ... order by ...)
lead(exp_str,offset,defval) over(partition by ... order by ...)
exp_str是字段名稱大诸;
offset表示偏移量,即是上一個或上N個的值资柔,假設(shè)當(dāng)前行在表中排在第5行,offset為3辙芍,
則表示我們所要找的數(shù)據(jù)行就是表中的第2行(即5-3=2)羹与,offset默認(rèn)值為1。
defval默認(rèn)值纵搁,當(dāng)這兩個函數(shù)取上N/下N個值時,在表中從當(dāng)前行位置向前數(shù)N行已經(jīng)超出了
表的范圍時徘层,lag()函數(shù)將defval這個參數(shù)作為函數(shù)的返回值,若每月指定默認(rèn)值惑灵,則返回NULL。
lead()函數(shù)也是一樣的道理佩憾。在數(shù)學(xué)運(yùn)算中干花,一般都是要給一個默認(rèn)值才不會出錯。
8)支付時間間隔超過100天的用戶數(shù)(這一次購買距離下一次購買的時間抡驼?肿仑,注意datediff函數(shù)是日期大的在前面)
寫法一:
select count(distinct a.user_name)
from
(select user_name
,dt
,lead(dt,1,dt) over(partition by user_name order by dt) lead_dt
from user_trade
where dt>'0'
)a
where datediff(a.lead_dt,dt)>100
寫法二:
select count(distinct a.user_name)
from
(select user_name
,dt
,lag(dt,1,dt) over(partition by user_name order by dt) lag_dt
from user_trade
where dt>'0'
)a
where datediff(dt,a.lag_dt)>100
9)每個城市,不同性別尤慰,2018年支付金額最高的TOP3用戶
select *
from(
select
b.city
,b.sex
,a.user_name
,a.amount
,row_number()over(partition by b.city,b.sex order by a.amount desc) rank
from
(select user_name
,sum(pay_amount) amount
from user_trade
where year(dt)=2018
group by user_name)a
left join
(select user_name,city,sex
from user_info)b
on a.user_name=b.user_name)c
where c.rank<=3
步驟總結(jié):
1伟端、首先篩選出每個用戶和每個用戶總的消費(fèi)金額;
2责蝠、對兩個表進(jìn)行連接提取需要的字段;
3齿拂、對連接后的表進(jìn)行二次聚合計算肴敛,計算出不同城市、性別的金額排名值朋;
4巩搏、對二次聚合計算的表進(jìn)行條件篩選提取
10)每個手機(jī)品牌退款金額前25%的用戶
"phonebrand":"iphone X"
select *
from
(
select b.phonebrand
,a.user_name
,a.amount
,ntile(4) over(partition by b.phonebrand order by a.amount desc) ntile
from
(select user_name,sum(refund_amount) amount
from user_refund
where dt>'0'
group by user_name)a
left join
(select user_name,extra2["phonebrand"]as phonebrand
from user_info
)b
on a.user_name=b.user_name
)c
where c.ntile=1
步驟總結(jié):
1贯底、首先篩選出每個用戶和每個用戶的總退款金額撒强;
2笙什、對兩個表進(jìn)行連接提取需要的字段;
3琐凭、對連接后的表進(jìn)行按手機(jī)品牌內(nèi)分組;
4胚吁、對分組后的表進(jìn)行條件篩選提取