做流量數(shù)據(jù)分析的小伙伴經(jīng)常會(huì)遇到流量數(shù)據(jù)多維度匯總的問題露筒,比如運(yùn)營(yíng)希望看到分城市,業(yè)務(wù)線敌卓,用戶類型慎式,頻道,一級(jí)類目和二級(jí)類目等維度的交叉數(shù)據(jù)趟径,因?yàn)檎wUV并不等于各維度UV的加和瘪吏,所有整體UV需要單獨(dú)計(jì)算,最后將計(jì)算結(jié)果和分維度的計(jì)算結(jié)果union起來(lái)蜗巧,這樣代碼會(huì)特別長(zhǎng)掌眠,舉一個(gè)簡(jiǎn)單例子:
用戶日志表為analyst.user_lot,device_id為設(shè)備ID,pingdao為各頻道名稱幕屹,計(jì)算整體和各頻道的PV和UV代碼如下:
select
date as event_date
,'all' as pingdao
,count(distinct device_id) as uv
,count(device_id) as pv
from analyst.user_log
where date='20200529'
group by date,'all'
union all
select
date as event_date
,pingao
,count(distinct device_id) as uv
,count(device_id) as pv
from analyst.user_log
where date='20200529'
group by date,pingao
運(yùn)行結(jié)果:(這里為了簡(jiǎn)化蓝丙,假設(shè)只有三個(gè)頻道搜索,banner和商家推薦)
event_date | pingao | uv | pv |
---|---|---|---|
20200529 | all | 500 | 6000 |
20200529 | 搜索 | 300 | 3000 |
20200529 | banner | 200 | 2000 |
20200529 | 商家推薦 | 100 | 1000 |
如果我們要維度特別多的話望拖,我們的代碼會(huì)特別長(zhǎng)(例如有3個(gè)維度渺尘,每個(gè)維度又2個(gè)值,我們需要union all六段代碼说敏,才能看到所有維度的匯總和明細(xì)數(shù)據(jù)),hive提供了一個(gè)grouping_id set() 函數(shù)來(lái)解決以上問題鸥跟,例如以上問題代碼可以簡(jiǎn)化為:
select
date as event_date
,pingdao
,Grouping_ID as group_id
,count(distinct device_id) as uv
,count(device_id) as pv
from analyst.user_log
where date='20200529'
group by date,pingdao
group by sets(date,(date,pingdao))
運(yùn)行結(jié)果:(這里為了簡(jiǎn)化,假設(shè)只有三個(gè)頻道:搜索盔沫,banner和商家推薦)
event_date | pingao | group_id | uv | pv |
---|---|---|---|---|
20200529 | /N | 1 | 500 | 6000 |
20200529 | 搜索 | 0 | 300 | 3000 |
20200529 | banner | 0 | 200 | 2000 |
20200529 | 商家推薦 | 0 | 100 | 1000 |
按照group by后面的變量排列順序锌雀,grouping sets()的組合中出現(xiàn)的變量賦值0蚂夕,未出現(xiàn)的變量賦值1迅诬,這樣會(huì)生成一個(gè)二進(jìn)制數(shù)字腋逆,最后將二進(jìn)制數(shù)字轉(zhuǎn)化成十進(jìn)制后的數(shù)字賦值給group_id變量,例如上面代碼運(yùn)行邏輯:
(date,null)二進(jìn)制數(shù)值:01 → 十進(jìn)制數(shù)字為1
(date,pingdao) 二進(jìn)制數(shù)值:00 →十進(jìn)制數(shù)字為0