1涂圆、具體問題
在數(shù)據(jù)分析師面試中,經(jīng)常會遇到手寫代碼的問題币叹,比如下面這道面試題润歉,已知一個表格有三個字段,第一個是店鋪shop颈抚,第二個是銷售日期saledday踩衩,第三個是銷售額gmv。求連續(xù)三天有銷售記錄的店鋪贩汉。本文提供的是hive語法的解法驱富。
shop | saleday | gmv |
---|---|---|
A | 2017-10-11 | 300 |
A | 2017-10-12 | 200 |
A | 2017-10-13 | 100 |
A | 2017-10-15 | 100 |
B | 2017-10-11 | 400 |
B | 2017-10-12 | 200 |
B | 2017-10-13 | 300 |
C | 2017-10-11 | 350 |
C | 2017-10-15 | 400 |
C | 2017-10-16 | 200 |
C | 2017-10-17 | 100 |
D | 2017-10-13 | 500 |
D | 2017-10-14 | 600 |
E | 2017-10-14 | 600 |
E | 2017-10-15 | 500 |
2、解決方案
2.1 方法1 join兩次
第一種方法比較容易想到匹舞,就是將表進行自連接褐鸥,將表連續(xù)往后推移兩天,如果都當前日期赐稽、當前日期的后一天叫榕、當前日期的后兩天該店鋪都有銷售記錄,那么這個店鋪就是連續(xù)三天有銷售記錄的店鋪姊舵。test_table為表名晰绎,將表自連接兩次,連接字段為店鋪名字和時間括丁,第二個表的銷售日期和第一個表的銷售日期差1天寒匙,第三個表的銷售日期的和第一個表的銷售日期相差2天。
select t1.shop as shop1
,t1.saleday as saleday1
,t1.gmv as gmv1
,t2.shop as shop2
,t2.saleday as saleday2
,t2.gmv as gmv2
,t3.shop as shop3
,t3.saleday as saleday3
,t3.gmv as gmv3
from test_table t1
left join test_table t2
on t1.shop =t2.shop
and replace(t1.saleday,'-','')=to_char(dateadd(to_date(t2.saleday,'yyyy-mm-dd'),-1,'dd'),'yyyymmdd')
left join test_table t3
on t1.shop=t3.shop
and replace(t1.saleday,'-','')=to_char(dateadd(to_date(t3.saleday,'yyyy-mm-dd'),-2,'dd'),'yyyymmdd')
上面的代碼中t1為主表躏将,左連接t2和t3锄弱,連接字段是店鋪名和日期,這里用到dateadd函數(shù)用來做日期的加減祸憋,用replace函數(shù)改變?nèi)掌诘母袷交嵯堋=Y(jié)果如下表所示:
shop1 | saleday1 | gmv1 | shop2 | saleday2 | gmv2 | shop3 | saleday3 | gmv3 |
---|---|---|---|---|---|---|---|---|
A | 2017/10/11 | 300 | A | 2017/10/12 | 200 | A | 2017/10/13 | 100 |
B | 2017/10/11 | 400 | B | 2017/10/12 | 200 | B | 2017/10/13 | 300 |
A | 2017/10/12 | 200 | A | 2017/10/13 | 100 | \N | \N | \N |
B | 2017/10/12 | 200 | B | 2017/10/13 | 300 | \N | \N | \N |
D | 2017/10/13 | 500 | D | 2017/10/14 | 600 | \N | \N | \N |
E | 2017/10/14 | 600 | E | 2017/10/15 | 500 | \N | \N | \N |
C | 2017/10/15 | 400 | C | 2017/10/16 | 200 | C | 2017/10/17 | 100 |
C | 2017/10/16 | 200 | C | 2017/10/17 | 100 | \N | \N | \N |
A | 2017/10/13 | 100 | \N | \N | \N | A | 2017/10/15 | 100 |
C | 2017/10/11 | 350 | \N | \N | \N | \N | \N | \N |
B | 2017/10/13 | 300 | \N | \N | \N | \N | \N | \N |
D | 2017/10/14 | 600 | \N | \N | \N | \N | \N | \N |
A | 2017/10/15 | 100 | \N | \N | \N | \N | \N | \N |
E | 2017/10/15 | 500 | \N | \N | \N | \N | \N | \N |
C | 2017/10/17 | 100 | \N | \N | \N | \N | \N | \N |
從上表可以看出如果某個店鋪當前銷售日期的后一天沒有銷售記錄,后面就是NULL蚯窥,所以我們下一步就是要把三天都有銷售記錄的店鋪名字取出來掸鹅,直接用where限制一下條件就好。
select distinct t1.shop
from test_table t1
left join test_table t2
on t1.shop =t2.shop
and replace(t1.saleday,'-','')=to_char(dateadd(to_date(t2.saleday,'yyyy-mm-dd'),-1,'dd'),'yyyymmdd')
left join test_table t3
on t1.shop=t3.shop
and replace(t1.saleday,'-','')=to_char(dateadd(to_date(t3.saleday,'yyyy-mm-dd'),-2,'dd'),'yyyymmdd')
where t1.gmv>0 and t2.gmv>0 and t3.gmv>0
結(jié)果如下:
shop |
---|
A |
B |
C |
2.2 方法2 lead函數(shù)
第一種方法比較容易想到拦赠,也好理解巍沙,但是書寫起來比較麻煩,要join兩次荷鼠,hive中提供了一些窗口函數(shù)可以使用句携,比如lead函數(shù)可以獲取結(jié)果集中,按一定排序所排列的當前行的相鄰后面若干行的某個行的某個列(不用結(jié)果集的自關(guān)聯(lián))允乐,這邊首先要將銷售日期字段升序排列矮嫉,具體代碼如下:
select t1.shop as shop1,t1.saleday as saleday1,t1.gmv as gmv1
,lead(t1.shop,1,null) over (partition by t1.shop order by saleday) as shop2
,lead(t1.saleday,1,null) over (partition by t1.shop order by saleday) as saleday2
,lead(t1.gmv,1,null) over (partition by t1.shop order by saleday) as gmv2
,lead(t1.shop,2,null) over (partition by t1.shop order by saleday) as shop3
,lead(t1.saleday,2,null) over (partition by t1.shop order by saleday) as saleday3
,lead(t1.gmv,2,null) over (partition by t1.shop order by saleday) as gmv3
from test_table t1
lead(t1.shop,1,null) over (partition by t1.shop order by saleday) 代表數(shù)據(jù)按照店鋪名字分組削咆,按照銷售日期升序排列,整個表格往后移一行的shop列取出來蠢笋。結(jié)果如表下:
shop1 | saleday1 | gmv1 | shop2 | saleday2 | gmv2 | shop3 | saleday3 | gmv3 |
---|---|---|---|---|---|---|---|---|
A | 2017-10-11 | 300 | A | 2017-10-12 | 200 | A | 2017-10-13 | 100 |
A | 2017-10-12 | 200 | A | 2017-10-13 | 100 | A | 2017-10-15 | 100 |
A | 2017-10-13 | 100 | A | 2017-10-15 | 100 | \N | \N | \N |
A | 2017-10-15 | 100 | \N | \N | \N | \N | \N | \N |
B | 2017-10-11 | 400 | B | 2017-10-12 | 200 | B | 2017-10-13 | 300 |
B | 2017-10-12 | 200 | B | 2017-10-13 | 300 | \N | \N | \N |
B | 2017-10-13 | 300 | \N | \N | \N | \N | \N | \N |
C | 2017-10-11 | 350 | C | 2017-10-15 | 400 | C | 2017-10-16 | 200 |
C | 2017-10-15 | 400 | C | 2017-10-16 | 200 | C | 2017-10-17 | 100 |
C | 2017-10-16 | 200 | C | 2017-10-17 | 100 | \N | \N | \N |
C | 2017-10-17 | 100 | \N | \N | \N | \N | \N | \N |
D | 2017-10-13 | 500 | D | 2017-10-14 | 600 | \N | \N | \N |
D | 2017-10-14 | 600 | \N | \N | \N | \N | \N | \N |
E | 2017-10-14 | 600 | E | 2017-10-15 | 500 | \N | \N | \N |
E | 2017-10-15 | 500 | \N | \N | \N | \N | \N | \N |
從上表可以看出某些店鋪的銷售日期不是連續(xù)的拨齐,比如店鋪A銷售日期2017-10-13后面是2017-10-15,所以接下來我們要限制一下時間間隔昨寞,還有g(shù)mv不為空瞻惋。具體代碼如下:
select distinct t2.shop1
from
(
select t1.shop as shop1,t1.saleday as saleday1,t1.gmv as gmv1
,lead(t1.shop,1,null) over (partition by t1.shop order by saleday) as shop2
,lead(t1.saleday,1,null) over (partition by t1.shop order by saleday) as saleday2
,lead(t1.gmv,1,null) over (partition by t1.shop order by saleday) as gmv2
,lead(t1.shop,2,null) over (partition by t1.shop order by saleday) as shop3
,lead(t1.saleday,2,null) over (partition by t1.shop order by saleday) as saleday3
,lead(t1.gmv,2,null) over (partition by t1.shop order by saleday) as gmv3
from test_table t1
)t2
where t2.gmv2>0 and t2.gmv3>0
and replace(t2.saleday1,'-','')=to_char(dateadd(to_date(t2.saleday2,'yyyy-mm-dd'),-1,'dd'),'yyyymmdd')
and replace(t2.saleday1,'-','')=to_char(dateadd(to_date(t2.saleday3,'yyyy-mm-dd'),-2,'dd'),'yyyymmdd')
上述代碼運行一下就可以得到連續(xù)三天有銷售記錄的店鋪,結(jié)果如下:
shop |
---|
A |
B |
C |
2.3 方法3 新思路
最后一種方法比較難想到援岩,如果我們把銷售日期中的幾號取出來熟史,然后再分組按照銷售日期降序排列,如果將兩者相加窄俏,連續(xù)的天數(shù)的和應(yīng)該是一樣的蹂匹。代碼如下:
select *
,substr(saleday, 9, 2) as day
,row_number() over (partition by shop order by saleday desc) as rank
,cast(substr(saleday,9,2) as int) + row_number() over (partition by shop order by saleday desc) as plus
from test_table
運行結(jié)果如下:
shop | saleday | gmv | day | rank | plus |
---|---|---|---|---|---|
A | 2017-10-15 | 100 | 15 | 1 | 16 |
A | 2017-10-13 | 100 | 13 | 2 | 15 |
A | 2017-10-12 | 200 | 12 | 3 | 15 |
A | 2017-10-11 | 300 | 11 | 4 | 15 |
B | 2017-10-13 | 300 | 13 | 1 | 14 |
B | 2017-10-12 | 200 | 12 | 2 | 14 |
B | 2017-10-11 | 400 | 11 | 3 | 14 |
C | 2017-10-17 | 100 | 17 | 1 | 18 |
C | 2017-10-16 | 200 | 16 | 2 | 18 |
C | 2017-10-15 | 400 | 15 | 3 | 18 |
C | 2017-10-11 | 350 | 11 | 4 | 15 |
D | 2017-10-14 | 600 | 14 | 1 | 15 |
D | 2017-10-13 | 500 | 13 | 2 | 15 |
E | 2017-10-15 | 500 | 15 | 1 | 16 |
E | 2017-10-14 | 600 | 14 | 2 | 16 |
plus列中如果數(shù)字相等則表示改店鋪有連續(xù)銷售記錄,如果連續(xù)三天凹蜈,則有三個數(shù)字相等限寞,如果連續(xù)四天則有四個數(shù)字相等,所以接下來我們只要根據(jù)店鋪名字字段和plus字段分組仰坦,找到相同數(shù)據(jù)記錄超過3行的店鋪名字履植,具體代碼如下:
select t1.shop
from
(
select *
,substr(saleday, 9, 2) as day
,row_number() over (partition by shop order by saleday desc) as rank
,cast(substr(saleday,9,2) as int) + row_number() over (partition by shop order by saleday desc) as plus
from test_table
)t1
group by t1.shop,t1.plus
having count(*)>=3
運行結(jié)果如下:
shop |
---|
A |
B |
C |
方法3易于拓展,可以拓展到多天悄晃,不光連續(xù)3天玫霎,4天5天也可以,但是有個問題妈橄,如果跨月這個方法就會不準庶近,如果原始數(shù)據(jù)存在跨月的情況,那建議采用方法1或者方法2眷蚓。