引用: 2020學(xué)霸批拼多多數(shù)據(jù)分析筆試總結(jié)
拼多多2020學(xué)霸批數(shù)據(jù)分析師筆試 —— SQL整理
參考這個(gè)里邊的題目 自己試著做一下 看看自己sql的水平怎么樣
第一部分:SQL查詢(xún)題3道
問(wèn)題一:
(考點(diǎn):活動(dòng)運(yùn)營(yíng)數(shù)據(jù)分析)
表1——訂單表orders,大概字段有(user_id‘用戶(hù)編號(hào)’, order_pay‘訂單金額’ , order_time‘下單時(shí)間’)秕噪。(圖中字段名有些差異)
表2——活動(dòng)報(bào)名表act_apply斤斧,大概字段有(act_id‘活動(dòng)編號(hào)’, user_id‘報(bào)名用戶(hù)’,act_time‘報(bào)名時(shí)間’)
要求:
統(tǒng)計(jì)每個(gè)活動(dòng)對(duì)應(yīng)所有用戶(hù)在報(bào)名后產(chǎn)生的總訂單金額,總訂單數(shù)谣拣。(每個(gè)用戶(hù)限報(bào)一個(gè)活動(dòng),題干默認(rèn)用戶(hù)報(bào)名后產(chǎn)生的訂單均為參加活動(dòng)的訂單)募寨。
統(tǒng)計(jì)每個(gè)活動(dòng)從開(kāi)始后到當(dāng)天(考試日)平均每天產(chǎn)生的訂單數(shù),活動(dòng)開(kāi)始時(shí)間定義為最早有用戶(hù)報(bào)名的時(shí)間森缠。(涉及到時(shí)間的數(shù)據(jù)類(lèi)型均為:datetime)拔鹰。
解答
第一問(wèn)
select A.act_id, sum(O.order_pay) as total_pay, count(O.user_id) as order_count
from orders as O
left join act_apply as A
on A.user_id = O.user_id
group by A.act_id
存在的問(wèn)題: 少考慮了一個(gè)因素 訂單時(shí)間應(yīng)當(dāng)超過(guò)活動(dòng)時(shí)間 且別名起成中文會(huì)好一些
select A.act_id as '活動(dòng)編號(hào)', sum(O.order_pay) as '總金額', count(O.user_id) as '訂單數(shù)'
from orders as O
left join act_apply as A
on A.user_id = O.user_id
where A.act_time <= O.order_time
group by A.act_id;
第二問(wèn)
先選出每個(gè)活動(dòng)開(kāi)始的時(shí)間
select A.act_id, min(A.act_time) as 'act_start'
from act_apply as A
group by A.act_id;
第一問(wèn)可以得到總訂單數(shù) 兩表連接
select tmp1.act_id as '活動(dòng)編號(hào)', tmp1.order_num/datediff(now(), tmp2.'act_start' ) as '活動(dòng)開(kāi)始后平均每天下單數(shù)'
from (select A.act_id , count(O.user_id) as 'order_num'
from orders as O
left join act_apply as A
on A.user_id = O.user_id
where A.act_time <= O.order_time
group by A.act_id) as tmp1
join (select A.act_id, min(A.act_time) as 'act_start'
from act_apply as A
group by A.act_id) as tmp2
on tmp1.act_id = tmp2.act_id
整體思路就是 兩個(gè)子查詢(xún) 第一個(gè)先查出每個(gè)活動(dòng)的訂單數(shù)量 第二個(gè)查出每個(gè)訂單的開(kāi)始日期 兩表按照活動(dòng)id連接 每個(gè)訂單的數(shù)量除以 當(dāng)前日期與開(kāi)始日期的差值即可
(應(yīng)該是對(duì)的)
后邊發(fā)現(xiàn)自己的寫(xiě)法寫(xiě)復(fù)雜了 其實(shí)第二個(gè)子查詢(xún)是沒(méi)必要的
select A.act_id as '活動(dòng)編號(hào)', count(O.user_id)/datediff(now(), min(A.act_time)) as '活動(dòng)開(kāi)始后平均每天下單數(shù)'
from orders as O
left join act_apply as A
on A.user_id = O.user_id
where A.act_time <= O.order_time
group by A.act_id
作者的答案 其實(shí)也寫(xiě)復(fù)雜了但是可以學(xué)習(xí)一下窗口函數(shù)
SELECT act_id as '活動(dòng)編號(hào)', COUNT(*)/DATEDIFF(NOW(),act_start) AS '活動(dòng)開(kāi)始后平均每天下單數(shù)'
FROM orders a
LEFT JOIN
(SELECT user_id ,act_id ,act_time, min(act_time) over(PARTITION by act_id) as 'act_start'
FROM act_apply
) b
ON a.user_id=b.user_id
WHERE order_time>=act_time
GROUP BY act_id,act_start
這里用到了一個(gè)over函數(shù) 具體用法在后邊進(jìn)行了補(bǔ)充
這個(gè)的思路與第一問(wèn)的思路一致 只不過(guò)對(duì)于活動(dòng)表給出了每個(gè)活動(dòng)的開(kāi)始日期(使用over比較方便 否則需要group by min 然后再做一次連接) 然后兩表連接 分組計(jì)數(shù)除以時(shí)間差即可
需要注意的是 我也覺(jué)得最后一行GROUP BY act_id,act_start 可以改為 GROUP BY act_id 但是作者給出了解釋:
如果group by 沒(méi)有act_start的話(huà) sql不允許select語(yǔ)句使用它的~是聚合函數(shù)的限制
補(bǔ)充
https://kknews.cc/code/66jnqzv.html
OVER的定義
OVER用于為行定義一個(gè)窗口,它對(duì)一組值進(jìn)行操作贵涵,不需要使用GROUP BY子句對(duì)數(shù)據(jù)進(jìn)行分組列肢,能夠在同一行中同時(shí)返回基礎(chǔ)行的列和聚合列。
OVER的語(yǔ)法
OVER ( [ PARTITION BY column ] [ ORDER BY culumn ] )
PARTITION BY 子句進(jìn)行分組宾茂;
ORDER BY 子句進(jìn)行排序瓷马。
窗口函數(shù)OVER()指定一組行,開(kāi)窗函數(shù)計(jì)算從窗口函數(shù)輸出的結(jié)果集中各行的值跨晴。
開(kāi)窗函數(shù)不需要使用GROUP BY就可以對(duì)數(shù)據(jù)進(jìn)行分組欧聘,還可以同時(shí)返回基礎(chǔ)行的列和聚合列。
OVER的用法
OVER開(kāi)窗函數(shù)必須與聚合函數(shù)或排序函數(shù)一起使用端盆,聚合函數(shù)一般指SUM(),MAX(),MIN,COUNT(),AVG()等常見(jiàn)函數(shù)怀骤。排序函數(shù)一般指RANK(),ROW_NUMBER(),DENSE_RANK(),NTILE()等。
具體例子可以參考以上鏈接
問(wèn)題二:
(考點(diǎn):用戶(hù)行為路徑分析)
表1——用戶(hù)行為表tracking_log焕妙,大概字段有(user_id‘用戶(hù)編號(hào)’,opr_id‘操作編號(hào)’,log_time‘操作時(shí)間’)
要求:
1.計(jì)算每天的訪客數(shù)和他們的平均操作次數(shù)蒋伦。
2.統(tǒng)計(jì)每天符合以下條件的用戶(hù)數(shù):A操作之后是B操作,AB操作必須相鄰焚鹊。
解答
問(wèn)題1:
先選出每天的訪客數(shù)和每天的操作次數(shù)
select T.log_time, count(distinct T.user_id) as '訪客數(shù)', count(T.opr_id) as '總操作次數(shù)'
from tracking_log as T
group by T.log_time
子查詢(xún)
select tmp.log_time, tmp.'訪客數(shù)', tmp.'總操作次數(shù)'/tmp.'訪客數(shù)' as '平均操作次數(shù)'
from (select T.log_time, count(distinct T.user_id) as '訪客數(shù)', count(T.opr_id) as '總操作次數(shù)'
from tracking_log as T
group by T.log_time) as tmp;
好像又寫(xiě)復(fù)雜了痕届。。 不用子查詢(xún)就可以完成
select T.log_time, count(distinct T.user_id) as '訪客數(shù)', count(T.opr_id)/count(distinct T.user_id) as '平均操作次數(shù)'
from tracking_log as T
group by T.log_time
參考答案
SELECT b.date , COUNT(b.user_id) as '訪客數(shù)' , AVG(op) AS '平均操作次數(shù)'
FROM
(SELECT user_id , COUNT(opr_type) as 'op', CONVERT(log_time,date) as 'date' FROM tracking_log
GROUP BY user_id,date) b
GROUP BY b.date
思路:用group嵌套寺旺,剛好可以把人數(shù)和人次分解開(kāi)
子查詢(xún)對(duì)日期和user_id進(jìn)行分組 得到每天每個(gè)人的操作次數(shù)
然后對(duì)日期進(jìn)行分組 統(tǒng)計(jì)user_id的數(shù)量即為每天的訪客數(shù) 對(duì)每天每個(gè)人的操作次數(shù)取平均即為平均操作次數(shù)
需要注意的點(diǎn):
MySQL 的CAST()和CONVERT()函數(shù)可用來(lái)獲取一個(gè)類(lèi)型的值爷抓,并產(chǎn)生另一個(gè)類(lèi)型的值。兩者具體的語(yǔ)法如下:
CAST(value as type);
CONVERT(value, type);
這里的log_time需要用 CONVERT(log_time,date) as 'date' 處理
而且mysql的一個(gè)特點(diǎn)是group by 中的字段 是可以使用 select中的別名的
MySQL extends standard SQL to permit noncolumn expressions in GROUP BY clauses and considers the preceding statement valid.
修改自己的答案
select CONVERT(T.log_time,date) , count(distinct T.user_id) as '訪客數(shù)', count(T.opr_id)/count(distinct T.user_id) as '平均操作次數(shù)'
from tracking_log as T
group by CONVERT(T.log_time,date)
問(wèn)題2:
刷過(guò)leetcode之后這個(gè)題還是比較簡(jiǎn)單的 就是寫(xiě)起來(lái)有些復(fù)雜
先對(duì)時(shí)間排序?qū)τ诓煌膗ser_id創(chuàng)建按時(shí)間排序的rank
select T.user_id阻塑, CONVERT(T.log_time,date) as date蓝撇, T. opr_id,
@rank:= if(T.user_id = @pre_id, @rank+1, 1),
@pre_id:= T.user_id
from tracking_log as T, (select @rank:=0, @pre_id:=NULL) as init
-- Mysql,order by 后跟 別名 表達(dá)式 均可
order by T.user_id, CONVERT(T.log_time,date) asc;
然后兩個(gè)上表進(jìn)行連接 連接條件是user_id相等 tmp1.rank = tmp2.rank-1 (其實(shí)就是對(duì)于同一個(gè)user_id通過(guò)rank后后一天進(jìn)行連接)
然后篩選tmp1.opr_id = 'A' tmp2.opr_id = 'B'
最后對(duì)日期分組計(jì)數(shù)即可 可能一天中用戶(hù)先A后B多次 但只需要統(tǒng)計(jì)1次 因此需要去重
select tmp1.date, count(distinct tmp1.user_id) as 'A-B路徑用戶(hù)計(jì)數(shù)'
from (select T.user_id陈莽, CONVERT(T.log_time,date) as date渤昌, T. opr_id,
@rank:= if(T.user_id = @pre_id, @rank+1, 1) as rank,
@pre_id:= T.user_id
from tracking_log as T, (select @rank:=0, @pre_id:=NULL) as init
order by T.user_id, CONVERT(T.log_time,date) asc) as tmp1
join (select T.user_id走搁, CONVERT(T.log_time,date) as date独柑, T. opr_id,
@rank:= if(T.user_id = @pre_id, @rank+1, 1) as rank,
@pre_id:= T.user_id
from tracking_log as T, (select @rank:=0, @pre_id:=NULL) as init
order by T.user_id, CONVERT(T.log_time,date) asc) as tmp2
on tmp1.rank = tmp2.rank - 1 and tmp1.user_id = tmp2.user_id
where tmp1.opr_id = 'A' and tmp2.opr_id = 'B'
group by tmp1.date;
參考答案
ELECT a.Date , COUNT(*) as 'A-B路徑用戶(hù)計(jì)數(shù)'
FROM(
SELECT DISTINCT user_id as 'User',opr_type as '1st', CONVERT(log_time,date) as 'Date', lead(opr_type,1)
over(PARTITION by user_id,CONVERT(log_time,date) ORDER BY log_time) as '2nd'
FROM tracking_log
) a
WHERE a.1st = 'A' and a.2nd ='B'
GROUP BY a.Date
確實(shí)利用窗口函數(shù)要簡(jiǎn)單很多
對(duì)user_id,和時(shí)間分組 按時(shí)間排序 取出后一個(gè)操作opr_id的寫(xiě)法是這樣的
lag(1) 后移一項(xiàng) lead(1)前移一項(xiàng) 這里應(yīng)該用 lead
select lead(T.opr_id) over( PARTITION by T.user_id,convert(T.log_time,date) order by T.log_time asc)
from tracking_log as T
整體選出是這樣的 這里也需要去重
SELECT DISTINCT T.user_id as 'User', T.opr_type as '1st', CONVERT(T.log_time,date) as 'Date', lead(T.opr_type,1)
over(PARTITION by T.user_id, CONVERT(T.log_time,date) ORDER BY CONVERT(T.log_time,date) ASC) as '2nd'
FROM tracking_log as T
最后剩下就是分組篩選 然后計(jì)數(shù)了
問(wèn)題三
(考點(diǎn):用戶(hù)新增留存分析)
表1——用戶(hù)登陸表user_log私植,大概字段有(user_id‘用戶(hù)編號(hào)’忌栅,log_time‘登陸時(shí)間’)
要求:
1.每天新增用戶(hù)數(shù),以及他們第2天曲稼、30天的回訪比例
參考答案:
(找出每個(gè)用戶(hù)第一次登陸時(shí)間索绪,再聚合時(shí)間得到每一天新增用戶(hù)湖员,時(shí)間要聚合到天)
解答
先找出每個(gè)用戶(hù)的最小日期即為首次登陸的時(shí)間
select T.user_id, min(log_time) as first_time
from tracking_log as T
group by T.user_id
又要再篩選 用一下窗口函數(shù)試一下吧
select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time
from tracking_log as T
然后對(duì)first_time進(jìn)行分組,統(tǒng)計(jì)去重的user_id個(gè)數(shù)即為當(dāng)天的新用戶(hù)數(shù)量
select tmp.first_time, count(distinct tmp.user_id) as new_count
from (select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time
from tracking_log as T) as tmp
group by tmp.first_time
第二天的回訪數(shù)量怎么求
登陸時(shí)間與首次時(shí)間的差值為1就可以進(jìn)行篩選
select tmp1.first_time, count(distinct tmp1.user_id) as 2nd_back
from (select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time, convert(T.user_id, date) as date
from tracking_log as T) as tmp1
where datediff(tmp1.date, tmp1.first_time) = 1
group by tmp1.first_time
同理 一個(gè)月后的回訪
select tmp2.first_time, count(distinct tmp2.user_id) as 30nd_back
from (select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time, convert(T.user_id, date) as date
from tracking_log as T) as tmp2
where datediff(tmp2.date, tmp2.first_time) = 29
group by tmp2.first_time
以上三表進(jìn)行連接 左連接
select a.first_time as '日期', a.new_count as '新增用戶(hù)',
concat(round(100*b.2nd_back/a.new_count, 2), '%') as '第2天回訪率'
concat(round(100*b.30nd_back/a.new_count, 2), '%') as '第30天回訪率'
from (select tmp.first_time, count(distinct tmp.user_id) as new_count
from (select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time
from tracking_log as T) as tmp
group by tmp.first_time) as a
left join (select tmp1.first_time, count(distinct tmp1.user_id) as 2nd_back
from (select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time, convert(T.user_id, date) as date
from tracking_log as T) as tmp1
where datediff(tmp1.date, tmp1.first_time) = 1
group by tmp1.first_time) as b
on a.first_time = b.first_time
left join (select tmp2.first_time, count(distinct tmp2.user_id) as 30nd_back
from (select T.user_id, min(log_time) over(PARTITION by convert(T.user_id, date)) as first_time, convert(T.user_id, date) as date
from tracking_log as T) as tmp2
where datediff(tmp2.date, tmp2.first_time) = 29
group by tmp2.first_time) as c
on a.first_time = c.first_time
第二部分:計(jì)算題1道
問(wèn)題四:
(考點(diǎn):貝葉斯公式的應(yīng)用)
已知A,B廠生產(chǎn)的產(chǎn)品的次品率分別是1%和2%瑞驱,現(xiàn)在由A,B產(chǎn)品分別占60%娘摔、40%的樣品中隨機(jī)抽一件,若取到的是次品唤反,求此次品是B廠生產(chǎn)的概率凳寺。
第三部分:綜合分析題1道
問(wèn)題五:
(考點(diǎn):產(chǎn)品更新決策的數(shù)據(jù)支持)
某網(wǎng)站優(yōu)化了商品詳情頁(yè),現(xiàn)在新舊兩個(gè)版本同時(shí)運(yùn)行彤侍,新版頁(yè)面覆蓋了10%的用戶(hù)肠缨,舊版覆蓋90%的用戶(hù)。現(xiàn)在需要了解盏阶,新版頁(yè)面是否能夠提高商品詳情頁(yè)到支付頁(yè)的轉(zhuǎn)化率怜瞒,并決定是否要覆蓋舊版,你能為決策提供哪些信息般哼,需要收集哪些指標(biāo),給出統(tǒng)計(jì)方法及過(guò)程惠窄。
這個(gè)是A/B test 的問(wèn)題 先學(xué)習(xí)一下 再返回來(lái)看這個(gè)問(wèn)題