hive 日常數(shù)據(jù)需求(盡可能展示窗口函數(shù)的使用)
題目一:每個(gè)用戶截止到每月為止的最大交易金額和累計(jì)到該月的總交易金額,結(jié)果數(shù)據(jù)格式如下
#先將每個(gè)人每個(gè)月的消費(fèi)進(jìn)行聚合
SELECT customer_key,DATE_FORMAT(create_date,"yyyy-MM") AS umonth,
COUNT(1) AS ucount,SUM(unit_price) AS sum_price
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,"yyyy-MM")
ORDER BY umonth
#隨后通過窗口函數(shù)屁使,求出當(dāng)前月為止最大消費(fèi)以及累計(jì)消費(fèi)
SELECT customer_key,
umonth,
ucount,
MAX(sum_price) over(PARTITION BY customer_key ORDER BY umonth ROWS BETWEEN unbounded preceding AND current ROW) AS current_max,
SUM(sum_price) over(PARTITION BY customer_key ORDER BY umonth ROWS BETWEEN unbounded preceding AND current ROW) AS current_sum
FROM( #每個(gè)人每月消費(fèi)
SELECT customer_key,DATE_FORMAT(create_date,"yyyy-MM") AS umonth,
COUNT(1) AS ucount,SUM(unit_price) AS sum_price
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,"yyyy-MM")
ORDER BY umonth) a
LIMIT 10;
筆記:
1.窗口函數(shù)邊界:
2.hive中流济,取年月的兩種方式
題目二:計(jì)算用戶的回購(gòu)率和復(fù)購(gòu)率
復(fù)購(gòu)率: 當(dāng)前月份購(gòu)買2次及以上的客戶占所有客戶比例
回購(gòu)率:當(dāng)前月份購(gòu)買且上個(gè)月份也購(gòu)買的客戶占當(dāng)月所有月份客戶比例
思路:
復(fù)購(gòu)率
1、對(duì)當(dāng)月(2月份)的客戶分組技即,計(jì)數(shù)購(gòu)買次數(shù)
2著洼、篩選購(gòu)買次數(shù)為2以上的,認(rèn)為是復(fù)購(gòu)群體
SELECT
umonth,
COUNT(1) AS `總客戶數(shù)`,
SUM(IF(buys >=2,1,0)) AS `復(fù)購(gòu)客戶`,
SUM(IF(buys >=2,1,0))/COUNT(1) AS `復(fù)購(gòu)率`
FROM ( #按月份統(tǒng)計(jì)客戶消費(fèi)次數(shù)情況
SELECT customer_key,
DATE_FORMAT(create_date,"yyyy-MM") umonth,
COUNT(1) buys
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,"yyyy-MM")
) a
GROUP BY umonth
LIMIT 10;
回購(gòu)率
1、篩選當(dāng)月及上月部分
2身笤、利用客戶id進(jìn)行當(dāng)月連上月豹悬,推薦左連
3、對(duì)同一條客戶id均有購(gòu)買記錄的液荸,認(rèn)為是回購(gòu)群體
SELECT
a.umonth,
SUM(IF(a.cons >=2,1,0))/COUNT(1) AS `復(fù)購(gòu)率`,
COUNT(b.customer_key)/COUNT(a.customer_key) AS `回購(gòu)率`
FROM
(SELECT customer_key,
DATE_FORMAT(create_date,'yyyy-MM') umonth,
COUNT(1) cons
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,'yyyy-MM')
)a
LEFT JOIN
(SELECT customer_key,
DATE_FORMAT(create_date,'yyyy-MM') umonth,
COUNT(1) cons
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,'yyyy-MM')
) b
ON a.customer_key = b.customer_key
AND CONCAT(a.umonth,'-01') =add_months(CONCAT(b.umonth,'-01'),1)
GROUP BY a.umonth
LIMIT 10;
注意:
1.連接條件中瞻佛,條件1為客戶key字段,表示同一個(gè)客戶娇钱;條件2為兩個(gè)月的信息連接(本月日期 = 上月日期 +1個(gè)月)
2.ADD_MONTHS:別忘了有個(gè)'s'伤柄,add_months只能對(duì)完整日期格式進(jìn)行月份加減(add_months(‘yyyy-MM-dd hh:mm:ss’,N)文搂,add_months(‘yyyy-MM-dd’适刀,N)),所以這里先用concat給'年月'增加一個(gè)'日'字段细疚,其中N為正則增加蔗彤,N為負(fù)則減少
題目三:求用戶最先購(gòu)買的兩種子類別產(chǎn)品信息,要求拼接成 “用戶號(hào)-產(chǎn)品1-產(chǎn)品2” 的形式
第一步:根據(jù)用戶分組疯兼,求出用戶購(gòu)買的產(chǎn)品及順序
cpzl_zw1是cpzl_zw的上一個(gè)購(gòu)買的產(chǎn)品
SELECT customer_key,cpzl_zw,
row_number() over(PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw,1,0) over(PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
-
lag() over():
lag(參數(shù)1然遏,參數(shù)2,參數(shù)3)
參數(shù)1:表中列名 參數(shù)2:往下偏移多少位 參數(shù)3:超出行數(shù)時(shí)默認(rèn)設(shè)置值
lag 往往和over結(jié)合使用
如:lag(參數(shù)1吧彪,參數(shù)2待侵,參數(shù)3)over(order by 列)
第二步:選擇相鄰兩個(gè)產(chǎn)品不同的記錄
SELECT
customer_key,
cpzl_zw,
order_num,
cpzl_zw1
FROM (
SELECT
customer_key,
cpzl_zw,
row_number() OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw, 1, 0) OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
) a
WHERE cpzl_zw != cpzl_zw1
第三步:為相鄰的產(chǎn)品添加排序(即上面lag函數(shù)添加的產(chǎn)品)
若相鄰產(chǎn)品的排序?yàn)?,則表示第三件產(chǎn)品姨裸,題目要求前兩件產(chǎn)品秧倾,則可以根據(jù)小于3進(jìn)行篩選
WITH tmp1 AS (
SELECT
customer_key,
cpzl_zw,
order_num,
cpzl_zw1
FROM (
SELECT
customer_key,
cpzl_zw,
row_number() OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw, 1, 0) OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
) a
WHERE cpzl_zw != cpzl_zw1)
SELECT customer_key,cpzl_zw,order_num,cpzl_zw1,
row_number() over(PARTITION BY customer_key ORDER BY order_num ASC) AS order_num1
FROM tmp1
-
with as():類似于mysql中的視圖功能暫時(shí)引用版(as 括號(hào)后為暫時(shí)引用的表),必須在結(jié)束后使用select進(jìn)行查詢否則報(bào)錯(cuò)傀缩,且多個(gè)with as用逗號(hào)隔開那先。
第四步:數(shù)據(jù)拼接
WITH tmp1 AS (
SELECT
customer_key,
cpzl_zw,
order_num,
cpzl_zw1
FROM (
SELECT
customer_key,
cpzl_zw,
row_number() OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw, 1, 0) OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
) a
WHERE cpzl_zw != cpzl_zw1)
,tmp2 AS (
SELECT customer_key,cpzl_zw,order_num,cpzl_zw1,
row_number() over(PARTITION BY customer_key ORDER BY order_num ASC) AS order_num1
FROM tmp1)
SELECT
CONCAT(customer_key,'-',CONCAT_WS('-',collect_set(cpzl_zw))) AS z1
FROM tmp2
WHERE order_num1 < 3
GROUP BY customer_key;
-
concat_ws:如何指定參數(shù)之間的分隔符
使用函數(shù)CONCAT_WS()。使用語法為:CONCAT_WS(separator,str1,str2,…)
CONCAT_WS() 代表 CONCAT With Separator 赡艰,是CONCAT()的特殊形式售淡。第一個(gè)參數(shù)是其它參數(shù)的分隔符。分隔符的位置放在要連接的兩個(gè)字符串之間慷垮。分隔符可以是一個(gè)字符串揖闸,也可以是其它參數(shù)。如果分隔符為 NULL料身,則結(jié)果為 NULL汤纸。函數(shù)會(huì)忽略任何分隔符參數(shù)后的 NULL 值。但是CONCAT_WS()不會(huì)忽略任何空字符串芹血。 (然而會(huì)忽略所有的 NULL)贮泞。 - collect_set():
collect_set
常用于聚合后的字段處理
個(gè)人認(rèn)為第二步可以省略楞慈,collect_set自帶去重功能,有點(diǎn)多余了
WITH tmp AS (
SELECT customer_key,cpzl_zw,
row_number() over(PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw,1,0) over(PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
),tmp2 AS(
SELECT customer_key,cpzl_zw,order_num,cpzl_zw1,
row_number() over(PARTITION BY customer_key ORDER BY order_num ASC) order_num1
FROM tmp)
SELECT
CONCAT(customer_key,'-',CONCAT_WS('-',collect_set(cpzl_zw))) AS z1
FROM tmp2
WHERE order_num1 < 3
GROUP BY customer_key;
題目四:統(tǒng)計(jì)各個(gè)省份所屬城市下最受歡迎的Top 3產(chǎn)品和其銷量(不能出現(xiàn)有null)
#多次利用with as
#方法一:
SELECT chinese_city,product_key,cons,c.`rank`
FROM (
WITH tmp AS(
SELECT sales_order_key,product_key,chinese_city
FROM ods_sales_orders a
RIGHT JOIN ods_customer b
ON a.customer_key = b.customer_key)
,tmp2 AS (
SELECT chinese_city,product_key,COUNT(sales_order_key) AS cons
FROM tmp
GROUP BY chinese_city,product_key
)
SELECT chinese_city,product_key,cons,
row_number() over(PARTITION BY chinese_city ORDER BY cons DESC) AS `rank`
FROM tmp2
) c
WHERE c.`rank` <= 3
LIMIT 100;
#方法二:
SELECT chinese_city,product_key,cons,`rank`
FROM(
SELECT chinese_city,product_key,cons,
row_number() over(PARTITION BY chinese_city ORDER BY cons DESC) AS `rank`
FROM (
SELECT b.chinese_city,a.product_key,COUNT(sales_order_key) AS cons
FROM ods_sales_orders a
RIGHT JOIN ods_customer b
ON a.customer_key = b.customer_key
GROUP BY b.chinese_city,a.product_key
) c
) d
WHERE `rank` <= 3
AND d.chinese_city != 'null'
LIMIT 100;
步驟:
第一步:將兩表進(jìn)行連接(customer為主表)
第二步:按城市啃擦、產(chǎn)品進(jìn)行聚合分組抖部,求出銷量
第三步:窗口函數(shù)對(duì)銷量進(jìn)行排序
第四步:where進(jìn)行篩選(where不能直接對(duì)窗口函數(shù)結(jié)果進(jìn)行篩選,所以要再次使用表子連接)
題目五:商品的銷售數(shù)量top10议惰,排名需考慮并列排名的情況
提示:只用到訂單表 ods_sales_orders慎颗,并列排序 dense_rank 窗口函數(shù)
SELECT product_key,cons,`rank`
FROM(
SELECT product_key,cons,dense_rank() over(ORDER BY cons DESC) AS `rank`
FROM (
SELECT product_key,COUNT(1) AS cons
FROM ods_sales_orders
GROUP BY product_key) a)b
WHERE b.`rank` <= 10;
題目六:計(jì)算累計(jì)和(統(tǒng)計(jì)2019年1-12月的累積銷量,即1月為1月份的值言询,2月為1俯萎、2月份值的和,3月為1运杭、2夫啊、3月份的和,12月為1-12月份值的和)
SELECT yearmonth,ROUND(sum_price,2),
ROUND(SUM(sum_price) over(ORDER BY yearmonth ASC),2) AS cumsum
FROM(
SELECT
DATE_FORMAT(create_date,"yyyy-MM") AS yearmonth,SUM(unit_price) AS sum_price
FROM ods_sales_orders
WHERE YEAR(create_date) = '2019'
GROUP BY DATE_FORMAT(create_date,"yyyy-MM")
) a;
題目七:計(jì)算客戶平均購(gòu)買一次商品的間隔時(shí)間
SELECT customer_key,AVG(diff_date) AS avg_buy_period
FROM(
SELECT
customer_key,
create_date,
lead(create_date,1) over(PARTITION BY customer_key ORDER BY create_date ASC) next_date,
DATEDIFF(lead(create_date,1) over(PARTITION BY customer_key ORDER BY create_date ASC),create_date) AS diff_date
FROM ods_sales_orders) a
GROUP BY customer_key
HAVING avg_buy_period IS NOT NULL;
- lead() over():作用于lag相同辆憔,方向相反撇眯;Lag和Lead分析函數(shù)可以在同一次查詢中取出同一字段的前N行的數(shù)據(jù)(Lag)和后N行的數(shù)據(jù)(Lead)作為獨(dú)立的列。
題目八:查詢最近前20%時(shí)間的訂單信息
SELECT *
FROM(
SELECT *, ntile(5) OVER(ORDER BY create_date DESC) `grouping`
FROM ods_sales_orders ) a
WHERE a.`grouping` = 1
LIMIT 10;
- ntile(n):用于將數(shù)據(jù)分成n組