數(shù)據(jù)和題目源:數(shù)據(jù)蛙http://www.reibang.com/u/1f32f227da5f
建表:使用hive sql的CTAS 快速建表語法復制表到自己的數(shù)據(jù)庫
create table frog_qing.ods_sales_orders as select * from adventure_ods.ods_sales_orders;
create table frog_qing.ods_cutomer as select * from adventure_ods.ods_customer;
-
ods_sales_orders表的列說明如下:
-
ods_cutomer表的列說明如下:
題目匯總
- 題目一:每個用戶截止到每月為止的最大交易金額和累計到該月的總交易金額汞舱,結(jié)果數(shù)據(jù)格式如下
- 題目二:計算用戶的回購率和復購率
- 題目三:求用戶號對應不同的產(chǎn)品]
- 題目四:統(tǒng)計各個省份所屬城市下最受歡迎的Top 3產(chǎn)品和其銷量(不能出現(xiàn)有null)
- 題目五:商品的銷售數(shù)量top10伍纫,排名需考慮并列排名的情況
- 題目六:計算累計和(統(tǒng)計2019年1-12月的累積銷量,即1月為1月份的值兵拢,2月為1翻斟、2月份值的和,3月為1说铃、2访惜、3月份的和,12月為1-12月份值的和
- 題目七計算客戶平均購買一次商品的間隔時間
- 題目八:查詢最近前20%時間的訂單信息
- 每個用戶截止到每月為止的最大交易金額和累計到該月的總交易金額腻扇,結(jié)果數(shù)據(jù)格式如下
1.1 先將日期轉(zhuǎn)為By 年/月格式,按照客戶和年月進行分組
1.2 計算分組后的最大交易金額和累計總交易金額
select
b.customer_key,
b.umonth,
b.ucount,
b.current_max,
sum(b.sum_price) over(partition by b.customer_key order by b.umonth) as current_sum
from (
select
customer_key,
date_format(create_date, 'YYYY-MM-01') as umonth,
count(sales_order_key) as ucount,
max(unit_price) as current_max,
sum(unit_price) as sum_price
from ods_sales_orders
group by customer_key,date_format(create_date, 'YYYY-MM-01')
) b
limit 50;
- 計算用戶的回購率和復購率
2.1 復購率债热,一個月內(nèi)購買了多次的用戶占總用戶的比例
select
a.umonth,
count(a.ucount) as user_c,
sum(if(a.ucount>1, 1, 0))/count(ucount) as fugou
from (
select
date_format(create_date, 'YYYY-MM-01') AS umonth,
count(customer_key) as ucount
from ods_sales_orders
group by date_format(create_date, 'YYYY-MM-01'),customer_key
) a
group by a.umonth
limit 20;
2.2 回購率,當前月份購買且下個月份也購買的客戶占當月所有月份客戶比例幼苛,思路:現(xiàn)將當月數(shù)據(jù)和下個月數(shù)據(jù)進行連接窒篱,連接后按月分組計算回購人數(shù)和比例
select
a.umonth,
count(a.customer_key) as pay_user,
count(b.customer_key) as fg_user,
concat((count(b.customer_key)/count(a.customer_key))*100,'%') as fg_rate
from (
select
date_format(create_date, 'YYYY-MM-01') as umonth,
customer_key
from ods_sales_orders a
group by date_format(create_date, 'YYYY-MM-01'),customer_key) a
left join (
select
date_format(create_date, 'YYYY-MM-01') as umonth,
customer_key
from ods_sales_orders a
group by date_format(create_date, 'YYYY-MM-01'),customer_key) b
on a.customer_key=b.customer_key
and add_months(a.umonth,1)=b.umonth
group by a.umonth;
- 求用戶號對應不同的產(chǎn)品,理解:求出用戶第一次和第二次購買的產(chǎn)品
注意點:
3.1.1 lead()窗口函數(shù)求出窗口的后一個值,當為Null值時,使用where函數(shù)判斷是否相等時會過濾掉這些null值墙杯,而我們需要保留這些null值配并,使用nvl()將null值轉(zhuǎn)換,進行保留
3.1.2 collect_set(字段)根據(jù)某個字段分組后高镐,把分在一組的數(shù)據(jù)進行去重溉旋,去重后的數(shù)據(jù)按分組形成集合,默認分隔符','
3.1.3 concat_ws('-',collect_set(xx)) 可將分組后的數(shù)據(jù)使用指定分隔符鏈接
- 方法一嫉髓、按照數(shù)據(jù)蛙答案修改,篩選出來購買大于兩次的用戶观腊,且購買產(chǎn)品不同的用戶和對應產(chǎn)品
with tmp as (
select
a.customer_key,
a.cpzl_zw,
a.rk,
a.second_cpzl,
row_number() over(partition by customer_key order by rk) as rn
from (
select
customer_key,
cpzl_zw,
row_number() over(partition by customer_key order by create_date) as rk,
lead(cpzl_zw,1, null) over(partition by customer_key order by create_date) as second_cpzl
from ods_sales_orders
) a
where a.cpzl_zw != a.second_cpzl
)
select concat(customer_key,'-',cpzl_zw, '-',second_cpzl),count(1)
from tmp
where tmp.rn < 2
limit 20;
- 方法二、利用collect_set特性算行,保留了購買一次客戶的產(chǎn)品和兩次購買相同產(chǎn)品的客戶
select
a.customer_key,
count(1),
concat(a.customer_key, '-',concat_ws('-', collect_set(a.cpzl_zw))) as product
from (
select
customer_key,
cpzl_zw,
row_number() over(partition by customer_key order by create_date) as rn
from ods_sales_orders
) a
where a.rn <3
group by a.customer_key
limit 20;
- 統(tǒng)計各個省份所屬城市下最受歡迎的Top 3產(chǎn)品和其銷量(不能出現(xiàn)有null)
4.1 求出每個城市各產(chǎn)品的銷量
select
b.chinese_city,
a.cpzl_zw,
count(a.sales_order_key) as sales
from ods_sales_orders a
left join ods_cutomer b
on a.customer_key = b.customer_key
group by b.chinese_city,a.cpzl_zw
limit 20;
4.2 求出按城市按產(chǎn)品銷量的排名
select
c.chinese_city,
c.cpzl_zw,
c.sales,
row_number() over(partition by c.chinese_city order by c.sales desc) rn
from (
select
b.chinese_city,
a.cpzl_zw,
count(a.sales_order_key) as sales
from ods_sales_orders a
left join ods_cutomer b
on a.customer_key = b.customer_key
group by b.chinese_city,a.cpzl_zw
) c
where c.chinese_city is not null and c.chinese_city != 'null'
limit 20;
4.3 篩選每個城市top3的產(chǎn)品
select
d.chinese_city,
d.cpzl_zw,
d.sales,
d.rn
from (
select
c.chinese_city,
c.cpzl_zw,
c.sales,
row_number() over(partition by c.chinese_city order by c.sales desc) rn
from (
select
b.chinese_city,
a.cpzl_zw,
count(a.sales_order_key) as sales
from ods_sales_orders a
left join ods_cutomer b
on a.customer_key = b.customer_key
group by b.chinese_city,a.cpzl_zw
) c
where c.chinese_city is not null and c.chinese_city != 'null'
) d
where d.rn <4;
- 商品的銷售數(shù)量top10梧油,排名需考慮并列排名的情況
5.1 計算每個產(chǎn)品的銷量
select
cpzl_zw,
count(1) as sales
from ods_sales_orders
group by cpzl_zw
limit 20;
5.2 按銷量進行排名,篩選前10銷量
DENSE_RANK() 生成數(shù)據(jù)項在分組中的排名州邢,相同值序號一樣儡陨,不跳過下個序號
select *
from (
select
a.cpzl_zw,
a.sales,
dense_rank() over(order by a.sales) as rn
from (
select
cpzl_zw,
count(1) as sales
from ods_sales_orders
group by cpzl_zw
) a
) b
where b.rn <=10;
- 計算累計和(統(tǒng)計2019年1-12月的累積銷量,即1月為1月份的值量淌,2月為1迄委、2月份值的和,3月為1类少、2、3月份的和渔扎,12月為1-12月份值的和)
6.1 計算每個月的銷量
select
date_format(create_date, 'YYYY-MM-01') as month_date,
COUNT(sales_order_key) as sale_amount
from ods_sales_orders
group by date_format(create_date, 'YYYY-MM-01');
6.2 計算累計銷量
ORDER BY create_date 按月份對查詢讀取的記錄進行排序硫狞,就是窗口范圍內(nèi)的排序
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 定義起點和終點,UNBOUNDED PRECEDING 為起點晃痴,表明從第一行開始, CURRENT ROW為默認值残吩,就是這一句等于:ROWS UNBOUNDED PRECEDING PRECEDING:在前 N 行的意思,F(xiàn)OLLOWING:在后 N 行的意思倘核。
select
a.month_date,
a.sale_amount,
sum(a.sale_amount) over(order by month_date asc ORDER BY create_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sale_amount_consum
from (
select
date_format(create_date, 'YYYY-MM-01') as month_date,
COUNT(sales_order_key) as sale_amount
from ods_sales_orders
group by date_format(create_date, 'YYYY-MM-01')
) a;
- 計算客戶平均購買一次商品的間隔時間
7.1 得出用戶兩次購買的時間泣侮,使用lead()函數(shù)
LEAD(col,n,DEFAULT) 用于統(tǒng)計窗口內(nèi)往下第n行值,第一個參數(shù)為列名紧唱,第二個參數(shù)為往下第n行(可選活尊,默認為1),第三個參數(shù)為默認值(當往下第n行為NULL時候漏益,取默認值蛹锰,如不指定,則為NULL)所以使用LEAD (時間,1)绰疤,取客戶的時間铜犬,按照客戶編號分組再按時間排序,所以使用over(partition by 客戶編號 order by 時間)
select
customer_key,
create_date,
lead(create_date,1,null) over(partition by customer_key order by create_date) as next_date
from ods_sales_orders
limit 20;
7.2 計算兩次購買時間差,再計算出每個客戶的平均時間差
select
a.customer_key,
avg(datediff(a.next_date, a.create_date)) as avg_days
from (
select
customer_key,
create_date,
lead(create_date,1,'2019-01-01') over(partition by customer_key order by create_date) as next_date
from ods_sales_orders
) a
where datediff(a.next_date, a.create_date)>0
group by a.customer_key limit 20;
- 查詢最近前20%即前1/5時間的訂單信息
8.1 將時間按降序分為5組,使用nitle()窗口函數(shù)將日期降序分位5組
select
*,
ntile(5) over(order by create_date desc) gid
from ods_sales_orders limit 10;
8.2 取出時間分組為第一組的訂單信息
select *
from (
select
*,
ntile(5) over(order by create_date desc) gid
from ods_sales_orders
) a
where a.gid=1;