背景
日常工作中有許多數(shù)據(jù)處理需求需要解決,在此之間作箍,獲得需求,用hive實現(xiàn)需求前硫,最終完成任務(wù)胞得。
題目
數(shù)據(jù)源在:hive中的adventure_ods庫的ods_sales_orders表
表名 | 表注釋 | 字段 | 字段注釋 | |
---|---|---|---|---|
ods_sales_orders | 訂單明細(xì)表 | sales_order_key | 訂單主鍵 | 一個訂單表示銷售一個產(chǎn)品 |
ods_sales_orders | 訂單明細(xì)表 | create_date | 訂單日期 | |
ods_sales_orders | 訂單明細(xì)表 | customer_key | 客戶編號 | |
ods_sales_orders | 訂單明細(xì)表 | product_key | 產(chǎn)品編號 | |
ods_sales_orders | 訂單明細(xì)表 | english_product_name | 產(chǎn)品名 | |
ods_sales_orders | 訂單明細(xì)表 | cpzl_zw | 產(chǎn)品子類 | |
ods_sales_orders | 訂單明細(xì)表 | cplb_zw | 產(chǎn)品類別 | |
ods_sales_orders | 訂單明細(xì)表 | unit_price | 產(chǎn)品單價 |
找出前前幾行看看:
題目一:每個用戶截止到每月為止的最大交易金額和累計到該月的總交易金額,結(jié)果數(shù)據(jù)格式如下
方法1屹电,先對日期匯總
先針對日期做窗口匯總阶剑,然后group by+max 干掉每個月里面的不是第一天的那種'小數(shù)據(jù)'.
select
customer_key,
substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as month_count,
count() over(partition by customer_key) as total_count,
last_value(create_date) over(partition by customer_key) as last_id,
max(unit_price) over(partition by customer_key order by create_date) max_until_this_time,
sum(unit_price) over(partition by customer_key order by create_date) cumsum_until_this_time
from ods_sales_orders;
這時候只是多了冗余數(shù)據(jù),去掉即可.怎么去掉呢危号?我們先把這個結(jié)果存起來牧愁。CTAS。
create table temp_table as
(select
customer_key,
substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as month_count,
count() over(partition by customer_key) as total_count,
last_value(create_date) over(partition by customer_key) as last_id,
max(unit_price) over(partition by customer_key order by create_date) max_until_this_time,
sum(unit_price) over(partition by customer_key order by create_date) cumsum_until_this_time
from ods_sales_orders
);
這時候我們就可以去掉冗余了外莲,使用max函數(shù)干掉小的那種猪半。
select customer_key, umonth, month_count,
max(max_until_this_time) as current_max,
max(cumsum_until_this_time) as current_sum
from temp_table
group by customer_key, umonth, month_count;
方法二,先對月份做匯總
select
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders;
這時候有重復(fù)偷线,用group by 去重磨确。
select customer_key, umonth, ucount, max_this_month, sum_this_month
from (
select
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders)mubiao
group by customer_key, umonth, ucount, max_this_month, sum_this_month;
-- 保存為臨時表
create table temp_tt as (
select customer_key, umonth, ucount, max_this_month, sum_this_month
from (
select
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders)mubiao
group by customer_key, umonth, ucount, max_this_month, sum_this_month);
這時候我們就可以計算截止當(dāng)前的最大和累計了。
select customer_key, umonth, ucount,
cast(max(max_this_month) over(partition by customer_key order by umonth) as decimal(10,2)) as current_max,
cast(sum(sum_this_month) over(partition by customer_key order by umonth) as decimal(10,2)) as current_sum
from temp_tt;
customer_key | umonth(當(dāng)月) | ucount(月訂單量) | current_max(最大交易金額) | current_sum(累計該月總交易金額) |
---|---|---|---|---|
11009 | 2018-12 | 1 | 53.99 | 53.99 |
1358999 | 2019-2 | 1 | 28.99 | 28.99 |
1358999 | 2019-4 | 1 | 69.99 | 98.98 |
1359000 | 2019-1 | 1 | 2294.99 | 2294.99 |
1359002 | 2019-11 | 1 | 8.99 | 8.99 |
1359003 | 2020-1 | 1 | 1120.49 | 1120.49 |
1359005 | 2019-2 | 1 | 782.99 | 782.99 |
1359009 | 2019-1 | 1 | 2384.07 | 2384.07 |
1359014 | 2019-1 | 1 | 69.99 | 69.99 |
1359014 | 2019-2 | 1 | 69.99 | 94.98 |
思路:
1.對數(shù)據(jù)按照客戶及其年-月分組
2.分組后就每月銷售金額之和
3.使用窗口函數(shù)淋昭,對每個客戶不同月份分組求最大值(max)和累計值(sum)
題目二:計算用戶的回購率和復(fù)購率
復(fù)購率
復(fù)購率: 當(dāng)前月份購買2次及以上的客戶占所有客戶比例,以2月為例進(jìn)行計算
- 先增加購買一個本月購買次數(shù)的列
select customer_key, count(customer_key) as count_this_month from
ods_sales_orders
where year(create_date)=2019 and month(create_date)=2
group by customer_key;
下面就好算了俐填。
select concat("", cast(sum(if(count_this_month>1,1,0))*100/count(*) as decimal(14,4)), "%")
from (
select customer_key, count(customer_key) as count_this_month from
ods_sales_orders
where year(create_date)=2019 and month(create_date)=2
group by customer_key
) mubiao ;
回購率
回購率:當(dāng)前月份購買且上個月份也購買的客戶占當(dāng)月所有月份客戶比例
1.先得到當(dāng)前月份有哪些用戶
select customer_key,month(create_date) as month_2_count
from ods_sales_orders
where month(create_date)=2
group by customer_key, month(create_date); -- 2月份的用戶
select customer_key,month(create_date) as month_1_count
from ods_sales_orders
where month(create_date)=1
group by customer_key, month(create_date); -- 1月份的用戶
上面可以分別得到2,1月份每個customer的購買情況翔忽,left_join 篩選非NULL行就可以得到上個月是否買了東西英融。
select sum(if (month_1_count is not null, 1, 0)) / count(*)
from (select customer_key,month(create_date) as month_2_count
from ods_sales_orders
where month(create_date)=2
group by customer_key, month(create_date)) a
left join (select customer_key,month(create_date) as month_1_count
from ods_sales_orders
where month(create_date)=1
group by customer_key, month(create_date)) b on a.customer_key=b.customer_key;
思路:
復(fù)購率
- 1、對當(dāng)月(2月份)的客戶分組歇式,計數(shù)購買次數(shù)
- 2驶悟、篩選購買次數(shù)為2以上的,認(rèn)為是復(fù)購群體
回購率
- 1材失、篩選當(dāng)月及上月部分
- 2痕鳍、利用客戶id進(jìn)行當(dāng)月連上月,推薦左連
- 3、對同一條客戶id均有購買記錄的笼呆,認(rèn)為是回購群體
題目三:求用戶號對應(yīng)不同的產(chǎn)品
用戶號 | 產(chǎn)品 | 購買時間 |
---|---|---|
1 | A | 2019-12-23 |
1 | B | 2019-12-23 |
2 | C | 2019-12-23 |
2 | A | 2019-12-24 |
2 | B | 2019-12-23 |
解答
先找?guī)讞l看看數(shù)據(jù)的樣子熊响。
select customer_key, create_date, product_key from ods_sales_orders limit 100;
其實就是怕連續(xù)買了兩個產(chǎn)品1,所以需要去重诗赌,對于同一個產(chǎn)品只保留最前面的那一次購買汗茄,分組然后min時間即可。
select customer_key, product_key, min(create_date) as create_date
from ods_sales_orders
group by customer_key, product_key;
with temp as (
select customer_key, product_key, min(create_date) as create_date
from ods_sales_orders
group by customer_key, product_key
)
select
*,
row_number() over(partition by customer_key order by create_date) as product_index_by_time
from temp;
這時候就可以篩選了铭若,選出其中序號<3的那些訂單們洪碳。
with temp as (
select customer_key, product_key, min(create_date) as create_date
from ods_sales_orders
group by customer_key, product_key
) , -- end of temp table1
temp_with_index as(
select
*,
row_number() over(partition by customer_key order by create_date) as product_index_by_time
from temp) --end of temp table2
select concat(customer_key, '-', concat_ws('-', collect_set(product_key)))
from temp_with_index
where product_index_by_time < 3
group by customer_key;
當(dāng)然也可以改為算子類:
with temp as (
select customer_key, cpzl_zw, min(create_date) as create_date
from ods_sales_orders
group by customer_key, cpzl_zw
) , -- end of temp table1
temp_with_index as(
select
*,
row_number() over(partition by customer_key order by create_date) as product_index_by_time
from temp) --end of temp table2
select concat(customer_key, '-', concat_ws('-', collect_set(cpzl_zw)))
from temp_with_index
where product_index_by_time < 3
group by customer_key;
要求輸出例子**:用戶號-產(chǎn)品1-產(chǎn)品2
例如:1-A-B (按先后時間順序,同時不限定)
參考:http://www.reibang.com/p/90d0657c0218
思路:
- 1.利用窗口函數(shù)叼屠,對用戶號分組瞳腌,按時間對產(chǎn)品進(jìn)行排序
- 2.利用左連或其他方法拼接,篩選排序順序為1镜雨、2的
- 3.用concat或者其他函數(shù)拼接獲得結(jié)果
題目四:統(tǒng)計各個省份所屬城市下最受歡迎的Top 3產(chǎn)品和其銷量(不能出現(xiàn)有null)
-
先看看城市和顧客的從屬關(guān)系
1592970242408 關(guān)聯(lián)一下嫂侍,給每個訂單確定一個城市。
select
a.*,
b.chinese_city
from ods_sales_orders a inner join ods_customer b
on a.customer_key = b.customer_key
where product_key = 214;
with temp as
(select
product_key,
b.chinese_city,
count(*)
from ods_sales_orders a inner join ods_customer b
on (a.customer_key = b.customer_key)
and (chinese_city is not NULL) and (product_key is not NULL)
group by chinese_city, product_key
)
select * from temp;
截圖里面有個null冷离,好奇嘛吵冒?原來他用的是字符串null
-- 匯總每個城市里面的產(chǎn)品降序
with temp as
(select
b.chinese_city,
product_key,
count(*) as pro_count
from ods_sales_orders a inner join ods_customer b
on (a.customer_key = b.customer_key)
and (chinese_city != 'null') and (product_key != 'null')
group by chinese_city, product_key
)
select *,
row_number() over(partition by chinese_city order by pro_count desc) as rn
from temp;
下面考慮只要top3.
with temp as
(select
b.chinese_city,
product_key,
count(*) as pro_count
from ods_sales_orders a inner join ods_customer b
on (a.customer_key = b.customer_key)
and (chinese_city != 'null') and (product_key != 'null')
group by chinese_city, product_key
),
temp_rn as(
select *,
row_number() over(partition by chinese_city order by pro_count desc) as rn
from temp)
select * from temp_rn
where rn < 4;
題目五:商品的銷售數(shù)量top10亿汞,排名需考慮并列排名的情況
select product_key,
count(*) as num
from ods_sales_orders
group by product_key;
with temp as(
select product_key,
count(*) as num
from ods_sales_orders
group by product_key
),
temp_with_rn as
(
select
*,
rank() over(order by num desc) as rn
from temp
)
select * from temp_with_rn
where rn < 11;
題目六:計算累計和(統(tǒng)計2019年1-12月的累積銷量瞭空,即1月為1月份的值,2月為1疗我、2月份值的和咆畏,3月為1、2吴裤、3月份的和旧找,12月為1-12月份值的和)
先統(tǒng)計每個月的和
select substr(create_date, 1, 7) as umonth,
sum(unit_price) as umonth_total
from ods_sales_orders
where year(create_date)=2019
group by substr(create_date, 1, 7);
累加和的計算
with temp as(
select substr(create_date, 1, 7) as umonth,
sum(unit_price) as umonth_total
from ods_sales_orders
where year(create_date)=2019
group by substr(create_date, 1, 7)
)
select *,
cast(sum(umonth_total) over(order by umonth) as decimal(12, 2)) as cumsum
from temp;
題目七:計算客戶平均購買一次商品的間隔時間
先記錄一下上一次的購買時間
select customer_key,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders;
記錄兩次之間的時間間隔:
with temp as
(select customer_key, create_date,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders)
select customer_key, create_date, prev_date,
datediff(create_date, prev_date) as gap_date
from temp
where prev_date is not NULL;
with temp as
(select customer_key, create_date,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders),
temp_with_diff as
(
select customer_key, create_date, prev_date,
datediff(create_date, prev_date) as gap_date
from temp
where prev_date is not NULL
)
select customer_key, avg(gap_date) as avg_gap
from temp_with_diff
group by customer_key;
題目八:查詢最近前20%時間的訂單信息
提示:使用ntile 函數(shù)記錄條目在哪個分段。
select *,
ntile(5) over(order by create_date) as tile_index
from ods_sales_orders;
with temp as
(
select *,
ntile(5) over(order by create_date) as tile_index
from ods_sales_orders
)
select *
from temp
where tile_index = 2; -- 20%-40%