用戶行為數(shù)據(jù):埋點
業(yè)務(wù)交互數(shù)據(jù):業(yè)務(wù)流程產(chǎn)生的登陸 訂單 用戶 商品 支付 等有關(guān)的數(shù)據(jù) 通常存儲在DB中
0.創(chuàng)建gmall數(shù)據(jù)庫
1.創(chuàng)建ODS層
- 原始數(shù)據(jù)層:外部表,ods_start_log
- 時間日志表:ods_event_log
創(chuàng)建輸入數(shù)據(jù)是LZO,輸出是text凑耻,支持json解析的分區(qū)表
drop table if exists ods_start_log;
CREATE EXTERNAL TABLE `ods_start_log`(`line` string)
PARTITIONED BY (`dt` string)
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_start_log';
因為我們的HDFS上元數(shù)據(jù)的存儲格式就是Lzo卖哎,所以hive表的輸入lzo
因為元數(shù)據(jù)是一個json字符串需要進(jìn)行進(jìn)一步的解析碉输,所以輸出用text
ODS層數(shù)據(jù)加載腳本編寫架曹,企業(yè)開發(fā)中腳本執(zhí)行時間一般在每日凌晨30分至1點
注意:
[ -n 變量值 ] 判斷變量的值,是否為空
-- 變量的值互例,非空,返回true
-- 變量的值筝闹,為空媳叨,返回false
2.創(chuàng)建DWD層
DWD層數(shù)據(jù)解析:
對ODS層數(shù)據(jù)進(jìn)行清洗(去除空值,臟數(shù)據(jù)关顷,超過極限范圍的數(shù)據(jù)糊秆,行式存儲改為列存儲,改壓縮格式)
2.1創(chuàng)建基礎(chǔ)明細(xì)表(總共兩張表start/event)
明細(xì)表用于存儲ODS層原始表轉(zhuǎn)換過來的明細(xì)數(shù)據(jù)
基礎(chǔ)明細(xì)表分析
啟動/事件日志基礎(chǔ)明細(xì)表數(shù)據(jù):
drop table if exists dwd_base_start_log;
CREATE EXTERNAL TABLE `dwd_base_start_log`(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`event_name` string,
`event_json` string,
`server_time` string)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_base_start_log/';
存儲格式為parquet议双,其中event_name和event_json用來對應(yīng)事件名和整個事件
這個地方將原始日志1對1的形式拆分出來痘番,操作的時候我們將原始日志展平,需要用到UDF和UDTF
自定義UDF函數(shù)(一進(jìn)一出):
用于解析公共字段:
BaseFieldUDF extends UDF
:將公共字段解析為以“\t”為分隔符的字符串
自定義UDTF(一進(jìn)多出):
用于將et字段的Json串解析成event_name 和event_json
EventJsonUDTF extends GenericUDTF
:見簡書 http://www.reibang.com/p/49a11951eb30
解析啟動日志基礎(chǔ)明細(xì)表:
set hive.exec.dynamic.partition.mode=nonstrict;
設(shè)置動態(tài)分區(qū)為非嚴(yán)格模式
insert overwrite table dwd_base_start_log
PARTITION (dt)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source ,
os ,
area ,
model ,
brand ,
sdk_version ,
gmail ,
height_width ,
app_time ,
network ,
lng ,
lat ,
event_name ,
event_json ,
server_time ,
dt
from
(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time,
dt
from ods_start_log where dt='2019-02-10' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
字段如下:
DWD層明細(xì)表:
get_json_object(event_json,'$.kv.action') action
get_json_object(event_json,'$.kv.newsid') newsid
get_json_object(event_json,'$.kv.place') place
get_json_object(event_json,'$.kv.extend1') extend1
1. 商品點擊表
2. 商品詳情頁表
3. 商品列表頁
4. 廣告表
5. 消息通知表
6.用戶前臺活躍(event_name='active_foreground')
7. 用戶后臺活躍(event_name='active_background')
8. 評論表(event_name='comment')
9. 收藏表
10. 點贊表
11. 啟動日志表
12. 錯誤日志表 dwd_error_log
DWD層明細(xì)表腳本編寫聋伦,腳本執(zhí)行時間一般在凌晨0點30分至1點
業(yè)務(wù)知識儲備:
業(yè)務(wù)術(shù)語: http://www.reibang.com/p/9dccbaa8e42c
系統(tǒng)函數(shù):
把同一個分組的不同行的數(shù)據(jù)聚合成一個集合collect_set;
hive (gmall)> select * from stud;
stud.name stud.area stud.course stud.score
zhang3 bj math 88
li4 bj math 99
wang5 sh chinese 92
zhao6 sh chinese 54
tian7 bj chinese 91
hive(gmall)>select course,collect_set(area),avg(score)
from stud
group by course;
chinese ["sh","bj"] 79.0
math ["bj"] 93.5
日期處理函數(shù):
(1)date_format函數(shù)(根據(jù)格式整理日期)
>select date_format('2019-02-10','yyyy-MM');
2019-02
(2)date_add函數(shù)(加減日期)
select date_add('2019-02-10',-1);
20019-02-09
(3)next_day函數(shù)
取當(dāng)前天的下周一
>select next_day('2019-02-12','MO');
2019-02-18
取當(dāng)前周的周一
>select date_add(next_day('2019-02-12','MO'),-7);
2019-02-11
(4)last_day函數(shù)(求當(dāng)月最后一天日期)
>select last_day('2019-02-10');
2019-02-28
用戶需求:用戶活躍主題
DWS層
每日活躍設(shè)備明細(xì):dws_uv_detail_day
每周活躍用戶層:dws_uv_detail_wk
(mid monday_date sunday_date)
insert overwrite table dws_uv_datail_wk partition (wk_dt)
每月設(shè)備活躍明細(xì):dws_uv_detail_mn
DWS層數(shù)據(jù)加載腳本編寫夫偶,執(zhí)行一般在每日凌晨30分至1點
ADS層
目標(biāo):當(dāng)日 當(dāng)周 當(dāng)月活躍設(shè)備數(shù)量
ads_uv_count
建表語句
drop table if exists ads_uv_count;
create external table ads_uv_count(
`dt` string COMMENT '統(tǒng)計日期',
`day_count` bigint COMMENT '當(dāng)日用戶數(shù)量',
`wk_count` bigint COMMENT '當(dāng)周用戶數(shù)量',
`mn_count` bigint COMMENT '當(dāng)月用戶數(shù)量',
`is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最終結(jié)果',
`is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最終結(jié)果'
) COMMENT '每日活躍用戶數(shù)量'
stored as parquet
location '/warehouse/gmall/ads/ads_uv_count_day/';
用戶需求:用戶新增主題
DWS層(每日新增設(shè)備明細(xì))
每日新增設(shè)備明細(xì)表 dws_new_mid_day
(主要字段:mid_id create_date)
導(dǎo)入數(shù)據(jù):
insert into table dws_new_mid_day
select
ud.mid_id,
ud.user_id ,
ud.version_code ,
ud.version_name ,
ud.lang ,
ud.source,
ud.os,
ud.area,
ud.model,
ud.brand,
ud.sdk_version,
ud.gmail,
ud.height_width,
ud.app_time,
ud.network,
ud.lng,
ud.lat,
'2019-02-10'
from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id
where ud.dt='2019-02-10' and nm.mid_id is null;
ADS層(每日新增設(shè)備表) :
drop table if exists `ads_new_mid_count`;
create table `ads_new_mid_count`(
create_date String comment '創(chuàng)建時間',
new_mid_count BigInt comment '新增設(shè)備數(shù)量'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_mid_count/';
導(dǎo)入數(shù)據(jù):
insert into table ads_new_mid_count
select create_date,count(*)
from dws_new_mid_day
where create_date='2019-02-10'
group by create_date
用戶需求:用戶留存
用戶留存概念:某段時間內(nèi)的新增用戶觉增,經(jīng)過一段時間后兵拢,又繼續(xù)使用的用戶
DWS(每日留存用戶明細(xì)表dws_user_retention_day):
按天進(jìn)行分區(qū)保存,每天計算一次前n天的留存明細(xì)
drop table if exists `dws_user_retention_day`;
create table `dws_user_retention_day`
(
`mid_id` string COMMENT '設(shè)備唯一標(biāo)識',
`user_id` string COMMENT '用戶標(biāo)識',
`version_code` string COMMENT '程序版本號',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系統(tǒng)語言',
`source` string COMMENT '渠道號',
`os` string COMMENT '安卓系統(tǒng)版本',
`area` string COMMENT '區(qū)域',
`model` string COMMENT '手機(jī)型號',
`brand` string COMMENT '手機(jī)品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕寬高',
`app_time` string COMMENT '客戶端日志產(chǎn)生時的時間',
`network` string COMMENT '網(wǎng)絡(luò)模式',
`lng` string COMMENT '經(jīng)度',
`lat` string COMMENT '緯度',
`create_date` string comment '設(shè)備新增時間',
`retention_day` int comment '截止當(dāng)前日期留存天數(shù)'
) COMMENT '每日用戶留存情況'
PARTITIONED BY ( `dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_retention_day/';
導(dǎo)入數(shù)據(jù):
2019-02-11當(dāng)天計算前一天的用戶留存人明細(xì)計算(count(*)即可算出留存人數(shù)):
insert overwrite table dws_user_retention_day partition(dt = "2019-02-10")
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud
join dws_new_mid_day nm
on ud.mid_id=nm.mid_id
where ud.dt='2019-02-11' #2-11仍然在活躍狀態(tài)的用戶
and nm.create_date=date_add('2019-02-11',-1); #2-10號創(chuàng)建的賬戶
依次可以算出前2天 前三天逾礁,前四天....一直到前n天的用戶留存明細(xì)表说铃,最后union all并insert into
每日留存用戶明細(xì)表中即可
ADS層用戶留存數(shù):
創(chuàng)建表(每日用戶留存表ads_user_retention_day_count)
drop table if exists `ads_user_retention_day_count`;
create table `ads_user_retention_day_count`
(
`create_date` string comment '設(shè)備新增日期',
`retention_day` int comment '截止當(dāng)前日期留存天數(shù)',
retention_count bigint comment '留存數(shù)量'
) COMMENT '每日用戶留存情況'
stored as parquet
location '/warehouse/gmall/ads/ads_user_retention_day_count/';
導(dǎo)入數(shù)據(jù):
insert into table ads_user_retention_day_count
select
create_date,
retention_day,
count(*) retention_count
from dws_user_retention_day
where dt='2019-02-11'
group by create_date,retention_day;
留存用戶比率(建表ads_user_retention_day_rate):
create table `ads_user_retention_day_rate`
(
`stat_date` string comment '統(tǒng)計日期',
`create_date` string comment '設(shè)備新增日期',
`retention_day` int comment '截止當(dāng)前日期留存天數(shù)',
`retention_count` bigint comment '留存數(shù)量',
`new_mid_count` string comment '當(dāng)日設(shè)備新增數(shù)量',
`retention_ratio` decimal(10,2) comment '留存率'
) COMMENT '每日用戶留存情況'
stored as parquet
location '/warehouse/gmall/ads/ads_user_retention_day_count/';
導(dǎo)入數(shù)據(jù):
insert into table ads_user_retention_day_rate
select
'2019-02-11' ,
ur.create_date,
ur.retention_day,
ur.retention_count ,
nc.new_mid_count,
ur.retention_count/nc.new_mid_count*100
from
(
select
create_date,
retention_day,
count(*) retention_count
from `dws_user_retention_day`
where dt='2019-02-11'
group by create_date,retention_day
) ur join ads_new_mid_count nc on nc.create_date=ur.create_date;
在線教育項目(李國龍老師計算七日留存率):
create table tmp.seven_days_retained_analysis_${day}(
register_day INT,
zero_interval_retained_rate DOUBLE,
one_interval_retained_rate DOUBLE,
two_interval_retained_rate DOUBLE,
three_interval_retained_rate DOUBLE,
four_interval_retained_rate DOUBLE,
five_interval_retained_rate DOUBLE,
six_interval_retained_rate DOUBLE,
dt INT
)row format delimited fields terminated by '\t';
SQL實現(xiàn):
//獲取近7天全部用戶的注冊信息
select
uid,
dt as register_day,
event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
and event_key = "registerAccount"
//獲取近7天每日活躍的用戶列表
select
uid,
dt as active_day,
max(event_time) as event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
group by uid,dt
//兩者整合访惜,生成uid register_day active_day,interval(活躍時距離注冊日期幾天)
select
t1.uid,
t1.register_day,
t2.active_day,
datediff(from_unixtime(t2.event_time,"yyyy-MM-dd"),from_unixtime(t1.event_time,"yyyy-MM-dd")) as day_interval
from(
select uid,dt as register_day,event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
and event_key = "registerAccount") t1
left join(
select uid,dt as active_day,max(event_time) as event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
group by uid,dt) t2
on t1.uid = t2.uid
//根據(jù)以上的表生成留存用戶數(shù)臨時表
drop table if exists tmp.user_retained_${startDay}_${endDay};
create table if not exists tmp.user_retained_${startDay}_${endDay}
as
select register_day,day_interval,count(1) as retained
from (
select
t1.uid,t1.register_day,t2.active_day,
datediff(from_unixtime(t2.event_time,"yyyy-MM-dd"),from_unixtime(t1.event_time,"yyyy-MM-dd")) as day_interval
from(
select uid,dt as register_day,event_time from dwd.user_behavior where dt between ${startDay} and ${endDay} and event_key = "registerAccount") t1
left join(
select uid,dt as active_day,max(event_time) as event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
group by uid,dt) t2
on t1.uid = t2.uid) tmp
group by register_day,day_interval
結(jié)果:
20190402 0 27000
20190402 1 19393
20190402 2 14681
20190402 3 9712
20190402 4 5089
20190402 5 1767
20190402 6 1775
//計算7日留存率
drop table if exists tmp.user_retained_rate_${startDay}_${endDay};
create table if not exists tmp.user_retained_rate_${startDay}_${endDay}
as
select register_day,day_interval,
round(retained / register_cnt,4) as retained_rate,
current_dt from (
select t1.register_day,t1.day_interval,t1.retained,t2.register_cnt,${endDay} as current_dt
from
(select register_day,day_interval,retained
from tmp.user_retained_${startDay}_${endDay}) t1
left join
(select dt,count(1) as register_cnt
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
and event_key = "registerAccount"
group by dt) t2
on t1.register_day = t2.dt
group by t1.register_day,t1.day_interval ,t1.retained,t2.register_cnt) tmp2
//列轉(zhuǎn)行
insert overwrite table tmp.seven_days_retained_analysis_${day}
select
register_day,
max(case when day_interval = 0 then retained_rate else 0 end) as zero_interval_retained_rate,
max(case when day_interval = 1 then retained_rate else 0 end) as one_interval_retained_rate,
max(case when day_interval = 2 then retained_rate else 0 end) as two_interval_retained_rate,
max(case when day_interval = 3 then retained_rate else 0 end) as three_interval_retained_rate,
max(case when day_interval = 4 then retained_rate else 0 end) as four_interval_retained_rate,
max(case when day_interval = 5 then retained_rate else 0 end) as five_interval_retained_rate,
max(case when day_interval = 6 then retained_rate else 0 end) as six_interval_retained_rate,
current_dt
from tmp.user_retained_rate_${startDay}_${endDay}
group by register_day,current_dt;