- 介紹
- 實(shí)踐應(yīng)用
介紹
概念
? Slowly Changing Dimensions
? 在現(xiàn)實(shí)世界中,維度的屬性并不是靜態(tài)的巩掺,它會(huì)隨著時(shí)間的流失發(fā)生緩慢的變化偏序。這種隨時(shí)間發(fā)生變化的維度我們一般稱之為緩慢變化維,并且把處理維度表的歷史變化信息的問(wèn)題稱為處理緩慢變化維的問(wèn)題胖替,有時(shí)也簡(jiǎn)稱為處理SCD的問(wèn)題研儒。
應(yīng)用場(chǎng)景
? 支持對(duì)于歷史數(shù)據(jù)的追溯,Inmon對(duì)于數(shù)倉(cāng)的定義已成為公認(rèn)独令。而數(shù)倉(cāng)是這么考慮的:在企業(yè)管理和決策中面向主題的端朵、集成的、與時(shí)間相關(guān)的记焊、不可修改的數(shù)據(jù)集合逸月。
處理方式
如何使用
-- 通過(guò)對(duì)開始和結(jié)束時(shí)間的限定栓撞,獲取指定時(shí)間點(diǎn)數(shù)據(jù)情況
select name
from tableA
where begin_time<='2019-06-02' and end_time>='2019-06-02'
存在的問(wèn)題
-
數(shù)據(jù)重復(fù)
處理數(shù)據(jù)時(shí)遍膜,總會(huì)有各種各樣的情況導(dǎo)致數(shù)據(jù)更新失敗或錯(cuò)誤。這個(gè)時(shí)候瓤湘,需要對(duì)應(yīng)的腳本支持?jǐn)?shù)據(jù)重跑瓢颅。在緩慢變化維中,由于保留歷史數(shù)據(jù)弛说,如何保證重跑數(shù)據(jù)時(shí)保障數(shù)據(jù)正確性挽懦、不會(huì)大的變化。
-
存儲(chǔ)與性能
緩慢變化維記錄整個(gè)歷史數(shù)據(jù)木人,數(shù)據(jù)會(huì)以增量的方式進(jìn)行擴(kuò)張信柿。隨著時(shí)間推移冀偶,數(shù)據(jù)就會(huì)越來(lái)越多。查詢的效率也會(huì)隨之面臨挑戰(zhàn)渔嚷。
-
應(yīng)用的邊界
數(shù)據(jù)的變化是不可避免的进鸠,在處理數(shù)據(jù)時(shí)哪些是必須要使用,那么可用可不用形病。如何選擇
實(shí)踐應(yīng)用
基礎(chǔ)數(shù)據(jù)
處理方式a
? 目標(biāo) 記錄數(shù)據(jù)變化軌跡客年,支持?jǐn)?shù)據(jù)重跑
-- 更新已有數(shù)據(jù)
insert overwrite table temp_db.temp_scd_checking
select
a.id
,a.name
,a.money
,a.status
,a.begin_time
,case when b.id is not null and a.end_time='9999-09-09' then date_sub('${dealDate}',1) else a.end_time end as end_time
from temp_db.temp_scd_checking as a
left join (
select
id
from temp_db.temp_scd_source_data
where p_dt='${dealDate}'
) as b on a.id=b.id
;
-- 插入新數(shù)據(jù)
insert overwrite table temp_db.temp_scd_checking
select
case when a.id is not null then a.id else b.id end as id
,case when a.id is not null then a.name else b.name end as name
,case when a.id is not null then a.money else b.money end as money
,case when a.id is not null then a.status else b.status end as status
,case when a.id is not null then '${dealDate}' else b.begin_time end as begin_time
,case when a.id is not null then '9999-09-09' else b.end_time end as end_time
from (
select
id
,name
,money
,status
,p_dt
from temp_db.temp_scd_source_data
where p_dt='${dealDate}'
) as a
full join (
select
*
from temp_db.temp_scd_checking
where end_time='9999-09-09'
) as b on a.id=b.id
-- 歷史去除重復(fù)(by: sunhj)
union all
select
id
,name
,money
,status
,begin_time
,end_time
from temp_db.temp_scd_checking
where end_time<>'9999-09-09'
group by
id
,name
,money
,status
,begin_time
,end_time
;
結(jié)果
重跑結(jié)果(0602-0604)
處理方式b
目標(biāo) 方式a中處理效果太差,時(shí)間漠吻、腳本都太長(zhǎng)量瓜。進(jìn)行優(yōu)化
-- 支持歷史重刷 (by sunhj)
insert overwrite table temp_db.temp_scd_checking
select
a.id
,a.name
,a.money
,a.status
,a.begin_time
,case when b.id is not null and a.end_time='9999-09-09' then date_sub('${dealDate}',1) else a.end_time end as end_time
from
(
select
id
,name
,money
,status
,begin_time
,case when begin_time=date_sub('${dealDate}',1) then '9999-09-09' else end_time end as end_time
from temp_db.temp_scd_checking
where begin_time<'${dealDate}'
) as a
left join (
select
id
from temp_db.temp_scd_source_data
where p_dt='${dealDate}'
) as b on a.id=b.id
;
-- 更新數(shù)據(jù) (by sunhj)
insert overwrite table temp_db.temp_scd_checking
select
case when a.id is not null then a.id else b.id end as id
,case when a.id is not null then a.name else b.name end as name
,case when a.id is not null then a.money else b.money end as money
,case when a.id is not null then a.status else b.status end as status
,case when a.id is not null then a.begin_time else b.begin_time end as begin_time
,case when a.id is not null then a.end_time else b.end_time end as end_time
from (
select
id
,name
,money
,status
,'${dealDate}' as begin_time
,'9999-09-09' as end_time
from temp_db.temp_scd_source_data
where p_dt='${dealDate}'
) as a
full join temp_db.temp_scd_checking as b on a.id=b.id and b.end_time='9999-09-09'
;
結(jié)果略.
優(yōu)劣對(duì)比
處理方式a | 處理方式b | |
---|---|---|
代碼簡(jiǎn)潔與可讀性 | 高 | 低 |
執(zhí)行效率 | 高 | 低(基于hive语淘,增加近80%的時(shí)間延遲) |
數(shù)據(jù)重跑 | 支持 | 支持 |
重跑臟數(shù)據(jù) | 無(wú) | 有 |
總結(jié): 建議方式b處理