正巧大后天就要考Data Warehouse(數(shù)據(jù)倉庫)了勉吻,就憑借自己寫一點(diǎn)基礎(chǔ)的知識權(quán)當(dāng)做一個(gè)考前的復(fù)習(xí)作用,以及作為一個(gè)基礎(chǔ)的入門介紹吧
1. Motivation:
首先要明確為啥需要data warehouse奉瘤,以及究竟什么是data warehouse亲族。
沒有數(shù)據(jù)倉庫時(shí)种呐,我們需要直接從業(yè)務(wù)數(shù)據(jù)庫中取數(shù)據(jù)來做分析狠半。業(yè)務(wù)數(shù)據(jù)庫主要是為業(yè)務(wù)操作服務(wù)噩死,雖然可以用于分析,但需要做很多額外的調(diào)整神年,在我看來已维,主要有以下幾個(gè)問題:結(jié)構(gòu)復(fù)雜,數(shù)據(jù)臟亂已日,難以理解垛耳,缺少歷史,大規(guī)模查詢緩慢。(OLTP) Online Transaction Processing
上面的問題艾扮,都可以通過一個(gè)建設(shè)良好的數(shù)據(jù)倉庫來解決址否。
業(yè)務(wù)數(shù)據(jù)庫是面向操作的云头,主要服務(wù)于業(yè)務(wù)產(chǎn)品和開發(fā)。而數(shù)據(jù)倉庫則是面向分析的柑土,主要服務(wù)于我們分析人員逆济。評價(jià)數(shù)據(jù)倉庫做的好不好酌予,就看我們分析師用得爽不爽。因此奖慌,數(shù)據(jù)倉庫從產(chǎn)品設(shè)計(jì)開始抛虫,就一直是站在分析師的立場上考慮的,致力于解決使用業(yè)務(wù)數(shù)據(jù)進(jìn)行分析帶來的種種弊端简僧。(OLAP) Online Analytical Processing
What is data warehouse:
拿William H.Inmon’s 話來定義:
Data warehouse is subject-oriented, time varying, integrated, non volatile collection of data in support of management's decision analyse.
Subject-oriented: DW organised by several subject determined by requirements. 這句話我所理解的是因?yàn)閿?shù)據(jù)倉庫本書是沒有數(shù)據(jù)的建椰,所有數(shù)據(jù)來源都是來自不同的數(shù)據(jù)庫,那么在建立的時(shí)候數(shù)據(jù)來源自然就由用戶query時(shí)的需求所決定
integrated: the content results from the integration of data from multiple sources
time-varying: keeps track of data changes so that reports show evolution over time
non-volatile: new data can be added, but data is±never deleted nor updated. 實(shí)際上這也是olap和oltp主要的區(qū)別之一
現(xiàn)在市面上可以使用數(shù)據(jù)倉庫主要通過以下產(chǎn)品:
Oracle岛马,SAP棉姐,IBM,SQL server啦逆,teradata
open resource的有:
talend
下面這張圖就詳解的介紹了olap和oltp的區(qū)別:
可以簡單的把oltp理解成原始的數(shù)據(jù)庫伞矩,olap理解成數(shù)據(jù)倉庫
2.數(shù)據(jù)倉庫的基本架構(gòu)和基本組成與模型:
由于傳統(tǒng)的數(shù)據(jù)庫中用的一般都是3NF的表,在BI顯得過于復(fù)雜夏志,故需要新的模型來支持data warehouse中的data
我們實(shí)際上利用Multi-dimensional model乃坤,由于它比較簡單易懂,就是根據(jù)多個(gè)dimension將多個(gè)關(guān)系組成一個(gè)立方體(維度可以大于等于3)的形式沟蔑。
在multi-dimensional model中的基本組成成分由下列幾種:
Cube:指組成的那個(gè)立方體
Cell:交錯(cuò)的小方格
Fact:不為空的cell
Measure:不為空的cell的數(shù)值
Dimension:簡單的講就是組成cube的各個(gè)表
Facts:has a granularity = level of detail
主要分為兩種fact:event(針對某個(gè)的時(shí)間或者地點(diǎn))湿诊,snapshot(強(qiáng)調(diào)在一定的時(shí)期范圍內(nèi))
Dimension:是有層次之分的,比如有一個(gè)表中包含(Paris溉贿,Berlin枫吧,Lyon,London)可以對其做rollup使之聚類成為country層次的dimension, 本次均假設(shè)是線性的dimension宇色,有個(gè)最大的層次ALL
Measure:有三種measure: additive 可加的意思上該measure在任何dimension均可相加并且有實(shí)際意義九杂,semi-additive在某些維度可以相加有實(shí)際意義,non-additive 在任何維度均不可相加
對于multi-dimensional model有以下幾種可能的操作形式:
ROLLUP(Cube,Dimension to Level, AggFunction(Measure)) 即剛才所說的向上聚合的過程
DRILLDOWN(Cube,Dimension to Level) 從高層次擴(kuò)展到detail的過程
SLICE(Cube,Dimension, Level=value)
DICE(Cube,), with: boolean combination
3.數(shù)據(jù)倉庫的基本存儲形式
主要用三種形式來表達(dá)之前提到過的cube:
1.ROLAP:用relational database來存儲數(shù)據(jù)?
ROLAP可以較好的支持現(xiàn)有的數(shù)據(jù)庫宣蠕,利用relational database來存儲cube例隆,可以fit memory,只存儲非空的cell抢蚀,可以支持語義镀层,比較容易維護(hù)
2.MOLAP:用array來存儲數(shù)據(jù)
比較容易得到某個(gè)特定的數(shù)據(jù),可以使用不在relational database下其他指令,沒必要存儲每個(gè)點(diǎn)的坐標(biāo)
3.HOLAP:結(jié)合以上兩種
由于ROLAP有對于原本就有的數(shù)據(jù)庫的支持唱逢,主要介紹以下ROLAP下的幾種不同類型的表格
1.Star Schema:
由一個(gè)fact table以及多個(gè)dimension table組成
fact table:由事實(shí)和dimension table 的FK組成吴侦,在star schema 的形式下dimension table里不允許再出現(xiàn)FK
2.Snowflake Schema:
由一個(gè)fact table 以及多個(gè)dimension table組成
此時(shí)與star schema最大的區(qū)別就在于對于dimension table下可以存放字結(jié)構(gòu)的FK,被連接的dimension table 被稱為outrigger dimension 支撐向量
3. Starflake Schema:
結(jié)合了上述兩種schema的表現(xiàn)形式
4. Galaxy:
相當(dāng)于是star schema+多個(gè)fact table
可以存儲arregate之后的結(jié)果
4 Fact table 與dimension attribute
在fact table中要避免dimension table 的FK為null
these nulls would automatically cause a referential integrity violation
在fact table中measure 的值可以為空
fact table有以下幾種形式:
transaction fact tables
periodic snapshot fact table
accumulating snapshot fact table
factless fact table
aggregate fact table
consolidated fact table
Role-playing dimension:a dimension that participates several times in a fact table
意思就是某個(gè)dimension table 里面有多個(gè)FK坞古,這些FK均可以連接到同一個(gè)Fact Table里备韧。fact Table有多個(gè)FK可以指向同一個(gè)表
在dimension table中要避免簡單的Y/N,最好寫上具體意思
5. SCD
當(dāng)我們需要改變屬于OLAP的表格時(shí)痪枫,我們需要update數(shù)據(jù)织堂。在DW中一般有以下幾種方式:
type1: overwrite,直接覆蓋舊數(shù)據(jù)
type2: with version
add column (effective奶陈,expiration易阳,indicator)
insert the new value
我們新加入一行,這一行數(shù)據(jù)的surrogate key是不同的吃粒,但是natural key是與原數(shù)據(jù)一致潦俺,同時(shí)加入3列,用來指明該數(shù)據(jù)生效和失效的日期声搁,并指明現(xiàn)在正在使用的數(shù)據(jù)對象的狀態(tài)是effective還是expire
type3: with version
replace the old one黑竞,add a column(last version)
在這里我們將新加入一列用來指明之前的舊數(shù)據(jù)
6.query in DW
當(dāng)我們實(shí)際在DW進(jìn)行query查詢操作時(shí),以oracle為例:
with clause to do the recursive query:可以支持遞歸查詢疏旨,第一行用來init 數(shù)據(jù)很魂,union all之后的部分可以當(dāng)作不斷遞歸擴(kuò)展的部分
“Group By”從字面意義上理解就是根據(jù)“By”指定的規(guī)則對數(shù)據(jù)進(jìn)行分組,所謂的分組就是將一個(gè)“數(shù)據(jù)集”劃分成若干個(gè)“小區(qū)域”檐涝,然后針對若干個(gè)“小區(qū)域”進(jìn)行數(shù)據(jù)處理遏匆。
rollup(a,b): (a,b),(a),() 屬于group by的一部分,數(shù)據(jù)將會有(a,b),(a),()三個(gè)層次的區(qū)別谁榜。最后一個(gè)即對所有total層次的統(tǒng)計(jì)
cube(a,b):(a,b),(a),(b),()類似與rollup幅聘,但是增加了對b對象支持
grouping sets:可以講想要的維度對象加入到其中
grouping(balba): 1 when the value is null, else 0 可以用來檢測現(xiàn)在的層次的狀態(tài),歸根結(jié)底還是為了避免出現(xiàn)null的情況做出的努力
grouping_ID(a,b,c): use the binary to store the situation. 當(dāng)有多個(gè)元素時(shí)窃植,grouping_ID出現(xiàn)的將會是以二進(jìn)制的形式帝蒿。 以a b c為例,1 0 0即當(dāng)a層次為null巷怜,將會顯示6
Group_ID: whether it is replicated or not and how many times it shows
顯示數(shù)據(jù)重復(fù)次數(shù)
in the window function:
when the window function is a aggregated function then it must need group by operation
window function over(partition by.. [order by ...]):值得注意的是葛超,當(dāng)window function是arrogate的時(shí)候,一定需要加上group by( partition的對象)
rank() over (partition by.. order by ...): 從醒铀堋(1)到大绣张; 若有desc則是從大(1)到小排序,注意此時(shí)的order by是必須的关带,很好理解侥涵,沒有order就沒辦法進(jìn)行rank了
DenseRank():不允許出現(xiàn)rank排名有跳過的現(xiàn)象(1 2 2 3 4)
window function over(...... [row 2 preceding]):向上看2行
window function over(...... [range condition]):向上看logic offset?
LAG(expression, offset, default_value): 對每行數(shù)據(jù)進(jìn)行expression中的操作,并利用offset中距離向上看,如果向上看的對象不存在則用default value代替
LEAD:一樣芜飘,換成向下看
7. Partition
divide data (table, index) into pieces that can be manipulated independently.
partition的好處:
allows to handle very large relation
allows to parallelism
filter the partition when query optimisation
ez to maintenance
這是在物理的層面做優(yōu)化要考慮的
何時(shí)考慮采用partition:
table size >2G more than 1M rows
historical data read only
multi media storages
partition主要分成兩種 vertical partition horizontal partition务豺。
horizontal partition主要分成以下三種基本類(它們可以自由組合): hash, range, list
Hash:?
好處: balance the partition, on non-temporal attribute, distribute the data on the physical storage
怎么選擇hash時(shí)的對象:choose the column that is almost unique
怎么選擇hash的值:take the power of 2 as number
in the partition wise join
Update:
DISABLE/ENABLE ROW MOVEMENT
row movement disabled (default): update fails if it would result in row migration
Enable row movement: updating the key may cause a row to move to another partition
Range
MAX VALUE can be used to specify the top range, otherwise: implicit integrity constraint on the table.
LIST:
Does not support multi-column partition keys.
Interval partitioning:
first by range
then use interval(...):
必須要明確定義第一行數(shù)據(jù)
Partition by reference:
PARTITIONBYREFERENCE(sales_item_fk)
-- requires named referential constraint toward partitioned table
Partitioning on virtual column
The other operation possible:
ADD
DROP: the fast way to remove large volume data, delete the index first, otherwise it will degrade the performance?
TRUNCATE: only delete the row data, the partition remains
SPLIT
MERGE
Partition exchange
Partition pruning:
RANGE, LISTIN, =, LIKE, range (<,. . . )
HASH IN, =
Partition-wise joins:
Parallelism within a query (principle):split the query in multiple subqueries over distinct parts of the data; process subqueries in parallel.
full: partition key on both table with the same (#partition in hash)
partial: (table 1 has the partition key, table 2 will (re)partitioned based on the reference table partition.)
8.Index
B-tree
B+-tree: 在range queries中效率比較高
B樹和B+樹的主要區(qū)別在于:B-tree could store the data in the node, in the B+ only the leaves could store the data
一個(gè)小例子:
imagine if there is table with k attribute and n rows
then use B+ tree will be need 64*k*n
bitmap是適合DW的一種索引方式
create a bitmap for the attribute
它的好處在于,good at the boolean operation嗦明,節(jié)省空間不需要存儲ROW ID可以fit 在main memory
in that case it will be 3*k*n(3 is the coordinate of attribute k)
bitmap on join 只需要找到相同的join key再對相應(yīng)的對象進(jìn)行建立bitmap就可以了
STAR QUERY是非吵迥兀快的,因?yàn)樗萰oin需要有選擇(select)操作的對象招狸,再join無法select的大對象
在使用時(shí)必須要:STAR_TRANSFORMATION_ENABLED = true.
when using the star query, it need all the fk in based on bitmap
in the clustered-index:
Creating an index-organized table:
ORGANIZATION INDEX 指明這是index organized的情況
MAPPING TABLE // creates a table mapping logical rowids-- the mapping table is necessary to support bitmap indexes
Multidimensional clusting中也分為兩種:
linear order
there is also something like interleaved order: 意味著以Z字形進(jìn)行存儲數(shù)據(jù),例如對坐標(biāo)(2邻薯,1裙戏,0)對象進(jìn)行建立索引時(shí)將會為(100010)2=34
cluster的建立可以減少IO操作,無須像index那么的存儲空間厕诡,可以提高壓縮的程度
9. Bloom filter
to decide whether the element is in the table or not
Array A: mbit
Hash function
1.map(element in S) using the several hash function into A
2. A[h(s)]=0 then it doesn't in the S
else it is possible
improve the accuracy, A has more bits, making more hash functions..
10.Column-oriented DB
merge the dictionary, sorted it
replace the old one using the new dictionary to replace.
After the merge,
different buffer is empty but with the valid vector as 1