Data Warehouse簡介與入門

正巧大后天就要考Data Warehouse（數(shù)據(jù)倉庫）了勉吻，就憑借自己寫一點(diǎn)基礎(chǔ)的知識權(quán)當(dāng)做一個(gè)考前的復(fù)習(xí)作用，以及作為一個(gè)基礎(chǔ)的入門介紹吧

1. Motivation：

首先要明確為啥需要data warehouse奉瘤，以及究竟什么是data warehouse亲族。

沒有數(shù)據(jù)倉庫時(shí)种呐，我們需要直接從業(yè)務(wù)數(shù)據(jù)庫中取數(shù)據(jù)來做分析狠半。業(yè)務(wù)數(shù)據(jù)庫主要是為業(yè)務(wù)操作服務(wù)噩死，雖然可以用于分析，但需要做很多額外的調(diào)整神年，在我看來已维，主要有以下幾個(gè)問題：結(jié)構(gòu)復(fù)雜，數(shù)據(jù)臟亂已日，難以理解垛耳，缺少歷史，大規(guī)模查詢緩慢。（OLTP） Online Transaction Processing

上面的問題艾扮，都可以通過一個(gè)建設(shè)良好的數(shù)據(jù)倉庫來解決址否。

業(yè)務(wù)數(shù)據(jù)庫是面向操作的云头，主要服務(wù)于業(yè)務(wù)產(chǎn)品和開發(fā)。而數(shù)據(jù)倉庫則是面向分析的柑土，主要服務(wù)于我們分析人員逆济。評價(jià)數(shù)據(jù)倉庫做的好不好酌予，就看我們分析師用得爽不爽。因此奖慌，數(shù)據(jù)倉庫從產(chǎn)品設(shè)計(jì)開始抛虫，就一直是站在分析師的立場上考慮的，致力于解決使用業(yè)務(wù)數(shù)據(jù)進(jìn)行分析帶來的種種弊端简僧。（OLAP） Online Analytical Processing

What is data warehouse：

拿William H.Inmon’s 話來定義：

Data warehouse is subject-oriented, time varying, integrated, non volatile collection of data in support of management's decision analyse.

Subject-oriented: DW organised by several subject determined by requirements. 這句話我所理解的是因?yàn)閿?shù)據(jù)倉庫本書是沒有數(shù)據(jù)的建椰，所有數(shù)據(jù)來源都是來自不同的數(shù)據(jù)庫，那么在建立的時(shí)候數(shù)據(jù)來源自然就由用戶query時(shí)的需求所決定

integrated: the content results from the integration of data from multiple sources

time-varying: keeps track of data changes so that reports show evolution over time

non-volatile: new data can be added, but data is±never deleted nor updated. 實(shí)際上這也是olap和oltp主要的區(qū)別之一

現(xiàn)在市面上可以使用數(shù)據(jù)倉庫主要通過以下產(chǎn)品：

Oracle岛马，SAP棉姐，IBM，SQL server啦逆，teradata

open resource的有：

talend

下面這張圖就詳解的介紹了olap和oltp的區(qū)別：

可以簡單的把oltp理解成原始的數(shù)據(jù)庫伞矩，olap理解成數(shù)據(jù)倉庫

2.數(shù)據(jù)倉庫的基本架構(gòu)和基本組成與模型：

由于傳統(tǒng)的數(shù)據(jù)庫中用的一般都是3NF的表，在BI顯得過于復(fù)雜夏志，故需要新的模型來支持data warehouse中的data

我們實(shí)際上利用Multi-dimensional model乃坤，由于它比較簡單易懂，就是根據(jù)多個(gè)dimension將多個(gè)關(guān)系組成一個(gè)立方體（維度可以大于等于3）的形式沟蔑。

在multi-dimensional model中的基本組成成分由下列幾種：

Cube：指組成的那個(gè)立方體

Cell：交錯(cuò)的小方格

Fact：不為空的cell

Measure：不為空的cell的數(shù)值

Dimension：簡單的講就是組成cube的各個(gè)表

Facts：has a granularity = level of detail

主要分為兩種fact：event（針對某個(gè)的時(shí)間或者地點(diǎn)）湿诊，snapshot（強(qiáng)調(diào)在一定的時(shí)期范圍內(nèi)）

Dimension：是有層次之分的，比如有一個(gè)表中包含（Paris溉贿，Berlin枫吧，Lyon，London）可以對其做rollup使之聚類成為country層次的dimension, 本次均假設(shè)是線性的dimension宇色，有個(gè)最大的層次ALL

Measure：有三種measure： additive 可加的意思上該measure在任何dimension均可相加并且有實(shí)際意義九杂，semi-additive在某些維度可以相加有實(shí)際意義，non-additive 在任何維度均不可相加

對于multi-dimensional model有以下幾種可能的操作形式：

ROLLUP(Cube,Dimension to Level, AggFunction(Measure)) 即剛才所說的向上聚合的過程

DRILLDOWN(Cube,Dimension to Level) 從高層次擴(kuò)展到detail的過程

SLICE(Cube,Dimension, Level=value)

DICE(Cube,), with: boolean combination

3.數(shù)據(jù)倉庫的基本存儲形式

主要用三種形式來表達(dá)之前提到過的cube：

1.ROLAP：用relational database來存儲數(shù)據(jù)?

ROLAP可以較好的支持現(xiàn)有的數(shù)據(jù)庫宣蠕，利用relational database來存儲cube例隆，可以fit memory，只存儲非空的cell抢蚀，可以支持語義镀层，比較容易維護(hù)

2.MOLAP：用array來存儲數(shù)據(jù)

比較容易得到某個(gè)特定的數(shù)據(jù)，可以使用不在relational database下其他指令，沒必要存儲每個(gè)點(diǎn)的坐標(biāo)

3.HOLAP：結(jié)合以上兩種

由于ROLAP有對于原本就有的數(shù)據(jù)庫的支持唱逢，主要介紹以下ROLAP下的幾種不同類型的表格

1.Star Schema：

由一個(gè)fact table以及多個(gè)dimension table組成

fact table：由事實(shí)和dimension table 的FK組成吴侦，在star schema 的形式下dimension table里不允許再出現(xiàn)FK

2.Snowflake Schema：

由一個(gè)fact table 以及多個(gè)dimension table組成

此時(shí)與star schema最大的區(qū)別就在于對于dimension table下可以存放字結(jié)構(gòu)的FK，被連接的dimension table 被稱為outrigger dimension 支撐向量

3. Starflake Schema：

結(jié)合了上述兩種schema的表現(xiàn)形式

4. Galaxy：

相當(dāng)于是star schema+多個(gè)fact table

可以存儲arregate之后的結(jié)果

4 Fact table 與dimension attribute

在fact table中要避免dimension table 的FK為null

these nulls would automatically cause a referential integrity violation

在fact table中measure 的值可以為空

fact table有以下幾種形式：

transaction fact tables

periodic snapshot fact table

accumulating snapshot fact table

factless fact table

aggregate fact table

consolidated fact table

Role-playing dimension：a dimension that participates several times in a fact table

意思就是某個(gè)dimension table 里面有多個(gè)FK坞古，這些FK均可以連接到同一個(gè)Fact Table里备韧。fact Table有多個(gè)FK可以指向同一個(gè)表

在dimension table中要避免簡單的Y／N，最好寫上具體意思

5. SCD

當(dāng)我們需要改變屬于OLAP的表格時(shí)痪枫，我們需要update數(shù)據(jù)织堂。在DW中一般有以下幾種方式：

type1: overwrite，直接覆蓋舊數(shù)據(jù)

type2: with version

add column （effective奶陈，expiration易阳，indicator）

insert the new value

我們新加入一行，這一行數(shù)據(jù)的surrogate key是不同的吃粒，但是natural key是與原數(shù)據(jù)一致潦俺，同時(shí)加入3列，用來指明該數(shù)據(jù)生效和失效的日期声搁，并指明現(xiàn)在正在使用的數(shù)據(jù)對象的狀態(tài)是effective還是expire

type3: with version

replace the old one黑竞，add a column(last version)

在這里我們將新加入一列用來指明之前的舊數(shù)據(jù)

6.query in DW

當(dāng)我們實(shí)際在DW進(jìn)行query查詢操作時(shí)，以oracle為例：

with clause to do the recursive query：可以支持遞歸查詢疏旨，第一行用來init 數(shù)據(jù)很魂，union all之后的部分可以當(dāng)作不斷遞歸擴(kuò)展的部分

“Group By”從字面意義上理解就是根據(jù)“By”指定的規(guī)則對數(shù)據(jù)進(jìn)行分組，所謂的分組就是將一個(gè)“數(shù)據(jù)集”劃分成若干個(gè)“小區(qū)域”檐涝，然后針對若干個(gè)“小區(qū)域”進(jìn)行數(shù)據(jù)處理遏匆。

rollup(a,b): (a,b),(a),() 屬于group by的一部分，數(shù)據(jù)將會有(a,b),(a),()三個(gè)層次的區(qū)別谁榜。最后一個(gè)即對所有total層次的統(tǒng)計(jì)

cube(a,b):(a,b),(a),(b),()類似與rollup幅聘，但是增加了對b對象支持

grouping sets：可以講想要的維度對象加入到其中

grouping(balba): 1 when the value is null, else 0 可以用來檢測現(xiàn)在的層次的狀態(tài)，歸根結(jié)底還是為了避免出現(xiàn)null的情況做出的努力

grouping_ID(a,b,c): use the binary to store the situation. 當(dāng)有多個(gè)元素時(shí)窃植，grouping_ID出現(xiàn)的將會是以二進(jìn)制的形式帝蒿。以a b c為例，1 0 0即當(dāng)a層次為null巷怜，將會顯示6

Group_ID: whether it is replicated or not and how many times it shows

顯示數(shù)據(jù)重復(fù)次數(shù)

in the window function:

when the window function is a aggregated function then it must need group by operation

window function over(partition by.. [order by ...])：值得注意的是葛超，當(dāng)window function是arrogate的時(shí)候，一定需要加上group by（ partition的對象）

rank() over (partition by.. order by ...): 從醒铀堋（1）到大绣张；若有desc則是從大（1）到小排序，注意此時(shí)的order by是必須的关带，很好理解侥涵，沒有order就沒辦法進(jìn)行rank了

DenseRank():不允許出現(xiàn)rank排名有跳過的現(xiàn)象（1 2 2 3 4）

window function over(...... [row 2 preceding]):向上看2行

window function over(...... [range condition]):向上看logic offset?

LAG（expression, offset, default_value): 對每行數(shù)據(jù)進(jìn)行expression中的操作，并利用offset中距離向上看，如果向上看的對象不存在則用default value代替

LEAD：一樣芜飘，換成向下看

7. Partition

divide data (table, index) into pieces that can be manipulated independently.

partition的好處：

allows to handle very large relation

allows to parallelism

filter the partition when query optimisation

ez to maintenance

這是在物理的層面做優(yōu)化要考慮的

何時(shí)考慮采用partition：

table size >2G more than 1M rows

historical data read only

multi media storages

partition主要分成兩種 vertical partition horizontal partition务豺。

horizontal partition主要分成以下三種基本類（它們可以自由組合）： hash, range, list

Hash:?

好處: balance the partition, on non-temporal attribute, distribute the data on the physical storage

怎么選擇hash時(shí)的對象：choose the column that is almost unique

怎么選擇hash的值：take the power of 2 as number

in the partition wise join

Update:

DISABLE/ENABLE ROW MOVEMENT

row movement disabled (default): update fails if it would result in row migration

Enable row movement: updating the key may cause a row to move to another partition

Range

MAX VALUE can be used to specify the top range, otherwise: implicit integrity constraint on the table.

LIST:

Does not support multi-column partition keys.

Interval partitioning:

first by range

then use interval(...):

必須要明確定義第一行數(shù)據(jù)

Partition by reference:

PARTITIONBYREFERENCE(sales_item_fk)

-- requires named referential constraint toward partitioned table

Partitioning on virtual column

The other operation possible:

ADD

DROP: the fast way to remove large volume data, delete the index first, otherwise it will degrade the performance?

TRUNCATE: only delete the row data, the partition remains

SPLIT

MERGE

Partition exchange

Partition pruning:

RANGE, LISTIN, =, LIKE, range (<,. . . )

HASH IN, =

Partition-wise joins:

Parallelism within a query (principle):split the query in multiple subqueries over distinct parts of the data; process subqueries in parallel.

full: partition key on both table with the same (#partition in hash)

partial: (table 1 has the partition key, table 2 will (re)partitioned based on the reference table partition.)

8.Index

B-tree

B+-tree: 在range queries中效率比較高

B樹和B+樹的主要區(qū)別在于：B-tree could store the data in the node, in the B+ only the leaves could store the data

一個(gè)小例子：

imagine if there is table with k attribute and n rows

then use B+ tree will be need 64*k*n

bitmap是適合DW的一種索引方式

create a bitmap for the attribute

它的好處在于，good at the boolean operation嗦明，節(jié)省空間不需要存儲ROW ID可以fit 在main memory

in that case it will be 3*k*n(3 is the coordinate of attribute k)

bitmap on join 只需要找到相同的join key再對相應(yīng)的對象進(jìn)行建立bitmap就可以了

STAR QUERY是非吵迥兀快的，因?yàn)樗萰oin需要有選擇（select）操作的對象招狸，再join無法select的大對象

在使用時(shí)必須要：STAR_TRANSFORMATION_ENABLED = true.

when using the star query, it need all the fk in based on bitmap

in the clustered-index:

Creating an index-organized table：

ORGANIZATION INDEX 指明這是index organized的情況

MAPPING TABLE // creates a table mapping logical rowids-- the mapping table is necessary to support bitmap indexes

Multidimensional clusting中也分為兩種：

linear order

there is also something like interleaved order: 意味著以Z字形進(jìn)行存儲數(shù)據(jù)，例如對坐標(biāo)（2邻薯，1裙戏，0）對象進(jìn)行建立索引時(shí)將會為（100010）2=34

cluster的建立可以減少IO操作，無須像index那么的存儲空間厕诡，可以提高壓縮的程度

9. Bloom filter

to decide whether the element is in the table or not

Array A: mbit

Hash function

1.map(element in S) using the several hash function into A

2. A[h(s)]=0 then it doesn't in the S

else it is possible

improve the accuracy, A has more bits, making more hash functions..

10.Column-oriented DB

merge the dictionary, sorted it

replace the old one using the new dictionary to replace.

After the merge,

different buffer is empty but with the valid vector as 1

最后編輯于：2017.11.27 02:22:15

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末累榜，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子灵嫌，更是在濱河造成了極大的恐慌壹罚，老刑警劉巖，帶你破解...
沈念sama閱讀 219,110評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件寿羞，死亡現(xiàn)場離奇詭異猖凛，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)绪穆，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,443評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門辨泳，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人玖院，你說我怎么就攤上這事菠红。” “怎么了难菌？”我有些...
開封第一講書人閱讀 165,474評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵试溯，是天一觀的道長。經(jīng)常有香客問我郊酒，道長遇绞，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,881評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任猎塞，我火速辦了婚禮试读，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘荠耽。我一直安慰自己钩骇，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,902評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著倘屹，像睡著了一般银亲。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上纽匙，一...
開封第一講書人閱讀 51,698評論 1贊 305
城市分裂傳說
那天务蝠，我揣著相機(jī)與錄音，去河邊找鬼烛缔。笑死馏段，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的践瓷。我是一名探鬼主播院喜，決...
沈念sama閱讀 40,418評論 3贊 419
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼晕翠！你這毒婦竟也來了喷舀？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,332評論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤淋肾，失蹤者是張志新（化名）和其女友劉穎硫麻，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體樊卓，經(jīng)...
沈念sama閱讀 45,796評論 1贊 316
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡拿愧，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,968評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了碌尔。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片赶掖。...
茶點(diǎn)故事閱讀 40,110評論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖七扰，靈堂內(nèi)的尸體忽然破棺而出奢赂，到底是詐尸還是另有隱情，我是刑警寧澤颈走，帶...
沈念sama閱讀 35,792評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布膳灶，位于F島的核電站，受9級特大地震影響立由，放射性物質(zhì)發(fā)生泄漏轧钓。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,455評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一锐膜、第九天我趴在偏房一處隱蔽的房頂上張望毕箍。院中可真熱鬧，春花似錦道盏、人聲如沸而柑。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,003評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽媒咳。三九已至粹排，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間涩澡，已是汗流浹背顽耳。一陣腳步聲響...
開封第一講書人閱讀 33,130評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留妙同，地道東北人射富。一個(gè)月前我還...
沈念sama閱讀 48,348評論 3贊 373
代替公主和親
正文我出身青樓，卻偏偏與公主長得像粥帚，于是被迫代替她去往敵國和親辉浦。傳聞我的和親對象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,047評論 2贊 355