COMP9318_WEEK3

聲明：由于本人也是處于學(xué)習(xí)階段，有些理解可能并不深刻军俊，甚至?xí)y帶一定錯(cuò)誤砍濒，因此請(qǐng)以批判的態(tài)度來(lái)進(jìn)行閱讀淋肾，如有錯(cuò)誤，請(qǐng)留言或直接聯(lián)系本人爸邢。

本周內(nèi)容參照J(rèn)iawei.Han&Micheline.Kamber&Jian.Pei, DATA MINING: Concepts and Techniques, Third Edition. 版本的部分內(nèi)容樊卓。

本周內(nèi)容：1) Logical Model; 2) Query Language; 3) Physical Model and Query Processing Technologies; 4) Materialized Cuboids and Efficient Computing Cuboids

關(guān)鍵詞：Star Schema; Snowflake Schema; Fact Collection; Normalization; Denormalization; SQL; MDX; ROLAP; MOLAP; Bitmap; Join Index; Arbitrary selections; Coarse-grain Aggregations; Top-down Approach; Bottom-up Approach

問(wèn)題一，什么是Logical Model杠河？它的實(shí)現(xiàn)方式有哪些碌尔？
什么是Logical Model(Logical Data Model)浇辜？
Wikipedia解釋：A logical data model or logical schema is a data model of a specific problem domain expressed independently of a particular database management product or storage technology (physical data model) but in terms of data structures such as relational tables and columns, object-oriented classes, or XML tags.

Data Warehouse的Logical Model有兩種主要實(shí)現(xiàn)方式：
1）relational DB technology:
1.1）Star schema,
1.2）Snowflake schema,
1.3）Fact constellation
2）multidimensional technology:
2.1）Just as multidimensional data cube

問(wèn)題二，什么是Star schema唾戚？
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.（DATA MINING P.114）

Star Schema

在這里柳洋，SALES是Fact Table，其他的是Dimension Table叹坦。

那么這里的star schema是怎么由universal schema轉(zhuǎn)變過(guò)來(lái)的呢熊镣？
其實(shí)這里的star schema是由universal schema經(jīng)過(guò)normalization轉(zhuǎn)化而來(lái)，具體有：
1）Each dimension is represented by a dimension-table
1.1）LOCATION (location_key, store, street_address, city, state, country, region)
1.2）dimension tables are not normalized
2）Transactions are described through a fact-table
2.1）each tuple consists of a pointer to each of the dimension-tables (foreign-key) and a list of measures (例如上圖中的募书，units_sold; amount)

使用Star schema有什么好處呢绪囱？
Facts and dimensions are clearly depicted
1）dimension tables are relatively static, data is loaded (append mostly) into fact table(s)
2）easy to comprehend (and write queries)

問(wèn)題三，什么是Snowflake schema莹捡？
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.(DATA MINING P.114)

Snowflak Schema

因此鬼吵，我們可知，Snowflake schema只要經(jīng)過(guò)Denormalization就可以變回Star schema.

問(wèn)題四篮赢，什么是Fact Collection齿椅？
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.(DATA MINING P.116)

Fact Collection

由上圖可以得到Fact Collection是由多個(gè)fact table共享它們相同的dimension table 的內(nèi)容，可以達(dá)到降低冗余的需求启泣。

名詞解釋：
Normalization：Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).(Wikipedia)(Tips:可以思考relational database中的1NF->2NF->3NF等等的轉(zhuǎn)換)
為什么要在relational database中使用normalization呢涣脚？
1）節(jié)省空間。例如寥茫，street1：Bark Street涩澡；City1：Kingsford；Stae1：NSW坠敷。Street2：Harry stree妙同；City1：Kingsford；Stae1：NSW膝迎。納悶我們將City進(jìn)行normalization粥帚，即city_key指向city、stae限次，然后我們不同的street就可以共用這一city_key芒涡，以達(dá)到減少redundancy的目的。
2）便于update卖漫。任然是上面的例子费尽，如果Bark Street現(xiàn)在劃歸為Randwick，那么只需要將city_key的指向指為Randwick所屬的那個(gè)city_key就可以了羊始。

Denormalization：Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.[1][2] It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization should not be confused with Unnormalized form. Databases/tables must first be normalized to efficiently denormalize them.(Wikipedia)
還可參照以下更為通俗的解釋：https://medium.com/@katedoesdev/normalized-vs-denormalized-databases-210e1d67927d

問(wèn)題五旱幼，Data Warehouse的query language 有哪些？
1）Using relational DB technology: SQL (with extensions such as CUBE/PIVOT/UNPIVOT)
2）Using multidimensional technology: MDX

query language

上圖中突委，左邊的是SQL柏卤，右邊的是MDX冬三。相較而言，MDX更為簡(jiǎn)單缘缚，因?yàn)樗恼Z(yǔ)句更加直觀簡(jiǎn)潔勾笆。它們所共同表達(dá)的意思就是：Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)

問(wèn)題六，Data Warehouse 的層次結(jié)構(gòu)模型有什么桥滨？
Data Warehouse architecture分為3層：1）最底層窝爪，Data Warehouse Server; 2）中間層，OLAP Server; 3）最高層齐媒，F(xiàn)ront-end Tools

Data Warehouse 的層次結(jié)構(gòu)模型

我們著重介紹中間層：OLAP Server酸舍。
OLAP Server主要采用兩種方式：
1）ROLAP(which is used in relational database technology)
2）MOLAP(which is used in multidimensional technology)
還有一種混合的OLAP Server，HOLAP

Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.(DATA MINING P.135)

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data136 Chapter 3 Data Warehouse and OLAP Technology: An Overview cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 4). Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser subcubes are identified and stored as array structures, whereas sparse subcubes employ compression technology for efficient storage utilization.(DATA MINING P.135)

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.(DATA MINING P.136)

image.png

例如上圖里初，就是利用HOLAP來(lái)進(jìn)行數(shù)據(jù)存放及處理，base cuboid我們可以理解為star schema中的fact table忽舟，它使用ROLAP來(lái)進(jìn)行存儲(chǔ)處理双妨，因?yàn)閞elational database query更加方便；然而下面的可以用MOLAP來(lái)進(jìn)行存儲(chǔ)處理叮阅，這樣使得數(shù)據(jù)存儲(chǔ)更加立體刁品，索引起來(lái)更加方便。

問(wèn)題七浩姥，當(dāng)我們我們針對(duì)OLAP中不同數(shù)據(jù)情況時(shí)挑随，我們?cè)摬捎煤畏N所應(yīng)方式來(lái)方便query呢？
1）Selection on low-cardinality attributes勒叠，Bitmap Index(BI)是個(gè)不錯(cuò)的選擇兜挨。
BI on dimension tables
1.1) Index on an attribute (column) with low distinct values
1.2) Each distinct values, v, is associated with a n-bit vector (n = #rows)
1.2.1) The i-th bit is set if the i-th row of the table has the value v for the indexed column
1.3) Multiple BIs can be efficiently combined to enable optimized scan of the table

Bitmap Index

這里將region作為”distinct values”，在Asia行時(shí)眯分，讀出C1,C3的Region為Asia拌汇，所以bitmap 的第1，3列表為1弊决。以此類推噪舀。如果需要加入新元素，C6,C7,C8......Cn飘诗，則只需要將bitmap后面增加n-5個(gè)column即可与倡。（記住，data warehouse一般只能增加數(shù)據(jù)昆稿，一般不可刪除修改數(shù)據(jù)或）
Bitmap的優(yōu)點(diǎn)：1）提高存儲(chǔ)空間利用率纺座；2）提高運(yùn)行速率

2）Selection on high-cardinality attributes，Join Indices是個(gè)不錯(cuò)的選擇
2.1）Join index relates the values of the dimensions of a star schema to rows in the fact table.
2.1.1）a join index on city maintains for each distinct city a list of ROW-IDs of the tuples recording the sales in the city
2.2）Join indices can span multiple dimensions OR
2.2.1）can be implemented as bitmap indexes (per dimension)
2.2.2）use bit-op for multiple-joins

Join Index

我們可以將Kingsford在fact table Sales中的Row ID記錄在Kingsford的join index table（黃色區(qū)域）下溉潭，以方便我們索引Kingsford 比驻；同時(shí)该溯，還可以為Time中的year= ‘2017’建立join index table。假如别惦，Kingsford join index table[R102, R117, R118, R124], ‘2017’join index table[R111, R117, R 119, R124]狈茉，那么當(dāng)我們需要查找位于Kingsford 的2017年的sales數(shù)據(jù)則可以索引join index table得到R117和R124符合條件。

3）Arbitrary selections on Dimensions

Arbitrary selections

可以使用正則來(lái)篩選掸掸。

（以下是Relational Database 和 Data Warehouse之間的數(shù)據(jù)檢索的差別）

image.png

Star Query

上面的語(yǔ)句就是Query氯庆，與traditional relational DB相同，Sales與select出來(lái)的Time相join扰付，然后和select出來(lái)的Time相join堤撵，然后和select出來(lái)的Time相join。請(qǐng)注意羽莺，這里的join只能是binary combine实昨，而且是fact table with dimension table only（因?yàn)閐imension table間沒(méi)有common foreign key，因此無(wú)法combine盐固，然而fact table中存儲(chǔ)了所有的foreign key）荒给。（binary combine比較耗費(fèi)時(shí)間。）

Star Join

在Data Warehouse中我們使用上面star join這種方法刁卜，先將select出的Time, Customer, Loc進(jìn)行Cross Join生成笛卡兒積志电，再join fact table（因?yàn)閒act table太大了所以最后join）。例如例子所給蛔趴，我們將select出的3組數(shù)據(jù)cross join得到8個(gè)tuple挑辆，然后得到找出想要的2個(gè)tuple，接著通過(guò)scan fact table中的數(shù)據(jù)來(lái)對(duì)比出符合這2個(gè)tuple中3個(gè)foreign key的tuples（fact table中的tuples）孝情。這樣做相較于star query鱼蝉，只需要再最后join fact table，而再star query中箫荡，需要在3次join中與fact table相join蚀乔，因此star join更加高效。

4）Coarse-grain Aggregations

即使Join Index可以幫助我們清除大量的無(wú)用信息菲茬，但是留下的信息仍然非常龐大吉挣，如果我們直接進(jìn)行aggregation的話那會(huì)消耗非常多的資源。于是需要pre-compute, 我們采用cuboid來(lái)幫助最后Group By婉弹，語(yǔ)句如下

image.png

我們將Time聚合為Year睬魂，customer聚合為Type，Loc聚合為City镀赌，以此來(lái)形成一個(gè)Cuboid氯哮，方便。當(dāng)然商佛，需要根據(jù)實(shí)際情況需求來(lái)形成cuboid喉钢。

問(wèn)題八姆打，How to store the materialized cuboids? How to compute the cuboids efficiently?
有4種方式：
1）ROLAP（Store all data into one single table）

image.png

當(dāng)需要用到某些cuboids來(lái)應(yīng)對(duì)query時(shí)，可以使用selection語(yǔ)句進(jìn)行查詢（這里的（store, product）, (store), (product),()都是獨(dú)立的cuboid）肠虽。

2）Top-down Approach

Top-down Approach

在這里幔戏，晶格圖中最上層是所有dimension的detail，往下走是一步步group by税课。例如A是product闲延，B是quarter，那么Cuboid AB表示(product, quarter), Cuboid A表示第三層的(product),Cuboid B表示第三層的(quarter). Cuboid A中韩玩，是以product為Group by的垒玲，所以tuple1(A: 1; B: 7; C: 100), tuple2(A: 2; B: 4; C: 50)。所以group by和computing是從頂向下的找颓。

3）Bottom-up Approach

Bottom-up Approach

自底向上合愈，從A, B, C, D各自group by，再到AB等等間group by直到最后達(dá)到ABCD聚合击狮。
這里介紹的計(jì)算方式是Recursive佛析，具體見PPT60

4）MOLAP
詳見PPT69

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市帘不，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌杨箭，老刑警劉巖寞焙，帶你破解...
沈念sama閱讀 206,968評(píng)論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異互婿，居然都是意外死亡捣郊，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,601評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門慈参，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)呛牲，“玉大人，你說(shuō)我怎么就攤上這事驮配∧锢” “怎么了？”我有些...
開封第一講書人閱讀 153,220評(píng)論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵壮锻，是天一觀的道長(zhǎng)琐旁。經(jīng)常有香客問(wèn)我，道長(zhǎng)猜绣，這世上最難降的妖魔是什么灰殴？我笑而不...
開封第一講書人閱讀 55,416評(píng)論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮掰邢，結(jié)果婚禮上牺陶，老公的妹妹穿的比我還像新娘伟阔。我一直安慰自己，他們只是感情好掰伸，可當(dāng)我...
茶點(diǎn)故事閱讀 64,425評(píng)論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布皱炉。她就那樣靜靜地躺著，像睡著了一般碱工。火紅的嫁衣襯著肌膚如雪娃承。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,144評(píng)論 1贊 285
城市分裂傳說(shuō)
那天怕篷，我揣著相機(jī)與錄音历筝，去河邊找鬼。笑死廊谓，一個(gè)胖子當(dāng)著我的面吹牛梳猪，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播蒸痹，決...
沈念sama閱讀 38,432評(píng)論 3贊 401
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼春弥，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了叠荠？” 一聲冷哼從身側(cè)響起匿沛，我...
開封第一講書人閱讀 37,088評(píng)論 0贊 261
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎榛鼎，沒(méi)想到半個(gè)月后逃呼，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,586評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡者娱，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,028評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年抡笼，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片黄鳍。...
茶點(diǎn)故事閱讀 38,137評(píng)論 1贊 334
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡推姻，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出框沟，到底是詐尸還是另有隱情藏古，我是刑警寧澤，帶...
沈念sama閱讀 33,783評(píng)論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布忍燥，位于F島的核電站校翔，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏灾前。R本人自食惡果不足惜防症，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,343評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧蔫敲，春花似錦饲嗽、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,333評(píng)論 0贊 19
一樁弒父案貌虾，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至裙犹，卻和暖如春尽狠，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背叶圃。一陣腳步聲響...
開封第一講書人閱讀 31,559評(píng)論 1贊 262
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工袄膏，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人掺冠。一個(gè)月前我還...
沈念sama閱讀 45,595評(píng)論 2贊 355
代替公主和親
正文我出身青樓沉馆，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親德崭。傳聞我的和親對(duì)象是個(gè)殘疾皇子斥黑，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,901評(píng)論 2贊 345

COMP9318_WEEK3

推薦閱讀更多精彩內(nèi)容