聲明:由于本人也是處于學(xué)習(xí)階段,有些理解可能并不深刻军俊,甚至?xí)y帶一定錯(cuò)誤砍濒,因此請(qǐng)以批判的態(tài)度來(lái)進(jìn)行閱讀淋肾,如有錯(cuò)誤,請(qǐng)留言或直接聯(lián)系本人爸邢。
本周內(nèi)容參照J(rèn)iawei.Han&Micheline.Kamber&Jian.Pei, DATA MINING: Concepts and Techniques, Third Edition. 版本的部分內(nèi)容樊卓。
本周內(nèi)容:1) Logical Model; 2) Query Language; 3) Physical Model and Query Processing Technologies; 4) Materialized Cuboids and Efficient Computing Cuboids
關(guān)鍵詞:Star Schema; Snowflake Schema; Fact Collection; Normalization; Denormalization; SQL; MDX; ROLAP; MOLAP; Bitmap; Join Index; Arbitrary selections; Coarse-grain Aggregations; Top-down Approach; Bottom-up Approach
問(wèn)題一,什么是Logical Model杠河?它的實(shí)現(xiàn)方式有哪些碌尔?
什么是Logical Model(Logical Data Model)浇辜?
Wikipedia解釋:A logical data model or logical schema is a data model of a specific problem domain expressed independently of a particular database management product or storage technology (physical data model) but in terms of data structures such as relational tables and columns, object-oriented classes, or XML tags.
Data Warehouse的Logical Model有兩種主要實(shí)現(xiàn)方式:
1)relational DB technology:
1.1)Star schema,
1.2)Snowflake schema,
1.3)Fact constellation
2)multidimensional technology:
2.1)Just as multidimensional data cube
問(wèn)題二,什么是Star schema唾戚?
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.(DATA MINING P.114)
在這里柳洋,SALES是Fact Table, 其他的是Dimension Table叹坦。
那么這里的star schema是怎么由universal schema轉(zhuǎn)變過(guò)來(lái)的呢熊镣?
其實(shí)這里的star schema是由universal schema經(jīng)過(guò)normalization轉(zhuǎn)化而來(lái),具體有:
1)Each dimension is represented by a dimension-table
1.1)LOCATION (location_key, store, street_address, city, state, country, region)
1.2)dimension tables are not normalized
2)Transactions are described through a fact-table
2.1)each tuple consists of a pointer to each of the dimension-tables (foreign-key) and a list of measures (例如上圖中的募书,units_sold; amount)
使用Star schema有什么好處呢绪囱?
Facts and dimensions are clearly depicted
1)dimension tables are relatively static, data is loaded (append mostly) into fact table(s)
2)easy to comprehend (and write queries)
問(wèn)題三,什么是Snowflake schema莹捡?
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.(DATA MINING P.114)
因此鬼吵,我們可知,Snowflake schema只要經(jīng)過(guò)Denormalization就可以變回Star schema.
問(wèn)題四篮赢,什么是Fact Collection齿椅?
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.(DATA MINING P.116)
由上圖可以得到Fact Collection是由多個(gè)fact table共享它們相同的dimension table 的內(nèi)容,可以達(dá)到降低冗余的需求启泣。
名詞解釋:
Normalization:Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).(Wikipedia)(Tips:可以思考relational database中的1NF->2NF->3NF等等的轉(zhuǎn)換)
為什么要在relational database中使用normalization呢涣脚?
1)節(jié)省空間。例如寥茫,street1:Bark Street涩澡;City1:Kingsford;Stae1:NSW坠敷。Street2:Harry stree妙同;City1:Kingsford;Stae1:NSW膝迎。納悶我們將City進(jìn)行normalization粥帚,即city_key指向city、stae限次,然后我們不同的street就可以共用這一city_key芒涡,以達(dá)到減少redundancy的目的。
2)便于update卖漫。任然是上面的例子费尽,如果Bark Street現(xiàn)在劃歸為Randwick,那么只需要將city_key的指向指為Randwick所屬的那個(gè)city_key就可以了羊始。
Denormalization:Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.[1][2] It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization should not be confused with Unnormalized form. Databases/tables must first be normalized to efficiently denormalize them.(Wikipedia)
還可參照以下更為通俗的解釋:https://medium.com/@katedoesdev/normalized-vs-denormalized-databases-210e1d67927d
問(wèn)題五旱幼,Data Warehouse的query language 有哪些?
1)Using relational DB technology: SQL (with extensions such as CUBE/PIVOT/UNPIVOT)
2)Using multidimensional technology: MDX
上圖中突委,左邊的是SQL柏卤,右邊的是MDX冬三。相較而言,MDX更為簡(jiǎn)單缘缚,因?yàn)樗恼Z(yǔ)句更加直觀簡(jiǎn)潔勾笆。它們所共同表達(dá)的意思就是:Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)
問(wèn)題六,Data Warehouse 的層次結(jié)構(gòu)模型有什么桥滨?
Data Warehouse architecture分為3層:1)最底層窝爪,Data Warehouse Server; 2)中間層,OLAP Server; 3)最高層齐媒,F(xiàn)ront-end Tools
我們著重介紹中間層:OLAP Server酸舍。
OLAP Server主要采用兩種方式:
1)ROLAP(which is used in relational database technology)
2)MOLAP(which is used in multidimensional technology)
還有一種混合的OLAP Server,HOLAP
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.(DATA MINING P.135)
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data136 Chapter 3 Data Warehouse and OLAP Technology: An Overview cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 4). Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser subcubes are identified and stored as array structures, whereas sparse subcubes employ compression technology for efficient storage utilization.(DATA MINING P.135)
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.(DATA MINING P.136)
例如上圖里初,就是利用HOLAP來(lái)進(jìn)行數(shù)據(jù)存放及處理,base cuboid我們可以理解為star schema中的fact table忽舟,它使用ROLAP來(lái)進(jìn)行存儲(chǔ)處理双妨,因?yàn)閞elational database query更加方便;然而下面的可以用MOLAP來(lái)進(jìn)行存儲(chǔ)處理叮阅,這樣使得數(shù)據(jù)存儲(chǔ)更加立體刁品,索引起來(lái)更加方便。
問(wèn)題七浩姥,當(dāng)我們我們針對(duì)OLAP中不同數(shù)據(jù)情況時(shí)挑随,我們?cè)摬捎煤畏N所應(yīng)方式來(lái)方便query呢?
1)Selection on low-cardinality attributes勒叠,Bitmap Index(BI)是個(gè)不錯(cuò)的選擇兜挨。
BI on dimension tables
1.1) Index on an attribute (column) with low distinct values
1.2) Each distinct values, v, is associated with a n-bit vector (n = #rows)
1.2.1) The i-th bit is set if the i-th row of the table has the value v for the indexed column
1.3) Multiple BIs can be efficiently combined to enable optimized scan of the table
這里將region作為”distinct values”,在Asia行時(shí)眯分,讀出C1,C3的Region為Asia拌汇,所以bitmap 的第1,3列表為1弊决。以此類推噪舀。如果需要加入新元素,C6,C7,C8......Cn飘诗,則只需要將bitmap后面增加n-5個(gè)column即可与倡。(記住,data warehouse一般只能增加數(shù)據(jù)昆稿,一般不可刪除修改數(shù)據(jù)或)
Bitmap的優(yōu)點(diǎn):1)提高存儲(chǔ)空間利用率纺座;2)提高運(yùn)行速率
2)Selection on high-cardinality attributes,Join Indices是個(gè)不錯(cuò)的選擇
2.1)Join index relates the values of the dimensions of a star schema to rows in the fact table.
2.1.1)a join index on city maintains for each distinct city a list of ROW-IDs of the tuples recording the sales in the city
2.2)Join indices can span multiple dimensions OR
2.2.1)can be implemented as bitmap indexes (per dimension)
2.2.2)use bit-op for multiple-joins
我們可以將Kingsford在fact table Sales中的Row ID記錄在Kingsford的join index table(黃色區(qū)域)下溉潭,以方便我們索引Kingsford 比驻;同時(shí)该溯,還可以為Time中的year= ‘2017’建立join index table。假如别惦,Kingsford join index table[R102, R117, R118, R124], ‘2017’join index table[R111, R117, R 119, R124]狈茉,那么當(dāng)我們需要查找位于Kingsford 的2017年的sales數(shù)據(jù)則可以索引join index table得到R117和R124符合條件。
3)Arbitrary selections on Dimensions
可以使用正則來(lái)篩選掸掸。
(以下是Relational Database 和 Data Warehouse之間的數(shù)據(jù)檢索的差別)
上面的語(yǔ)句就是Query氯庆,與traditional relational DB相同,Sales與select出來(lái)的Time相join扰付,然后和select出來(lái)的Time相join堤撵,然后和select出來(lái)的Time相join。請(qǐng)注意羽莺,這里的join只能是binary combine实昨,而且是fact table with dimension table only(因?yàn)閐imension table間沒(méi)有common foreign key,因此無(wú)法combine盐固,然而fact table中存儲(chǔ)了所有的foreign key)荒给。(binary combine比較耗費(fèi)時(shí)間。)
在Data Warehouse中我們使用上面star join這種方法刁卜,先將select出的Time, Customer, Loc進(jìn)行Cross Join生成笛卡兒積志电,再join fact table(因?yàn)閒act table太大了所以最后join)。例如例子所給蛔趴,我們將select出的3組數(shù)據(jù)cross join得到8個(gè)tuple挑辆,然后得到找出想要的2個(gè)tuple,接著通過(guò)scan fact table中的數(shù)據(jù)來(lái)對(duì)比出符合這2個(gè)tuple中3個(gè)foreign key的tuples(fact table中的tuples)孝情。這樣做相較于star query鱼蝉,只需要再最后join fact table,而再star query中箫荡,需要在3次join中與fact table相join蚀乔,因此star join更加高效。
4)Coarse-grain Aggregations
我們將Time聚合為Year睬魂,customer聚合為Type,Loc聚合為City镀赌,以此來(lái)形成一個(gè)Cuboid氯哮,方便。當(dāng)然商佛,需要根據(jù)實(shí)際情況需求來(lái)形成cuboid喉钢。
問(wèn)題八姆打,How to store the materialized cuboids? How to compute the cuboids efficiently?
有4種方式:
1)ROLAP(Store all data into one single table)
當(dāng)需要用到某些cuboids來(lái)應(yīng)對(duì)query時(shí),可以使用selection語(yǔ)句進(jìn)行查詢(這里的(store, product), (store), (product),()都是獨(dú)立的cuboid)肠虽。
2)Top-down Approach
在這里幔戏,晶格圖中最上層是所有dimension的detail,往下走是一步步group by税课。例如A是product闲延,B是quarter,那么Cuboid AB表示(product, quarter), Cuboid A表示第三層的(product),Cuboid B表示第三層的(quarter). Cuboid A中韩玩,是以product為Group by的垒玲,所以tuple1(A: 1; B: 7; C: 100), tuple2(A: 2; B: 4; C: 50)。所以group by和computing是從頂向下的找颓。
3)Bottom-up Approach
自底向上合愈,從A, B, C, D各自group by,再到AB等等間group by直到最后達(dá)到ABCD聚合击狮。
這里介紹的計(jì)算方式是Recursive佛析,具體見PPT60
4)MOLAP
詳見PPT69