OLTP & OLAP
OLTP: Online Transaction Processing 聯(lián)機(jī)事務(wù)處理
OLAP: 聯(lián)機(jī)分析技術(shù)( On-Line Analytical Processing)
OLTP vs. OLAP | Transactional Databases vs. Analytical Databases
OLTP
OLTP數(shù)據(jù)庫分類
關(guān)系數(shù)據(jù)庫
幾乎所有的數(shù)據(jù)庫管理系統(tǒng)都配備了一個(gè)開放式數(shù)據(jù)庫連接(ODBC)驅(qū)動程序,令各個(gè)數(shù)據(jù)庫之間得以互相集成妒貌。
非關(guān)系型數(shù)據(jù)庫(NoSQL)
Apache Cassandra(為Facebook所使用[7]):高度可擴(kuò)展[8]
LevelDB(Google)
MySQL & MongoDB
https://en.wikipedia.org/wiki/MongoDB
https://en.wikipedia.org/wiki/MySQL
特點(diǎn)比較
https://medium.com/@rsk.saikrishna/when-to-use-mongodb-rather-than-mysql-d03ceff2e922
MySQL: The SQL Relational Database
The following are some MySQL benefits and strengths:
Maturity: MySQL is an extremely established database, meaning that there’s a huge community, extensive testing and quite a bit of stability.
Compatibility: MySQL is available for all major platforms, including Linux, Windows, Mac, BSD and Solaris. It also has connectors to languages like Node.js, Ruby, C#, C++, Java, Perl, Python and PHP, meaning that it’s not limited to SQL query language.
Cost-effective: The database is open source and free.
Replicable: The MySQL database can be replicated across multiple nodes, meaning that the workload can be reduced and the scalability and availability of the application can be increased.
Data Sharding: While sharding cannot be done on most SQL databases, it can be done on MySQL servers. This is both cost-effective and good for business.
MongoDB: The NoSQL Non-Relational Database
The following are some of MongoDB benefits and strengths:
? Dynamic schema: As mentioned, this gives you flexibility to change your data schema without modifying any of your existing data.
? Scalability: MongoDB is horizontally scalable, which helps reduce the workload and scale your business with ease.
- Manageability: The database doesn’t require a database administrator. Since it is fairly user-friendly in this way, it can be used by both developers and administrators.
? Speed: It’s high-performing for simple queries. ? Flexibility: You can add new columns or fields on MongoDB without affecting existing rows or application performance.
適合場景
Reasons to Use a SQL Database
Not every database fits every business need. That’s why many companies rely on both relational and non-relational databases for different tasks. Although NoSQL databases have gained popularity for their speed and scalability, there are still situations in which a highly structured SQL database might be preferable. Two reasons why you might consider a SQL database are:
- You need ACID compliancy (Atomicity, Consistency, Isolation, Durability). ACID compliancy reduces anomalies and protects the integrity of your database. It does this by defining exactly how transactions interact with the database, which is not the case with NoSQL databases, which have a primary goal of flexibility and speed, rather than 100% data integrity.
- Your data is structured and unchanging:If your business is not growing exponentially, there may be no reason to use a system designed to support a variety of data types and high traffic volume.
Reasons to Use a NoSQL Database
To prevent the database from becoming a system-wide bottleneck, especially in high volume environments, NoSQL databases perform in a way that relational databases cannot.
The following features are driving the popularity of NoSQL databases like MongoDB, Couch DB, Cassandra, and HBase:
- Storing large volumes of data without structure. A NoSQL database doesn’t limit storable data types. Plus, you can add new types as business needs change.
- Using cloud computing and storage. Cloud-based storage is a great solution, but it requires data to be easily spread across multiple servers for scaling. Using affordable hardware on-site for testing and then for production in the cloud is what NoSQL databases are designed for.
- Rapid development. If you are developing using modern agile methodologies, a relational database will slow you down. A NoSQL database doesn’t require the level of preparation typically needed for relational databases.
OLAP
OLAP技術(shù)
https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers
選擇適合你的開源 OLAP 引擎 - 微信公眾號:數(shù)據(jù)社 - OSCHINA - 中文開源技術(shù)交流社區(qū)
目前市面上主流的開源OLAP引擎包含不限于:Hive毕箍、Hawq、Presto、Kylin跛梗、Impala寻馏、Sparksql棋弥、Druid、Clickhouse诚欠、Greeplum等顽染。
它們各自有各自的特點(diǎn),我們將其分組:
Hive轰绵,Hawq粉寞,Impala - 基于SQL on Hadoop
Presto和Spark SQL類似 - 基于內(nèi)存解析SQL生成執(zhí)行計(jì)劃
Kylin - 用空間換時(shí)間,預(yù)計(jì)算
Druid - 一個(gè)支持?jǐn)?shù)據(jù)的實(shí)時(shí)攝入
ClickHouse - OLAP領(lǐng)域的Hbase左腔,單表查詢性能優(yōu)勢巨大
Greenpulm - OLAP領(lǐng)域的Postgresql
Hive & ClickHouse
對比:ClickHouse vs. Hive vs. Impala Comparison
典型的大數(shù)據(jù)分析架構(gòu) --> ClickHouse
Hive
https://en.wikipedia.org/wiki/Apache_Hive
Hive是一個(gè)數(shù)據(jù)倉庫基礎(chǔ)工具在Hadoop中用來處理結(jié)構(gòu)化數(shù)據(jù)唧垦。它架構(gòu)在Hadoop之上,總歸為大數(shù)據(jù)液样,并使得查詢和分析方便振亮。并提供簡單的sql查詢功能,可以將sql語句轉(zhuǎn)換為MapReduce任務(wù)進(jìn)行運(yùn)行鞭莽。
那么MapReduce又是什么坊秸?
MapReduce
Big Data & Hadoop: MapReduce Framework | EduPristine
ClickHouse
https://tech.bytedance.net/articles/6853622128044146696#heading3
https://tech.bytedance.net/articles/6908282140950167565
最快開源OLAP引擎澎怒!ClickHouse在頭條的技術(shù)演進(jìn)-InfoQ
Clickhouse的前世今生和優(yōu)缺點(diǎn)_Xlucas的博客-CSDN博客_clickhouse優(yōu)點(diǎn)缺點(diǎn)
列式數(shù)據(jù)庫
列式數(shù)據(jù)庫更適合于OLAP場景(對于大多數(shù)查詢而言褒搔,處理速度至少提高了100倍),下面詳細(xì)解釋了原因(通過圖片更有利于直觀理解):
行式
列式
看到差別了么喷面?下面將詳細(xì)介紹為什么會發(fā)生這種情況星瘾。
clickhouse基于列式進(jìn)行存儲,支持?jǐn)?shù)據(jù)壓縮惧辈,我們都知道死相,查詢IO的耗時(shí)操作主要有尋道時(shí)間和定位扇區(qū)時(shí)間和讀取時(shí)間,我們應(yīng)該盡可能減少尋道的時(shí)間咬像,所以順序?qū)懭氲膶懭肽芰κ潜入S機(jī)寫入大得多算撮,同時(shí)生宛,順序讀取同一個(gè)文件比隨機(jī)讀取可以減少磁盤的調(diào)度次數(shù),如果基于行來存儲肮柜,當(dāng)讀取多個(gè)行時(shí)候就需要多次尋道時(shí)間陷舅,如果改為列式存儲(一列一個(gè)文件),將大大減少IO的讀寫時(shí)間审洞。
在基于列式存儲之上莱睁,文件的數(shù)量大大減少,每一個(gè)列式存儲文件大小更大芒澜,因?yàn)榭梢愿咝У倪M(jìn)行數(shù)據(jù)壓縮仰剿,減少數(shù)據(jù)存儲量
向量化執(zhí)行引擎
在原本的查詢邏輯中,當(dāng)有多個(gè)數(shù)據(jù)到達(dá)CPU的時(shí)候痴晦,通常是串行作業(yè)南吮,一個(gè)寄存器處理多個(gè)數(shù)據(jù),串行執(zhí)行誊酌,CPU大部分時(shí)間都在遍歷查詢操作樹部凑,并沒有真正的去處理數(shù)據(jù),因而CPU利用率不高碧浊,處理數(shù)據(jù)的效率不高涂邀。如果是一批數(shù)據(jù)都是執(zhí)行相同的邏輯,那么可以基于SIMD執(zhí)行對數(shù)據(jù)并行執(zhí)行箱锐。
clickhouse 表引擎
clickhouse支持多種表引擎比勉,不同的表引擎支持不同的功能和特性,其中MergeTree表引擎是所有引擎的基礎(chǔ)驹止,其他的表引擎都是在此之上加上新的特征浩聋,
MergeTree表引擎 提供了數(shù)據(jù)分區(qū)、數(shù)據(jù)副本等功能
ReplacingMergeTree表引擎 提供了根據(jù)主鍵刪除重復(fù)數(shù)據(jù)功能
SummingMergeTree表引擎 支持按照建自動聚合數(shù)據(jù)
AggregatingMergeTree表引擎幢哨,預(yù)先對需要聚合的數(shù)據(jù)做預(yù)聚合并存儲
CollapsingMergeTree表引擎赡勘,通過新增一行(以增代刪)實(shí)現(xiàn)行級別粒度的刪除數(shù)據(jù),新增一行數(shù)據(jù)設(shè)置sign = 1捞镰,通過再次新增一行數(shù)據(jù)設(shè)置sign = -1 代表數(shù)據(jù)已刪除闸与,但要求sign = -1 的行數(shù)據(jù)要在待刪除數(shù)據(jù)之后。
VersionedCollapsingMergeTree表引擎通過版本號的機(jī)制實(shí)現(xiàn)了AggregatingMergeTree表引擎寫入順序的局限性岸售,對寫入順序無限制践樱。
MergeTree存儲結(jié)構(gòu)
Patition:表示分區(qū)目錄,只有當(dāng)數(shù)據(jù)進(jìn)入新的分區(qū)時(shí)才會創(chuàng)建凸丸。
checksums.txt:對分區(qū)文件數(shù)量和數(shù)量hash的保存拷邢。
count.txt:記錄數(shù)據(jù)的總行數(shù)。
primary.idx:一級索引屎慢,用于構(gòu)建稀疏索引瞭稼。
Column.bin: 列文件忽洛,每一列一個(gè)文件用于存儲某一列的數(shù)據(jù)信息。
Column.mrk:用于記錄稀疏索引到數(shù)據(jù)的具體偏移量环肘。
partition.dat:如果設(shè)置分區(qū)鍵會記錄分區(qū)最終表達(dá)式
minmax.idx:分區(qū)字段對應(yīng)原始數(shù)據(jù)的最小和最大值
索引生成以及查找流程
primary.idx用來存儲一級索引欲虚,也是稀疏索引,其中每一個(gè)索引代表的是一段數(shù)據(jù)悔雹,一批數(shù)據(jù)而并不是每一個(gè)數(shù)據(jù)复哆,因而稀疏索引可以使用較少的數(shù)據(jù)量來表示數(shù)據(jù),其中定義一批數(shù)據(jù)的量級在于參數(shù)index_granularity(通常值為8192)
這里假設(shè)使用countID作為主鍵腌零,那么索引的保存格式就是 countID+countID+...(要保持盡可能的緊湊)
而索引的查詢是基于遞歸的查詢區(qū)間 如果不存在次區(qū)間梯找,則直接剪枝優(yōu)化掉 如果存在此區(qū)間,就會判斷區(qū)間的數(shù)據(jù)量長度是否大于8益涧,如果大于8則拆分成8個(gè)小區(qū)間接下來遞歸查詢锈锤,類似于遞歸的八分查詢。
數(shù)據(jù)塊壓縮
每一列的數(shù)據(jù)都存儲在column.bin文件中饰躲,但并非是常規(guī)的把數(shù)據(jù)全部寫入bin文件牙咏,而是根據(jù)一定的批量對數(shù)據(jù)壓縮后作為一個(gè)整體壓縮到bin文件中臼隔,其中批量壓縮就需要有一個(gè)數(shù)據(jù)的批量標(biāo)記嘹裂,用于標(biāo)記壓縮前壓縮后的大小,其中就包含頭文件和壓縮數(shù)據(jù)摔握。 頭文件主要表達(dá)為 壓縮方法+壓縮后大小+壓縮前大小寄狼。
其中壓縮方式更是支持zstd LZ4等壓縮格式。
支持長字符串轉(zhuǎn)枚舉氨淌、差量信息壓縮等泊愧。