-
學(xué)習(xí)資料
- 1.ClickHouse中文官網(wǎng):
https://clickhouse.com/docs/zh
- 2.ClickHouse官網(wǎng)
https://clickhouse.com/
- 3.ClickHouse GitHub
https://github.com/ClickHouse/ClickHouse
- 4.DBeaver 地址
https://dbeaver.io/download/
-
安裝部署
拉取服務(wù)端鏡像:
docker pull yandex/clickhouse-server
拉取客戶端端鏡像:docker pull yandex/clickhouse-client
啟動(dòng)Server端:docker run -d --name clickhouse-server --ulimit nofile=262144:262144 yandex/clickhouse-server
啟動(dòng)Client并連接到Server端:docker run -it --rm --link clickhouse-server:clickhouse-server yandex/clickhouse-client --host clickhouse-server
客戶端常用參數(shù):
clickhouse-client
--host, -h :服務(wù)端host名稱栗竖,默認(rèn) localhost
--port :連接端口,默認(rèn)9000
--user, -u :用戶名拦耐,默認(rèn) default
--password :密碼,默認(rèn)空
--query, -q :非交互模式下的查詢語(yǔ)句
--database, -d :默認(rèn)當(dāng)前操作的數(shù)據(jù)庫(kù),默認(rèn)default
--multiline, -m :允許多行語(yǔ)句查詢谴返,在clickhouse中默認(rèn)回車(chē)即為sql結(jié)束结榄,可使用該參數(shù)多行輸入
--format, -f :使用指定的默認(rèn)格式輸出結(jié)果 csv,以逗號(hào)分隔
--time, -t :非交互模式下會(huì)打印查詢執(zhí)行的時(shí)間
--stacktrace :出現(xiàn)異常會(huì)打印堆棧跟蹤信息
--config-file :配置文件名稱
停止clickhouse-server
wudy.yu@wudyyudeMacBook-Pro ~ % docker stop clickhouse-server
clickhouse-server
指定端口、掛載目錄啟動(dòng)
docker run -d --name clickhouse-server --ulimit nofile=262144:262144 -p 8123:8123 -p 9000:9000 -p 9009:9009 --privileged=true -v /Users/wudy.yu/docker-volumn/clickhouse/log:/var/log/clickhouse-server -v /Users/wudy.yu/docker-volumn/clickhouse/data:/var/lib/clickhouse yandex/clickhouse-server
-
基本操作
1.創(chuàng)建數(shù)據(jù)庫(kù)
ec430d220d73 :) CREATE DATABASE wudy
CREATE DATABASE wudy
Query id: d7f76e59-3fb6-4404-b482-f69657d3756f
Ok.
0 rows in set. Elapsed: 0.055 sec.
ec430d220d73 :) show databases
SHOW DATABASES
Query id: 0e8ac65d-f551-41a8-a816-960edfd0d7f9
┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default │
│ information_schema │
│ system │
│ wudy │
└────────────────────┘
5 rows in set. Elapsed: 0.026 sec.
2.創(chuàng)建數(shù)據(jù)表
CREATE TABLE wudy.clickstream
(
`customer_id` String,
`time_stamp` Date,
`click_event_type` String,
`page_code` FixedString(20),
`source_id` UInt64
)
ENGINE = MergeTree() ORDER BY time_stamp
Query id: f8699293-3182-4d82-bb6b-557270d4de28
Ok.
0 rows in set. Elapsed: 0.064 sec.
3.插入數(shù)據(jù)
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('wudy.yu', '2023-07-09', 'create_order', 'home_enter', 800820082)
INSERT INTO wudy.clickstream FORMAT Values
Query id: fad68be9-39da-4076-a493-f5fedfeebbea
Ok.
1 rows in set. Elapsed: 0.110 sec.
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('peter.li', '2023-07-10', 'add_to_cart', 'home_enter', 812820097)
INSERT INTO wudy.clickstream FORMAT Values
Query id: bb05cff2-0fd6-4aaa-8864-ce1023ab9455
Ok.
1 rows in set. Elapsed: 0.074 sec.
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('timo.peng', '2023-12-12', 'add_to_cart', 'product_center', 832821831)
INSERT INTO wudy.clickstream FORMAT Values
Query id: d5789224-720d-4db0-b357-cb786e3e85c3
Ok.
1 rows in set. Elapsed: 0.051 sec.
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('tony.zhang', '2023-11-11', 'create_order', 'order_center', 812651931)
INSERT INTO wudy.clickstream FORMAT Values
Query id: ccd03fef-67c8-47d3-9d58-0496b75a96d6
Ok.
1 rows in set. Elapsed: 0.055 sec.
4.查詢數(shù)據(jù)
ec430d220d73 :) SELECT * FROM wudy.clickstream WHERE time_stamp > '2023-01-01'
SELECT *
FROM wudy.clickstream
WHERE time_stamp > '2023-01-01'
Query id: 3443d7f9-d7cc-4a50-96ad-2212981c0ba4
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code──┬─source_id─┐
│ peter.li │ 2023-07-10 │ add_to_cart │ home_enter │ 812820097 │
└─────────────┴────────────┴──────────────────┴────────────┴───────────┘
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code────┬─source_id─┐
│ tony.zhang │ 2023-11-11 │ create_order │ order_center │ 812651931 │
└─────────────┴────────────┴──────────────────┴──────────────┴───────────┘
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code──────┬─source_id─┐
│ timo.peng │ 2023-12-12 │ add_to_cart │ product_center │ 832821831 │
└─────────────┴────────────┴──────────────────┴────────────────┴───────────┘
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code──┬─source_id─┐
│ wudy.yu │ 2023-07-09 │ create_order │ home_enter │ 800820082 │
└─────────────┴────────────┴──────────────────┴────────────┴───────────┘
4 rows in set. Elapsed: 0.095 sec.
ec430d220d73 :) DESCRIBE wudy.clickstream
DESCRIBE TABLE wudy.clickstream
Query id: 460a3ee3-433d-4833-b934-5c6a796ec7ea
┌─name─────────────┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ customer_id │ String │ │ │ │ │ │
│ time_stamp │ Date │ │ │ │ │ │
│ click_event_type │ String │ │ │ │ │ │
│ page_code │ FixedString(20) │ │ │ │ │ │
│ source_id │ UInt64 │ │ │ │ │ │
└──────────────────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
5 rows in set. Elapsed: 0.019 sec.
// 操作billFlow表
CREATE TABLE wudy.billflow
(
`id` Int64,
`batch_no` UUID,
`project_id` String,
`status` Enum8('unpaid' = 1, 'frozen' = 2, 'paid' = 3),
`create_time` DateTime,
`require_amount` Decimal(12, 8),
`del_flag` Bool
)
ENGINE = MergeTree() ORDER BY create_time
Query id: 1ceeeb80-43f4-43ad-99a6-944bbf1d4549
Ok.
0 rows in set. Elapsed: 0.057 sec.
ec430d220d73 :) DESCRIBE wudy.billflow
DESCRIBE TABLE wudy.billflow
Query id: 4fde98cd-77c7-4394-a745-859f5f6945a7
┌─name───────────┬─type──────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Int64 │ │ │ │ │ │
│ batch_no │ UUID │ │ │ │ │ │
│ project_id │ String │ │ │ │ │ │
│ status │ Enum8('unpaid' = 1, 'frozen' = 2, 'paid' = 3) │ │ │ │ │ │
│ create_time │ DateTime │ │ │ │ │ │
│ require_amount │ Decimal(12, 8) │ │ │ │ │ │
│ del_flag │ Bool │ │ │ │ │ │
└────────────────┴───────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
7 rows in set. Elapsed: 0.018 sec.
ec430d220d73 :) INSERT INTO wudy.billflow VALUES ('20230710123456789', generateUUIDv4(), 'd7whdr3jd83yr8hde73dh37hfd', 'unpaid', '2100-01-01 11:11:11', 123.45678, TRUE)
INSERT INTO wudy.billflow FORMAT Values
Query id: c4eef0f0-cac6-4f46-a225-fc60d8918b4e
Ok.
1 rows in set. Elapsed: 0.150 sec.
ec430d220d73 :) INSERT INTO wudy.billflow VALUES ('20230710223457611', generateUUIDv4(), 'f6343234h4543m4543n45678n88', 'frozen', '2100-02-02 12:12:12', 888.99999, FALSE)
INSERT INTO wudy.billflow FORMAT Values
Query id: 7a1abdc0-ce27-43fc-99d9-b08cd5db2558
Ok.
1 rows in set. Elapsed: 0.066 sec.
ec430d220d73 :) INSERT INTO wudy.billflow VALUES ('20230711871524984', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2024-12-02 07:07:07', 666.77777, TRUE)
INSERT INTO wudy.billflow FORMAT Values
Query id: e9a91a8e-594a-4b15-abad-8b7b0bab2bcd
Connecting to clickhouse-server:9000 as user default.
Connected to ClickHouse server version 22.1.3 revision 54455.
Ok.
1 rows in set. Elapsed: 0.112 sec.
-
構(gòu)造賬單測(cè)試表
主鍵 | SQL |
---|---|
1 | insert into wudy.bill_data values(1,101,'202308191052001','custom','201108003118','wudy.yu','new','10086','msp',99.99, '2023-08-19 22:57:00'); |
2 | insert into wudy.bill_data values(2,102,'202308191108001','custom','201108003119','張磊','new','30045','msp',66.66, '2023-08-19 23:08:11'); |
3 | insert into wudy.bill_data values(3,103,'202308191109002','custom','201108003120','彭夢(mèng)婷','renew','10011','msp',43.88, '2023-08-20 10:01:23'); |
4 | insert into wudy.bill_data values(4,104,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',20.88, '2023-08-20 10:26:45'); |
image.png
進(jìn)入clickhouse容器可以看到有4個(gè)分區(qū)文件:
image.png
計(jì)算出統(tǒng)計(jì)的總價(jià): 231.41
image.png
-
ReplacingMergeTree引擎去重
去重分為:
物理去重
(重復(fù)的數(shù)據(jù)直接被刪除掉)梢莽、查詢?nèi)ブ?/code>(不處理物理數(shù)據(jù)萧豆,但是查詢結(jié)果是已經(jīng)將重復(fù)數(shù)據(jù)過(guò)濾掉的)
插入重復(fù)數(shù)據(jù)(孫鐵): 除了version、require_amount外昏名,其他字段都一樣
| 主鍵 | SQL |
| ------------- |:-------------:|
| 4 |insert into wudy.bill_data values(4,105,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',105.88, '2023-08-20 10:26:45');
|
| 4 |insert into wudy.bill_data values(4,106,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',106.88, '2023-08-20 10:26:45');
|
方式一:
final
去重查詢:查詢結(jié)果已去重
涮雷,物理數(shù)據(jù)未去重(未合并分區(qū)文件)
image.png
方式二:argMax
方式去重查詢: 查詢結(jié)果已去重
,物理數(shù)據(jù)未去重(未合并分區(qū)文件)
image.png
方式三:普通查詢: 查詢結(jié)果未去重
轻局,物理數(shù)據(jù)未去重(未合并分區(qū)文件)
image.png
上訴 final
和 argMax
查詢示例都是基于本地表
做的操作洪鸭,final和argMax在結(jié)果上沒(méi)有差異样刷,但是如果基于分布式表
進(jìn)行試驗(yàn),兩條數(shù)據(jù)落在了不同數(shù)據(jù)分片(注意這里不是數(shù)據(jù)分區(qū))卿嘲,那么final和argMax的結(jié)果將會(huì)產(chǎn)生差異: final
的結(jié)果將是未去重
的颂斜,原因是final只能對(duì)本地表做去重查詢,不能對(duì)跨分片的數(shù)據(jù)進(jìn)行去重查詢拾枣,但是argMax
的結(jié)果是去重
的沃疮。argMax是通過(guò)比較第二參數(shù)version的大小,來(lái)取出我們要查詢的最新數(shù)據(jù)來(lái)達(dá)到過(guò)濾掉重復(fù)數(shù)據(jù)的目的梅肤,其原理是將每個(gè)Shard的數(shù)據(jù)摟到同一個(gè)Shard的內(nèi)存中進(jìn)行比較計(jì)算司蔬,所以支持跨分片的去重
上訴三種查詢方式均未對(duì)物理數(shù)據(jù)去重, 物理數(shù)據(jù)依然是6條,如下圖:
image.png
再插入一條孫鐵數(shù)據(jù), id姨蝴、version字段都一樣俊啼, 只有創(chuàng)建時(shí)間不同
| 主鍵 | SQL |
| ------------- |:-------------:|
| 4 |insert into wudy.bill_data values(4,106,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',106.88, '2023-08-21 20:12:23');
|
image.png
image.png
手動(dòng)觸發(fā)合并分區(qū): optimize table bill_data final;
-
ReplacingMergeTree優(yōu)缺點(diǎn):
- 1.使用主鍵作為判斷重復(fù)數(shù)據(jù)的唯一鍵,支持插入相同主鍵數(shù)據(jù)左医。
在合并分區(qū)的時(shí)候會(huì)觸發(fā)刪除重復(fù)數(shù)據(jù)的邏輯授帕。但是合并的時(shí)機(jī)不確定,所以在
- 2.查詢的時(shí)候可能會(huì)有重復(fù)數(shù)據(jù)浮梢,但是最終會(huì)去重跛十。可以手動(dòng)調(diào)用optimize秕硝,但是會(huì)引發(fā)對(duì)數(shù)據(jù)大量的讀寫(xiě)芥映,不建議生產(chǎn)使用。
- 3.以數(shù)據(jù)分區(qū)為單位刪除重復(fù)數(shù)據(jù)远豺,當(dāng)分區(qū)合并時(shí)奈偏,同一分區(qū)內(nèi)的重復(fù)數(shù)據(jù)會(huì)被刪除,不同分區(qū)的重復(fù)數(shù)據(jù)不會(huì)被刪除躯护。
- 4.可以通過(guò)final惊来,argMax方式做查詢?nèi)ブ兀@種方式無(wú)論有沒(méi)有做過(guò)數(shù)據(jù)合并榛做,都可以得到正確的查詢結(jié)果唁盏。
-
ReplacingMergeTree最佳使用方案:
- 普通select查詢:對(duì)時(shí)效不高的離線查詢可以采用ClickHouse自動(dòng)合并配合,但是需要保證同一業(yè)務(wù)單據(jù)落在同一個(gè)數(shù)據(jù)分區(qū)检眯,分布式表也需要保證在同一個(gè)分片(Shard)厘擂,這是一種最高效,最節(jié)省計(jì)算資源的查詢方式锰瘸。
- 2.final方式查詢:對(duì)于實(shí)時(shí)查詢可以使用final刽严,final是本地去重,需要保證同一主鍵數(shù)據(jù)落在同一個(gè)分片(Shard),但是不需要落在同一個(gè)數(shù)據(jù)分區(qū)舞萄,這種方式效率次之眨补,但是與普通select相比會(huì)消耗一些性能,如果where條件對(duì)主鍵索引倒脓,二級(jí)索引撑螺,分區(qū)字段命中的比較好的話效率也可以完全可以使用。
- 3.argMax方式查詢:對(duì)于實(shí)時(shí)查詢可以使用argMax崎弃,argMax的使用要求最低甘晤,咋查都能去重,但是由于它的實(shí)現(xiàn)方式饲做,效率會(huì)低很多线婚,也很消耗性能,不建議使用盆均。后面9.4.3會(huì)配合壓測(cè)數(shù)據(jù)與final進(jìn)行對(duì)比
-
5.與MYSQL對(duì)比
ClickHouse
MySQL
說(shuō)明
UInt8
UNSIGNED TINYINT
Int8
TINYINT
Int16
UNSIGNED SMALLINT
Int16
SMALLINT
UInt32
UNSIGNED INT, UNSIGNED MEDIUMINT
Int32
INT, MEDIUMINT
UInt64
UNSIGNED BIGINT
Int64
BIGINT
Float32
FLOAT
Float64
DOUBLE
Date
DATE
DATETIME
DATETIME, TIMESTAMP
FixedString
Char
String
VARCHAR, BLOB, TEXT
Decimal32(3)
Decimal
Decimal32:表示最大位數(shù)9位龄坪,小數(shù)部分3位 (123456.789), Decimal64:表示最大位數(shù)18位陷揪,Decimal128:表示最大位數(shù)38位
修改數(shù)據(jù):
1.早期的ClickHouse不支持可變數(shù)據(jù)
2.不支持事物释树,建議批量操作削祈,避免高頻率小數(shù)據(jù)量的修改
3.刪除和更新是一個(gè)異步的操作,語(yǔ)句提交立即返回但不一定已經(jīng)完成
注意事項(xiàng):每次更新或者刪除肮砾,會(huì)廢棄目標(biāo)數(shù)據(jù)的原有分區(qū)奏候,而重建新分區(qū)
ec430d220d73 :) ALTER TABLE wudy.billflow UPDATE require_amount = 222.33333 WHERE id = '20230710123456789'
ALTER TABLE wudy.billflow
UPDATE require_amount = 222.33333 WHERE id = '20230710123456789'
Query id: 00f3a410-47c5-4295-9398-3eafc06b9121
Ok.
0 rows in set. Elapsed: 0.062 sec.
-
6.ClickHouse分片分區(qū)
-
7.ClickHouse聚合組合器
組合器允許擴(kuò)展和混合聚合,以解決各種數(shù)據(jù)結(jié)構(gòu)的需求,并簡(jiǎn)化查詢唇敞,同時(shí)避免對(duì)數(shù)據(jù)進(jìn)行結(jié)構(gòu)性更改。
如何使用組合器
1.常見(jiàn)的聚合函數(shù)
sumIf
咒彤、sumArrayIf
image.png
2.在聚合中的增加條件
我們有時(shí)需要根據(jù)特定條件對(duì)數(shù)據(jù)進(jìn)行聚合疆柔。我們就可以使用If組合器,并將條件指定為組合函數(shù)的最后一個(gè)參數(shù)镶柱,而不是使用WHERE子句旷档。
image.png
3.構(gòu)造測(cè)試數(shù)據(jù)
- 賬單表數(shù)據(jù)
CREATE TABLE wudy.billflow
(
`id` Int64,
`batch_no` UUID,
`project_id` String,
`status` Enum8('unpaid' = 1, 'frozen' = 2, 'paid' = 3),
`create_time` DateTime,
`modify_time` DateTime,
`require_amount` Decimal(12, 8),
`del_flag` Bool
)
ENGINE = MergeTree() ORDER BY create_time
手動(dòng)插入如下數(shù)據(jù):
INSERT INTO wudy.billflow VALUES ('20230710123456789', generateUUIDv4(), 'd7whdr3jd83yr8hde73dh37hfd', 'unpaid', '2023-01-01 11:11:11','2023-01-01 11:11:11', 50, TRUE)
INSERT INTO wudy.billflow VALUES ('20230710223457611', generateUUIDv4(), 'f6343234h4543m4543n45678n88', 'frozen', '2023-02-02 12:12:12', '2023-02-02 12:12:12',100, FALSE)
INSERT INTO wudy.billflow VALUES ('20230711871524984', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-02 07:07:07', '2023-12-02 07:09:17',120, TRUE)
INSERT INTO wudy.billflow VALUES ('20231230871526714', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-30 20:45:13', '2023-12-30 20:45:23',130, TRUE)
INSERT INTO wudy.billflow VALUES ('20231230877152494', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-30 21:50:19','2023-12-30 21:53:50', 140, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231123456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'unpaid', '2023-12-31 01:26:13','2023-12-31 01:26:13', 200, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231223456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'unpaid', '2023-12-31 02:26:13', '2023-12-31 02:26:13',220, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231323456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'unpaid', '2023-12-31 03:26:13', '2023-12-31 03:26:13',240, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231343456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'frozen', '2023-12-31 04:26:13', '2023-12-31 04:26:13',250, TRUE)
4.假設(shè)我們想獲得總金額,但僅當(dāng)賬單狀態(tài)為已支付
SELECT sumIf(require_amount, status = 'paid')
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 81af7f66-bb33-44d8-b195-6cb9a4d03422
┌─sumIf(require_amount, equals(status, 'paid'))─┐
│ 390 │
└───────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.064 sec.
我們還可以獲取根更新時(shí)間比成創(chuàng)建時(shí)間晚一分鐘的賬單總金額
SELECT sumIf(require_amount, (status = 'paid') AND (modify_time > (create_time + toIntervalMinute(1)))) AS project_total_amount
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 43d30c11-40a2-402d-9946-d1b3a15889ad
┌─project_total_amount─┐
│ 260 │
└──────────────────────┘
使用條件If相對(duì)于標(biāo)準(zhǔn)WHERE子句的主要優(yōu)勢(shì)在于能夠?yàn)椴煌淖泳溆?jì)算多個(gè)總和,還可以在組合函數(shù)中使用任何可用的聚合函數(shù)歇拆,如countIf()
鞋屈、avgIf()
或quantileIf()
- 任何函數(shù)。結(jié)合這些功能故觅,我們可以在單個(gè)請(qǐng)求中根據(jù)多個(gè)條件和函數(shù)進(jìn)行聚合
SELECT
countIf((status = 'paid') AND (modify_time > (create_time + toIntervalMinute(1)))) AS project_paid_bill_count,
sumIf(require_amount, (status = 'paid') AND (modify_time > (create_time + toIntervalMinute(1)))) AS project_paid_total_amount,
countIf(status = 'unpaid') AS project_unpaid_bill_count,
avgIf(require_amount, status = 'unpaid') AS project_unpaid_avg_amount
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: e056b6e1-b1b6-45d4-944f-63b75a78a14b
┌─project_paid_bill_count─┬─project_paid_total_amount─┬─project_unpaid_bill_count─┬─project_unpaid_avg_amount─┐
│ 2 │ 260 │ 3 │ 220 │
└─────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────┘
1 rows in set. Elapsed: 0.018 sec.
5.僅對(duì)唯一條目進(jìn)行聚合
計(jì)算唯一條目的數(shù)量是一種常見(jiàn)情況厂庇。ClickHouse有幾種方法可以使用COUNT(DISTINCT col)(與uniqExact相同),或當(dāng)估計(jì)(但更快)的值足夠時(shí)输吏,使用uniq()來(lái)執(zhí)行此操作权旷。但是,我們可能希望在不同的聚合函數(shù)中使用在列中的唯一值贯溅。為此拄氯,可以使用Distinct組合器:
image.png
// 再插入一條存在重復(fù)數(shù)據(jù)的賬單
INSERT INTO wudy.billflow VALUES ('20231230877851342', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-30 21:55:19','2023-12-30 21:55:50', 140, TRUE)
SELECT
countDistinct(toHour(create_time)) AS hours,
avgDistinct(toHour(create_time)) AS avg_hour,
sumDistinct(require_amount) AS sum_amount
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 7b91264e-4252-4134-ad4e-abd3a9f7e5a7
┌─hours─┬──────────avg_hour─┬─sum_amount─┐
│ 7 │ 8.285714285714286 │ 1300 │
└───────┴───────────────────┴────────────┘
1 rows in set. Elapsed: 0.016 sec.
- 6.結(jié)合 Distinct 和 If
由于組合器可以互相結(jié)合躲查,可以使用前面的組合器和avgDistinctIf函數(shù)來(lái)處理更高級(jí)的邏輯:
SELECT avgDistinctIf(toHour(create_time), require_amount > 200) AS avg_hour
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 561b03c3-1291-439c-8452-e658df3147e2
┌─avg_hour─┐
│ 3 │
└──────────┘
1 rows in set. Elapsed: 0.018 sec.
- 7.控制空結(jié)果的聚合值
聚合函數(shù)對(duì)結(jié)果集不包含數(shù)據(jù)的情況有不同的反應(yīng)。例如译柏,count()將返回0镣煮,而avg()將產(chǎn)生nan值
我們可以使用OrDefault()
和OrNull()
組合器來(lái)控制此行為。兩者都會(huì)更改在數(shù)據(jù)集為空時(shí)使用的聚合函數(shù)的返回值
-
OrDefault()
將返回函數(shù)的默認(rèn)值而不是nan
-
OrNull()
將返回NULL(并且還將更改返回類(lèi)型為Nullable)
SELECT
count(),
countOrNull(),
countOrDefault(),
avg(require_amount),
avgOrDefault(require_amount),
sumOrNull(require_amount)
FROM billflow
WHERE project_id = '123456789'
Query id: 69cef755-6012-4222-9e65-992ccbf0a1fc
┌─count()─┬─countOrNull()─┬─countOrDefault()─┬─avg(require_amount)─┬─avgOrDefault(require_amount)─┬─sumOrNull(require_amount)─┐
│ 0 │ ???? │ 0 │ nan │ 0 │ ???? │
└─────────┴───────────────┴──────────────────┴─────────────────────┴──────────────────────────────┴───────────────────────────┘
1 rows in set. Elapsed: 0.011 sec.
countOrNull()
將返回NULL而不是0鄙麦,avgOrDefault()
會(huì)返回0而不是nan
- 8.與其他組合器一起使用
與所有其他組合器一樣典唇,OrNull()和OrDefault()可以與不同的組合器一起使用以進(jìn)行更高級(jí)的邏輯
SELECT
sumIfOrNull(require_amount, status = 'paid') AS sum_paid_amount,
countIfDistinctOrNull(require_amount, status = 'paid') AS paid_distinct
FROM billflow
WHERE require_amount > 120
Query id: 616f57e1-7a4e-4833-8b84-8473f5ccc968
┌─sum_paid_amount─┬─paid_distinct─┐
│ 410 │ 2 │
└─────────────────┴───────────────┘
1 rows in set. Elapsed: 0.012 sec.
我們使用了sumIfOrNull()
組合函數(shù)來(lái)已支付的賬單,并在空集上返回NULL黔衡。countIfDistinctOrNull()
函數(shù)計(jì)算不同的require_amount值蚓聘,但僅適用于滿足指定條件的行
- 9.聚合數(shù)組
ClickHouse
的Array
類(lèi)型在其用戶中很受歡迎,因?yàn)樗鼮楸斫Y(jié)構(gòu)帶來(lái)了很多靈活性盟劫。為了有效地操作Array列夜牡,ClickHouse提供了一組數(shù)組函數(shù)。為了簡(jiǎn)化對(duì)Array類(lèi)型的聚合侣签,ClickHouse提供了Array()組合器
塘装。這些將給定的聚合函數(shù)應(yīng)用于數(shù)組列中的所有值,而不是數(shù)組本身:
image.png
以下表格用于存儲(chǔ)文章里各個(gè)片段的閱讀數(shù)據(jù)影所。當(dāng)用戶閱讀一篇文章時(shí)蹦肴,我們將閱讀的片段保存到sections數(shù)組列中,并將相關(guān)的閱讀時(shí)間保存到times列中
CREATE TABLE article_reads
(
`time` DateTime,
`article_id` UInt32,
`sections` Array(UInt16),
`times` Array(UInt16),
`user_id` UInt32
)
ENGINE = MergeTree
ORDER BY (article_id, time)
INSERT INTO wudy.article_reads VALUES ('2023-12-30 21:55:19', 10, [16,18,7,21,23,22,11,19,9,8], [82,96,294,253,292,66,44,256,222,86], 32578)
INSERT INTO wudy.article_reads VALUES ('2023-12-31 11:12:19', 10, [21,8],[30,176], 11086)
INSERT INTO wudy.article_reads VALUES ('2023-12-31 13:22:33', 10, [24,11,23,9],[178,177,172,105], 12345)
我們使用uniqArray()函數(shù)計(jì)算每篇文章閱讀的唯一片段(section
)數(shù)量猴娩,再使用avgArray()
得到每個(gè) 片段的平均時(shí)間
SELECT
article_id,
uniqArray(sections) AS sections_read,
round(avgArray(times)) AS time_per_section
FROM wudy.article_reads
GROUP BY article_id
Query id: f0f4ac7f-b41d-4b4b-9041-ac2d65dc42ae
┌─article_id─┬─sections_read─┬─time_per_section─┐
│ 10 │ 11 │ 158 │
└────────────┴───────────────┴──────────────────┘
1 rows in set. Elapsed: 0.045 sec.
我們可以使用minArray()
和maxArray()
函數(shù)獲取所有文章的最小和最大閱讀時(shí)間
SELECT
minArray(times),
maxArray(times)
FROM wudy.article_reads
Query id: 7dc5c2ed-f809-4ccd-94c6-9cf1d995b361
┌─minArray(times)─┬─maxArray(times)─┐
│ 30 │ 294 │
└─────────────────┴─────────────────┘
1 rows in set. Elapsed: 0.009 sec.
我們還可以使用groupUniqArray()
函數(shù)與Array()
組合器獲取每篇文章的閱讀部分列表
SELECT
article_id,
groupUniqArrayArray(sections)
FROM wudy.article_reads
GROUP BY article_id
Query id: c8169602-4b80-401a-83e8-1ea8482b2576
┌─article_id─┬─groupUniqArrayArray(sections)───┐
│ 10 │ [16,11,8,24,9,23,19,7,21,18,22] │
└────────────┴─────────────────────────────────┘
1 rows in set. Elapsed: 0.010 sec.
另一個(gè)流行的函數(shù)是any()阴幌,它在聚合下返回任何列值,并且還可以與Array組合:
SELECT
article_id,
anyArray(sections)
FROM wudy.article_reads
GROUP BY article_id
Query id: 06101fab-b72d-4d9f-a171-514217258b80
┌─article_id─┬─anyArray(sections)─┐
│ 10 │ 16 │
└────────────┴────────────────────┘
1 rows in set. Elapsed: 0.007 sec.
使用Array與其他組合器
- Array組合器可以與任何其他組合器一起使用:
SELECT
article_id,
sumArrayIfOrNull(times, length(sections) > 8)
FROM wudy.article_reads
GROUP BY article_id
Query id: ef46503f-d703-4eb4-b93d-c873c2d3a98a
┌─article_id─┬─sumArrayOrNullIf(times, greater(length(sections), 8))─┐
│ 10 │ 1691 │
└────────────┴───────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.045 sec.
使用sumArrayIfOrNull()
函數(shù)來(lái)計(jì)算閱讀了超過(guò)八個(gè)部分的文章的總時(shí)間卷中。請(qǐng)注意矛双,對(duì)于閱讀未超過(guò)八個(gè)部分的文章,將返回NULL蟆豫,因?yàn)槲覀冞€使用了OrNull()
組合器
INSERT INTO wudy.article_reads VALUES ('2023-12-31 13:22:33', 11, [56,15,13,9],[247,81,22,5], 12345)
SELECT
article_id,
sumArrayIfOrNull(times, length(sections) > 8)
FROM wudy.article_reads
GROUP BY article_id
Query id: fee32e91-2030-4979-9364-7c3ab147f1ce
┌─article_id─┬─sumArrayOrNullIf(times, greater(length(sections), 8))─┐
│ 11 │ ???? │
│ 10 │ 1691 │
└────────────┴───────────────────────────────────────────────────────┘
2 rows in set. Elapsed: 0.026 sec.
我們將array函數(shù)與組合器一起使用议忽,我們甚至可以處理更高級(jí)的情況:
SELECT
article_id,
countArray(arrayFilter(x -> (x > 120), times)) AS sections_engaged
FROM wudy.article_reads
GROUP BY article_id
Query id: 9df392d7-e5b4-4391-89ad-3a29c824e29c
┌─article_id─┬─sections_engaged─┐
│ 11 │ 1 │
│ 10 │ 9 │
└────────────┴──────────────────┘
2 rows in set. Elapsed: 0.028 sec.
我們首先使用arrayFilter函數(shù)過(guò)濾times數(shù)組,以刪除所有小于120秒的值十减。然后栈幸,我們使用countArray來(lái)計(jì)算每篇文章的過(guò)濾時(shí)間
- 10.聚合映射
ClickHouse中另一種強(qiáng)大的類(lèi)型是Map。與數(shù)組一樣帮辟,我們可以使用Map()組合器將聚合應(yīng)用于此類(lèi)型
CREATE TABLE page_loads
(
`time` DateTime,
`url` String,
`params` Map(String, UInt32)
)
ENGINE = MergeTree
ORDER BY (url, time)
INSERT INTO wudy.page_loads VALUES ('2023-12-31 11:12:19', 'http://baidu.com', {'load_speed':100, 'scroll_depth':59})
INSERT INTO wudy.page_loads VALUES ('2024-01-01 13:14:15', 'http://ifeng.com', {'load_speed':200, 'scroll_depth':87})
INSERT INTO wudy.page_loads VALUES ('2024-01-01 13:14:15', 'http://google.com', {'scroll_depth':87})
我們可以使用Map()組合器來(lái)進(jìn)行sum()和avg()函數(shù)速址,以獲得總加載時(shí)間和平均滾動(dòng)深度
SELECT
sumMap(params)['load_speed'] AS total_load_time,
avgMap(params)['scroll_depth'] AS average_scroll
FROM page_loads
Query id: 6fcd08b5-bf09-4b2f-93be-82d66366a0cb
┌─total_load_time─┬────average_scroll─┐
│ 300 │ 77.66666666666667 │
└─────────────────┴───────────────────┘
1 rows in set. Elapsed: 0.021 sec.
Map()組合器還可以與其他組合器一起使用:
SELECT sumMapIf(params, url = 'http://google.com')['scroll_depth'] AS average_scroll
FROM page_loads
Query id: 93fc7e18-498e-4e03-9628-e29f3e189dc7
┌─average_scroll─┐
│ 87 │
└────────────────┘
1 rows in set. Elapsed: 0.024 sec.
- 11.聚合相應(yīng)數(shù)組值
理數(shù)組列的另一種方法是從兩個(gè)數(shù)組中聚合相應(yīng)的值。這將導(dǎo)致另一個(gè)數(shù)組织阅。這可以用于矢量化數(shù)據(jù)(如矢量或矩陣)壳繁,通過(guò)ForEach()組合器實(shí)現(xiàn):
image.png
CREATE TABLE article_reads_2
(
`time` DateTime,
`article_id` UInt32,
`sections` Array(UInt16),
`times` Array(UInt16),
`user_id` UInt32
)
ENGINE = MergeTree
ORDER BY (article_id, time)
INSERT INTO wudy.article_reads_2 VALUES ('2023-12-31 11:12:19', 10, [21,8],[30,176], 11086);
INSERT INTO wudy.article_reads_2 VALUES ('2024-01-01 11:12:19', 10, [33,10],[10,190,200], 11086);
SELECT * FROM article_reads_2
Query id: 9902fc4d-27b7-4631-94c4-6cd977e65236
┌────────────────time─┬─article_id─┬─sections─┬─times────────┬─user_id─┐
│ 2024-01-01 11:12:19 │ 10 │ [33,10] │ [10,190,200] │ 11086 │
└─────────────────────┴────────────┴──────────┴──────────────┴─────────┘
┌────────────────time─┬─article_id─┬─sections─┬─times────┬─user_id─┐
│ 2023-12-31 11:12:19 │ 10 │ [21,8] │ [30,176] │ 11086 │
└─────────────────────┴────────────┴──────────┴──────────┴─────────┘
2 rows in set. Elapsed: 0.022 sec.
373168478a29 :) SELECT avgForEach(times) FROM article_reads_2;
SELECT avgForEach(times) FROM article_reads_2
Query id: b2c28a3f-aaa6-4c19-9f99-5f07c12ec243
┌─avgForEach(times)─┐
│ [20,183,200] │
└───────────────────┘
1 rows in set. Elapsed: 0.019 sec.
這將要求ClickHouse計(jì)算所有坐標(biāo)數(shù)組的第一個(gè)元素的平均值,并將其放入結(jié)果數(shù)組的第一個(gè)元素中。然后對(duì)第二個(gè)和第三個(gè)元素重復(fù)相同的操作
還支持與其他組合器一起使用:
// 案例1
SELECT avgForEachIf(times, user_id = '11086')
FROM article_reads_2
Query id: 58a97c3b-d947-4220-912a-6ee396902536
┌─avgForEachIf(times, equals(user_id, '11086'))─┐
│ [20,183,200] │
// 案例2
373168478a29 :) SELECT avgForEachIfOrDefault(times, user_id='11086') FROM article_reads_2;
Query id: 158f3b31-79cd-489f-a794-608213f7e32c
┌─avgForEachIfOrDefault(times, equals(user_id, '11086'))─┐
│ [20,183,200] │
└────────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.021 sec.
// 案例3
373168478a29 :) SELECT avgForEachIfOrDefault(times, user_id='32578') FROM article_reads_2;
Query id: c1a62211-34d4-415d-8de8-e803b17a48a6
┌─avgForEachIfOrDefault(times, equals(user_id, '32578'))─┐
│ [] │
└────────────────────────────────────────────────────────┘
- 12.使用聚合狀態(tài)
ClickHouse允許使用中間聚合狀態(tài)
而不是結(jié)果值闹炉。假設(shè)我們需要計(jì)算唯一值的計(jì)數(shù)蒿赢,但我們不想保存值本身(因?yàn)樗加每臻g)。在這種情況下渣触,我們可以使用uniq()
函數(shù)的State()組合器
來(lái)保存中間聚合狀態(tài)羡棵,然后使用Merge()
組合器來(lái)計(jì)算實(shí)際值:
-
8.ClickHouse物化視圖
物化視圖是一種特殊的觸發(fā)器,當(dāng)數(shù)據(jù)被插入時(shí)嗅钻,它將數(shù)據(jù)上執(zhí)行 SELECT 查詢的結(jié)果存儲(chǔ)為到一個(gè)目標(biāo)表中
image.png
案例1:為常用的大數(shù)據(jù)量查詢創(chuàng)建物化視圖
// 創(chuàng)建`wikistat`表
CREATE TABLE wudy.wikistat
(
`time` DateTime CODEC(Delta(4), ZSTD(1)),
`project` LowCardinality(String),
`subproject` LowCardinality(String),
`path` String,
`hits` UInt64
)
ENGINE = MergeTree
ORDER BY (path, time);
// 準(zhǔn)備數(shù)據(jù)
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:10:00', '201108003118','201108003119','www.baidu.com', 1023);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:20:00', '201108003118','201108003119','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:30:00', '201108003118','201108003119','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:40:00', '201108003118','201108003119','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:00:00', '201108003118','201108003119','www.ifeng.com', 1201);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:10:00', '201108003118','201108003119','www.baidu.com', 1641);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:30:00', '201108003118','201108003119','www.ifeng.com', 1562);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:30:01', '201108003118','201108003119','www.baidu.com', 911);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:22:00', '201108003118','201108003101','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:31:00', '201108003118','201108003102','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:44:05', '201108003118','201108003103','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:03:07', '201108003118','201108003104','www.ifeng.com', 1201);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:35:02', '201108003118','201108003105','www.baidu.com', 1641);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:31:00', '201108003118','201108003106','www.ifeng.com', 1562);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:38:01', '201108003118','201108003107','www.baidu.com', 911);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:20:00', '201108003122','201108003112','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:30:00', '201108003122','201108003144','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-02 11:40:00', '201108003122','201108003123','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:00:00', '201108003122','201108003133','www.ifeng.com', 121);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:10:00', '201108003101','201108003142','www.baidu.com', 577);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:30:00', '201108003101','201108003107','www.ifeng.com', 224);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:30:01', '201108003101','201108003105','www.baidu.com', 2091);
INSERT INTO wudy.wikistat VALUES('2020-10-03 11:20:00', '201108003122','201108003138','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-03 11:30:00', '201108003122','201108003138','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-04 11:40:00', '201108003122','201108003112','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-04 12:00:00', '201108003122','201108003127','www.ifeng.com', 121);
INSERT INTO wudy.wikistat VALUES('2020-10-04 13:10:00', '201108003101','201108003118','www.baidu.com', 577);
INSERT INTO wudy.wikistat VALUES('2020-10-03 12:30:00', '201108003101','201108003119','www.ifeng.com', 224);
INSERT INTO wudy.wikistat VALUES('2020-10-03 13:30:01', '201108003101','201108003120','www.baidu.com', 2091);
假設(shè)我們經(jīng)常查詢某個(gè)日期最受歡迎的項(xiàng)目:
373168478a29 :) SELECT
project,
sum(hits) AS h
FROM wikistat
WHERE date(time) = '2020-10-01'
GROUP BY project
ORDER BY h DESC
LIMIT 10
Query id: a3dccadd-2797-413a-88c7-82a920f3b38a
┌─project──────┬────h─┐
│ 201108003118 │ 8535 │
│ 201108003122 │ 1756 │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.014 sec.
如果上面這個(gè)查詢耗時(shí)較長(zhǎng)皂冰,且我們有大量這樣的查詢,并且我們需要ClickHouse提供毫秒級(jí)性能养篓,我們可以為這個(gè)查詢創(chuàng)建一個(gè)物化視圖:
// wikistat_top_project 是我們要用來(lái)保存物化視圖的表的名稱
// 使用了SummingMergeTree表引擎(SummingMergeTree引擎是異步)秃流,因?yàn)槲覀兿M麨槊總€(gè)date/project匯總hits值
373168478a29 :) CREATE TABLE wikistat_top_project
(
`date` Date,
`project` LowCardinality(String),
`hit` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (date, project);
// wikistat_top_project_mv 是物化視圖本身(觸發(fā)器)的名稱
// AS 后面的內(nèi)容是構(gòu)建物化視圖的查詢
373168478a29 :) CREATE MATERIALIZED VIEW wikistat_top_project_mv TO wikistat_top_project AS
SELECT
date(time) AS date,
project,
sum(hits) AS hits
FROM wikistat
GROUP BY
date,
project
Query id: c9df59fe-2024-4ab5-b9d5-9c10c38194d9
Ok.
0 rows in set. Elapsed: 0.010 sec.
注意: 我們可以創(chuàng)建任意數(shù)量的物化視圖
,但每一個(gè)新的物化視圖都是額外的存儲(chǔ)負(fù)擔(dān)柳弄,因此保持總數(shù)合理舶胀,即每個(gè)表下的物化視圖數(shù)目控制在10個(gè)以內(nèi)
現(xiàn)在,我們使用與 wikistat 表相同的查詢來(lái)填充物化視圖的目標(biāo)表
373168478a29 :) INSERT INTO wikistat_top_project SELECT
date(time) AS date,
project,
sum(hits) AS hits
FROM wikistat
GROUP BY
date,
project
Query id: 7c0c3c47-7c6d-427a-8fcf-8d78c0c9c6ab
Ok.
0 rows in set. Elapsed: 0.018 sec.
1.查詢物化視圖
373168478a29 :) SELECT
project,
sum(hit) AS hits
FROM wikistat_top_project
WHERE date = '2020-10-01'
GROUP BY project
ORDER BY hits DESC
LIMIT 10;
Query id: 1d625f2d-bf7a-44aa-9f61-fa489871083d
┌─project──────┬─hits─┐
│ 201108003118 │ 8535 │
│ 201108003122 │ 1756 │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.010 sec.
再插入一條符合物化視圖條件的數(shù)據(jù)碧注,會(huì)發(fā)現(xiàn)物化視圖查詢出來(lái)的數(shù)據(jù)未發(fā)生改變
373168478a29 :) INSERT INTO wudy.wikistat VALUES('2020-10-01 15:20:00', '201108003122','201108003136','www.baidu.com', 1000);
373168478a29 :) SELECT
project,
sum(hit) AS hits
FROM wikistat_top_project
WHERE date = '2020-10-01'
GROUP BY project
ORDER BY hits DESC
LIMIT 10
Query id: e8865895-75f2-4fb1-8d47-2108fc6676f5
┌─project──────┬─hits─┐
│ 201108003118 │ 8535 │
│ 201108003122 │ 1756 │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.072 sec.
2.管理物化視圖
使用 SHOW TABLES
查詢列出物化視圖
373168478a29 :) SHOW TABLES LIKE 'wikistat_top_project_mv'
Query id: 58033394-3567-457e-8e84-b578efe4a4d2
┌─name────────────────────┐
│ wikistat_top_project_mv │
└─────────────────────────┘
1 rows in set. Elapsed: 0.008 sec.
可以使用 DROP TABLE
刪除物化視圖嚣伐,但這只會(huì)刪除觸發(fā)器本身
DROP TABLE wikistat_top_project_mv
如果不再需要目標(biāo)表,請(qǐng)記得也將其刪除:
DROP TABLE wikistat_top_project
3.獲取物化視圖在磁盤(pán)上的大小
物化視圖表的元數(shù)據(jù)都存儲(chǔ)在system數(shù)據(jù)庫(kù)中
373168478a29 :)SELECT formatReadableSize(total_bytes) AS total_bytes_on_disk
FROM system.tables
WHERE table = 'wikistat_top_project'
Query id: a1b5808d-4cb7-4786-9880-613025c30e13
┌─total_bytes_on_disk─┐
│ 326.00 B │
└─────────────────────┘
1 rows in set. Elapsed: 0.009 sec.
使用物化視圖加速查詢
物化視圖是一種提高查詢性能的方法萍丐。對(duì)于分析查詢轩端,常見(jiàn)的聚合操作不僅僅是前面示例中展示的 sum()
。SummingMergeTree
非常適用于計(jì)算匯總數(shù)據(jù)
逝变,但還有更高級(jí)的聚合可以使用AggregatingMergeTree引擎
進(jìn)行計(jì)算
假設(shè)我們經(jīng)常執(zhí)行以下類(lèi)型的查詢:
SELECT
toDate(time) AS date,
min(hits) AS min_hits_per_hour,
max(hits) AS max_hits_per_hour,
avg(hits) AS avg_hits_per_hour
FROM wikistat
WHERE project = '201108003122'
GROUP BY date
Query id: bc6c89ce-1ba6-40be-811d-3122ef4e88af
// 查詢給定項(xiàng)目的每日點(diǎn)擊量的月最小值基茵、最大值和平均值:
┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬─avg_hits_per_hour─┐
│ 2020-10-01 │ 769 │ 1000 │ 939 │
│ 2020-10-02 │ 121 │ 2000 │ 1060.5 │
│ 2020-10-03 │ 769 │ 987 │ 878 │
│ 2020-10-04 │ 121 │ 2000 │ 1060.5 │
└────────────┴───────────────────┴───────────────────┴───────────────────┘
使用物化視圖
存儲(chǔ)這些聚合結(jié)果以便更快地檢索。使用狀態(tài)組合器(state combinators)
定義聚合結(jié)果壳影。狀態(tài)組合器
要求ClickHouse
保存內(nèi)部聚合狀態(tài)
耿导,而不是最終的聚合結(jié)果。這允許使用聚合操作态贤,而無(wú)需保存帶有原始值的所有記錄。這種方法很簡(jiǎn)單 - 我們?cè)趧?chuàng)建物化視圖時(shí)使用*State()
函數(shù)醋火,然后在查詢時(shí)使用其對(duì)應(yīng)的*Merge()
函數(shù)獲取正確的聚合結(jié)果
image.png
我們將使用 min 悠汽、 max 和 avg 狀態(tài)。在新物化視圖的目標(biāo)表中芥驳,我們將使用 AggregateFunction
類(lèi)型存儲(chǔ)聚合狀態(tài)而不是值
CREATE TABLE wikistat_daily_summary
(
`project` String,
`date` Date,
`min_hits_per_hour` AggregateFunction(min, UInt64),
`max_hits_per_hour` AggregateFunction(max, UInt64),
`avg_hits_per_hour` AggregateFunction(avg, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (project, date)
CREATE MATERIALIZED VIEW wikistat_daily_summary_mv TO wikistat_daily_summary AS
SELECT
project,
toDate(time) AS date,
minState(hits) AS min_hits_per_hour,
maxState(hits) AS max_hits_per_hour,
avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY
project,
date
現(xiàn)在柿冲,讓我們?yōu)樗畛鋽?shù)據(jù):
INSERT INTO wikistat_daily_summary SELECT
project,
toDate(time) AS date,
minState(hits) AS min_hits_per_hour,
maxState(hits) AS max_hits_per_hour,
avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY
project,
date
在查詢時(shí),我們使用相應(yīng)的 Merge
組合器來(lái)檢索值:
SELECT
date,
minMerge(min_hits_per_hour) AS min_hits_per_hour,
maxMerge(max_hits_per_hour) AS max_hits_per_hour,
avgMerge(avg_hits_per_hour) AS avg_hits_per_hour
FROM wikistat_daily_summary
WHERE project = '201108003122'
GROUP BY date
Query id: 7ac5bb0e-94b9-4ee1-a85b-ebd39823073a
┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬─avg_hits_per_hour─┐
│ 2020-10-01 │ 769 │ 1000 │ 939 │
│ 2020-10-02 │ 121 │ 2000 │ 1060.5 │
│ 2020-10-03 │ 769 │ 987 │ 878 │
│ 2020-10-04 │ 121 │ 2000 │ 1060.5 │
└────────────┴───────────────────┴───────────────────┴───────────────────┘
注意: 任何聚合函數(shù)都可以作為一個(gè)聚合物化視圖的一部分與State/Merge組合器一起使用
驗(yàn)證和過(guò)濾數(shù)據(jù)
使用物化視圖的另一個(gè)流行的示例是在插入后立即處理數(shù)據(jù)兆旬。數(shù)據(jù)驗(yàn)證就是一個(gè)很好的例子
image.png
假設(shè)我們想要濾掉所有包含不需要的符號(hào)的path假抄,再保存到結(jié)果表中
373168478a29 :) SELECT count(*) FROM wikistat
┌─count()─┐
│ 31 │
└─────────┘
1 rows in set. Elapsed: 0.021 sec.
373168478a29 :) SELECT count(*) FROM wikistat WHERE NOT match(path, 'baidu')
┌─count()─┐
│ 13 │
└─────────┘
為了實(shí)現(xiàn)驗(yàn)證過(guò)濾,我們需要兩個(gè)表 - 一個(gè)帶有所有數(shù)據(jù)的表和一個(gè)只帶有干凈數(shù)據(jù)的表。物化視圖的目標(biāo)表將扮演一個(gè)只帶有干凈數(shù)據(jù)的最終表的角色宿饱,源表將是暫時(shí)的熏瞄。我們可以根據(jù)TTL
從源表中刪除數(shù)據(jù),就像我們?cè)谏弦还?jié)中所做的那樣谬以,或者將此表的引擎更改為Null强饮,該引擎不存儲(chǔ)任何數(shù)據(jù)(數(shù)據(jù)只會(huì)存儲(chǔ)在物化視圖中):
373168478a29 :) CREATE TABLE wudy.wikistat_src
(
`time` DateTime CODEC(Delta(4), ZSTD(1)),
`project` LowCardinality(String),
`subproject` LowCardinality(String),
`path` String,
`hits` UInt64
)
ENGINE = Null
Query id: f2673c20-4615-408c-b290-6bd843661562
Ok.
0 rows in set. Elapsed: 0.015 sec.
現(xiàn)在,讓我們使用數(shù)據(jù)驗(yàn)證查詢創(chuàng)建一個(gè)物化視圖:
373168478a29 :) CREATE TABLE wikistat_clean AS wikistat
Query id: 9fd5efeb-feec-4311-927a-0ddad298ebc0
Ok.
0 rows in set. Elapsed: 0.040 sec.
373168478a29 :) CREATE MATERIALIZED VIEW wikistat_clean_mv TO wikistat_clean AS
SELECT *
FROM wikistat_src
WHERE match(path, 'baidu')
Query id: 881d04a0-2a7e-4a81-bbfd-7365103b1465
Ok.
0 rows in set. Elapsed: 0.012 sec.
當(dāng)我們插入數(shù)據(jù)時(shí)为黎, wikistat_src 將保持為空:
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 12:00:00', '201108003122','201108003127','www.ifeng.com', 121);
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 13:10:00', '201108003101','201108003118','www.baidu.com', 577);
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 12:30:00', '201108003101','201108003119','www.huawei.com', 224);
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 13:30:01', '201108003101','201108003120','www.csdn.com', 2091);
373168478a29 :) SELECT count(*) FROM wikistat_src
Query id: 1f94547a-d805-4a67-8c18-ec6e900e57bf
┌─count()─┐
│ 0 │
└─────────┘
1 rows in set. Elapsed: 0.022 sec.
但是邮丰,我們的 wikistat_clean 物化表現(xiàn)在只有有效的行:
373168478a29 :) SELECT * FROM wikistat_clean
Query id: 7c81c628-8fa2-4235-b35c-1c94b0013232
┌────────────────time─┬─project──────┬─subproject───┬─path──────────┬─hits─┐
│ 2020-10-05 13:10:00 │ 201108003101 │ 201108003118 │ www.baidu.com │ 577 │
└─────────────────────┴──────────────┴──────────────┴───────────────┴──────┘
再次插入一條符合match條件的數(shù)據(jù)
373168478a29 :) INSERT INTO wudy.wikistat_src VALUES('2020-10-05 14:40:00', '201108003122','201108003101','www.baidu.com', 888);
INSERT INTO wudy.wikistat_src FORMAT Values
Query id: 9f127a73-d3ea-43fa-9cdb-6332305e72bd
Ok.
1 rows in set. Elapsed: 0.026 sec.
373168478a29 :) SELECT * FROM wikistat_clean
Query id: 58571aa3-ed3a-4c5e-84b2-c321f062df8c
┌────────────────time─┬─project──────┬─subproject───┬─path──────────┬─hits─┐
│ 2020-10-05 14:40:00 │ 201108003122 │ 201108003101 │ www.baidu.com │ 888 │
└─────────────────────┴──────────────┴──────────────┴───────────────┴──────┘
┌────────────────time─┬─project──────┬─subproject───┬─path──────────┬─hits─┐
│ 2020-10-05 13:10:00 │ 201108003101 │ 201108003118 │ www.baidu.com │ 577 │
└─────────────────────┴──────────────┴──────────────┴───────────────┴──────┘
2 rows in set. Elapsed: 0.024 sec.