-
學(xué)習(xí)資料
- 1.ClickHouse中文官網(wǎng):
https://clickhouse.com/docs/zh
- 2.ClickHouse官網(wǎng)
https://clickhouse.com/
- 3.ClickHouse GitHub
https://github.com/ClickHouse/ClickHouse
- 4.DBeaver 地址
https://dbeaver.io/download/
-
安裝部署
拉取服務(wù)端鏡像:
docker pull yandex/clickhouse-server
拉取客戶端端鏡像:docker pull yandex/clickhouse-client
啟動(dòng)Server端:docker run -d --name clickhouse-server --ulimit nofile=262144:262144 yandex/clickhouse-server
啟動(dòng)Client并連接到Server端:docker run -it --rm --link clickhouse-server:clickhouse-server yandex/clickhouse-client --host clickhouse-server
客戶端常用參數(shù):
clickhouse-client
--host, -h :服務(wù)端host名稱栗竖,默認(rèn) localhost
--port :連接端口,默認(rèn)9000
--user, -u :用戶名拦耐,默認(rèn) default
--password :密碼,默認(rèn)空
--query, -q :非交互模式下的查詢語(yǔ)句
--database, -d :默認(rèn)當(dāng)前操作的數(shù)據(jù)庫(kù),默認(rèn)default
--multiline, -m :允許多行語(yǔ)句查詢谴返,在clickhouse中默認(rèn)回車(chē)即為sql結(jié)束结榄,可使用該參數(shù)多行輸入
--format, -f :使用指定的默認(rèn)格式輸出結(jié)果 csv,以逗號(hào)分隔
--time, -t :非交互模式下會(huì)打印查詢執(zhí)行的時(shí)間
--stacktrace :出現(xiàn)異常會(huì)打印堆棧跟蹤信息
--config-file :配置文件名稱
停止clickhouse-server
wudy.yu@wudyyudeMacBook-Pro ~ % docker stop clickhouse-server
clickhouse-server
指定端口、掛載目錄啟動(dòng)
docker run -d --name clickhouse-server --ulimit nofile=262144:262144 -p 8123:8123 -p 9000:9000 -p 9009:9009 --privileged=true -v /Users/wudy.yu/docker-volumn/clickhouse/log:/var/log/clickhouse-server -v /Users/wudy.yu/docker-volumn/clickhouse/data:/var/lib/clickhouse yandex/clickhouse-server
-
基本操作
1.創(chuàng)建數(shù)據(jù)庫(kù)
ec430d220d73 :) CREATE DATABASE wudy
CREATE DATABASE wudy
Query id: d7f76e59-3fb6-4404-b482-f69657d3756f
Ok.
0 rows in set. Elapsed: 0.055 sec.
ec430d220d73 :) show databases
SHOW DATABASES
Query id: 0e8ac65d-f551-41a8-a816-960edfd0d7f9
┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default │
│ information_schema │
│ system │
│ wudy │
└────────────────────┘
5 rows in set. Elapsed: 0.026 sec.
2.創(chuàng)建數(shù)據(jù)表
CREATE TABLE wudy.clickstream
(
`customer_id` String,
`time_stamp` Date,
`click_event_type` String,
`page_code` FixedString(20),
`source_id` UInt64
)
ENGINE = MergeTree() ORDER BY time_stamp
Query id: f8699293-3182-4d82-bb6b-557270d4de28
Ok.
0 rows in set. Elapsed: 0.064 sec.
3.插入數(shù)據(jù)
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('wudy.yu', '2023-07-09', 'create_order', 'home_enter', 800820082)
INSERT INTO wudy.clickstream FORMAT Values
Query id: fad68be9-39da-4076-a493-f5fedfeebbea
Ok.
1 rows in set. Elapsed: 0.110 sec.
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('peter.li', '2023-07-10', 'add_to_cart', 'home_enter', 812820097)
INSERT INTO wudy.clickstream FORMAT Values
Query id: bb05cff2-0fd6-4aaa-8864-ce1023ab9455
Ok.
1 rows in set. Elapsed: 0.074 sec.
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('timo.peng', '2023-12-12', 'add_to_cart', 'product_center', 832821831)
INSERT INTO wudy.clickstream FORMAT Values
Query id: d5789224-720d-4db0-b357-cb786e3e85c3
Ok.
1 rows in set. Elapsed: 0.051 sec.
ec430d220d73 :) INSERT INTO wudy.clickstream VALUES('tony.zhang', '2023-11-11', 'create_order', 'order_center', 812651931)
INSERT INTO wudy.clickstream FORMAT Values
Query id: ccd03fef-67c8-47d3-9d58-0496b75a96d6
Ok.
1 rows in set. Elapsed: 0.055 sec.
4.查詢數(shù)據(jù)
ec430d220d73 :) SELECT * FROM wudy.clickstream WHERE time_stamp > '2023-01-01'
SELECT *
FROM wudy.clickstream
WHERE time_stamp > '2023-01-01'
Query id: 3443d7f9-d7cc-4a50-96ad-2212981c0ba4
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code──┬─source_id─┐
│ peter.li │ 2023-07-10 │ add_to_cart │ home_enter │ 812820097 │
└─────────────┴────────────┴──────────────────┴────────────┴───────────┘
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code────┬─source_id─┐
│ tony.zhang │ 2023-11-11 │ create_order │ order_center │ 812651931 │
└─────────────┴────────────┴──────────────────┴──────────────┴───────────┘
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code──────┬─source_id─┐
│ timo.peng │ 2023-12-12 │ add_to_cart │ product_center │ 832821831 │
└─────────────┴────────────┴──────────────────┴────────────────┴───────────┘
┌─customer_id─┬─time_stamp─┬─click_event_type─┬─page_code──┬─source_id─┐
│ wudy.yu │ 2023-07-09 │ create_order │ home_enter │ 800820082 │
└─────────────┴────────────┴──────────────────┴────────────┴───────────┘
4 rows in set. Elapsed: 0.095 sec.
ec430d220d73 :) DESCRIBE wudy.clickstream
DESCRIBE TABLE wudy.clickstream
Query id: 460a3ee3-433d-4833-b934-5c6a796ec7ea
┌─name─────────────┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ customer_id │ String │ │ │ │ │ │
│ time_stamp │ Date │ │ │ │ │ │
│ click_event_type │ String │ │ │ │ │ │
│ page_code │ FixedString(20) │ │ │ │ │ │
│ source_id │ UInt64 │ │ │ │ │ │
└──────────────────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
5 rows in set. Elapsed: 0.019 sec.
// 操作billFlow表
CREATE TABLE wudy.billflow
(
`id` Int64,
`batch_no` UUID,
`project_id` String,
`status` Enum8('unpaid' = 1, 'frozen' = 2, 'paid' = 3),
`create_time` DateTime,
`require_amount` Decimal(12, 8),
`del_flag` Bool
)
ENGINE = MergeTree() ORDER BY create_time
Query id: 1ceeeb80-43f4-43ad-99a6-944bbf1d4549
Ok.
0 rows in set. Elapsed: 0.057 sec.
ec430d220d73 :) DESCRIBE wudy.billflow
DESCRIBE TABLE wudy.billflow
Query id: 4fde98cd-77c7-4394-a745-859f5f6945a7
┌─name───────────┬─type──────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Int64 │ │ │ │ │ │
│ batch_no │ UUID │ │ │ │ │ │
│ project_id │ String │ │ │ │ │ │
│ status │ Enum8('unpaid' = 1, 'frozen' = 2, 'paid' = 3) │ │ │ │ │ │
│ create_time │ DateTime │ │ │ │ │ │
│ require_amount │ Decimal(12, 8) │ │ │ │ │ │
│ del_flag │ Bool │ │ │ │ │ │
└────────────────┴───────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
7 rows in set. Elapsed: 0.018 sec.
ec430d220d73 :) INSERT INTO wudy.billflow VALUES ('20230710123456789', generateUUIDv4(), 'd7whdr3jd83yr8hde73dh37hfd', 'unpaid', '2100-01-01 11:11:11', 123.45678, TRUE)
INSERT INTO wudy.billflow FORMAT Values
Query id: c4eef0f0-cac6-4f46-a225-fc60d8918b4e
Ok.
1 rows in set. Elapsed: 0.150 sec.
ec430d220d73 :) INSERT INTO wudy.billflow VALUES ('20230710223457611', generateUUIDv4(), 'f6343234h4543m4543n45678n88', 'frozen', '2100-02-02 12:12:12', 888.99999, FALSE)
INSERT INTO wudy.billflow FORMAT Values
Query id: 7a1abdc0-ce27-43fc-99d9-b08cd5db2558
Ok.
1 rows in set. Elapsed: 0.066 sec.
ec430d220d73 :) INSERT INTO wudy.billflow VALUES ('20230711871524984', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2024-12-02 07:07:07', 666.77777, TRUE)
INSERT INTO wudy.billflow FORMAT Values
Query id: e9a91a8e-594a-4b15-abad-8b7b0bab2bcd
Connecting to clickhouse-server:9000 as user default.
Connected to ClickHouse server version 22.1.3 revision 54455.
Ok.
1 rows in set. Elapsed: 0.112 sec.
-
構(gòu)造賬單測(cè)試表
主鍵 | SQL |
---|---|
1 | insert into wudy.bill_data values(1,101,'202308191052001','custom','201108003118','wudy.yu','new','10086','msp',99.99, '2023-08-19 22:57:00'); |
2 | insert into wudy.bill_data values(2,102,'202308191108001','custom','201108003119','張磊','new','30045','msp',66.66, '2023-08-19 23:08:11'); |
3 | insert into wudy.bill_data values(3,103,'202308191109002','custom','201108003120','彭夢(mèng)婷','renew','10011','msp',43.88, '2023-08-20 10:01:23'); |
4 | insert into wudy.bill_data values(4,104,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',20.88, '2023-08-20 10:26:45'); |
image.png
進(jìn)入clickhouse容器可以看到有4個(gè)分區(qū)文件:
image.png
計(jì)算出統(tǒng)計(jì)的總價(jià): 231.41
image.png
-
ReplacingMergeTree引擎去重
去重分為:
物理去重
(重復(fù)的數(shù)據(jù)直接被刪除掉)梢莽、查詢?nèi)ブ?/code>(不處理物理數(shù)據(jù)萧豆,但是查詢結(jié)果是已經(jīng)將重復(fù)數(shù)據(jù)過(guò)濾掉的)
插入重復(fù)數(shù)據(jù)(孫鐵): 除了version、require_amount外昏名,其他字段都一樣
| 主鍵 | SQL |
| ------------- |:-------------:|
| 4 |insert into wudy.bill_data values(4,105,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',105.88, '2023-08-20 10:26:45');
|
| 4 |insert into wudy.bill_data values(4,106,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',106.88, '2023-08-20 10:26:45');
|方式一:
final
去重查詢:查詢結(jié)果已去重
涮雷,物理數(shù)據(jù)未去重(未合并分區(qū)文件)
image.png
方式二:
argMax
方式去重查詢:查詢結(jié)果已去重
,物理數(shù)據(jù)未去重(未合并分區(qū)文件)
image.png
方式三:普通查詢:
查詢結(jié)果未去重
轻局,物理數(shù)據(jù)未去重(未合并分區(qū)文件)
image.png
上訴
final
和argMax
查詢示例都是基于本地表
做的操作洪鸭,final和argMax在結(jié)果上沒(méi)有差異样刷,但是如果基于分布式表
進(jìn)行試驗(yàn),兩條數(shù)據(jù)落在了不同數(shù)據(jù)分片(注意這里不是數(shù)據(jù)分區(qū))卿嘲,那么final和argMax的結(jié)果將會(huì)產(chǎn)生差異:final
的結(jié)果將是未去重
的颂斜,原因是final只能對(duì)本地表做去重查詢,不能對(duì)跨分片的數(shù)據(jù)進(jìn)行去重查詢拾枣,但是argMax
的結(jié)果是去重
的沃疮。argMax是通過(guò)比較第二參數(shù)version的大小,來(lái)取出我們要查詢的最新數(shù)據(jù)來(lái)達(dá)到過(guò)濾掉重復(fù)數(shù)據(jù)的目的梅肤,其原理是將每個(gè)Shard的數(shù)據(jù)摟到同一個(gè)Shard的內(nèi)存中進(jìn)行比較計(jì)算司蔬,所以支持跨分片的去重
上訴三種查詢方式均未對(duì)物理數(shù)據(jù)去重, 物理數(shù)據(jù)依然是6條,如下圖:
image.png
再插入一條孫鐵數(shù)據(jù), id姨蝴、version字段都一樣俊啼, 只有創(chuàng)建時(shí)間不同
| 主鍵 | SQL |
| ------------- |:-------------:|
| 4 |insert into wudy.bill_data values(4,106,'202308201001001','project','201108003121','孫鐵','change','20071','tenant',106.88, '2023-08-21 20:12:23');
|
image.png
image.png
手動(dòng)觸發(fā)合并分區(qū):optimize table bill_data final;
-
ReplacingMergeTree優(yōu)缺點(diǎn):
- 1.使用主鍵作為判斷重復(fù)數(shù)據(jù)的唯一鍵,支持插入相同主鍵數(shù)據(jù)左医。
在合并分區(qū)的時(shí)候會(huì)觸發(fā)刪除重復(fù)數(shù)據(jù)的邏輯授帕。但是合并的時(shí)機(jī)不確定,所以在- 2.查詢的時(shí)候可能會(huì)有重復(fù)數(shù)據(jù)浮梢,但是最終會(huì)去重跛十。可以手動(dòng)調(diào)用optimize秕硝,但是會(huì)引發(fā)對(duì)數(shù)據(jù)大量的讀寫(xiě)芥映,不建議生產(chǎn)使用。
- 3.以數(shù)據(jù)分區(qū)為單位刪除重復(fù)數(shù)據(jù)远豺,當(dāng)分區(qū)合并時(shí)奈偏,同一分區(qū)內(nèi)的重復(fù)數(shù)據(jù)會(huì)被刪除,不同分區(qū)的重復(fù)數(shù)據(jù)不會(huì)被刪除躯护。
- 4.可以通過(guò)final惊来,argMax方式做查詢?nèi)ブ兀@種方式無(wú)論有沒(méi)有做過(guò)數(shù)據(jù)合并榛做,都可以得到正確的查詢結(jié)果唁盏。
-
ReplacingMergeTree最佳使用方案:
- 普通select查詢:對(duì)時(shí)效不高的離線查詢可以采用ClickHouse自動(dòng)合并配合,但是需要保證同一業(yè)務(wù)單據(jù)落在同一個(gè)數(shù)據(jù)分區(qū)检眯,分布式表也需要保證在同一個(gè)分片(Shard)厘擂,這是一種最高效,最節(jié)省計(jì)算資源的查詢方式锰瘸。
- 2.final方式查詢:對(duì)于實(shí)時(shí)查詢可以使用final刽严,final是本地去重,需要保證同一主鍵數(shù)據(jù)落在同一個(gè)分片(Shard),但是不需要落在同一個(gè)數(shù)據(jù)分區(qū)舞萄,這種方式效率次之眨补,但是與普通select相比會(huì)消耗一些性能,如果where條件對(duì)主鍵索引倒脓,二級(jí)索引撑螺,分區(qū)字段命中的比較好的話效率也可以完全可以使用。
- 3.argMax方式查詢:對(duì)于實(shí)時(shí)查詢可以使用argMax崎弃,argMax的使用要求最低甘晤,咋查都能去重,但是由于它的實(shí)現(xiàn)方式饲做,效率會(huì)低很多线婚,也很消耗性能,不建議使用盆均。后面9.4.3會(huì)配合壓測(cè)數(shù)據(jù)與final進(jìn)行對(duì)比
-
5.與MYSQL對(duì)比
ClickHouse | MySQL | 說(shuō)明 |
---|---|---|
UInt8 | UNSIGNED TINYINT | |
Int8 | TINYINT | |
Int16 | UNSIGNED SMALLINT | |
Int16 | SMALLINT | |
UInt32 | UNSIGNED INT, UNSIGNED MEDIUMINT | |
Int32 | INT, MEDIUMINT | |
UInt64 | UNSIGNED BIGINT | |
Int64 | BIGINT | |
Float32 | FLOAT | |
Float64 | DOUBLE | |
Date | DATE | |
DATETIME | DATETIME, TIMESTAMP | |
FixedString | Char | |
String | VARCHAR, BLOB, TEXT | |
Decimal32(3) | Decimal | Decimal32:表示最大位數(shù)9位龄坪,小數(shù)部分3位 (123456.789), Decimal64:表示最大位數(shù)18位陷揪,Decimal128:表示最大位數(shù)38位 |
修改數(shù)據(jù):
1.早期的ClickHouse不支持可變數(shù)據(jù)
2.不支持事物释树,建議批量操作削祈,避免高頻率小數(shù)據(jù)量的修改
3.刪除和更新是一個(gè)異步的操作,語(yǔ)句提交立即返回但不一定已經(jīng)完成注意事項(xiàng):每次更新或者刪除肮砾,會(huì)廢棄目標(biāo)數(shù)據(jù)的原有分區(qū)奏候,而重建新分區(qū)
ec430d220d73 :) ALTER TABLE wudy.billflow UPDATE require_amount = 222.33333 WHERE id = '20230710123456789'
ALTER TABLE wudy.billflow
UPDATE require_amount = 222.33333 WHERE id = '20230710123456789'
Query id: 00f3a410-47c5-4295-9398-3eafc06b9121
Ok.
0 rows in set. Elapsed: 0.062 sec.
-
6.ClickHouse分片分區(qū)
-
7.ClickHouse聚合組合器
組合器允許擴(kuò)展和混合聚合,以解決各種數(shù)據(jù)結(jié)構(gòu)的需求,并簡(jiǎn)化查詢唇敞,同時(shí)避免對(duì)數(shù)據(jù)進(jìn)行結(jié)構(gòu)性更改。
如何使用組合器
1.常見(jiàn)的聚合函數(shù)
sumIf
咒彤、sumArrayIf
image.png
2.在聚合中的增加條件
我們有時(shí)需要根據(jù)特定條件對(duì)數(shù)據(jù)進(jìn)行聚合疆柔。我們就可以使用If組合器,并將條件指定為組合函數(shù)的最后一個(gè)參數(shù)镶柱,而不是使用WHERE子句旷档。
image.png
3.構(gòu)造測(cè)試數(shù)據(jù)
- 賬單表數(shù)據(jù)
CREATE TABLE wudy.billflow
(
`id` Int64,
`batch_no` UUID,
`project_id` String,
`status` Enum8('unpaid' = 1, 'frozen' = 2, 'paid' = 3),
`create_time` DateTime,
`modify_time` DateTime,
`require_amount` Decimal(12, 8),
`del_flag` Bool
)
ENGINE = MergeTree() ORDER BY create_time
手動(dòng)插入如下數(shù)據(jù):
INSERT INTO wudy.billflow VALUES ('20230710123456789', generateUUIDv4(), 'd7whdr3jd83yr8hde73dh37hfd', 'unpaid', '2023-01-01 11:11:11','2023-01-01 11:11:11', 50, TRUE)
INSERT INTO wudy.billflow VALUES ('20230710223457611', generateUUIDv4(), 'f6343234h4543m4543n45678n88', 'frozen', '2023-02-02 12:12:12', '2023-02-02 12:12:12',100, FALSE)
INSERT INTO wudy.billflow VALUES ('20230711871524984', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-02 07:07:07', '2023-12-02 07:09:17',120, TRUE)
INSERT INTO wudy.billflow VALUES ('20231230871526714', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-30 20:45:13', '2023-12-30 20:45:23',130, TRUE)
INSERT INTO wudy.billflow VALUES ('20231230877152494', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-30 21:50:19','2023-12-30 21:53:50', 140, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231123456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'unpaid', '2023-12-31 01:26:13','2023-12-31 01:26:13', 200, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231223456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'unpaid', '2023-12-31 02:26:13', '2023-12-31 02:26:13',220, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231323456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'unpaid', '2023-12-31 03:26:13', '2023-12-31 03:26:13',240, TRUE)
INSERT INTO wudy.billflow VALUES ('20231231343456789', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'frozen', '2023-12-31 04:26:13', '2023-12-31 04:26:13',250, TRUE)
4.假設(shè)我們想獲得總金額,但僅當(dāng)賬單狀態(tài)為
已支付
SELECT sumIf(require_amount, status = 'paid')
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 81af7f66-bb33-44d8-b195-6cb9a4d03422
┌─sumIf(require_amount, equals(status, 'paid'))─┐
│ 390 │
└───────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.064 sec.
我們還可以獲取根更新時(shí)間比成創(chuàng)建時(shí)間晚一分鐘的賬單總金額
SELECT sumIf(require_amount, (status = 'paid') AND (modify_time > (create_time + toIntervalMinute(1)))) AS project_total_amount
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 43d30c11-40a2-402d-9946-d1b3a15889ad
┌─project_total_amount─┐
│ 260 │
└──────────────────────┘
使用條件If相對(duì)于標(biāo)準(zhǔn)WHERE子句的主要優(yōu)勢(shì)在于能夠?yàn)椴煌淖泳溆?jì)算多個(gè)總和,還可以在組合函數(shù)中使用任何可用的聚合函數(shù)歇拆,如
countIf()
鞋屈、avgIf()
或quantileIf()
- 任何函數(shù)。結(jié)合這些功能故觅,我們可以在單個(gè)請(qǐng)求中根據(jù)多個(gè)條件和函數(shù)進(jìn)行聚合
SELECT
countIf((status = 'paid') AND (modify_time > (create_time + toIntervalMinute(1)))) AS project_paid_bill_count,
sumIf(require_amount, (status = 'paid') AND (modify_time > (create_time + toIntervalMinute(1)))) AS project_paid_total_amount,
countIf(status = 'unpaid') AS project_unpaid_bill_count,
avgIf(require_amount, status = 'unpaid') AS project_unpaid_avg_amount
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: e056b6e1-b1b6-45d4-944f-63b75a78a14b
┌─project_paid_bill_count─┬─project_paid_total_amount─┬─project_unpaid_bill_count─┬─project_unpaid_avg_amount─┐
│ 2 │ 260 │ 3 │ 220 │
└─────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────┘
1 rows in set. Elapsed: 0.018 sec.
5.僅對(duì)唯一條目進(jìn)行聚合
計(jì)算唯一條目的數(shù)量是一種常見(jiàn)情況厂庇。ClickHouse有幾種方法可以使用COUNT(DISTINCT col)(與uniqExact相同),或當(dāng)估計(jì)(但更快)的值足夠時(shí)输吏,使用uniq()來(lái)執(zhí)行此操作权旷。但是,我們可能希望在不同的聚合函數(shù)中使用在列中的唯一值贯溅。為此拄氯,可以使用Distinct組合器:
image.png
// 再插入一條存在重復(fù)數(shù)據(jù)的賬單
INSERT INTO wudy.billflow VALUES ('20231230877851342', generateUUIDv4(), 'p53674b45bb4nn243b334534j342', 'paid', '2023-12-30 21:55:19','2023-12-30 21:55:50', 140, TRUE)
SELECT
countDistinct(toHour(create_time)) AS hours,
avgDistinct(toHour(create_time)) AS avg_hour,
sumDistinct(require_amount) AS sum_amount
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 7b91264e-4252-4134-ad4e-abd3a9f7e5a7
┌─hours─┬──────────avg_hour─┬─sum_amount─┐
│ 7 │ 8.285714285714286 │ 1300 │
└───────┴───────────────────┴────────────┘
1 rows in set. Elapsed: 0.016 sec.
- 6.結(jié)合 Distinct 和 If
由于組合器可以互相結(jié)合躲查,可以使用前面的組合器和avgDistinctIf函數(shù)來(lái)處理更高級(jí)的邏輯:
SELECT avgDistinctIf(toHour(create_time), require_amount > 200) AS avg_hour
FROM billflow
WHERE project_id = 'p53674b45bb4nn243b334534j342'
Query id: 561b03c3-1291-439c-8452-e658df3147e2
┌─avg_hour─┐
│ 3 │
└──────────┘
1 rows in set. Elapsed: 0.018 sec.
- 7.控制空結(jié)果的聚合值
聚合函數(shù)對(duì)結(jié)果集不包含數(shù)據(jù)的情況有不同的反應(yīng)。例如译柏,count()將返回0镣煮,而avg()將產(chǎn)生nan值
我們可以使用OrDefault()
和OrNull()
組合器來(lái)控制此行為。兩者都會(huì)更改在數(shù)據(jù)集為空時(shí)使用的聚合函數(shù)的返回值
OrDefault()
將返回函數(shù)的默認(rèn)值而不是nanOrNull()
將返回NULL(并且還將更改返回類(lèi)型為Nullable)
SELECT
count(),
countOrNull(),
countOrDefault(),
avg(require_amount),
avgOrDefault(require_amount),
sumOrNull(require_amount)
FROM billflow
WHERE project_id = '123456789'
Query id: 69cef755-6012-4222-9e65-992ccbf0a1fc
┌─count()─┬─countOrNull()─┬─countOrDefault()─┬─avg(require_amount)─┬─avgOrDefault(require_amount)─┬─sumOrNull(require_amount)─┐
│ 0 │ ???? │ 0 │ nan │ 0 │ ???? │
└─────────┴───────────────┴──────────────────┴─────────────────────┴──────────────────────────────┴───────────────────────────┘
1 rows in set. Elapsed: 0.011 sec.
countOrNull()
將返回NULL而不是0鄙麦,avgOrDefault()
會(huì)返回0而不是nan
- 8.與其他組合器一起使用
與所有其他組合器一樣典唇,OrNull()和OrDefault()可以與不同的組合器一起使用以進(jìn)行更高級(jí)的邏輯
SELECT
sumIfOrNull(require_amount, status = 'paid') AS sum_paid_amount,
countIfDistinctOrNull(require_amount, status = 'paid') AS paid_distinct
FROM billflow
WHERE require_amount > 120
Query id: 616f57e1-7a4e-4833-8b84-8473f5ccc968
┌─sum_paid_amount─┬─paid_distinct─┐
│ 410 │ 2 │
└─────────────────┴───────────────┘
1 rows in set. Elapsed: 0.012 sec.
我們使用了
sumIfOrNull()
組合函數(shù)來(lái)已支付的賬單,并在空集上返回NULL黔衡。countIfDistinctOrNull()
函數(shù)計(jì)算不同的require_amount值蚓聘,但僅適用于滿足指定條件的行
- 9.聚合數(shù)組
ClickHouse
的Array
類(lèi)型在其用戶中很受歡迎,因?yàn)樗鼮楸斫Y(jié)構(gòu)帶來(lái)了很多靈活性盟劫。為了有效地操作Array列夜牡,ClickHouse提供了一組數(shù)組函數(shù)。為了簡(jiǎn)化對(duì)Array類(lèi)型的聚合侣签,ClickHouse提供了Array()組合器
塘装。這些將給定的聚合函數(shù)應(yīng)用于數(shù)組列中的所有值,而不是數(shù)組本身:
image.png
以下表格用于存儲(chǔ)文章里各個(gè)片段的閱讀數(shù)據(jù)影所。當(dāng)用戶閱讀一篇文章時(shí)蹦肴,我們將閱讀的片段保存到sections數(shù)組列中,并將相關(guān)的閱讀時(shí)間保存到times列中
CREATE TABLE article_reads
(
`time` DateTime,
`article_id` UInt32,
`sections` Array(UInt16),
`times` Array(UInt16),
`user_id` UInt32
)
ENGINE = MergeTree
ORDER BY (article_id, time)
INSERT INTO wudy.article_reads VALUES ('2023-12-30 21:55:19', 10, [16,18,7,21,23,22,11,19,9,8], [82,96,294,253,292,66,44,256,222,86], 32578)
INSERT INTO wudy.article_reads VALUES ('2023-12-31 11:12:19', 10, [21,8],[30,176], 11086)
INSERT INTO wudy.article_reads VALUES ('2023-12-31 13:22:33', 10, [24,11,23,9],[178,177,172,105], 12345)
我們使用uniqArray()函數(shù)計(jì)算每篇文章閱讀的唯一片段(
section
)數(shù)量猴娩,再使用avgArray()
得到每個(gè) 片段的平均時(shí)間
SELECT
article_id,
uniqArray(sections) AS sections_read,
round(avgArray(times)) AS time_per_section
FROM wudy.article_reads
GROUP BY article_id
Query id: f0f4ac7f-b41d-4b4b-9041-ac2d65dc42ae
┌─article_id─┬─sections_read─┬─time_per_section─┐
│ 10 │ 11 │ 158 │
└────────────┴───────────────┴──────────────────┘
1 rows in set. Elapsed: 0.045 sec.
我們可以使用
minArray()
和maxArray()
函數(shù)獲取所有文章的最小和最大閱讀時(shí)間
SELECT
minArray(times),
maxArray(times)
FROM wudy.article_reads
Query id: 7dc5c2ed-f809-4ccd-94c6-9cf1d995b361
┌─minArray(times)─┬─maxArray(times)─┐
│ 30 │ 294 │
└─────────────────┴─────────────────┘
1 rows in set. Elapsed: 0.009 sec.
我們還可以使用
groupUniqArray()
函數(shù)與Array()
組合器獲取每篇文章的閱讀部分列表
SELECT
article_id,
groupUniqArrayArray(sections)
FROM wudy.article_reads
GROUP BY article_id
Query id: c8169602-4b80-401a-83e8-1ea8482b2576
┌─article_id─┬─groupUniqArrayArray(sections)───┐
│ 10 │ [16,11,8,24,9,23,19,7,21,18,22] │
└────────────┴─────────────────────────────────┘
1 rows in set. Elapsed: 0.010 sec.
另一個(gè)流行的函數(shù)是any()阴幌,它在聚合下返回任何列值,并且還可以與Array組合:
SELECT
article_id,
anyArray(sections)
FROM wudy.article_reads
GROUP BY article_id
Query id: 06101fab-b72d-4d9f-a171-514217258b80
┌─article_id─┬─anyArray(sections)─┐
│ 10 │ 16 │
└────────────┴────────────────────┘
1 rows in set. Elapsed: 0.007 sec.
使用Array與其他組合器
- Array組合器可以與任何其他組合器一起使用:
SELECT
article_id,
sumArrayIfOrNull(times, length(sections) > 8)
FROM wudy.article_reads
GROUP BY article_id
Query id: ef46503f-d703-4eb4-b93d-c873c2d3a98a
┌─article_id─┬─sumArrayOrNullIf(times, greater(length(sections), 8))─┐
│ 10 │ 1691 │
└────────────┴───────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.045 sec.
使用
sumArrayIfOrNull()
函數(shù)來(lái)計(jì)算閱讀了超過(guò)八個(gè)部分的文章的總時(shí)間卷中。請(qǐng)注意矛双,對(duì)于閱讀未超過(guò)八個(gè)部分的文章,將返回NULL蟆豫,因?yàn)槲覀冞€使用了OrNull()
組合器
INSERT INTO wudy.article_reads VALUES ('2023-12-31 13:22:33', 11, [56,15,13,9],[247,81,22,5], 12345)
SELECT
article_id,
sumArrayIfOrNull(times, length(sections) > 8)
FROM wudy.article_reads
GROUP BY article_id
Query id: fee32e91-2030-4979-9364-7c3ab147f1ce
┌─article_id─┬─sumArrayOrNullIf(times, greater(length(sections), 8))─┐
│ 11 │ ???? │
│ 10 │ 1691 │
└────────────┴───────────────────────────────────────────────────────┘
2 rows in set. Elapsed: 0.026 sec.
我們將array函數(shù)與組合器一起使用议忽,我們甚至可以處理更高級(jí)的情況:
SELECT
article_id,
countArray(arrayFilter(x -> (x > 120), times)) AS sections_engaged
FROM wudy.article_reads
GROUP BY article_id
Query id: 9df392d7-e5b4-4391-89ad-3a29c824e29c
┌─article_id─┬─sections_engaged─┐
│ 11 │ 1 │
│ 10 │ 9 │
└────────────┴──────────────────┘
2 rows in set. Elapsed: 0.028 sec.
我們首先使用arrayFilter函數(shù)過(guò)濾times數(shù)組,以刪除所有小于120秒的值十减。然后栈幸,我們使用countArray來(lái)計(jì)算每篇文章的過(guò)濾時(shí)間
- 10.聚合映射
ClickHouse中另一種強(qiáng)大的類(lèi)型是Map。與數(shù)組一樣帮辟,我們可以使用Map()組合器將聚合應(yīng)用于此類(lèi)型
CREATE TABLE page_loads
(
`time` DateTime,
`url` String,
`params` Map(String, UInt32)
)
ENGINE = MergeTree
ORDER BY (url, time)
INSERT INTO wudy.page_loads VALUES ('2023-12-31 11:12:19', 'http://baidu.com', {'load_speed':100, 'scroll_depth':59})
INSERT INTO wudy.page_loads VALUES ('2024-01-01 13:14:15', 'http://ifeng.com', {'load_speed':200, 'scroll_depth':87})
INSERT INTO wudy.page_loads VALUES ('2024-01-01 13:14:15', 'http://google.com', {'scroll_depth':87})
我們可以使用Map()組合器來(lái)進(jìn)行sum()和avg()函數(shù)速址,以獲得總加載時(shí)間和平均滾動(dòng)深度
SELECT
sumMap(params)['load_speed'] AS total_load_time,
avgMap(params)['scroll_depth'] AS average_scroll
FROM page_loads
Query id: 6fcd08b5-bf09-4b2f-93be-82d66366a0cb
┌─total_load_time─┬────average_scroll─┐
│ 300 │ 77.66666666666667 │
└─────────────────┴───────────────────┘
1 rows in set. Elapsed: 0.021 sec.
Map()組合器還可以與其他組合器一起使用:
SELECT sumMapIf(params, url = 'http://google.com')['scroll_depth'] AS average_scroll
FROM page_loads
Query id: 93fc7e18-498e-4e03-9628-e29f3e189dc7
┌─average_scroll─┐
│ 87 │
└────────────────┘
1 rows in set. Elapsed: 0.024 sec.
- 11.聚合相應(yīng)數(shù)組值
理數(shù)組列的另一種方法是從兩個(gè)數(shù)組中聚合相應(yīng)的值。這將導(dǎo)致另一個(gè)數(shù)組织阅。這可以用于矢量化數(shù)據(jù)(如矢量或矩陣)壳繁,通過(guò)ForEach()組合器實(shí)現(xiàn):
image.png
CREATE TABLE article_reads_2
(
`time` DateTime,
`article_id` UInt32,
`sections` Array(UInt16),
`times` Array(UInt16),
`user_id` UInt32
)
ENGINE = MergeTree
ORDER BY (article_id, time)
INSERT INTO wudy.article_reads_2 VALUES ('2023-12-31 11:12:19', 10, [21,8],[30,176], 11086);
INSERT INTO wudy.article_reads_2 VALUES ('2024-01-01 11:12:19', 10, [33,10],[10,190,200], 11086);
SELECT * FROM article_reads_2
Query id: 9902fc4d-27b7-4631-94c4-6cd977e65236
┌────────────────time─┬─article_id─┬─sections─┬─times────────┬─user_id─┐
│ 2024-01-01 11:12:19 │ 10 │ [33,10] │ [10,190,200] │ 11086 │
└─────────────────────┴────────────┴──────────┴──────────────┴─────────┘
┌────────────────time─┬─article_id─┬─sections─┬─times────┬─user_id─┐
│ 2023-12-31 11:12:19 │ 10 │ [21,8] │ [30,176] │ 11086 │
└─────────────────────┴────────────┴──────────┴──────────┴─────────┘
2 rows in set. Elapsed: 0.022 sec.
373168478a29 :) SELECT avgForEach(times) FROM article_reads_2;
SELECT avgForEach(times) FROM article_reads_2
Query id: b2c28a3f-aaa6-4c19-9f99-5f07c12ec243
┌─avgForEach(times)─┐
│ [20,183,200] │
└───────────────────┘
1 rows in set. Elapsed: 0.019 sec.
這將要求ClickHouse計(jì)算所有坐標(biāo)數(shù)組的第一個(gè)元素的平均值,并將其放入結(jié)果數(shù)組的第一個(gè)元素中。然后對(duì)第二個(gè)和第三個(gè)元素重復(fù)相同的操作
還支持與其他組合器一起使用:
// 案例1
SELECT avgForEachIf(times, user_id = '11086')
FROM article_reads_2
Query id: 58a97c3b-d947-4220-912a-6ee396902536
┌─avgForEachIf(times, equals(user_id, '11086'))─┐
│ [20,183,200] │
// 案例2
373168478a29 :) SELECT avgForEachIfOrDefault(times, user_id='11086') FROM article_reads_2;
Query id: 158f3b31-79cd-489f-a794-608213f7e32c
┌─avgForEachIfOrDefault(times, equals(user_id, '11086'))─┐
│ [20,183,200] │
└────────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.021 sec.
// 案例3
373168478a29 :) SELECT avgForEachIfOrDefault(times, user_id='32578') FROM article_reads_2;
Query id: c1a62211-34d4-415d-8de8-e803b17a48a6
┌─avgForEachIfOrDefault(times, equals(user_id, '32578'))─┐
│ [] │
└────────────────────────────────────────────────────────┘
- 12.使用聚合狀態(tài)
ClickHouse允許使用
中間聚合狀態(tài)
而不是結(jié)果值闹炉。假設(shè)我們需要計(jì)算唯一值的計(jì)數(shù)蒿赢,但我們不想保存值本身(因?yàn)樗加每臻g)。在這種情況下渣触,我們可以使用uniq()
函數(shù)的State()組合器
來(lái)保存中間聚合狀態(tài)羡棵,然后使用Merge()
組合器來(lái)計(jì)算實(shí)際值:
-
8.ClickHouse物化視圖
物化視圖是一種特殊的觸發(fā)器,當(dāng)數(shù)據(jù)被插入時(shí)嗅钻,它將數(shù)據(jù)上執(zhí)行 SELECT 查詢的結(jié)果存儲(chǔ)為到一個(gè)目標(biāo)表中
image.png
案例1:
為常用的大數(shù)據(jù)量查詢創(chuàng)建物化視圖
// 創(chuàng)建`wikistat`表
CREATE TABLE wudy.wikistat
(
`time` DateTime CODEC(Delta(4), ZSTD(1)),
`project` LowCardinality(String),
`subproject` LowCardinality(String),
`path` String,
`hits` UInt64
)
ENGINE = MergeTree
ORDER BY (path, time);
// 準(zhǔn)備數(shù)據(jù)
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:10:00', '201108003118','201108003119','www.baidu.com', 1023);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:20:00', '201108003118','201108003119','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:30:00', '201108003118','201108003119','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:40:00', '201108003118','201108003119','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:00:00', '201108003118','201108003119','www.ifeng.com', 1201);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:10:00', '201108003118','201108003119','www.baidu.com', 1641);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:30:00', '201108003118','201108003119','www.ifeng.com', 1562);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:30:01', '201108003118','201108003119','www.baidu.com', 911);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:22:00', '201108003118','201108003101','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:31:00', '201108003118','201108003102','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:44:05', '201108003118','201108003103','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:03:07', '201108003118','201108003104','www.ifeng.com', 1201);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:35:02', '201108003118','201108003105','www.baidu.com', 1641);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:31:00', '201108003118','201108003106','www.ifeng.com', 1562);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:38:01', '201108003118','201108003107','www.baidu.com', 911);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:20:00', '201108003122','201108003112','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-01 11:30:00', '201108003122','201108003144','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-02 11:40:00', '201108003122','201108003123','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:00:00', '201108003122','201108003133','www.ifeng.com', 121);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:10:00', '201108003101','201108003142','www.baidu.com', 577);
INSERT INTO wudy.wikistat VALUES('2020-10-02 12:30:00', '201108003101','201108003107','www.ifeng.com', 224);
INSERT INTO wudy.wikistat VALUES('2020-10-02 13:30:01', '201108003101','201108003105','www.baidu.com', 2091);
INSERT INTO wudy.wikistat VALUES('2020-10-03 11:20:00', '201108003122','201108003138','www.baidu.com', 987);
INSERT INTO wudy.wikistat VALUES('2020-10-03 11:30:00', '201108003122','201108003138','www.baidu.com', 769);
INSERT INTO wudy.wikistat VALUES('2020-10-04 11:40:00', '201108003122','201108003112','www.ifeng.com', 2000);
INSERT INTO wudy.wikistat VALUES('2020-10-04 12:00:00', '201108003122','201108003127','www.ifeng.com', 121);
INSERT INTO wudy.wikistat VALUES('2020-10-04 13:10:00', '201108003101','201108003118','www.baidu.com', 577);
INSERT INTO wudy.wikistat VALUES('2020-10-03 12:30:00', '201108003101','201108003119','www.ifeng.com', 224);
INSERT INTO wudy.wikistat VALUES('2020-10-03 13:30:01', '201108003101','201108003120','www.baidu.com', 2091);
假設(shè)我們經(jīng)常查詢某個(gè)日期最受歡迎的項(xiàng)目:
373168478a29 :) SELECT
project,
sum(hits) AS h
FROM wikistat
WHERE date(time) = '2020-10-01'
GROUP BY project
ORDER BY h DESC
LIMIT 10
Query id: a3dccadd-2797-413a-88c7-82a920f3b38a
┌─project──────┬────h─┐
│ 201108003118 │ 8535 │
│ 201108003122 │ 1756 │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.014 sec.
如果上面這個(gè)查詢耗時(shí)較長(zhǎng)皂冰,且我們有大量這樣的查詢,并且我們需要ClickHouse提供毫秒級(jí)性能养篓,我們可以為這個(gè)查詢創(chuàng)建一個(gè)物化視圖:
// wikistat_top_project 是我們要用來(lái)保存物化視圖的表的名稱
// 使用了SummingMergeTree表引擎(SummingMergeTree引擎是異步)秃流,因?yàn)槲覀兿M麨槊總€(gè)date/project匯總hits值
373168478a29 :) CREATE TABLE wikistat_top_project
(
`date` Date,
`project` LowCardinality(String),
`hit` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (date, project);
// wikistat_top_project_mv 是物化視圖本身(觸發(fā)器)的名稱
// AS 后面的內(nèi)容是構(gòu)建物化視圖的查詢
373168478a29 :) CREATE MATERIALIZED VIEW wikistat_top_project_mv TO wikistat_top_project AS
SELECT
date(time) AS date,
project,
sum(hits) AS hits
FROM wikistat
GROUP BY
date,
project
Query id: c9df59fe-2024-4ab5-b9d5-9c10c38194d9
Ok.
0 rows in set. Elapsed: 0.010 sec.
注意: 我們可以創(chuàng)建任意數(shù)量的
物化視圖
,但每一個(gè)新的物化視圖都是額外的存儲(chǔ)負(fù)擔(dān)柳弄,因此保持總數(shù)合理舶胀,即每個(gè)表下的物化視圖數(shù)目控制在10個(gè)以內(nèi)
現(xiàn)在,我們使用與 wikistat 表相同的查詢來(lái)填充物化視圖的目標(biāo)表
373168478a29 :) INSERT INTO wikistat_top_project SELECT
date(time) AS date,
project,
sum(hits) AS hits
FROM wikistat
GROUP BY
date,
project
Query id: 7c0c3c47-7c6d-427a-8fcf-8d78c0c9c6ab
Ok.
0 rows in set. Elapsed: 0.018 sec.
1.查詢物化視圖
373168478a29 :) SELECT
project,
sum(hit) AS hits
FROM wikistat_top_project
WHERE date = '2020-10-01'
GROUP BY project
ORDER BY hits DESC
LIMIT 10;
Query id: 1d625f2d-bf7a-44aa-9f61-fa489871083d
┌─project──────┬─hits─┐
│ 201108003118 │ 8535 │
│ 201108003122 │ 1756 │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.010 sec.
再插入一條符合物化視圖條件的數(shù)據(jù)碧注,會(huì)發(fā)現(xiàn)物化視圖查詢出來(lái)的數(shù)據(jù)未發(fā)生改變
373168478a29 :) INSERT INTO wudy.wikistat VALUES('2020-10-01 15:20:00', '201108003122','201108003136','www.baidu.com', 1000);
373168478a29 :) SELECT
project,
sum(hit) AS hits
FROM wikistat_top_project
WHERE date = '2020-10-01'
GROUP BY project
ORDER BY hits DESC
LIMIT 10
Query id: e8865895-75f2-4fb1-8d47-2108fc6676f5
┌─project──────┬─hits─┐
│ 201108003118 │ 8535 │
│ 201108003122 │ 1756 │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.072 sec.
2.管理物化視圖
使用
SHOW TABLES
查詢列出物化視圖
373168478a29 :) SHOW TABLES LIKE 'wikistat_top_project_mv'
Query id: 58033394-3567-457e-8e84-b578efe4a4d2
┌─name────────────────────┐
│ wikistat_top_project_mv │
└─────────────────────────┘
1 rows in set. Elapsed: 0.008 sec.
可以使用
DROP TABLE
刪除物化視圖嚣伐,但這只會(huì)刪除觸發(fā)器本身
DROP TABLE wikistat_top_project_mv
如果不再需要目標(biāo)表,請(qǐng)記得也將其刪除:
DROP TABLE wikistat_top_project
3.獲取物化視圖在磁盤(pán)上的大小
物化視圖表的元數(shù)據(jù)都存儲(chǔ)在system數(shù)據(jù)庫(kù)中
373168478a29 :)SELECT formatReadableSize(total_bytes) AS total_bytes_on_disk
FROM system.tables
WHERE table = 'wikistat_top_project'
Query id: a1b5808d-4cb7-4786-9880-613025c30e13
┌─total_bytes_on_disk─┐
│ 326.00 B │
└─────────────────────┘
1 rows in set. Elapsed: 0.009 sec.
使用物化視圖加速查詢
物化視圖是一種提高查詢性能的方法萍丐。對(duì)于分析查詢轩端,常見(jiàn)的聚合操作不僅僅是前面示例中展示的
sum()
。SummingMergeTree
非常適用于計(jì)算匯總數(shù)據(jù)
逝变,但還有更高級(jí)的聚合可以使用AggregatingMergeTree引擎
進(jìn)行計(jì)算
假設(shè)我們經(jīng)常執(zhí)行以下類(lèi)型的查詢:
SELECT
toDate(time) AS date,
min(hits) AS min_hits_per_hour,
max(hits) AS max_hits_per_hour,
avg(hits) AS avg_hits_per_hour
FROM wikistat
WHERE project = '201108003122'
GROUP BY date
Query id: bc6c89ce-1ba6-40be-811d-3122ef4e88af
// 查詢給定項(xiàng)目的每日點(diǎn)擊量的月最小值基茵、最大值和平均值:
┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬─avg_hits_per_hour─┐
│ 2020-10-01 │ 769 │ 1000 │ 939 │
│ 2020-10-02 │ 121 │ 2000 │ 1060.5 │
│ 2020-10-03 │ 769 │ 987 │ 878 │
│ 2020-10-04 │ 121 │ 2000 │ 1060.5 │
└────────────┴───────────────────┴───────────────────┴───────────────────┘
使用
物化視圖
存儲(chǔ)這些聚合結(jié)果以便更快地檢索。使用狀態(tài)組合器(state combinators)
定義聚合結(jié)果壳影。狀態(tài)組合器
要求ClickHouse
保存內(nèi)部聚合狀態(tài)
耿导,而不是最終的聚合結(jié)果。這允許使用聚合操作态贤,而無(wú)需保存帶有原始值的所有記錄。這種方法很簡(jiǎn)單 - 我們?cè)趧?chuàng)建物化視圖時(shí)使用*State()
函數(shù)醋火,然后在查詢時(shí)使用其對(duì)應(yīng)的*Merge()
函數(shù)獲取正確的聚合結(jié)果
image.png
我們將使用 min 悠汽、 max 和 avg 狀態(tài)。在新物化視圖的目標(biāo)表中芥驳,我們將使用
AggregateFunction
類(lèi)型存儲(chǔ)聚合狀態(tài)而不是值
CREATE TABLE wikistat_daily_summary
(
`project` String,
`date` Date,
`min_hits_per_hour` AggregateFunction(min, UInt64),
`max_hits_per_hour` AggregateFunction(max, UInt64),
`avg_hits_per_hour` AggregateFunction(avg, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (project, date)
CREATE MATERIALIZED VIEW wikistat_daily_summary_mv TO wikistat_daily_summary AS
SELECT
project,
toDate(time) AS date,
minState(hits) AS min_hits_per_hour,
maxState(hits) AS max_hits_per_hour,
avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY
project,
date
現(xiàn)在柿冲,讓我們?yōu)樗畛鋽?shù)據(jù):
INSERT INTO wikistat_daily_summary SELECT
project,
toDate(time) AS date,
minState(hits) AS min_hits_per_hour,
maxState(hits) AS max_hits_per_hour,
avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY
project,
date
在查詢時(shí),我們使用相應(yīng)的
Merge
組合器來(lái)檢索值:
SELECT
date,
minMerge(min_hits_per_hour) AS min_hits_per_hour,
maxMerge(max_hits_per_hour) AS max_hits_per_hour,
avgMerge(avg_hits_per_hour) AS avg_hits_per_hour
FROM wikistat_daily_summary
WHERE project = '201108003122'
GROUP BY date
Query id: 7ac5bb0e-94b9-4ee1-a85b-ebd39823073a
┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬─avg_hits_per_hour─┐
│ 2020-10-01 │ 769 │ 1000 │ 939 │
│ 2020-10-02 │ 121 │ 2000 │ 1060.5 │
│ 2020-10-03 │ 769 │ 987 │ 878 │
│ 2020-10-04 │ 121 │ 2000 │ 1060.5 │
└────────────┴───────────────────┴───────────────────┴───────────────────┘
注意:
任何聚合函數(shù)都可以作為一個(gè)聚合物化視圖的一部分與State/Merge組合器一起使用
驗(yàn)證和過(guò)濾數(shù)據(jù)
使用物化視圖的另一個(gè)流行的示例是在插入后立即處理數(shù)據(jù)兆旬。數(shù)據(jù)驗(yàn)證就是一個(gè)很好的例子
image.png
假設(shè)我們想要濾掉所有包含不需要的符號(hào)的path假抄,再保存到結(jié)果表中
373168478a29 :) SELECT count(*) FROM wikistat
┌─count()─┐
│ 31 │
└─────────┘
1 rows in set. Elapsed: 0.021 sec.
373168478a29 :) SELECT count(*) FROM wikistat WHERE NOT match(path, 'baidu')
┌─count()─┐
│ 13 │
└─────────┘
為了實(shí)現(xiàn)驗(yàn)證過(guò)濾,我們需要兩個(gè)表 - 一個(gè)帶有所有數(shù)據(jù)的表和一個(gè)只帶有干凈數(shù)據(jù)的表。物化視圖的目標(biāo)表將扮演一個(gè)只帶有干凈數(shù)據(jù)的最終表的角色宿饱,源表將是暫時(shí)的熏瞄。我們可以根據(jù)
TTL
從源表中刪除數(shù)據(jù),就像我們?cè)谏弦还?jié)中所做的那樣谬以,或者將此表的引擎更改為Null强饮,該引擎不存儲(chǔ)任何數(shù)據(jù)(數(shù)據(jù)只會(huì)存儲(chǔ)在物化視圖中):
373168478a29 :) CREATE TABLE wudy.wikistat_src
(
`time` DateTime CODEC(Delta(4), ZSTD(1)),
`project` LowCardinality(String),
`subproject` LowCardinality(String),
`path` String,
`hits` UInt64
)
ENGINE = Null
Query id: f2673c20-4615-408c-b290-6bd843661562
Ok.
0 rows in set. Elapsed: 0.015 sec.
現(xiàn)在,讓我們使用數(shù)據(jù)驗(yàn)證查詢創(chuàng)建一個(gè)物化視圖:
373168478a29 :) CREATE TABLE wikistat_clean AS wikistat
Query id: 9fd5efeb-feec-4311-927a-0ddad298ebc0
Ok.
0 rows in set. Elapsed: 0.040 sec.
373168478a29 :) CREATE MATERIALIZED VIEW wikistat_clean_mv TO wikistat_clean AS
SELECT *
FROM wikistat_src
WHERE match(path, 'baidu')
Query id: 881d04a0-2a7e-4a81-bbfd-7365103b1465
Ok.
0 rows in set. Elapsed: 0.012 sec.
當(dāng)我們插入數(shù)據(jù)時(shí)为黎, wikistat_src 將保持為空:
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 12:00:00', '201108003122','201108003127','www.ifeng.com', 121);
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 13:10:00', '201108003101','201108003118','www.baidu.com', 577);
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 12:30:00', '201108003101','201108003119','www.huawei.com', 224);
INSERT INTO wudy.wikistat_src VALUES('2020-10-05 13:30:01', '201108003101','201108003120','www.csdn.com', 2091);
373168478a29 :) SELECT count(*) FROM wikistat_src
Query id: 1f94547a-d805-4a67-8c18-ec6e900e57bf
┌─count()─┐
│ 0 │
└─────────┘
1 rows in set. Elapsed: 0.022 sec.
但是邮丰,我們的 wikistat_clean 物化表現(xiàn)在只有有效的行:
373168478a29 :) SELECT * FROM wikistat_clean
Query id: 7c81c628-8fa2-4235-b35c-1c94b0013232
┌────────────────time─┬─project──────┬─subproject───┬─path──────────┬─hits─┐
│ 2020-10-05 13:10:00 │ 201108003101 │ 201108003118 │ www.baidu.com │ 577 │
└─────────────────────┴──────────────┴──────────────┴───────────────┴──────┘
再次插入一條符合match條件的數(shù)據(jù)
373168478a29 :) INSERT INTO wudy.wikistat_src VALUES('2020-10-05 14:40:00', '201108003122','201108003101','www.baidu.com', 888);
INSERT INTO wudy.wikistat_src FORMAT Values
Query id: 9f127a73-d3ea-43fa-9cdb-6332305e72bd
Ok.
1 rows in set. Elapsed: 0.026 sec.
373168478a29 :) SELECT * FROM wikistat_clean
Query id: 58571aa3-ed3a-4c5e-84b2-c321f062df8c
┌────────────────time─┬─project──────┬─subproject───┬─path──────────┬─hits─┐
│ 2020-10-05 14:40:00 │ 201108003122 │ 201108003101 │ www.baidu.com │ 888 │
└─────────────────────┴──────────────┴──────────────┴───────────────┴──────┘
┌────────────────time─┬─project──────┬─subproject───┬─path──────────┬─hits─┐
│ 2020-10-05 13:10:00 │ 201108003101 │ 201108003118 │ www.baidu.com │ 577 │
└─────────────────────┴──────────────┴──────────────┴───────────────┴──────┘
2 rows in set. Elapsed: 0.024 sec.