今天的主題酸休,不用寫序言骂租,看上圖就懂
如何用SQL的方式操作一個(gè)文件
先舉例
- 想知道history命令里,敲得最多的是哪個(gè)雨席?
- 當(dāng)然菩咨,sed,awk完全可以做到陡厘,但是抽米,如果用SQL怎么做?
history | awk '{print $1"\t"$2}' | clickhouse-client \
--query="SELECT shell, count() AS c FROM history \
GROUP BY shell ORDER BY c DESC limit 10 " \
--external --file=- --name=history \
--structure='id UInt16, shell String' -h 127.0.0.1
ls 390
cd 243
clickhouse-client 173
du 67
vim 57
htop 42
cat 28
history 27
tailf 25
mysql 24
原理
- ClickHouse支持把一個(gè)外部文件糙置,加載到內(nèi)部的一個(gè)臨時(shí)表中云茸,對(duì)這個(gè)臨時(shí)表進(jìn)行SQL化操作
格式
--external --file=... [--name=...] [--format=...] [--types=...|--structure=...]
-
--external
表示這個(gè)操作是外部文件的 -
--file=...
指定一個(gè)文件,如果是標(biāo)準(zhǔn)輸入谤饭,則寫-
-
[--name=...]
表名标捺,如果忽略,默認(rèn)給_data
-
[--format=...]
列分隔符揉抵,默認(rèn)是TabSeparated
- `[--types=...|--structure=...] 這句不解釋了亡容,看上面的例子就好了
再來(lái)一個(gè)測(cè)試
- 為了模擬一個(gè)有意義的場(chǎng)景,我們選了ClickHouse的system.parts這個(gè)表冤今,里面記錄的是ClickHouse的分區(qū)信息闺兢,表結(jié)構(gòu)如下
partition: 201709
name: 20170903_20170905_2_2963928_22
replicated: 0
active: 1
marks: 23372
rows: 191456971
bytes: 93294984484
modification_time: 2017-09-05 23:37:33
remove_time: 0000-00-00 00:00:00
refcount: 2
min_date: 2017-09-03
max_date: 2017-09-05
min_block_number: 2
max_block_number: 2963928
level: 22
primary_key_bytes_in_memory: 93488
primary_key_bytes_in_memory_allocated: 196608
database: xx
table: xx
engine: MergeTree
- 我們導(dǎo)出一份數(shù)據(jù),作為測(cè)試文件
- 默認(rèn)導(dǎo)出的文件是tab分割
clickhouse-client -h 127.0.0.1 -m -d system -q "select * from parts " > test.sql
- 目標(biāo)SQL
- 找某個(gè)表的分區(qū)數(shù)據(jù)戏罢,即有幾個(gè)分區(qū)屋谭,分區(qū)文件多大
SELECT
partition,
count() AS number_of_parts,
formatReadableSize(sum(bytes)) AS sum_size
FROM system.parts
WHERE active AND (database = 'xxxx') AND (table = 'xxxx_msg')
GROUP BY partition
ORDER BY partition ASC
- 文件SQL
root@10.xxxx:/root # wc -l test.sql
11991 test.sql
root@10.xxxx:/root # clickhouse-client \
--query="SELECT partition, count() AS number_of_parts, \
formatReadableSize(sum(bytes)) AS sum_size FROM parts \
WHERE active AND (database = 'xxxx') AND (table = 'xxxx_msg') \
GROUP BY partition ORDER BY partition ASC ;" \
--external --file=test.sql --name=parts \
--structure='partition UInt16,name String,replicated UInt16,active UInt16,marks UInt16,rows UInt16,bytes UInt16,modification_time String,remove_time String,refcount UInt16,min_date String,max_date String,min_block_number UInt16,max_block_number UInt16,level UInt16,primary_key_bytes_in_memory UInt16,primary_key_bytes_in_memory_allocated UInt16,database String,table String,engine String' \
-h 127.0.0.1
201709 36 1.68 TiB
201710 26 1.42 TiB
201711 30 1.42 TiB
201712 31 963.07 GiB
注意事項(xiàng)
- 文件操作雖然方便,但是官方文檔也提到了龟糕,如果是特別大的文件桐磁,還是不要這么玩了
- 另外,這個(gè)文件SQL其實(shí)還是要依賴ClickHouse-Server的讲岁,如果你沒(méi)有啟動(dòng)Server我擂,玩不了的哦~