一全闷、 clickhouse-copier
1叉寂、what's this?
clickhouse-copier是官方出的用來(lái)同步數(shù)據(jù)的工具总珠,依賴zk來(lái)滿足跨集群同步數(shù)據(jù)的場(chǎng)景屏鳍。
2勘纯、how to use?
假設(shè)我們要從cluster1[IP1,IP2,IP3]集群中拷貝table_dis到cluster2[IP4,IP5]中钓瞭。table_dis是distributed table驳遵,對(duì)應(yīng)的mergetree表為table_local
(1)zk.xml
<yandex>
<logger>
<level>trace</level>
<size>100M</size>
<count>3</count>
</logger>
<zookeeper>
<node index="1">
<host>[ZK-IP]</host>
<port>[ZK-PORT]</port>
</node>
</zookeeper>
</yandex>
創(chuàng)建zk.xml文件,用于copy時(shí)候使用降淮。
(2)schema.xml
用括號(hào)標(biāo)注的變量需要根據(jù)實(shí)際情況更換超埋。
<yandex>
<!-- Configuration of clusters as in an ordinary server config -->
<remote_servers>
<source_cluster>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>[IP1]</host>
<port>[TCP PORT]</port>
</replica>
</shard>
</source_cluster>
<destination_cluster>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>[IP4]</host>
<port>[TCP PORT]</port>
</replica>
</shard>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>[IP5]</host>
<port>[TCP PORT]</port>
</replica>
</shard>
</destination_cluster>
</remote_servers>
<max_workers>2</max_workers>
<settings_pull>
<readonly>1</readonly>
</settings_pull>
<settings_push>
<readonly>0</readonly>
</settings_push>
<settings>
<connect_timeout>3</connect_timeout>
<!-- Sync insert is set forcibly, leave it here just in case. -->
<insert_distributed_sync>1</insert_distributed_sync>
</settings>
<!-- Copying tasks description.
You could specify several table task in the same task description (in the same ZooKeeper node), they will be performed
sequentially.
-->
<tables>
<!-- A table task, copies one table. -->
<table_hits>
<!-- Source cluster name (from <remote_servers/> section) and tables in it that should be copied -->
<!-- cluster_pull、cluster_push不需要變動(dòng)-->
<cluster_pull>source_cluster</cluster_pull>
<database_pull>[DATABASE]</database_pull>
<table_pull>[table_dis]</table_pull>
<!-- Destination cluster name and tables in which the data should be inserted -->
<cluster_push>destination_cluster</cluster_push>
<database_push>[DATABASE]</database_push>
<table_push>[table_local]</table_push>
<!-- Sharding key used to insert data to destination cluster -->
<sharding_key>rand()</sharding_key>
<engine>[ENGINE SYNTAX]</engine>
</table_hits>
</tables>
</yandex>
- [IP1] & [table_dis]:我們選擇在IP1上運(yùn)行clickhouse-copier程序佳鳖,所以指定IP1。在source_cluster中只指定了一臺(tái)服務(wù)器媒惕,所以直接會(huì)用分布式表做query系吩。也有一種方案是在source_cluster中配置全I(xiàn)P1、IP2妒蔚、IP3穿挨,再在table_pull的tag中使用table_local表。
- 我們?cè)赿estination_cluster中配置全了cluster中的所以節(jié)點(diǎn)肴盏,并且table_push指定的是local表科盛,這樣子從IP1的分布式表查出來(lái)的數(shù)據(jù)就會(huì)分散寫入cluster2中的節(jié)點(diǎn)中。
- [ENGINE SYNTAX]:實(shí)測(cè)我們需要在目標(biāo)集群中都手動(dòng)建立起來(lái)需要的表菜皂,并且engine也需要指定贞绵。應(yīng)該是個(gè)bug。
- 其余參數(shù)可以根據(jù)描寫按實(shí)際需要修改恍飘,提高并行度榨崩。
3、execute
在IP1中執(zhí)行:
clickhouse-copier copier --daemon --config zk.xml --task-path /[ZK-CLICKHOUSE-PATH/NODE] --base-dir /[PATH] --task-upload-force true --task-file schema.xml
參數(shù) | 含義 |
---|---|
--daemon | 后臺(tái)運(yùn)行 |
/ZK-CLICKHOUSE-PATH/NODE | 在zk上使用的節(jié)點(diǎn)章母。ZK-CLICKHOUSE-PATH 這個(gè)節(jié)點(diǎn)需要先手動(dòng)創(chuàng)建母蛛,NODE不需要 |
PATH | 本機(jī)的一個(gè)路徑,用于記錄日志 |
--task-upload-force | 是否每次都更新schema.xml到zk節(jié)點(diǎn)上乳怎,默認(rèn)false |
4彩郊、性能
測(cè)試在3臺(tái)的集群1000w條的數(shù)據(jù),寫入2臺(tái)的集群中蚪缀,耗時(shí)在30s
二秫逝、remote table
INSERT INTO TABLE SELECT * FROM remote('remote-table-ip','origin-table-name','user','passwd')
三、對(duì)比
- remote表的方式較簡(jiǎn)單椿胯,copier的方式需要依賴zk筷登,并且配置較為麻煩。
- copier的方式支持多進(jìn)程復(fù)制哩盲,能提高復(fù)制的效率前方。
- copier表支持origin表平均狈醉、hash等方式寫入多個(gè)節(jié)點(diǎn),remote表只能一對(duì)一惠险。