背景
業(yè)務(wù)方數(shù)據(jù)在出現(xiàn)錯(cuò)誤后需要重跑數(shù)據(jù),由于業(yè)務(wù)方?jīng)]有使用MergeTree的折疊表四啰,需要?jiǎng)h除舊的數(shù)據(jù)后,再重新跑數(shù)據(jù)寫入新的正確的數(shù)據(jù)欧瘪。
之前這種模式一直運(yùn)轉(zhuǎn)的比較好匙赞,沒有出現(xiàn)過問題,不過近期發(fā)現(xiàn)涌庭,對(duì)該表發(fā)起Alter語句時(shí),出現(xiàn)了ZK Connection Loss的錯(cuò)誤坐榆,但是對(duì)其他的表發(fā)起Alter語句沒有出現(xiàn)相同的錯(cuò)誤。
本文主要分析一下定位問題的過程以及確定問題所在席镀,也希望大家就該問題進(jìn)行討論提供更好的解決方案。
問題現(xiàn)象分析
問題描述
Clickhouse版本:20.9.3.45
表結(jié)構(gòu):
CREATE TABLE default.business_table
(
createTime DateTime,
appid int ,
totalCount bigint
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/business_table', '{replica}');
Alter語句以及響應(yīng)的報(bào)錯(cuò)信息:
alter table default.business_table delete where toYYYYMMDDHH(createTime) =2022012020 and appid=1;
ERROR 999 (00000): Code: 999, e.displayText() = Coordination::Exception: Connection loss (version 20.9.3.45 (official build))
問題定位
首先查看了一下clickhouse的錯(cuò)誤日志顶捷,錯(cuò)誤日志中有相關(guān)的堆棧信息
2022.02.10 11:17:51.706169 [ 34045 ] {} <Error> executeQuery: Code: 999, e.displayText() = Coordination::Exception: Connection loss (version 20.9.3.45 (official build)) (from 12
7.0.0.1:48554) (in query: alter table default.business_table delete where toYYYYMMDDHH(createTime) =2022012020 and appid=1), Stack trace (when copying this message, always include the lines below):
0. Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x18e1b360 in /usr/bin/clickhouse
1. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0xe736dad in /usr/bin/clickhouse
2. Coordination::Exception::Exception(Coordination::Error) @ 0x16887dad in /usr/bin/clickhouse
3. ? @ 0x168991b0 in /usr/bin/clickhouse
4. DB::EphemeralLocksInAllPartitions::EphemeralLocksInAllPartitions(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic
_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, zkut
il::ZooKeeper&) @ 0x161dca16 in /usr/bin/clickhouse
5. DB::StorageReplicatedMergeTree::EphemeralLocksInAllPartitions(DB::MutationCommands const&, DB::Context const&) @ 0x1609c636 in /usr/bin/clickhouse
6. DB::InterpreterAlterQuery::execute() @ 0x15ab5126 in /usr/bin/clickhouse
再查看了一下zk的錯(cuò)誤日志
2022-02-10 11:17:51,680 [myid:90] - WARN [NIOWorkerThread-30:NIOServerCnxn@373] - Close of session 0x5a02260902470005
java.io.IOException: Len error 1190892
at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:541)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
然后大致對(duì)比了一下系統(tǒng)的表的大小服赎,目前出問題的表是最大的交播。
mysql> select count() from system.parts where table='business_table' and active =1 ;
+---------+
| count() |
+---------+
| 8108 |
+---------+
1 row in set (0.04 sec)
從上面可以看出表的數(shù)據(jù)分片很多。
分析ZK的日志發(fā)現(xiàn),ZK認(rèn)為客戶端發(fā)送的消息格式不正確荔仁,從而主動(dòng)斷開了clickhouse的連接芽死。從clickhouse的異常日志有可以看出正在執(zhí)行zk操作時(shí)出現(xiàn)了連接斷開的錯(cuò)誤次洼。
現(xiàn)在我們從代碼層面去看看問題的根因,當(dāng)clickhouse執(zhí)行alter操作時(shí)卖毁,如果對(duì)應(yīng)的mutation如果涉及到分片數(shù)據(jù)的變更時(shí),就需要對(duì)分片進(jìn)行鎖定炭剪,而分片的鎖定操作是在對(duì)應(yīng)的分片對(duì)應(yīng)的zk子目錄下面創(chuàng)建一個(gè)臨時(shí)節(jié)點(diǎn)翔脱,如下面代碼所示:
EphemeralLocksInAllPartitions::EphemeralLocksInAllPartitions(
const String & block_numbers_path, const String & path_prefix, const String & temp_path,
zkutil::ZooKeeper & zookeeper_)
: zookeeper(&zookeeper_)
{
std::vector<String> holders;
while (true)
{
......
Coordination::Requests lock_ops;
// 這里沒有
for (size_t i = 0; i < partitions.size(); ++i)
{
String partition_path_prefix = block_numbers_path + "/" + partitions[i] + "/" + path_prefix;
lock_ops.push_back(zkutil::makeCreateRequest(
partition_path_prefix, holders[i], zkutil::CreateMode::EphemeralSequential));
}
lock_ops.push_back(zkutil::makeCheckRequest(block_numbers_path, partitions_stat.version));
Coordination::Responses lock_responses;
// 問題出在這里
Coordination::Error rc = zookeeper->tryMulti(lock_ops, lock_responses);
if (rc == Coordination::Error::ZBADVERSION)
{
LOG_TRACE(&Poco::Logger::get("EphemeralLocksInAllPartitions"), "Someone has inserted a block in a new partition while we were creating locks. Retry.");
continue;
}
else if (rc != Coordination::Error::ZOK)
throw Coordination::Exception(rc);
clickhouse在zk的訪問中,采用了大量批量操作错妖,在上面的分片鎖定操作中疚沐,它針對(duì)所有影響到的分片的鎖定批量一次性提交命令到zk中,而zk的傳輸使用了jute亮蛔,jute缺省最大的包大小為1M,具體細(xì)節(jié)可以參考一下關(guān)于zookeeper寫入數(shù)據(jù)超過1M大小的踩坑記尔邓。
這里clickhouse的問題在于它沒有做分包,而是對(duì)所有影響的分片合并請(qǐng)求后齿尽,批量向zk發(fā)起請(qǐng)求灯节,從而造成了超過zk最大的傳輸包大小,從而造成連接斷開炎疆。
為什么這里需要一次性的批量提交呢?具體的原因有朋友了解的可以分享一下形入,我理解可能clickhouse需要做類似事務(wù)級(jí)別的保證。
問題解決
知道了問題的根因首先考慮到增加zk的jute缺省的最大包大小浓若,zookeeper本身,我們可以在配置上實(shí)現(xiàn)挪钓。但是我們查看了一下clickhouse的zk配置相關(guān)參數(shù),能夠調(diào)整的主要是ip倚评、port和會(huì)話時(shí)長(zhǎng),沒有看到j(luò)ute大小的控制參數(shù)馏予,所以這條路基本上行不通,經(jīng)過只修改zk的參數(shù)重啟后吗蚌,測(cè)試也發(fā)現(xiàn)不能成功。
控制Alter DELETE影響的數(shù)據(jù)范圍敷燎,從原來的Alter語句來看我們已經(jīng)制定了時(shí)間的范圍箩言,但是看起來Clickhouse不會(huì)主動(dòng)根據(jù)條件來做分區(qū)裁剪。查看源碼也發(fā)現(xiàn)沒有這塊邏輯陨收,但是從最新的clickhouse的文檔中,我們可以看到Delete語句支持分區(qū)操作务漩。
ALTER TABLE [db.]table DELETE [IN PARTITION partition_id] WHERE filter_expr
但是這個(gè)語法在20.9.3.45版本中并沒有得到支持,所以最后我們對(duì)clickhouse做了升級(jí)到21.X.X.X翘悉,并讓業(yè)務(wù)方采用DELETE IN PARTITION居触,問題暫時(shí)得到解決。如果讀者有更好的解決方案轮洋,希望留言探討