【問題描述】
L版本 12.2.13
空置一年的集群,檢查發(fā)現(xiàn)大量OSD的meta容量占用特別高湾宙;
重啟這些OSD樟氢,會長時(shí)間卡在BlueFS::_replay()函數(shù)里;
這些OSD最終有的可以啟動侠鳄,有的會因?yàn)長og的校驗(yàn)報(bào)錯而停止啟動埠啃;
replay時(shí)間以小時(shí)計(jì),盤越大的時(shí)間越長伟恶。
獨(dú)立DB分區(qū)的OSD從未出現(xiàn)過Log容量無限增長的問題碴开,而未單獨(dú)配置block.db的OSD則很大概率出此問題
最后無奈只能刪除OSD重建,但是集群空置一段時(shí)間后知押,這個現(xiàn)象又回來了叹螟。
從BlueStore.cc中的kv提交線程開始分析:
void BlueStore::_kv_sync_thread()
{
dout(10) << __func__ << " start" << dendl;
std::unique_lock<std::mutex> l(kv_lock);
assert(!kv_sync_started);
bool bluefs_do_check_balance = false;
kv_sync_started = true;
kv_cond.notify_all();
while (true) {
assert(kv_committing.empty());
if (kv_queue.empty() &&
((deferred_done_queue.empty() && deferred_stable_queue.empty()) ||
!deferred_aggressive) &&
(bluefs_do_check_balance == false)) {
閑置的時(shí)候,上面所有queue都空台盯,bluefs_do_check_balance初始為false罢绽,進(jìn)入到
if流程,在等待默認(rèn)1秒沒有新的kv寫時(shí)静盅,bluefs_do_check_balance被置為true
if (kv_stop)
break;
dout(20) << __func__ << " sleep" << dendl;
std::cv_status status = kv_cond.wait_for(l,
std::chrono::milliseconds(int64_t(cct->_conf->bluestore_bluefs_balance_interval * 1000)));
dout(20) << __func__ << " wake" << dendl;
if (status == std::cv_status::timeout) {
bluefs_do_check_balance = true;
}
} else {
下一次循環(huán)時(shí)良价,由于bluefs_do_check_balance被置為true,進(jìn)入到else流程蒿叠,省略掉很多無關(guān)代碼明垢,以...代替
} else {
deque<TransContext*> kv_submitting;
deque<DeferredBatch*> deferred_done, deferred_stable;
uint64_t aios = 0, costs = 0;
... ...
// we will use one final transaction to force a sync
KeyValueDB::Transaction synct = db->get_transaction();
上面這行代碼挺關(guān)鍵,用意是獲取db記錄的最后一個transaction市咽,后面用它來觸發(fā)一次sync操作痊银,確保本次循環(huán)所commit的事務(wù)落盤。
大致的流程如下:
... ...
for (auto txc : kv_committing) {
if (txc->state == TransContext::STATE_KV_QUEUED) {
txc->log_state_latency(logger, l_bluestore_state_kv_queued_lat);
int r = cct->_conf->bluestore_debug_omit_kv_commit ? 0 : db->submit_transaction(txc->t);
assert(r == 0);
txc->state = TransContext::STATE_KV_SUBMITTED;
... ...
} else {
assert(txc->state == TransContext::STATE_KV_SUBMITTED);
txc->log_state_latency(logger, l_bluestore_state_kv_queued_lat);
}
... ...
}
... ...
PExtentVector bluefs_gift_extents;
if (bluefs &&
after_flush - bluefs_last_balance >
cct->_conf->bluestore_bluefs_balance_interval) {
bluefs_last_balance = after_flush;
int r = _balance_bluefs_freespace(&bluefs_gift_extents);
assert(r >= 0);
if (r > 0) {
for (auto& p : bluefs_gift_extents) {
bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
::encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
<< bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
}
}
... ...施绎,
上面的流程大致為:對kv_committing隊(duì)列里的事務(wù)溯革,調(diào)用db->submit_transaction提交贞绳,而后進(jìn)行bluefs的空閑空間判斷,空閑空間過大時(shí)執(zhí)行回收空間并更新bluefs_extents記錄致稀;更新后的bluefs_extents記錄序列化后冈闭,附加到此前獲得的synct事務(wù)中,期望伴隨著synct事務(wù)一起提交并sync抖单,接下來
// submit synct synchronously (block and wait for it to commit)
int r = cct->_conf->bluestore_debug_omit_kv_commit ? 0 : db->submit_transaction_sync(synct);
assert(r == 0);
這里調(diào)用了db->submit_transaction_sync 萎攒,注意前面提交業(yè)務(wù)Io時(shí),用的db->submit_transaction提交矛绘,兩個函數(shù)有什么不同,等會兒分析羹应。
回顧上面的流程次屠,在集群空閑沒有業(yè)務(wù)IO時(shí),周期性的過程:
- KeyValueDB::Transaction synct = db->get_transaction(); // 一定被執(zhí)行
- db->submit_transaction(txc->t); // 一定不執(zhí)行裸违,因?yàn)闆]有新的事務(wù)進(jìn)來
- _balance_bluefs_freespace(&bluefs_gift_extents); // 周期性執(zhí)行
- synct->set(PREFIX_SUPER, "bluefs_extents", bl); // 低頻率執(zhí)行本昏,因?yàn)楹竺鎠ynct的提交可能逐漸撐大bluefs空間,而Log compaction又釋放掉空間導(dǎo)致bluefs free比例過大怔昨,引發(fā)第三步的空間回收過程宿稀。
- db->submit_transaction_sync(synct); // 一定會執(zhí)行
從上面的過程可以初步劃定范圍,導(dǎo)致BlueFS容量無限增長的可能觸發(fā)點(diǎn)在這兩個位置:
- _balance_bluefs_freespace(&bluefs_gift_extents);
- db->submit_transaction_sync(synct);
當(dāng)出現(xiàn)問題時(shí)矮烹,OSD重啟時(shí)會在BlueFS::_replay(bool noop)函數(shù)中長時(shí)間循環(huán)奉狈,可以推斷,出問題的OSD涩惑,有巨量的BlueFS Log在做回放
先對_balance_bluefs_freespace做分析,函數(shù)中有可能操作硬盤的調(diào)用為:
bluefs->reclaim_blocks(bluefs_shared_bdev, reclaim, &extents);
這里的bluefs_shared_bdev
碰纬,在有單獨(dú)DB設(shè)備時(shí)问芬,它是BDEV_DB此衅,否則是BDEV_SLOW
int BlueFS::reclaim_blocks(unsigned id, uint64_t want,
PExtentVector *extents)
{
std::unique_lock<std::mutex> l(lock);
dout(1) << __func__ << " bdev " << id
<< " want 0x" << std::hex << want << std::dec << dendl;
assert(id < alloc.size());
assert(alloc[id]);
int64_t got = alloc[id]->allocate(want, alloc_size[id], 0, extents);
ceph_assert(got != 0);
if (got < 0) {
derr << __func__ << " failed to allocate space to return to bluestore"
<< dendl;
alloc[id]->dump();
return got;
}
for (auto& p : *extents) {
block_all[id].erase(p.offset, p.length);
block_total[id] -= p.length;
log_t.op_alloc_rm(id, p.offset, p.length);
}
flush_bdev();
int r = _flush_and_sync_log(l);
assert(r == 0);
if (logger)
logger->inc(l_bluefs_reclaim_bytes, got);
dout(1) << __func__ << " bdev " << id << " want 0x" << std::hex << want
<< " got " << *extents << dendl;
return 0;
}
上面的函數(shù)挡鞍,傳入的extents大概率是空的预烙。所以log_t并未被更新
緊接著調(diào)用了 _flush_and_sync_log函數(shù),看看里面做了什么
int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>& l,
uint64_t want_seq,
uint64_t jump_to)
{
while (log_flushing) {
dout(10) << __func__ << " want_seq " << want_seq
<< " log is currently flushing, waiting" << dendl;
assert(!jump_to);
log_cond.wait(l);
}
if (want_seq && want_seq <= log_seq_stable) {
dout(10) << __func__ << " want_seq " << want_seq << " <= log_seq_stable "
<< log_seq_stable << ", done" << dendl;
assert(!jump_to);
return 0;
}
if (log_t.empty() && dirty_files.empty()) {
dout(10) << __func__ << " want_seq " << want_seq
<< " " << log_t << " not dirty, dirty_files empty, no-op" << dendl;
assert(!jump_to);
return 0;
}
... ...
dirty_files
是空的翘县,到這里log_t.empty()
應(yīng)該成立谴分,所以函數(shù)返回了,什么也沒做忘伞。
至此
3. _balance_bluefs_freespace(&bluefs_gift_extents);導(dǎo)致Log無限增長的原因排除了沙兰。
只剩下最后一個嫌疑點(diǎn):
5. db->submit_transaction_sync(synct);
問題出在BlueFS::sync_metadata()
函數(shù)里面
經(jīng)過前面的分析鼎天,log_t大概率就是empty的舀奶,所以伪节,上面的if條件導(dǎo)致下面的if沒有執(zhí)行機(jī)會绩鸣,也就沒能觸發(fā)log compaction
社區(qū)有修復(fù)的PR,包含在在12.2.14版本
https://github.com/ceph/ceph/pull/34876
不免會有新的疑問:總有非empty的log來觸發(fā)這個compaction吧化借?
追蹤調(diào)用鏈捡多,在rocksdb的DBImpl實(shí)現(xiàn)里面铐炫,只有兩種情況會觸發(fā)BlueRocksDirectory::Fsync()
- 首次收到woptions.sync == true的寫請求的時(shí)候
- 切換到新的memtable時(shí)
假如這兩種情況觸發(fā)時(shí)倒信,寫下去的事務(wù)都是空的呢泳梆?
就很可能出現(xiàn)BlueRocksDirectory::Fsync()
被調(diào)用,但log_t里面啥也沒有的情況乘综,
下面這個PR雖沒有被社區(qū)接受,原因是end of life卡辰,但我相信它確實(shí)堵住了源頭
https://github.com/ceph/ceph/pull/36108
綜上邪意,問題已經(jīng)比較明晰
- 在集群無業(yè)務(wù)負(fù)載時(shí),周期性
db->submit_transaction_sync
調(diào)用允蚣,提交空的事務(wù) - 在首次提交以及切換memtable時(shí)呆贿,觸發(fā)
BlueRocksDirectory::Fsync()
,但由于log_t為空冒晰,未能執(zhí)行l(wèi)og compact - 第1竟块、2經(jīng)年累月的反復(fù)執(zhí)行,導(dǎo)致WAL File的log不斷增長蒋情,又得不到compaction處理耸携,日志空間不斷增大。
- 當(dāng)日志增大到極限大小夺衍,約500GB時(shí),會出現(xiàn)日志損壞的情況(具體原因還不明)
- 此時(shí)河劝,OSD重啟時(shí),就會經(jīng)歷漫長的replay過程牌里,而且最終因日志損壞而失敗煎娇,OSD無法啟動。
解決方案
我認(rèn)為,僅僅是我認(rèn)為杭隙,將如下兩個PR一起使用,能夠解決問題:
- 讓compact log總是有機(jī)會執(zhí)行: https://github.com/ceph/ceph/pull/34876
- 盡可能的避免提交空事務(wù): https://github.com/ceph/ceph/pull/36108