Apache BookKeeper中數(shù)據(jù)目錄分析
需要落盤(pán)的數(shù)據(jù)
- Journals
- 這個(gè)journals文件里存儲(chǔ)的相當(dāng)于BookKeeper的事務(wù)log或者說(shuō)是寫(xiě)前l(fā)og, 在任何針對(duì)ledger的更新發(fā)生前碑韵,都會(huì)先將這個(gè)更新的描述信息持久化到這個(gè)journal文件中。
- Bookeeper提供有單獨(dú)的sync線程根據(jù)當(dāng)前journal文件的大小來(lái)作journal文件的rolling;
- EntryLogFile
- 存儲(chǔ)真正數(shù)據(jù)的文件占卧,寫(xiě)入的時(shí)候Entry數(shù)據(jù)先緩存在內(nèi)存buffer中华蜒,然后批量flush到EntryLogFile中;
- 默認(rèn)情況下,所有l(wèi)edger的數(shù)據(jù)都是聚合然后順序?qū)懭氲酵粋€(gè)EntryLog文件中豁遭,避免磁盤(pán)隨機(jī)寫(xiě);
- Index文件
- 所有Ledger的entry數(shù)據(jù)都寫(xiě)入相同的EntryLog文件中蓖谢,為了加速數(shù)據(jù)讀取,會(huì)作 ledgerId + entryId 到文件offset的映射啥辨,這個(gè)映射會(huì)緩存在內(nèi)存中盯腌,稱(chēng)為IndexCache;
- IndexCache容量達(dá)到上限時(shí),會(huì)被 Sync線程flush到文件;
- LastLogMark
- 從上面的的講述可知着倾, 寫(xiě)入的EntryLog和Index都是先緩存在內(nèi)存中卡者,再根據(jù)一定的條件周期性的flush到磁盤(pán)客们,這就造成了從內(nèi)存到持久化到磁盤(pán)的時(shí)間間隔,如果在這間隔內(nèi)BookKeeper進(jìn)程崩潰恒傻,在重啟后盈厘,我們需要根據(jù)journal文件內(nèi)容來(lái)恢復(fù)官边,這個(gè)
LastLogMark
就記錄了從journal中什么位置開(kāi)始恢復(fù); - 它其實(shí)是存在內(nèi)存中外遇,當(dāng)IndexCache被flush到磁盤(pán)后其值會(huì)被更新跳仿,其也會(huì)周期性持久化到磁盤(pán)文件菲语,供BookKeeper進(jìn)程啟動(dòng)時(shí)讀取來(lái)從journal中恢復(fù);
- LastLogMark一旦被持久化到磁盤(pán)惑灵,即意味著在其之前的Index和EntryLog都已經(jīng)被持久化到了磁盤(pán)佩憾,那么journal在這個(gè)LastLogMark之前的數(shù)據(jù)都可以被清除了潭辈。
落盤(pán)數(shù)據(jù)目錄設(shè)置優(yōu)化
- journal, entrylog, index最好設(shè)置在不同磁盤(pán)上,避免IO競(jìng)爭(zhēng);
- journal 最好寫(xiě)在SSD等高速磁盤(pán)上修赞。
數(shù)據(jù)寫(xiě)入后各種文件的更新流程
- 流程圖
data-flow1.png
文件目錄使用情況監(jiān)控
-
用于寫(xiě)入文件的目錄有三種狀態(tài):
- 可寫(xiě);
- 可寫(xiě)柏副,但剩余空間低于所配置的警告閾值;
- 不可寫(xiě)割择,已經(jīng)寫(xiě)滿(mǎn); 當(dāng)被GC清理了一部分?jǐn)?shù)據(jù)后萎河,其狀態(tài)又可變?yōu)榭蓪?xiě);
BookKeeper需要持續(xù)監(jiān)控目錄空間使用情況, 通過(guò)
LedgerDirsMonitor
類(lèi)實(shí)現(xiàn), 我們主要來(lái)分析一下它的check
方法玛歌, 注釋寫(xiě)在函數(shù)體內(nèi)
private void check(final LedgerDirsManager ldm) {
// 對(duì)于Index, EntryLog, Journal都可以設(shè)置多個(gè)存儲(chǔ)路徑, 每一種對(duì)應(yīng)一個(gè)LedgerDirsManager
// 先獲取每種對(duì)應(yīng)的dirs的使用情況
final ConcurrentMap<File, Float> diskUsages = ldm.getDiskUsages();
try {
//獲取當(dāng)前可寫(xiě)狀態(tài)的目錄
List<File> writableDirs = ldm.getWritableLedgerDirs();
// Check all writable dirs disk space usage.
// 循環(huán)遍歷當(dāng)前可寫(xiě)狀態(tài)的dirs的剩余可寫(xiě)容量,更新diskUsages
// 同時(shí)處理各種異常值朋,比如
// 1. 讀dir失敗,回調(diào)diskFailed
// 2. 可寫(xiě)容量低于警戒閾值圈膏,但還處于可寫(xiě)狀態(tài), 回調(diào) diskAlmostFull
// 3. 不可寫(xiě)稽坤,調(diào)用ldm.addToFilledDirs
for (File dir : writableDirs) {
try {
diskUsages.put(dir, diskChecker.checkDir(dir));
} catch (DiskErrorException e) {
LOG.error("Ledger directory {} failed on disk checking : ", dir, e);
// Notify disk failure to all listeners
for (LedgerDirsListener listener : ldm.getListeners()) {
listener.diskFailed(dir);
}
} catch (DiskWarnThresholdException e) {
diskUsages.compute(dir, (d, prevUsage) -> {
if (null == prevUsage || e.getUsage() != prevUsage) {
LOG.warn("Ledger directory {} is almost full : usage {}", dir, e.getUsage());
}
return e.getUsage();
});
for (LedgerDirsListener listener : ldm.getListeners()) {
listener.diskAlmostFull(dir);
}
} catch (DiskOutOfSpaceException e) {
diskUsages.compute(dir, (d, prevUsage) -> {
if (null == prevUsage || e.getUsage() != prevUsage) {
LOG.error("Ledger directory {} is out-of-space : usage {}", dir, e.getUsage());
}
return e.getUsage();
});
// Notify disk full to all listeners
ldm.addToFilledDirs(dir);
}
}
// Let's get NoWritableLedgerDirException without waiting for the next iteration
// in case we are out of writable dirs
// otherwise for the duration of {interval} we end up in the state where
// bookie cannot get writable dir but considered to be writable
// check完之前所有可寫(xiě)目錄的最新?tīng)顟B(tài)后得湘,
// 看看現(xiàn)在還有沒(méi)有可寫(xiě)的目錄摆马,沒(méi)有可用的就拋出異常
ldm.getWritableLedgerDirs();
} catch (NoWritableLedgerDirException e) {
LOG.warn("LedgerDirsMonitor check process: All ledger directories are non writable");
boolean highPriorityWritesAllowed = true;
try {
// disk check can be frequent, so disable 'loggingNoWritable' to avoid log flooding.
ldm.getDirsAboveUsableThresholdSize(minUsableSizeForHighPriorityWrites, false);
} catch (NoWritableLedgerDirException e1) {
highPriorityWritesAllowed = false;
}
//進(jìn)到這里,表明沒(méi)有可寫(xiě)的目錄了惩淳,回調(diào)allDisksFull
for (LedgerDirsListener listener : ldm.getListeners()) {
listener.allDisksFull(highPriorityWritesAllowed);
}
}
List<File> fullfilledDirs = new ArrayList<File>(ldm.getFullFilledLedgerDirs());
boolean makeWritable = ldm.hasWritableLedgerDirs();
// When bookie is in READONLY mode, i.e there are no writableLedgerDirs:
// - Update fullfilledDirs disk usage.
// - If the total disk usage is below DiskLowWaterMarkUsageThreshold
// add fullfilledDirs back to writableLedgerDirs list if their usage is < conf.getDiskUsageThreshold.
try {
if (!makeWritable) {
// 返回當(dāng)前所有目錄總的已用容量百分比
float totalDiskUsage = diskChecker.getTotalDiskUsage(ldm.getAllLedgerDirs());
if (totalDiskUsage < conf.getDiskLowWaterMarkUsageThreshold()) {
makeWritable = true;
} else {
LOG.debug(
"Current TotalDiskUsage: {} is greater than LWMThreshold: {}."
+ " So not adding any filledDir to WritableDirsList",
totalDiskUsage, conf.getDiskLowWaterMarkUsageThreshold());
}
}
// Update all full-filled disk space usage
// 之前處于不可寫(xiě)狀態(tài)的目錄,如果GC時(shí)清除掉一些數(shù)據(jù)棉磨,則可能變?yōu)榭蓪?xiě)狀態(tài)乘瓤,這里作check
for (File dir : fullfilledDirs) {
try {
diskUsages.put(dir, diskChecker.checkDir(dir));
if (makeWritable) {
ldm.addToWritableDirs(dir, true);
}
} catch (DiskErrorException e) {
// Notify disk failure to all the listeners
for (LedgerDirsListener listener : ldm.getListeners()) {
listener.diskFailed(dir);
}
} catch (DiskWarnThresholdException e) {
diskUsages.put(dir, e.getUsage());
// the full-filled dir become writable but still above the warn threshold
if (makeWritable) {
ldm.addToWritableDirs(dir, false);
}
} catch (DiskOutOfSpaceException e) {
// the full-filled dir is still full-filled
diskUsages.put(dir, e.getUsage());
}
}
} catch (IOException ioe) {
LOG.error("Got IOException while monitoring Dirs", ioe);
for (LedgerDirsListener listener : ldm.getListeners()) {
listener.fatalError();
}
}
}