BlockBasedTable
RocksDB用SST文件(Sorted Sequence Table)來存儲用戶寫入的數(shù)據(jù). 文件中key是排好序的, 所以對key的查找操作可以用二分查找完成.
BlockBasedTable是SST文件的默認(rèn)格式.
SST文件結(jié)構(gòu)
<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block] (see section: "filter" Meta Block)
[meta block 2: stats block] (see section: "properties" Meta Block)
[meta block 3: compression dictionary block] (see section: "compression dictionary" Meta Block)
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[index block]
[Footer] (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>
- SST文件的開頭是key/value對按序排列, 分配在連續(xù)的block中.默認(rèn)的block大小為4k.
- 在data block后面是一些meta block.
- meta index block記錄了每個meta block的偏移和長度. key是meta block的名字, value的類型是 BlockHandle, 其定義如下
class BlockHandle {
public:
BlockHandle();
BlockHandle(uint64_t offset, uint64_t size);
...
private:
uint64_t offset_;
uint64_t size_;
}
- index block記錄了每個data block的索引. key是一個string, 它的值大于或等于被索引的block的最后一個key, 小于下一個data block的第一個key.
value是data block對應(yīng)的BlockHandle. - 文件的末尾寫入了一個固定長度的footer,其格式如下
metaindex_handle: char[p]; // Block handle for metaindex
index_handle: char[q]; // Block handle for index
padding: char[40-p-q]; // zeroed bytes to make fixed length
// (40==2*BlockHandle::kMaxEncodedLength)
magic: fixed64; // 0x88e241b785f4cff7 (little-endian)
類結(jié)構(gòu)分析
類職責(zé)說明
- TableBuilder
可以認(rèn)為, 一個Table就是一個SST文件, 只不過Table并不會把整個SST文件的內(nèi)容持有, 而是當(dāng)寫滿一個block, 就會flush到SST文件中.
TableBuilder就定義了構(gòu)建一個Table(SST File)的結(jié)構(gòu), 主要是Add接口, 接收調(diào)用者傳進來的kv. Finish接口在數(shù)據(jù)寫完之后, 將后續(xù)的meta block, index block等寫入在data block后面. - BlockBasedTableBuilder
實現(xiàn)了TableBuilder接口. 并定義了如何寫下block的方法, 同時實現(xiàn)了將block插入到壓縮cache中的私有方法. - TableFactory
工廠類接口, 用來創(chuàng)建指定的TableReader和TableBuilder對象. - BlockBasedTableFactory
創(chuàng)建BlockBasedTableReader和BlockBasedTableBuilder. - BlockBuilder
用來構(gòu)造Block的對象, 可復(fù)用. 當(dāng)一個block構(gòu)造完成, flush到sst文件中, 九調(diào)用Reset方法, 清空buffer和成員變量, 繼續(xù)構(gòu)造下一個Block.
類圖中, SstFileWriter只是使用TableBuilder的一個調(diào)用者, 還有其他依賴TableBuilder的調(diào)用這沒有畫出. 并不是本篇重點介紹的對象.
Block的構(gòu)造過程分析
用戶在打開DB時饵沧,傳入table相關(guān)選項蜡吧,由table_factory持有籍凝。
std::vector<ColumnFamilyDescriptor> cf_descs;
cf_descs.push_back({kDefaultColumnFamilyName, ColumnFamilyOptions()});
// initialize column families options
std::unique_ptr<CompactionFilter> compaction_filter;
compaction_filter.reset(new DummyCompactionFilter());
cf_descs[0].options.table_factory.reset(NewBlockBasedTableFactory(bbt_opts));
cf_descs[0].options.compaction_filter = compaction_filter.get();
s = DB::Open(Options(db_opt, cf_descs[0].options), kDBPath, &db);
assert(s.ok());
NewBlockBasedTableFactory函數(shù)祝峻,創(chuàng)建一個BlockBasedTableFactory對象
TableFactory* NewBlockBasedTableFactory(
const BlockBasedTableOptions& _table_options) {
return new BlockBasedTableFactory(_table_options);
}
BlockBasedTableFactory實現(xiàn)了TableFactory接口鹏秋,作為一個工廠類,對用戶提供了創(chuàng)建TableReader和TableBuilder等接口铸屉。我們以TableBuilder舉例瘸味。
TableBuilder* BlockBasedTableFactory::NewTableBuilder(
const TableBuilderOptions& table_builder_options, uint32_t column_family_id,
WritableFileWriter* file) const {
auto table_builder = new BlockBasedTableBuilder(
table_builder_options.ioptions, table_options_,
table_builder_options.internal_comparator,
table_builder_options.int_tbl_prop_collector_factories, column_family_id,
file, table_builder_options.compression_type,
table_builder_options.compression_opts,
table_builder_options.compression_dict,
table_builder_options.skip_filters,
table_builder_options.column_family_name);
return table_builder;
}
TableBuilder主要用于寫sst files,使用該接口有四個地方
- When flushing memtable to a level-0 output file, it creates a table
builder (In DBImpl::WriteLevel0Table(), by calling BuildTable())
- During compaction, it gets the builder for writing compaction output
files in DBImpl::OpenCompactionOutputFile().
- When recovering from transaction logs, it creates a table builder to
write to a level-0 output file (In DBImpl::WriteLevel0TableForRecovery,
by calling BuildTable()) - When running Repairer, it creates a table builder to convert logs to
SST files (In Repairer::ConvertLogToTable() by calling BuildTable())
類圖中SSTFileWriter對TableBuilder的引用就是其中一個場景挤庇。這里直接引用了該接口的注釋钞速。
如類圖所示,BlockBasedTableBuilder繼承了TableBuilder接口嫡秕,對外提供了Add玉工、Finish、Abandon等接口淘菩,并持有一個Rep內(nèi)部類,包裝了一些選項屠升,目標(biāo)文件指針潮改,和另外一個主要的結(jié)構(gòu)BlockBuilder。
BlockBasedTableBuilder的Add方法實現(xiàn):
- 判斷key的類型腹暖,不同類型有不同的處理方法
- 判斷當(dāng)前的key是否比上一個key大汇在,保證有序
- 判斷是否需要flush當(dāng)前的block到文件中,并清空data block
- 插入到data block中
部分實現(xiàn)代碼如下:
void BlockBasedTableBuilder::Add(const Slice& key, const Slice& value) {
Rep* r = rep_;
...
if (r->props.num_entries > 0) {
assert(r->internal_comparator.Compare(key, Slice(r->last_key)) > 0);
}
...
auto should_flush = r->flush_block_policy->Update(key, value);
if (should_flush) {
assert(!r->data_block.empty());
Flush();
...
r->last_key.assign(key.data(), key.size());
r->data_block.Add(key, value);
r->props.num_entries++;
r->props.raw_key_size += key.size();
...
r->data_block
成員類型為BlockBuilder脏答,這一層就是"BlockBasedTable"的底層糕殉,即負(fù)責(zé)構(gòu)造block的類。
- 前綴壓縮:BlockBuilder使用前綴壓縮來保存數(shù)據(jù)殖告,以節(jié)省空間阿蝶。
- restart point:并不是所有的kv都使用前綴壓縮,而是有一個分界點黄绩,每當(dāng)使用前綴壓縮保存了K個key羡洁,下一個kv就不適用前綴壓縮,而是保存整個key爽丹,它的offset就是一個restart point筑煮。restart point保存在一個數(shù)組中,寫在block的尾部粤蝎,用來做二分查找真仲。
- value的保存緊跟在對應(yīng)的key的后面。
一個典型的block的結(jié)構(gòu)如下:
| shared_bytes (varint32) | unshared_bytes(varint32) | value_length(varint32) | key_delta(unshared_bytes) 差異部分key | value(char[value_length]) |
... // n個上面的結(jié)構(gòu)
| restarts(uint32[num_restarts]) | num_restarts(uint32) | // block尾部
// 當(dāng)shared_bytes = 0 時初澎,代表一個restart point秸应。
BlockBuilder持有一個成員last_key_
,保存上一個Add的key,用來與當(dāng)前的key計算相同的前綴長度灸眼。
BlockBuilder的Add邏輯如下:
- 判斷是否需要restart point
- 如果不需要restart point卧檐,將當(dāng)前插入的key與前一個key比較前綴,得到可以壓縮的前綴長度焰宣。
- 得到所有需要的數(shù)據(jù)后霉囚,按照一個entry的格式,append到buffer中匕积。
計算相同前綴的長度為Slice的一個成員方法盈罐,Slice是key和value的類型,是對string的一層封裝闪唆。實現(xiàn)如下
inline size_t Slice::difference_offset(const Slice& b) const {
size_t off = 0;
const size_t len = (size_ < b.size_) ? size_ : b.size_;
for (; off < len; off++) {
if (data_[off] != b.data_[off]) break;
}
return off;
}
Add方法的主要邏輯實現(xiàn)如下
void BlockBuilder::Add(const Slice& key, const Slice& value) {
...
size_t shared = 0; // number of bytes shared with prev key
if (counter_ >= block_restart_interval_) {
// Restart compression
restarts_.push_back(static_cast<uint32_t>(buffer_.size()));
estimate_ += sizeof(uint32_t);
counter_ = 0;
if (use_delta_encoding_) {
// Update state
last_key_.assign(key.data(), key.size());
}
} else if (use_delta_encoding_) {
shared = key.difference_offset(last_key_piece);
...
last_key_.assign(key.data(), key.size());
}
const size_t non_shared = key.size() - shared;
const size_t curr_size = buffer_.size();
// Add "<shared><non_shared><value_size><key_delta><value>" to buffer_
PutVarint32Varint32Varint32(&buffer_, static_cast<uint32_t>(shared),
static_cast<uint32_t>(non_shared),
static_cast<uint32_t>(value.size()));
// Add string delta to buffer_ followed by value
buffer_.append(key.data() + shared, non_shared);
buffer_.append(value.data(), value.size());
}
當(dāng)向一個block中加入了若干個kv盅粪,由r->flush_block_policy
來決定是否調(diào)用BlockBasedTableBuilder的Flush
方法將當(dāng)前的block寫入到文件中,并清空block悄蕾,重新再用票顾。
默認(rèn)的fush block策略定義在FlushBlockBySizePolicy
中,即根據(jù)block的已寫入大小來決定刷盤帆调,接口為Update
奠骄,默認(rèn)的block大小為4K。
virtual bool Update(const Slice& key,
const Slice& value) override {
// it makes no sense to flush when the data block is empty
if (data_block_builder_.empty()) {
return false;
}
auto curr_size = data_block_builder_.CurrentSizeEstimate();
// Do flush if one of the below two conditions is true:
// 1) if the current estimated size already exceeds the block size,
// 2) block_size_deviation is set and the estimated size after appending
// the kv will exceed the block size and the current size is under the
// the deviation.
return curr_size >= block_size_ || BlockAlmostFull(key, value);
}
當(dāng)需要flush block時番刊,調(diào)用Flush
方法含鳞,F(xiàn)lush方法調(diào)用了WriteBlock方法。
void BlockBasedTableBuilder::Flush() {
...
WriteBlock(&r->data_block, &r->pending_handle, true /* is_data_block */);
...
}
WriteBlock會首先調(diào)用data_block的Finish()
方法芹务,將start points append到buffer_中蝉绷,設(shè)置block的標(biāo)志位finished_ = true
WriteBlock(block->Finish(), handle, is_data_block);
Slice BlockBuilder::Finish() {
// Append restart array
for (size_t i = 0; i < restarts_.size(); i++) {
PutFixed32(&buffer_, restarts_[i]);
}
PutFixed32(&buffer_, static_cast<uint32_t>(restarts_.size()));
finished_ = true;
return Slice(buffer_);
}
然后會按下面的格式將一個block寫入目標(biāo)文件中
| block_data(uint8[n]) | type(uint8) | crc(uint32)|
- 對block中的數(shù)據(jù)進行壓縮,如果打開了校驗功能枣抱,這時會立刻進行一次解壓熔吗,和原始數(shù)據(jù)進行一次比較,查看壓縮算法是否出錯沃但。
- 將壓縮(或者由于data太大而沒有壓縮)的數(shù)據(jù)寫入到文件中磁滚。
- 計算checksum
- 插入到壓縮cache中
部分實現(xiàn)如下
void BlockBasedTableBuilder::WriteBlock(const Slice& raw_block_contents,
BlockHandle* handle,
bool is_data_block) {
...
block_contents = CompressBlock(raw_block_contents, r->compression_opts,
&type, r->table_options.format_version,
compression_dict, &r->compressed_output);
...
WriteRawBlock(block_contents, type, handle);
}
void BlockBasedTableBuilder::WriteRawBlock(const Slice& block_contents,
CompressionType type,
BlockHandle* handle) {
...
r->status = r->file->Append(block_contents);
...
auto crc = crc32c::Value(block_contents.data(), block_contents.size());
crc = crc32c::Extend(crc, trailer, 1); // Extend to cover block type
EncodeFixed32(trailer_without_type, crc32c::Mask(crc));
...
r->status = r->file->Append(Slice(trailer, kBlockTrailerSize));
if (r->status.ok()) {
r->status = InsertBlockInCache(block_contents, type, handle);
}
...
}
在data block寫完之后, 會在BlockBasedTableBuilder
的Finish方法中,將后續(xù)的meta blocks, meta index block, index block和footer等寫入到文件中.
以上就是block的構(gòu)建過程.
小結(jié)
通過對BlockBasedTable的類圖分析, 和block構(gòu)建過程, 對RocksDB的數(shù)據(jù)存儲方式有了一個清楚的認(rèn)識. 其中一些meta blocks和TableReader等更多的內(nèi)容隨著我們對RocksDB更多的分析, 也會逐個分析到.
參考資料:
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format