9.1 基本原理
版本信息有什么用?先來簡要說明三個類的具體用途:
- Version:代表了某一時刻的數(shù)據(jù)庫版本信息儿礼,版本信息的主要內(nèi)容是當前各個Level的SSTable數(shù)據(jù)文件列表莺禁。
- VersionSet:維護了一份Version列表石洗,包含當前Alive的所有Version信息循诉,列表中第一個代表數(shù)據(jù)庫的當前版本贤斜。
-
VersionEdit:表示Version之間的變化拷淘,相當于delta 增量各墨,表示有增加了多少文件,刪除了文件启涯。Version0 +VersionEdit-->Version1贬堵。VersionEdit會保存到MANIFEST文件中,當做數(shù)據(jù)恢復時就會從MANIFEST文件中讀出來重建數(shù)據(jù)结洼。那么黎做,何時會觸發(fā)版本變遷呢?Compaction松忍。
有了上面的描述蒸殿,再來看版本信息到底有什么用呢?
如果還不能給出答案鸣峭,將上述三個類當作一個整體宏所,再來看Version類組到底包含了哪些信息:
-
運行信息
1.1 運行期各種遞增ID值:log number(log編號)、next file number(下一個文件編號)摊溶、last sequence(單條write操作遞增該編號爬骤,可認為是版本號)、prev log number(目前已棄用)莫换。
1.2 比較器名稱 -
數(shù)據(jù)庫元信息
2.1 各Level的SSTable文件列表
2.2 SSTable緩存 -
Compaction信息
3.1 Compaction Pointer
3.2 通過Seek觸發(fā)Compaction信息(文件名霞玄、Level);通過Compaction觸發(fā)Compaction信息(score浓镜、level)
關于版本信息到底有什么用這個話題暫時先放一放溃列,來看具體類劲厌。
9.2 VersionSet
VersionSet維護了一份Version列表膛薛,包含當前Alive的所有Version信息,列表中第一個代表數(shù)據(jù)庫的當前版本补鼻。
VersionSet類只有一個實例哄啄,在DBImpl(數(shù)據(jù)庫實現(xiàn)類)類中,維護所有活動的Version對象风范,來看VersionSet的所有語境咨跌。
9.2.1 數(shù)據(jù)庫啟動時
通過Current文件加載Manifset文件,讀取Manifest文件完成版本信息恢復硼婿。
Status VersionSet::Recover(bool *save_manifest)
{
......
//從current file中讀取mainfest文件名
// Read "CURRENT" file, which contains a pointer to the current manifest file
std::string current;
Status s = ReadFileToString(env_, CurrentFileName(dbname_), ¤t);
if (!s.ok())
{
return s;
}
if (current.empty() || current[current.size() - 1] != '\n')
{
return Status::Corruption("CURRENT file does not end with newline");
}
current.resize(current.size() - 1);
//打開mainfest
std::string dscname = dbname_ + "/" + current;
SequentialFile *file;
s = env_->NewSequentialFile(dscname, &file);
if (!s.ok())
{
return s;
}
bool have_log_number = false;
bool have_prev_log_number = false;
bool have_next_file = false;
bool have_last_sequence = false;
uint64_t next_file = 0;
uint64_t last_sequence = 0;
uint64_t log_number = 0;
uint64_t prev_log_number = 0;
Builder builder(this, current_);
{
LogReporter reporter;
reporter.status = &s;
log::Reader reader(file, &reporter, true /*checksum*/, 0 /*initial_offset*/);
Slice record;
std::string scratch;
//依次讀取manifest中的VersionEdit信息锌半,構建VersionSet
while (reader.ReadRecord(&record, &scratch) && s.ok())
{
VersionEdit edit;
s = edit.DecodeFrom(record);
if (s.ok())
{
//Comparator不一致時,返回錯誤信息
if (edit.has_comparator_ &&
edit.comparator_ != icmp_.user_comparator()->Name())
{
s = Status::InvalidArgument(
edit.comparator_ + " does not match existing comparator ",
icmp_.user_comparator()->Name());
//實際上寇漫,這里可以直接break
}
}
//構建當前Version
if (s.ok())
{
builder.Apply(&edit);
}
......
}
}
delete file;
file = NULL;
......
if (s.ok())
{
Version *v = new Version(this);
builder.SaveTo(v);
//計算下次執(zhí)行壓縮的Level
// Install recovered version
Finalize(v);
AppendVersion(v);
manifest_file_number_ = next_file;
next_file_number_ = next_file + 1;
last_sequence_ = last_sequence;
log_number_ = log_number;
prev_log_number_ = prev_log_number;
// See if we can reuse the existing MANIFEST file.
if (ReuseManifest(dscname, current))
{
// No need to save new manifest
}
else
{
*save_manifest = true;
}
}
return s;
}
Recover通過Manifest恢復VersionSet及Current Version信息刊殉,恢復完畢后Alive的Version列表中僅包含當Current Version對象殉摔。
9.2.2 Compaction時
Compaction(壓縮)應該是LevelDB中最為復雜的功能,它需要Version類組的深度介入记焊。來看VersionSet中所有和Compaction相關的接口聲明逸月。
// Apply *edit to the current version to form a new descriptor that
// is both saved to persistent state and installed as the new
// current version. Will release *mu while actually writing to the file.
// REQUIRES: *mu is held on entry.
// REQUIRES: no other thread concurrently calls LogAndApply()
Status LogAndApply(VersionEdit* edit, port::Mutex* mu);
// Pick level and inputs for a new compaction.
// Returns NULL if there is no compaction to be done.
// Otherwise returns a pointer to a heap-allocated object that
// describes the compaction. Caller should delete the result.
Compaction* PickCompaction();
// Return a compaction object for compacting the range [begin,end] in
// the specified level. Returns NULL if there is nothing in that
// level that overlaps the specified range. Caller should delete
// the result.
Compaction* CompactRange(
int level,
const InternalKey* begin,
const InternalKey* end);
// Create an iterator that reads over the compaction inputs for "*c".
// The caller should delete the iterator when no longer needed.
Iterator* MakeInputIterator(Compaction* c);
// Returns true iff some level needs a compaction.
bool NeedsCompaction() const {
Version* v = current_;
return (v->compaction_score_ >= 1) || (v->file_to_compact_ != NULL);
}
// Add all files listed in any live version to *live.
// May also mutate some internal state.
void AddLiveFiles(std::set<uint64_t>* live);
數(shù)據(jù)庫的讀、寫操作都可能觸發(fā)Compaction遍膜,通過調(diào)用NeedCompaction判定是否需要執(zhí)行Compaction碗硬,如需Compaction則調(diào)用PickCompaction獲取Compactoin信息。
其他幾個方法也和Compaction操作相關瓢颅,其中LogAndApply非常重要恩尾,它將VersionEdit應用于Current Version、VersoinEdit持久化到Manifest文件挽懦、將新的Version做為Current Version特笋。
Status VersionSet::LogAndApply(VersionEdit *edit, port::Mutex *mu)
{
if (edit->has_log_number_)
{
assert(edit->log_number_ >= log_number_);
assert(edit->log_number_ < next_file_number_);
}
else
{
edit->SetLogNumber(log_number_);
}
if (!edit->has_prev_log_number_)
{
edit->SetPrevLogNumber(prev_log_number_);
}
edit->SetNextFile(next_file_number_);
edit->SetLastSequence(last_sequence_);
//1. New Version = Current Version + VersionEdit
Version *v = new Version(this);
{
Builder builder(this, current_);
builder.Apply(edit);
builder.SaveTo(v);
}
//2. 重新計算Compaction Level\Compaction Score
Finalize(v);
//3. 打開數(shù)據(jù)庫時,創(chuàng)建新的Manifest并保存當前版本信息
// Initialize new descriptor log file if necessary by creating
// a temporary file that contains a snapshot of the current version.
std::string new_manifest_file;
Status s;
if (descriptor_log_ == NULL)
{
// No reason to unlock *mu here since we only hit this path in the
// first call to LogAndApply (when opening the database).
assert(descriptor_file_ == NULL);
new_manifest_file = DescriptorFileName(dbname_, manifest_file_number_);
edit->SetNextFile(next_file_number_);
s = env_->NewWritableFile(new_manifest_file, &descriptor_file_);
if (s.ok())
{
descriptor_log_ = new log::Writer(descriptor_file_);
//當前版本信息
s = WriteSnapshot(descriptor_log_);
}
}
//4. 保存增量信息巾兆,即VersionEdit信息
// Unlock during expensive MANIFEST log write
{
mu->Unlock();
// Write new record to MANIFEST log
if (s.ok())
{
std::string record;
edit->EncodeTo(&record);
s = descriptor_log_->AddRecord(record);
if (s.ok())
{
s = descriptor_file_->Sync();
}
if (!s.ok())
{
Log(options_->info_log, "MANIFEST write: %s\n", s.ToString().c_str());
}
}
// If we just created a new descriptor file, install it by writing a
// new CURRENT file that points to it.
if (s.ok() && !new_manifest_file.empty())
{
s = SetCurrentFile(env_, dbname_, manifest_file_number_);
}
mu->Lock();
}
//5. 將新的版本添加到Alive版本列表猎物,并將其做為Current Version
// Install the new version
if (s.ok())
{
AppendVersion(v);
log_number_ = edit->log_number_;
prev_log_number_ = edit->prev_log_number_;
}
else
{
delete v;
if (!new_manifest_file.empty())
{
delete descriptor_log_;
delete descriptor_file_;
descriptor_log_ = NULL;
descriptor_file_ = NULL;
env_->DeleteFile(new_manifest_file);
}
}
return s;
}
9.2.3 讀取數(shù)據(jù)時
LevelDB通過VersionSet中的TableCache對象完成數(shù)據(jù)讀取。
TableCache是SSTable的緩存類角塑,NewIterator方法通過傳入指定的文件編號返回該文件的Iterator供外部使用蔫磨。
class TableCache {
public:
TableCache(const std::string& dbname, const Options* options, int entries);
~TableCache();
// Return an iterator for the specified file number (the corresponding
// file length must be exactly "file_size" bytes). If "tableptr" is
// non-NULL, also sets "*tableptr" to point to the Table object
// underlying the returned iterator, or NULL if no Table object underlies
// the returned iterator. The returned "*tableptr" object is owned by
// the cache and should not be deleted, and is valid for as long as the
// returned iterator is live.
Iterator* NewIterator(const ReadOptions& options,
uint64_t file_number,
uint64_t file_size,
Table** tableptr = NULL);
// Evict any entry for the specified file number
void Evict(uint64_t file_number);
private:
Env* const env_;
const std::string dbname_;
const Options* options_;
Cache* cache_;
};
緩存機制主要通過Cache對象實現(xiàn),關于Cache的備忘下節(jié)會講圃伶。
9.3 Version
Version維護了一份當前版本的SSTable的元數(shù)據(jù)堤如,其對外暴露的接口大部分也和元數(shù)據(jù)相關:
void GetOverlappingInputs(
int level,
const InternalKey* begin, // NULL means before all keys
const InternalKey* end, // NULL means after all keys
std::vector<FileMetaData*>* inputs);
// Returns true iff some file in the specified level overlaps
// some part of [*smallest_user_key,*largest_user_key].
// smallest_user_key==NULL represents a key smaller than all keys in the DB.
// largest_user_key==NULL represents a key largest than all keys in the DB.
bool OverlapInLevel(int level,
const Slice* smallest_user_key,
const Slice* largest_user_key);
// Return the level at which we should place a new memtable compaction
// result that covers the range [smallest_user_key,largest_user_key].
int PickLevelForMemTableOutput(const Slice& smallest_user_key,
const Slice& largest_user_key);
int NumFiles(int level) const { return files_[level].size(); }
還有兩個數(shù)據(jù)庫讀取操作相關的方法Get、UpdateStats窒朋,來看Get:
Status Version::Get(const ReadOptions& options,
const LookupKey& k,
std::string* value,
GetStats* stats)
{
Slice ikey = k.internal_key();
Slice user_key = k.user_key();
const Comparator* ucmp = vset_->icmp_.user_comparator();
Status s;
stats->seek_file = NULL;
stats->seek_file_level = -1;
FileMetaData* last_file_read = NULL;
int last_file_read_level = -1;
// We can search level-by-level since entries never hop across
// levels. Therefore we are guaranteed that if we find data
// in an smaller level, later levels are irrelevant.
std::vector<FileMetaData*> tmp;
FileMetaData* tmp2;
//1. 查找包含指定Key的所有文件
for (int level = 0; level < config::kNumLevels; level++) {
size_t num_files = files_[level].size();
if (num_files == 0) continue;
// Get the list of files to search in this level
FileMetaData* const* files = &files_[level][0];
if (level == 0) { //1.1 Level-0可能存在多個文件均包含該Key
// Level-0 files may overlap each other. Find all files that
// overlap user_key and process them in order from newest to oldest.
tmp.reserve(num_files);
for (uint32_t i = 0; i < num_files; i++) {
FileMetaData* f = files[i];
if (ucmp->Compare(user_key, f->smallest.user_key()) >= 0 &&
ucmp->Compare(user_key, f->largest.user_key()) <= 0) {
tmp.push_back(f);
}
}
if (tmp.empty()) continue;
std::sort(tmp.begin(), tmp.end(), NewestFirst); //將文件按更新順序排列
files = &tmp[0];
num_files = tmp.size();
}
else { //1.2 Level-0之上搀罢,一個Key只可能存在于一個文件中
// Binary search to find earliest index whose largest key >= ikey.
uint32_t index = FindFile(vset_->icmp_, files_[level], ikey);
if (index >= num_files) {
files = NULL;
num_files = 0;
}
else {
tmp2 = files[index];
if (ucmp->Compare(user_key, tmp2->smallest.user_key()) < 0) {
// All of "tmp2" is past any data for user_key
files = NULL;
num_files = 0;
}
else {
files = &tmp2;
num_files = 1;
}
}
}
//2. 遍歷所有文件,查找Key值數(shù)據(jù)侥猩。
for (uint32_t i = 0; i < num_files; ++i) {
if (last_file_read != NULL && stats->seek_file == NULL) {
// We have had more than one seek for this read. Charge the 1st file.
stats->seek_file = last_file_read;
stats->seek_file_level = last_file_read_level;
}
FileMetaData* f = files[i];
last_file_read = f;
last_file_read_level = level;
//2.1 SSTable迭代器
Iterator* iter = vset_->table_cache_->NewIterator(
options,
f->number,
f->file_size);
iter->Seek(ikey); //2.2 查找指定Key
const bool done = GetValue(iter, user_key, value, &s); //2.3 Get Value
if (!iter->status().ok()) {
s = iter->status();
delete iter;
return s;
}
else {
delete iter;
if (done) {
return s;
}
}
}
}
return Status::NotFound(Slice()); // Use an empty error message for speed
}
9.4 VersionEdit
版本建變化除運行期編號修改外榔至,最主要的是SSTable文件的增刪信息。當Compaction執(zhí)行時欺劳,必然會出現(xiàn)部分SSTable無效被移除唧取,合并生成的新SSTable被加入到數(shù)據(jù)庫中。VersionEdit提供AddFile划提、DeleteFile完成變更標識枫弟。
VersionEdit提供的另外一個主要功能接口聲明如下:
void VersionEdit::EncodeTo(std::string* dst) const {
//1. 序列化比較器
if (has_comparator_) {
PutVarint32(dst, kComparator);
PutLengthPrefixedSlice(dst, comparator_);
}
//2. 序列化運行期編號信息
if (has_log_number_) {
PutVarint32(dst, kLogNumber);
PutVarint64(dst, log_number_);
}
if (has_prev_log_number_) {
PutVarint32(dst, kPrevLogNumber);
PutVarint64(dst, prev_log_number_);
}
if (has_next_file_number_) {
PutVarint32(dst, kNextFileNumber);
PutVarint64(dst, next_file_number_);
}
if (has_last_sequence_) {
PutVarint32(dst, kLastSequence);
PutVarint64(dst, last_sequence_);
}
//3. 序列化Compact Pointer
for (size_t i = 0; i < compact_pointers_.size(); i++) {
PutVarint32(dst, kCompactPointer);
PutVarint32(dst, compact_pointers_[i].first); // level
PutLengthPrefixedSlice(dst, compact_pointers_[i].second.Encode());
}
//4. 序列化本次版本變化的SSTable文件列表
for (DeletedFileSet::const_iterator iter = deleted_files_.begin();
iter != deleted_files_.end();
++iter) {
PutVarint32(dst, kDeletedFile);
PutVarint32(dst, iter->first); // level
PutVarint64(dst, iter->second); // file number
}
for (size_t i = 0; i < new_files_.size(); i++) {
const FileMetaData& f = new_files_[i].second;
PutVarint32(dst, kNewFile);
PutVarint32(dst, new_files_[i].first); // level
PutVarint64(dst, f.number);
PutVarint64(dst, f.file_size);
PutLengthPrefixedSlice(dst, f.smallest.Encode());
PutLengthPrefixedSlice(dst, f.largest.Encode());
}
}
9.5 總結
回到最開始的問題:版本信息由什么用?
- 版本信息記錄了運行期一組編號信息鹏往,該信息被序列化到Manifest文件中淡诗,當數(shù)據(jù)庫再次打開時可恢復至上一次的運行狀態(tài)。
- 版本信息記錄了SSTable信息,包括每個文件所屬的層級韩容、大小绪爸、編號(名稱等);Version類組提供了查詢SSTable信息功能宙攻,如每層文件的列表奠货、數(shù)量;同時數(shù)據(jù)庫的Get方法中如需通過文件查找key值數(shù)據(jù)時座掘,也由Version類組完成递惋。最后,SSTable的緩存機制也有Version類組提供溢陪。
- 版本信息提供了Compaction支持萍虽。
每個LevelDB有一個Current File,Current File內(nèi)唯一的信息為:當前數(shù)據(jù)庫的Manifest文件名形真。Manifest中包含了上次運行后全部的版本信息杉编,LevelDB通過Manifest文件恢復版本信息。
LevelDB的版本信息為富語義功能組咆霜,它所包含的信息已經(jīng)大大超出了版本定義本身邓馒。如果將Version類封裝為結構體、VersionSet僅僅為Version列表蛾坯、VersionEdit也是單純的結構數(shù)據(jù)光酣,再為上述結構提供多套功能類應該更為合理。目前來看脉课,這應當算作LevelDB實現(xiàn)的一處臭味救军。
轉載請注明:【隨安居士】http://www.reibang.com/p/27e48eae656d