leveldb源碼學(xué)習(xí)--log

所有的寫操作在寫入memtable之前都必須先成功寫入log文件中,主要兩點好處:

  1. 可以將隨機的寫IO變成append怔球,極大的提高寫磁盤速度;
  2. 防止在節(jié)點宕機導(dǎo)致內(nèi)存數(shù)據(jù)丟失浮还,造成數(shù)據(jù)丟失竟坛。

格式

The log file contents are a sequence of 32KB blocks.  The only
exception is that the tail of the file may contain a partial block.

Each block consists of a sequence of records:
   block := record* trailer?
   record :=
    checksum: uint32    // crc32c of type and data[] ; little-endian
    length: uint16      // little-endian
    type: uint8     // One of FULL, FIRST, MIDDLE, LAST
    data: uint8[length]

A record never starts within the last six bytes of a block (since it
won't fit).  Any leftover bytes here form the trailer, which must
consist entirely of zero bytes and must be skipped by readers.

也就是說,日志文件由連續(xù)的大小為32KB的block組成,block又由連續(xù)的record組成担汤,record的格式為
| CRC(4 byte) | Length(2 byte) | type(1 byte) | data |

Type有四種類型

  • FULL: 說明該log record包含一個完整的user record又官;
  • FIRST,說明是user record的第一條log record
  • MIDDLE漫试,說明是user record中間的log record
  • LAST六敬,說明是user record最后的一條log record

看文檔上給出的一個例子:

Example: consider a sequence of user records:
   A: length 1000
   B: length 97270
   C: length 8000
A will be stored as a FULL record in the first block.

B will be split into three fragments: first fragment occupies the rest
of the first block, second fragment occupies the entirety of the
second block, and the third fragment occupies a prefix of the third
block.  This will leave six bytes free in the third block, which will
be left empty as the trailer.

C will be stored as a FULL record in the fourth block.
示意圖

Note: 由于一條logrecord長度最短為7,如果一個block的剩余空間小于7byte驾荣,那么將被填充為空字符串外构,另外長度為7的logrecord是不包括任何用戶數(shù)據(jù)的。

log_format.h

namespace leveldb {
namespace log {
enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,
  kFullType = 1,
  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4
};
static const int kMaxRecordType = kLastType;
static const int kBlockSize = 32768;
// Header is checksum (4 bytes), length (2 bytes), type (1 byte).
static const int kHeaderSize = 4 + 2 + 1;
}  // namespace log
}  // namespace leveldb

Writer

先看下Writer類:

class Writer {
 public:
  // Create a writer that will append data to "*dest".
  // "*dest" must be initially empty.
  // "*dest" must remain live while this Writer is in use.
  explicit Writer(WritableFile* dest);

  // Create a writer that will append data to "*dest".
  // "*dest" must have initial length "dest_length".
  // "*dest" must remain live while this Writer is in use.
  Writer(WritableFile* dest, uint64_t dest_length);

  ~Writer();

  Status AddRecord(const Slice& slice);

 private:
  WritableFile* dest_;
  int block_offset_;       // Current offset in block

  // crc32c values for all supported record types.  These are
  // pre-computed to reduce the overhead of computing the crc of the
  // record type stored in the header.
  uint32_t type_crc_[kMaxRecordType + 1];

  Status EmitPhysicalRecord(RecordType type, const char* ptr, size_t length);

  // No copying allowed
  Writer(const Writer&);
  void operator=(const Writer&);
};

類的結(jié)構(gòu)比較簡單播掷,公開借口只有一個傳入Slice參數(shù)的AddRecord,由于RecordType是固定的幾種审编,所以為了效率,類的成員type_crc_數(shù)組歧匈,這里存放的為RecordType預(yù)先計算的CRC32值垒酬。
下面分析AddRecord的實現(xiàn)

AddRecord

1.首先取出Slice的字符串指針和長度,初始化begin=true,表明這是一條record的開始

const char* ptr = slice.data();
size_t left = slice.size();
bool begin = true;

2.然后進(jìn)入一個do{}while循環(huán)件炉,直到寫入出錯勘究,或者成功寫入全部數(shù)據(jù)

  • 檢查當(dāng)前block的大小是否小于kHeaderSize,如果小于則先將,剩余的部份補0,然后重制塊位移
const int leftover = kBlockSize - block_offset_;
    assert(leftover >= 0);
    if (leftover < kHeaderSize) {
      // Switch to a new block
      if (leftover > 0) {
        // Fill the trailer (literal below relies on kHeaderSize being 7)
        assert(kHeaderSize == 7);
        dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));
      }
      block_offset_ = 0;
    }
  • 計算block剩余大小斟冕,本次可寫入長度
const size_t avail = kBlockSize - block_offset_ - kHeaderSize;
const size_t fragment_length = (left < avail) ? left : avail;
  • 判斷l(xiāng)ogType
RecordType type;
    const bool end = (left == fragment_length);
    if (begin && end) {
      type = kFullType;
    } else if (begin) {
      type = kFirstType;
    } else if (end) {
      type = kLastType;
    } else {
      type = kMiddleType;
    }
  • 調(diào)用EmitPhysicalRecord函數(shù)口糕,append日志;并更新指針磕蛇、剩余長度和begin標(biāo)記景描。
s = EmitPhysicalRecord(type, ptr, fragment_length);
ptr += fragment_length;
left -= fragment_length;
begin = false;

EmitPhysicalRecord

這是實際寫入的地方

  • 首先計算head,并append到log中
  // Format the header
  char buf[kHeaderSize];
  buf[4] = static_cast<char>(n & 0xff);
  buf[5] = static_cast<char>(n >> 8);
  buf[6] = static_cast<char>(t);

  // Compute the crc of the record type and the payload.
  uint32_t crc = crc32c::Extend(type_crc_[t], ptr, n);
  crc = crc32c::Mask(crc);                 // Adjust for storage
  EncodeFixed32(buf, crc);
  // Write the header and the payload
  Status s = dest_->Append(Slice(buf, kHeaderSize));
  • 寫入,并Flush秀撇,更新block的當(dāng)前偏移
if (s.ok()) {
    s = dest_->Append(Slice(ptr, n));
    if (s.ok()) {
      s = dest_->Flush();
    }
  }
  block_offset_ += kHeaderSize + n;

Reader

先看下Reader類的成員變量(基本都有注釋超棺,無需詳解)

  SequentialFile* const file_;
  Reporter* const reporter_;
  bool const checksum_;
  char* const backing_store_;
  Slice buffer_;
  bool eof_;   // Last Read() indicated EOF by returning < kBlockSize

  // Offset of the last record returned by ReadRecord.
  uint64_t last_record_offset_;
  // Offset of the first location past the end of buffer_.
  uint64_t end_of_buffer_offset_;

  // Offset at which to start looking for the first record to return
  uint64_t const initial_offset_;

  // True if we are resynchronizing after a seek (initial_offset_ > 0). In
  // particular, a run of kMiddleType and kLastType records can be silently
  // skipped in this mode
  bool resyncing_;
  // Extend record types with the following special values
  enum {
    kEof = kMaxRecordType + 1,
    // Returned whenever we find an invalid physical record.
    // Currently there are three situations in which this happens:
    // * The record has an invalid CRC (ReadPhysicalRecord reports a drop)
    // * The record is a 0-length record (No drop is reported)
    // * The record is below constructor's initial_offset (No drop is reported)
    kBadRecord = kMaxRecordType + 2
  };

需要注意的兩個類

Reporter : 用來記錄錯誤的產(chǎn)生
SequentialFile: 用來從log文件中讀取數(shù)據(jù)

Reader類公開了兩個接口ReadRecordLastRecordOffset,比較重要的也是ReadRecord,下面重點分析一下這個函數(shù):

  • 根據(jù)initialoffset跳轉(zhuǎn)到調(diào)用者指定的位置,開始讀取日志文件呵燕。跳轉(zhuǎn)就是直接調(diào)用SequentialFile的Seek接口棠绘。另外,需要先調(diào)整調(diào)用者傳入的initialoffset參數(shù)虏等,調(diào)整和跳轉(zhuǎn)邏輯在SkipToInitialBlock函數(shù)中弄唧。
 if (last_record_offset_ < initial_offset_) {
    if (!SkipToInitialBlock()) {
      return false;
    }
  }

SkipToInitialBlock的函數(shù)邏輯如下

bool Reader::SkipToInitialBlock() {
  // 計算在block內(nèi)的偏移位置,并圓整到開始讀取block的起始位置
  size_t offset_in_block = initial_offset_ % kBlockSize;
  uint64_t block_start_location = initial_offset_ - offset_in_block;
  // 如果偏移在最后的6byte里霍衫,肯定不是一條完整的記錄候引,跳到下一個block
  if (offset_in_block > kBlockSize - 6) {
    offset_in_block = 0;
    block_start_location += kBlockSize;
  }
 // 設(shè)置讀取偏移
  end_of_buffer_offset_ = block_start_location;
  // Skip to start of first block that can contain the initial record
  if (block_start_location > 0) {
    Status skip_status = file_->Skip(block_start_location);
    if (!skip_status.ok()) {
      ReportDrop(block_start_location, skip_status);
      return false;
    }
  }
  return true;
}
  • 進(jìn)入while循環(huán)之前進(jìn)行一些標(biāo)記
  bool in_fragmented_record = false;// 是否遇到FIRST類型的type
  // Record offset of the logical record that we're reading
  // 0 is a dummy value to make compilers happy
  uint64_t prospective_record_offset = 0;//正在讀取的邏輯record的偏移
  • 進(jìn)入到while(true)循環(huán),直到讀取到KLastType或者KFullType的record敦跌,或者到了文件結(jié)尾澄干。讀取出現(xiàn)錯誤時逛揩,并不會退出循環(huán),而是匯報錯誤麸俘,繼續(xù)執(zhí)行辩稽,直到成功讀取一條user record,或者遇到文件結(jié)尾从媚。
    • 首先讀取一個record采用的是ReadPhysicalRecord函數(shù)
unsigned int Reader::ReadPhysicalRecord(Slice* result) {
  while (true) {
    if (buffer_.size() < kHeaderSize) {
      // 如果未到達(dá)文件結(jié)尾逞泄,清空buffer,讀取數(shù)據(jù)
      if (!eof_) {
        buffer_.clear();
        Status status = file_->Read(kBlockSize, &buffer_, backing_store_);
        end_of_buffer_offset_ += buffer_.size();
        if (!status.ok()) {
          buffer_.clear();
          ReportDrop(kBlockSize, status);
          eof_ = true;
          return kEof;
        } else if (buffer_.size() < kBlockSize) {  // 實際讀取字節(jié)<指定(Block Size)拜效,表明到了文件結(jié)尾
          eof_ = true;
        }
        continue;
      } else {
        buffer_.clear();
        return kEof;
      }
    }
    // 解析record頭
    const char* header = buffer_.data();
    const uint32_t a = static_cast<uint32_t>(header[4]) & 0xff;
    const uint32_t b = static_cast<uint32_t>(header[5]) & 0xff;
    const unsigned int type = header[6];
    const uint32_t length = a | (b << 8);
    // 長度超出喷众,匯報錯誤
    if (kHeaderSize + length > buffer_.size()) {
      size_t drop_size = buffer_.size();
      buffer_.clear();
      if (!eof_) {
        ReportCorruption(drop_size, "bad record length");
        return kBadRecord;
      }
      return kEof;
    }
    
    if (type == kZeroType && length == 0) { // 對于Zero Type類型處理
      buffer_.clear();
      return kBadRecord;
    }

    // 檢查crc
    if (checksum_) {
      uint32_t expected_crc = crc32c::Unmask(DecodeFixed32(header));
      uint32_t actual_crc = crc32c::Value(header + 6, 1 + length);
      if (actual_crc != expected_crc) {  // 如果出錯,匯報錯誤
        size_t drop_size = buffer_.size();
        buffer_.clear();
        ReportCorruption(drop_size, "checksum mismatch");
        return kBadRecord;
      }
    }

    buffer_.remove_prefix(kHeaderSize + length);

    if (end_of_buffer_offset_ - buffer_.size() - kHeaderSize - length <
        initial_offset_) {
      result->clear();
      return kBadRecord;
    }
    *result = Slice(header + kHeaderSize, length);
    return type;
  }
  • 接下來紧憾,對所讀取的record進(jìn)行判斷
    • 如果一開始就讀到kMiddleType到千, kLastType的record顯然是不正確(完整)的,所以簡單的拋棄赴穗,繼續(xù)讀后面的憔四。
if (resyncing_) {
      if (record_type == kMiddleType) {
        continue;
      } else if (record_type == kLastType) {
        resyncing_ = false;
        continue;
      } else {
        resyncing_ = false;
      }
    }
  • 根據(jù)類型采取相應(yīng)動作(邏輯非常簡單)
switch (record_type) {
      case kFullType:
        if (in_fragmented_record) {
          if (scratch->empty()) {
            in_fragmented_record = false;
          } else {
            ReportCorruption(scratch->size(), "partial record without end(1)");
          }
        }
        prospective_record_offset = physical_record_offset;
        scratch->clear();
        *record = fragment;
        last_record_offset_ = prospective_record_offset;
        return true;

      case kFirstType:
        if (in_fragmented_record) {
          if (scratch->empty()) {
            in_fragmented_record = false;
          } else {
            ReportCorruption(scratch->size(), "partial record without end(2)");
          }
        }
        prospective_record_offset = physical_record_offset;
        scratch->assign(fragment.data(), fragment.size());
        in_fragmented_record = true;
        break;

      case kMiddleType:
        if (!in_fragmented_record) {
          ReportCorruption(fragment.size(),
                           "missing start of fragmented record(1)");
        } else {
          scratch->append(fragment.data(), fragment.size());
        }
        break;

      case kLastType:
        if (!in_fragmented_record) {
          ReportCorruption(fragment.size(),
                           "missing start of fragmented record(2)");
        } else {
          scratch->append(fragment.data(), fragment.size());
          *record = Slice(*scratch);
          last_record_offset_ = prospective_record_offset;
          return true;
        }
        break;

      case kEof:
        if (in_fragmented_record) {
          scratch->clear();
        }
        return false;

      case kBadRecord:
        if (in_fragmented_record) {
          ReportCorruption(scratch->size(), "error in middle of record");
          in_fragmented_record = false;
          scratch->clear();
        }
        break;

      default: {
        char buf[40];
        snprintf(buf, sizeof(buf), "unknown record type %u", record_type);
        ReportCorruption(
            (fragment.size() + (in_fragmented_record ? scratch->size() : 0)),
            buf);
        in_fragmented_record = false;
        scratch->clear();
        break;
      }
    }
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市般眉,隨后出現(xiàn)的幾起案子了赵,更是在濱河造成了極大的恐慌,老刑警劉巖煤篙,帶你破解...
    沈念sama閱讀 221,273評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件斟览,死亡現(xiàn)場離奇詭異,居然都是意外死亡辑奈,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,349評論 3 398
  • 文/潘曉璐 我一進(jìn)店門已烤,熙熙樓的掌柜王于貴愁眉苦臉地迎上來鸠窗,“玉大人,你說我怎么就攤上這事胯究∩约疲” “怎么了?”我有些...
    開封第一講書人閱讀 167,709評論 0 360
  • 文/不壞的土叔 我叫張陵裕循,是天一觀的道長臣嚣。 經(jīng)常有香客問我,道長剥哑,這世上最難降的妖魔是什么硅则? 我笑而不...
    開封第一講書人閱讀 59,520評論 1 296
  • 正文 為了忘掉前任,我火速辦了婚禮株婴,結(jié)果婚禮上怎虫,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好大审,可當(dāng)我...
    茶點故事閱讀 68,515評論 6 397
  • 文/花漫 我一把揭開白布蘸际。 她就那樣靜靜地躺著,像睡著了一般徒扶。 火紅的嫁衣襯著肌膚如雪粮彤。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,158評論 1 308
  • 那天姜骡,我揣著相機與錄音导坟,去河邊找鬼。 笑死溶浴,一個胖子當(dāng)著我的面吹牛乍迄,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播士败,決...
    沈念sama閱讀 40,755評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼闯两,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了谅将?” 一聲冷哼從身側(cè)響起漾狼,我...
    開封第一講書人閱讀 39,660評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎饥臂,沒想到半個月后逊躁,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,203評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡隅熙,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,287評論 3 340
  • 正文 我和宋清朗相戀三年稽煤,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片囚戚。...
    茶點故事閱讀 40,427評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡酵熙,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出驰坊,到底是詐尸還是另有隱情匾二,我是刑警寧澤,帶...
    沈念sama閱讀 36,122評論 5 349
  • 正文 年R本政府宣布拳芙,位于F島的核電站察藐,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏舟扎。R本人自食惡果不足惜分飞,卻給世界環(huán)境...
    茶點故事閱讀 41,801評論 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望浆竭。 院中可真熱鬧浸须,春花似錦惨寿、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,272評論 0 23
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至肌索,卻和暖如春蕉拢,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背诚亚。 一陣腳步聲響...
    開封第一講書人閱讀 33,393評論 1 272
  • 我被黑心中介騙來泰國打工晕换, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人站宗。 一個月前我還...
    沈念sama閱讀 48,808評論 3 376
  • 正文 我出身青樓闸准,卻偏偏與公主長得像,于是被迫代替她去往敵國和親梢灭。 傳聞我的和親對象是個殘疾皇子夷家,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,440評論 2 359

推薦閱讀更多精彩內(nèi)容