原文網(wǎng)址:https://github.com/facebook/rocksdb/wiki/Basic-Operations
(有道)
Basic operations
The <code>rocksdb</code> library provides a persistent key value store. Keys and values are arbitrary byte arrays. The keys are ordered within the key value store according to a user-specified comparator function.
rocksdb 庫提供了一個(gè)持久的鍵值存儲(chǔ)呛讲。鍵和值是任意的字節(jié)數(shù)組识腿。鍵根據(jù)用戶指定的比較器函數(shù)在鍵值存儲(chǔ)區(qū)中排序。
Opening A Database
A <code>rocksdb</code> database has a name which corresponds to a file system directory. All of the contents of database are stored in this directory. The following example shows how to open a database, creating it if necessary:
一個(gè)<code>rocksdb</code>數(shù)據(jù)庫有一個(gè)與文件系統(tǒng)目錄對(duì)應(yīng)的名稱沐兵。數(shù)據(jù)庫的所有內(nèi)容都存儲(chǔ)在這個(gè)目錄中桃序。下面的例子展示了如何打開一個(gè)數(shù)據(jù)庫,并在必要時(shí)創(chuàng)建它:
#include <cassert>
#include "rocksdb/db.h"
rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;
rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);
assert(status.ok());
...
If you want to raise an error if the database already exists, add the following line before the <code>rocksdb::DB::Open</code> call:
如果你想在數(shù)據(jù)庫已經(jīng)存在的情況下引發(fā)錯(cuò)誤玛追,在rocksdb::DB::Open調(diào)用之前添加以下代碼:
options.error_if_exists = true;
If you are porting code from <code>leveldb</code> to <code>rocksdb</code>, you can convert your <code>leveldb::Options</code> object to a <code>rocksdb::Options</code> object using <code>rocksdb::LevelDBOptions</code>, which has the same functionality as <code>leveldb::Options</code>:
如果你要將代碼從leveldb移植到rocksdb,你可以使用rocksdb::LevelDBOptions將你的leveldb::Options對(duì)象轉(zhuǎn)換為rocksdb::Options對(duì)象撬即,它具有與leveldb::Options相同的功能。
#include "rocksdb/utilities/leveldb_options.h"
rocksdb::LevelDBOptions leveldb_options;
leveldb_options.option1 = value1;
leveldb_options.option2 = value2;
...
rocksdb::Options options = rocksdb::ConvertOptions(leveldb_options);
RocksDB Options
Users can choose to always set options fields explicitly in code, as shown above. Alternatively, you can also set it through a string to string map, or an option string. See [[Option String and Option Map]].
用戶可以選擇總是在代碼中顯式地設(shè)置選項(xiàng)字段藤巢,如上所示搞莺。或者掂咒,您也可以通過字符串到字符串的映射或選項(xiàng)字符串來設(shè)置它才沧。請(qǐng)參見[[Option String and Option Map]]。
Some options can be changed dynamically while DB is running. For example:
一些選項(xiàng)可以在DB運(yùn)行時(shí)動(dòng)態(tài)更改绍刮。例如:
rocksdb::Status s;
s = db->SetOptions({{"write_buffer_size", "131072"}});
assert(s.ok());
s = db->SetDBOptions({{"max_background_flushes", "2"}});
assert(s.ok());
RocksDB automatically keeps options used in the database in OPTIONS-xxxx files under the DB directory. Users can choose to preserve the option values after DB restart by extracting options from these option files. See [[RocksDB Options File]].
RocksDB會(huì)自動(dòng)將數(shù)據(jù)庫中使用的選項(xiàng)保存在DB目錄下的options -xxxx文件中温圆。用戶可以通過從這些選項(xiàng)文件中提取選項(xiàng)來選擇在DB重啟后保留選項(xiàng)值。參見[[RocksDB Options File]]孩革。
Status
You may have noticed the <code>rocksdb::Status</code> type above. Values of this type are returned by most functions in <code>rocksdb</code> that may encounter an error. You can check if such a result is ok, and also print an associated error message:
您可能已經(jīng)注意到上面的rocksdb::Status類型岁歉。該類型的值由rocksdb中的大多數(shù)可能遇到錯(cuò)誤的函數(shù)返回。你可以檢查這樣的結(jié)果是否正確膝蜈,并打印一個(gè)相關(guān)的錯(cuò)誤消息:
rocksdb::Status s = ...;
if (!s.ok()) cerr << s.ToString() << endl;
Closing A Database
When you are done with a database, there are 2 ways to gracefully close the database -
當(dāng)你關(guān)閉一個(gè)數(shù)據(jù)庫锅移,有兩種方法優(yōu)雅地關(guān)閉數(shù)據(jù)庫-
- Simply delete the database object. This will release all the resources that were held while the database was open. However, if any error is encountered when releasing any of the resources, for example error when closing the info_log file, it will be lost.
只需刪除數(shù)據(jù)庫對(duì)象。這將釋放數(shù)據(jù)庫打開時(shí)持有的所有資源饱搏。但是非剃,如果在釋放資源時(shí)出現(xiàn)錯(cuò)誤,例如關(guān)閉info_log文件時(shí)出現(xiàn)error推沸,則資源將丟失备绽。 - Call
DB::Close()
, followed by deleting the database object. TheDB::Close()
returnsStatus
, which can be examined to determine if there were any errors. Regardless of errors,DB::Close()
will release all resources and is irreversible.
調(diào)用DB::Close(),然后刪除數(shù)據(jù)庫對(duì)象鬓催。DB::Close()返回Status肺素,可以檢查Status以確定是否有任何錯(cuò)誤。不管有什么錯(cuò)誤宇驾,DB::Close()都會(huì)釋放所有的資源倍靡,并且是不可逆的。
Example:
... open the db as described above ...
... do something with db ...
delete db;
Or
... open the db as described above ...
... do something with db ...
Status s = db->Close();
... log status ...
delete db;
Reads
The database provides <code>Put</code>, <code>Delete</code>, <code>Get</code>, and <code>MultiGet</code> methods to modify/query the database. For example, the following code moves the value stored under key1 to key2.
數(shù)據(jù)庫提供Put课舍、Delete菌瘫、Get、MultiGet等方法對(duì)數(shù)據(jù)庫進(jìn)行修改和查詢布卡。例如,下面的代碼將存儲(chǔ)在key1下的值移動(dòng)到key2雇盖。
std::string value;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), key1, &value);
if (s.ok()) s = db->Put(rocksdb::WriteOptions(), key2, value);
if (s.ok()) s = db->Delete(rocksdb::WriteOptions(), key1);
Right now, value size must be smaller than 4GB.
現(xiàn)在忿等,值size必須小于4GB。
RocksDB also allows [[Single Delete]] which is useful in some special cases.
RocksDB還允許使用[[Single Delete]]崔挖,這在某些特殊情況下非常有用贸街。
Each Get
results into at least a memcpy from the source to the value string. If the source is in the block cache, you can avoid the extra copy by using a PinnableSlice.
每個(gè)Get結(jié)果到至少一個(gè)memcpy從源到值字符串庵寞。如果源文件在塊緩存中,可以使用PinnableSlice來避免額外的拷貝薛匪。
PinnableSlice pinnable_val;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), key1, &pinnable_val);
The source will be released once pinnable_val is destructed or ::Reset is invoked on it. Read more here.
當(dāng)pinnable_val被銷毀或者::Reset被調(diào)用時(shí)捐川,資源將被釋放。閱讀更多(這里)(http://rocksdb.org/blog/2017/08/24/pinnableslice.html)逸尖。
When reading multiple keys from the database, MultiGet
can be used. There are two variations of MultiGet
: 1. Read multiple keys from a single column family in a more performant manner, i.e it can be faster than calling Get
in a loop, and 2. Read keys across multiple column families consistent with each other.
當(dāng)從數(shù)據(jù)庫讀取多個(gè)鍵時(shí)古沥,可以使用MultiGet。MultiGet有兩種變體:以一種更高效的方式從一個(gè)列族中讀取多個(gè)鍵娇跟,即它可以比在循環(huán)中調(diào)用Get更快岩齿。跨多個(gè)一致的列族讀取鍵苞俘。
For example,
std::vector<Slice> keys;
std::vector<PinnableSlice> values;
std::vector<Status> statuses;
for ... {
keys.emplace_back(key);
}
values.resize(keys.size());
statuses.resize(keys.size());
db->MultiGet(ReadOptions(), cf, keys.size(), keys.data(), values.data(), statuses.data());
In order to avoid the overhead of memory allocations, the keys
, values
and statuses
above can be of type std::array
on stack or any other type that provides contiguous storage.
為了避免內(nèi)存分配的開銷盹沈,上面的鍵、值和狀態(tài)可以是std::array on stack 或任何其他提供連續(xù)存儲(chǔ)的類型吃谣。
Or
std::vector<ColumnFamilyHandle*> column_families;
std::vector<Slice> keys;
std::vector<std::string> values;
for ... {
keys.emplace_back(key);
column_families.emplace_back(column_family);
}
values.resize(keys.size());
std::vector<Status> statuses = db->MultiGet(ReadOptions(), column_families, keys, values);
For a more in-depth discussion of performance benefits of using MultiGet, see [[MultiGet Performance]].
有關(guān)使用MultiGet的性能好處的更深入的討論乞封,請(qǐng)參見[[MultiGet性能]]。
Writes
Atomic Updates
Note that if the process dies after the Put of key2 but before the delete of key1, the same value may be left stored under multiple keys. Such problems can be avoided by using the <code>WriteBatch</code> class to atomically apply a set of updates:
請(qǐng)注意岗憋,如果進(jìn)程在key2的Put之后但在刪除key1之前死亡肃晚,那么相同的值可能會(huì)保存在多個(gè)鍵下。這樣的問題可以通過使用WriteBatch類來自動(dòng)應(yīng)用一組更新來避免:
#include "rocksdb/write_batch.h"
...
std::string value;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), key1, &value);
if (s.ok()) {
rocksdb::WriteBatch batch;
batch.Delete(key1);
batch.Put(key2, value);
s = db->Write(rocksdb::WriteOptions(), &batch);
}
The <code>WriteBatch</code> holds a sequence of edits to be made to the database, and these edits within the batch are applied in order. Note that we called <code>Delete</code> before <code>Put</code> so that if <code>key1</code> is identical to <code>key2</code>, we do not end up erroneously dropping the value entirely.
WriteBatch保存要對(duì)數(shù)據(jù)庫進(jìn)行的編輯的序列澜驮,批處理中的這些編輯是按順序應(yīng)用的陷揪。注意,我們?cè)赑ut之前調(diào)用了Delete杂穷,這樣如果key1與key2相同悍缠,我們就不會(huì)錯(cuò)誤地完全放棄該值。
Apart from its atomicity benefits, <code>WriteBatch</code> may also be used to speed up bulk updates by placing lots of individual mutations into the same batch.
除了原子性的好處外耐量,WriteBatch還可以通過將許多單獨(dú)的突變放到同一個(gè)批處理中來加快批量更新的速度飞蚓。
Synchronous Writes
By default, each write to <code>rocksdb</code> is asynchronous: it returns after pushing the write from the process into the operating system. The transfer from operating system memory to the underlying persistent storage happens asynchronously. The <code>sync</code> flag can be turned on for a particular write to make the write operation not return until the data being written has been pushed all the way to persistent storage. (On Posix systems, this is implemented by calling either <code>fsync(...)</code> or <code>fdatasync(...)</code> or <code>msync(..., MS_SYNC)</code> before the write operation returns.)
默認(rèn)情況下,對(duì)rocksdb的每次寫操作都是異步的:它會(huì)在進(jìn)程將寫操作推入操作系統(tǒng)后返回廊蜒。從操作系統(tǒng)內(nèi)存到底層持久存儲(chǔ)的傳輸是異步進(jìn)行的趴拧。對(duì)于特定的寫操作,可以打開同步標(biāo)志山叮,使寫操作在被寫的數(shù)據(jù)被推到持久存儲(chǔ)之前不會(huì)返回著榴。(在Posix系統(tǒng)上,這是通過調(diào)用fsync(…)或fdatasync(…)或msync(…)實(shí)現(xiàn)的屁倔。在寫操作返回之前脑又,MS_SYNC)。
rocksdb::WriteOptions write_options;
write_options.sync = true;
db->Put(write_options, ...);
Non-sync Writes
With non-sync writes, RocksDB only buffers WAL write in OS buffer or internal buffer (when options.manual_wal_flush = true). They are often much faster than synchronous writes. The downside of non-sync writes is that a crash of the machine may cause the last few updates to be lost. Note that a crash of just the writing process (i.e., not a reboot) will not cause any loss since even when <code>sync</code> is false, an update is pushed from the process memory into the operating system before it is considered done.
對(duì)于非同步寫入,RocksDB只會(huì)在操作系統(tǒng)緩沖區(qū)或內(nèi)部緩沖區(qū)中進(jìn)行WAL寫入问麸。manual_wal_flush = true)往衷。它們通常比同步寫要快得多。非同步寫的缺點(diǎn)是严卖,機(jī)器的崩潰可能會(huì)導(dǎo)致最后幾次更新丟失席舍。請(qǐng)注意,僅僅是寫入進(jìn)程的崩潰(即哮笆,不是重新啟動(dòng))不會(huì)造成任何損失来颤,因?yàn)榧词箂ync為false,更新在被認(rèn)為完成之前疟呐,也會(huì)從進(jìn)程內(nèi)存中推送到操作系統(tǒng)脚曾。
Non-sync writes can often be used safely. For example, when loading a large amount of data into the database you can handle lost updates by restarting the bulk load after a crash. A hybrid scheme is also possible where DB::SyncWAL()
is called by a separate thread.
非同步寫通常可以安全使用启具。例如本讥,當(dāng)將大量數(shù)據(jù)加載到數(shù)據(jù)庫中時(shí),您可以在崩潰后通過重新啟動(dòng)批量加載來處理丟失的更新鲁冯】椒校混合模式也可以使用,其中DB::SyncWAL()由單獨(dú)的線程調(diào)用薯演。
We also provide a way to completely disable Write Ahead Log for a particular write. If you set <code>write_options.disableWAL</code> to true, the write will not go to the log at all and may be lost in an event of process crash.
我們還提供了一種方法來完全禁用WAL撞芍。如果你設(shè)置了write_options。如果disableal為true跨扮,則寫操作根本不會(huì)進(jìn)入日志序无,并且可能在進(jìn)程崩潰時(shí)丟失。
RocksDB by default uses <code>fdatasync()</code> to sync files, which might be faster than fsync() in certain cases. If you want to use fsync(), you can set <code>Options::use_fsync</code> to true. You should set this to true on filesystems like ext3 that can lose files after a reboot.
RocksDB默認(rèn)使用fdatasync()來同步文件衡创,在某些情況下帝嗡,這可能比fsync()更快。如果需要使用fsync()璃氢,可以將Options::use_fsync設(shè)置為true哟玷。在ext3這樣的文件系統(tǒng)上,重啟后可能會(huì)丟失文件一也,應(yīng)該將此設(shè)置為true巢寡。
Advanced
For more information about write performance optimizations and factors influencing performance, see [[Pipelined Write]] and [[Write Stalls]].
有關(guān)寫性能優(yōu)化和影響性能的因素的更多信息,請(qǐng)參見[[Pipelined Write]] 和 [[Write Stalls]]椰苟。
Concurrency
A database may only be opened by one process at a time. The <code>rocksdb</code> implementation acquires a lock from the operating system to prevent misuse. Within a single process, the same <code>rocksdb::DB</code> object may be safely shared by multiple concurrent threads. I.e., different threads may write into or fetch iterators or call <code>Get</code> on the same database without any external synchronization (the rocksdb implementation will automatically do the required synchronization). However other objects (like Iterator and WriteBatch) may require external synchronization. If two threads share such an object, they must protect access to it using their own locking protocol. More details are available in the public header files.
一個(gè)數(shù)據(jù)庫一次只能由一個(gè)進(jìn)程打開抑月。rocksdb實(shí)現(xiàn)從操作系統(tǒng)獲取一個(gè)鎖,以防止誤用舆蝴。在一個(gè)進(jìn)程中爪幻,同一個(gè)rocksdb::DB對(duì)象可以被多個(gè)并發(fā)線程安全地共享菱皆。也就是說,不同的線程可以寫入迭代器或獲取迭代器挨稿,或者在同一個(gè)數(shù)據(jù)庫上調(diào)用Get,而不需要任何外部同步(rocksdb實(shí)現(xiàn)會(huì)自動(dòng)完成所需的同步)京痢。然而奶甘,其他對(duì)象(如Iterator和WriteBatch)可能需要外部同步。如果兩個(gè)線程共享這樣一個(gè)對(duì)象祭椰,它們必須使用自己的鎖定協(xié)議來保護(hù)對(duì)它的訪問臭家。更多細(xì)節(jié)可以在公共頭文件中找到。
Merge operators
Merge operators provide efficient support for read-modify-write operation.
合并操作符為讀-修改-寫操作提供了有效的支持方淤。
More on the interface and implementation can be found on:
有關(guān)接口和實(shí)現(xiàn)的更多信息钉赁,請(qǐng)參閱:
- [[Merge Operator | Merge-Operator]]
- [[Merge Operator Implementation | Merge-Operator-Implementation]]
- Get Merge Operands
Iteration
The following example demonstrates how to print all (key, value) pairs in a database.
下面的例子演示了如何打印數(shù)據(jù)庫中的所有(鍵、值)對(duì)携茂。
rocksdb::Iterator* it = db->NewIterator(rocksdb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next()) {
cout << it->key().ToString() << ": " << it->value().ToString() << endl;
}
assert(it->status().ok()); // Check for any errors found during the scan
delete it;
The following variation shows how to process just the keys in the range <code>[start, limit)</code>:
下面的變化顯示了如何處理范圍內(nèi)的鍵[開始你踩,限制]:
for (it->Seek(start);
it->Valid() && it->key().ToString() < limit;
it->Next()) {
...
}
assert(it->status().ok()); // Check for any errors found during the scan
You can also process entries in reverse order. (Caveat: reverse iteration may be somewhat slower than forward iteration.)
您也可以按相反的順序處理?xiàng)l目。(注意:反向迭代可能比向前迭代慢一些讳苦。)
for (it->SeekToLast(); it->Valid(); it->Prev()) {
...
}
assert(it->status().ok()); // Check for any errors found during the scan
This is an example of processing entries in range (limit, start] in reverse order from one specific key:
這是一個(gè)處理range (limit, start)中的條目的例子带膜,從一個(gè)特定的鍵逆序開始:
for (it->SeekForPrev(start);
it->Valid() && it->key().ToString() > limit;
it->Prev()) {
...
}
assert(it->status().ok()); // Check for any errors found during the scan
See [[SeekForPrev]].
For explanation of error handling, different iterating options and best practice, see [[Iterator]].
有關(guān)錯(cuò)誤處理、不同迭代選項(xiàng)和最佳實(shí)踐的解釋鸳谜,請(qǐng)參見[[Iterator]]膝藕。
To know about implementation details, see Iterator's Implementation
要了解實(shí)現(xiàn)的細(xì)節(jié),請(qǐng)參見Iterator's Implementation
Snapshots
Snapshots provide consistent read-only views over the entire state of the key-value store. <code>ReadOptions::snapshot</code> may be non-NULL to indicate that a read should operate on a particular version of the DB state.
快照提供鍵值存儲(chǔ)的整個(gè)狀態(tài)的一致的只讀視圖咐扭。snapshot可以是非null芭挽,表示讀取操作應(yīng)該在DB狀態(tài)的特定版本上進(jìn)行。
If <code>ReadOptions::snapshot</code> is NULL, the read will operate on an implicit snapshot of the current state.
如果ReadOptions::snapshot為NULL蝗肪,則read操作將對(duì)當(dāng)前狀態(tài)的隱式快照進(jìn)行操作袜爪。
Snapshots are created by the DB::GetSnapshot() method:
快照是由DB::GetSnapshot()方法創(chuàng)建的:
rocksdb::ReadOptions options;
options.snapshot = db->GetSnapshot();
... apply some updates to db ...
rocksdb::Iterator* iter = db->NewIterator(options);
... read using iter to view the state when the snapshot was created ...
delete iter;
db->ReleaseSnapshot(options.snapshot);
Note that when a snapshot is no longer needed, it should be released using the DB::ReleaseSnapshot interface. This allows the implementation to get rid of state that was being maintained just to support reading as of that snapshot.
注意,當(dāng)不再需要快照時(shí)穗慕,應(yīng)該使用DB:: releassnapshot接口來釋放它饿敲。這允許實(shí)現(xiàn)擺脫正在維護(hù)的狀態(tài),以支持讀取快照逛绵。
Slice
The return value of the <code>it->key()</code> and <code>it->value()</code> calls above are instances of the <code>rocksdb::Slice</code> type. <code>Slice</code> is a simple structure that contains a length and a pointer to an external byte array. Returning a <code>Slice</code> is a cheaper alternative to returning a <code>std::string</code> since we do not need to copy potentially large keys and values. In addition, <code>rocksdb</code> methods do not return null-terminated C-style strings since <code>rocksdb</code> keys and values are allowed to contain '\0' bytes.
上面的it->key()和it->value()調(diào)用的返回值是rocksdb::Slice類型的實(shí)例怀各。Slice是一個(gè)簡(jiǎn)單的結(jié)構(gòu),包含一個(gè)長(zhǎng)度和一個(gè)指向外部字節(jié)數(shù)組的指針术浪。返回Slice是一個(gè)比返回std::string更便宜的選擇瓢对,因?yàn)槲覀儾恍枰獜?fù)制可能很大的鍵和值。另外胰苏,rocksdb方法不返回以null結(jié)尾的c風(fēng)格字符串硕蛹,因?yàn)閞ocksdb的鍵和值允許包含'\0'字節(jié)。
C++ strings and null-terminated C-style strings can be easily converted to a Slice:
c++字符串和以null結(jié)尾的C風(fēng)格字符串可以很容易地轉(zhuǎn)換為Slice:
rocksdb::Slice s1 = "hello";
std::string str("world");
rocksdb::Slice s2 = str;
A Slice can be easily converted back to a C++ string:
一個(gè)Slice可以很容易地轉(zhuǎn)換回一個(gè)c++字符串:
std::string str = s1.ToString();
assert(str == std::string("hello"));
Be careful when using Slices since it is up to the caller to ensure that the external byte array into which the Slice points remains live while the Slice is in use. For example, the following is buggy:
使用Slice時(shí)要小心,因?yàn)檎{(diào)用者要確保在使用Slice時(shí)法焰,Slice點(diǎn)所在的外部字節(jié)數(shù)組仍處于活動(dòng)狀態(tài)秧荆。例如,以下是錯(cuò)誤的:
rocksdb::Slice slice;
if (...) {
std::string str = ...;
slice = str;
}
Use(slice);
When the <code>if</code> statement goes out of scope, <code>str</code> will be destroyed and the backing storage for <code>slice</code> will disappear.
當(dāng)if語句超出作用域時(shí)埃仪,str將被銷毀乙濒,slice的備份存儲(chǔ)也將消失。
Transactions
RocksDB now supports multi-operation transactions. See [[Transactions]]
RocksDB現(xiàn)在支持多操作事務(wù)卵蛉。[[交易]]
Comparators
The preceding examples used the default ordering function for key, which orders bytes lexicographically. You can however supply a custom comparator when opening a database. For example, suppose each database key consists of two numbers and we should sort by the first number, breaking ties by the second number. First, define a proper subclass of <code>rocksdb::Comparator</code> that expresses these rules:
前面的示例使用了默認(rèn)的key排序函數(shù)颁股,該函數(shù)按字典順序排序字節(jié)。不過傻丝,您可以在打開數(shù)據(jù)庫時(shí)提供一個(gè)自定義比較器善涨。例如蒂萎,假設(shè)每個(gè)數(shù)據(jù)庫鍵由兩個(gè)數(shù)字組成袭祟,我們應(yīng)該按第一個(gè)數(shù)字排序缸榄,打破按第二個(gè)數(shù)字排序的僵局。首先运准,定義一個(gè)合適的rocksdb::Comparator子類來表達(dá)以下規(guī)則:
class TwoPartComparator : public rocksdb::Comparator {
public:
// Three-way comparison function:
// if a < b: negative result
// if a > b: positive result
// else: zero result
int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const {
int a1, a2, b1, b2;
ParseKey(a, &a1, &a2);
ParseKey(b, &b1, &b2);
if (a1 < b1) return -1;
if (a1 > b1) return +1;
if (a2 < b2) return -1;
if (a2 > b2) return +1;
return 0;
}
// Ignore the following methods for now:
const char* Name() const { return "TwoPartComparator"; }
void FindShortestSeparator(std::string*, const rocksdb::Slice&) const { }
void FindShortSuccessor(std::string*) const { }
};
Now create a database using this custom comparator:
現(xiàn)在用這個(gè)自定義比較器創(chuàng)建一個(gè)數(shù)據(jù)庫:
TwoPartComparator cmp;
rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;
options.comparator = &cmp;
rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);
...
Column Families
[[Column Families]] provide a way to logically partition the database. Users can provide atomic writes of multiple keys across multiple column families and read a consistent view from them.
[[Column Families]]提供了一種邏輯分區(qū)數(shù)據(jù)庫的方法幌氮。用戶可以跨多個(gè)列族提供多個(gè)鍵的原子寫入,并從中讀取一致的視圖胁澳。
Bulk Load
You can [[Creating and Ingesting SST files]] to bulk load a large amount of data directly into DB with minimum impacts on the live traffic.
您可以[[Creating and Ingesting SST files]]將大量的數(shù)據(jù)直接批量加載到DB中该互,對(duì)實(shí)時(shí)流量的影響最小。
Backup and Checkpoint
Backup allows users to create periodic incremental backups in a remote file system (think about HDFS or S3) and recover from any of them.
備份允許用戶在遠(yuǎn)程文件系統(tǒng)(例如HDFS或S3)中創(chuàng)建定期增量備份韭畸,并從其中恢復(fù)宇智。
[[Checkpoints]] provides the ability to take a snapshot of a running RocksDB database in a separate directory. Files are hardlinked, rather than copied, if possible, so it is a relatively lightweight operation.
[[Checkpoints]]提供了在一個(gè)單獨(dú)的目錄下對(duì)運(yùn)行中的RocksDB數(shù)據(jù)庫進(jìn)行快照的能力。如果可能的話胰丁,文件是硬鏈接的随橘,而不是復(fù)制的,所以它是一個(gè)相對(duì)輕量級(jí)的操作锦庸。
I/O
By default, RocksDB's I/O goes through operating system's page cache. Setting [[Rate Limiter]] can limit the speed that RocksDB issues file writes, to make room for read I/Os.
在默認(rèn)情況下机蔗,RocksDB的I/O將通過操作系統(tǒng)的頁面緩存。通過設(shè)置[[Rate elimiter]]甘萧,可以限制RocksDB的文件寫入速度萝嘁,為讀I/O留出空間。
Users can also choose to bypass operating system's page cache, using Direct I/O.
用戶也可以選擇繞過操作系統(tǒng)的頁面緩存扬卷,使用Direct I/O牙言。
See [[IO]] for more details.
詳見[[IO]]。
Backwards compatibility
The result of the comparator's <code>Name</code> method is attached to the database when it is created, and is checked on every subsequent database open. If the name changes, the <code>rocksdb::DB::Open</code> call will fail. Therefore, change the name if and only if the new key format and comparison function are incompatible with existing databases, and it is ok to discard the contents of all existing databases.
在創(chuàng)建數(shù)據(jù)庫時(shí)怪得,比較器的Name方法的結(jié)果被附加到數(shù)據(jù)庫中咱枉,并在隨后打開的每個(gè)數(shù)據(jù)庫中進(jìn)行檢查卑硫。如果名稱改變,則rocksdb::DB::Open調(diào)用將失敗蚕断。因此欢伏,當(dāng)且僅當(dāng)新的鍵格式和比較函數(shù)與現(xiàn)有數(shù)據(jù)庫不兼容時(shí),更改名稱亿乳,并且可以丟棄所有現(xiàn)有數(shù)據(jù)庫的內(nèi)容颜懊。
You can however still gradually evolve your key format over time with a little bit of pre-planning. For example, you could store a version number at the end of each key (one byte should suffice for most uses).
然而,你仍然可以在預(yù)先計(jì)劃的情況下风皿,隨著時(shí)間的推移逐步發(fā)展你的key格式。例如匠璧,您可以在每個(gè)鍵的末尾存儲(chǔ)一個(gè)版本號(hào)(對(duì)于大多數(shù)使用桐款,一個(gè)字節(jié)應(yīng)該足夠了)。
When you wish to switch to a new key format (e.g., adding an optional third part to the keys processed by <code>TwoPartComparator</code>),
當(dāng)您希望切換到一個(gè)新的密鑰格式(例如,添加一個(gè)可選的第三部分密鑰由TwoPartComparator處理),
(a) keep the same comparator name
保持相同的比較器名稱
(b) increment the version number for new keys
增加新密鑰版本號(hào)
(c) change the comparator function so it uses the version numbers found in the keys to decide how to interpret them.
改變比較器函數(shù)所以它使用版本號(hào)在決定如何解釋他們的關(guān)鍵夷恍。
MemTable and Table factories
By default, we keep the data in memory in skiplist memtable and the data on disk in a table format described here: <a >RocksDB Table Format</a>.
默認(rèn)情況下魔眨,我們會(huì)將內(nèi)存中的數(shù)據(jù)保存在skip memtable中,而將磁盤中的數(shù)據(jù)保存在如下所示的表格格式中:RocksDB表格格式酿雪。
Since one of the goals of RocksDB is to have different parts of the system easily pluggable, we support different implementations of both memtable and table format. You can supply your own memtable factory by setting <code>Options::memtable_factory</code> and your own table factory by setting <code>Options::table_factory</code>. For available memtable factories, please refer to <code>rocksdb/memtablerep.h</code> and for table factories to <code>rocksdb/table.h</code>. These features are both in active development and please be wary of any API changes that might break your application going forward.
由于RocksDB的目標(biāo)之一是讓系統(tǒng)的不同部分能夠輕松插入遏暴,所以我們支持memtable和table格式的不同實(shí)現(xiàn)。你可以通過設(shè)置Options::memtable_factory來提供你自己的memtable工廠指黎,也可以通過設(shè)置Options::table_factory來提供你自己的table工廠朋凉。對(duì)于可用的memtable工廠,請(qǐng)參考rocksdb/memtablerep.h醋安,對(duì)于表工廠杂彭,請(qǐng)參考rocksdb/table.h。這些特性都在積極開發(fā)中吓揪,請(qǐng)小心任何可能會(huì)破壞應(yīng)用程序的API更改亲怠。
You can also read more about memtables here and [[here|MemTable]].
你也可以在這里和[[here|MemTable]]閱讀更多關(guān)于memtables的信息。
Performance
Start with [[Setup Options and Basic Tuning]]. For more information about RocksDB performance, see the "Performance" section in the sidebar in the right side.
從[[Setup Options and Basic Tuning]]開始柠辞。有關(guān)RocksDB性能的更多信息团秽,請(qǐng)參見右側(cè)欄的“性能”部分。
Block size
<code>rocksdb</code> groups adjacent keys together into the same block and such a block is the unit of transfer to and from persistent storage. The default block size is approximately 4096 uncompressed bytes. Applications that mostly do bulk scans over the contents of the database may wish to increase this size. Applications that do a lot of point reads of small values may wish to switch to a smaller block size if performance measurements indicate an improvement. There isn't much benefit in using blocks smaller than one kilobyte, or larger than a few megabytes. Also note that compression will be more effective with larger block sizes. To change block size parameter, use <code>Options::block_size</code>.
Rocksdb將相鄰的鍵分組到同一個(gè)塊中叭首,這樣的塊就是與持久存儲(chǔ)進(jìn)行傳輸?shù)膯挝幌扒凇DJ(rèn)的塊大小大約是4096個(gè)未壓縮字節(jié)。主要對(duì)數(shù)據(jù)庫內(nèi)容進(jìn)行批量掃描的應(yīng)用程序可能希望增加這個(gè)大小放棒。如果性能測(cè)量表明有改進(jìn)姻报,那么對(duì)小值進(jìn)行大量點(diǎn)讀取的應(yīng)用程序可能希望切換到更小的塊大小。使用小于1千字節(jié)或大于幾兆字節(jié)的塊沒有什么好處间螟。還要注意的是吴旋,壓縮將會(huì)在較大的塊大小時(shí)更有效损肛。要更改塊大小參數(shù)治拿,請(qǐng)使用Options::block_size笆焰。
Write buffer
<code>Options::write_buffer_size</code> specifies the amount of data to build up in memory before converting to a sorted on-disk file. Larger values increase performance, especially during bulk loads. Up to max_write_buffer_number write buffers may be held in memory at the same time, so you may wish to adjust this parameter to control memory usage. Also, a larger write buffer will result in a longer recovery time the next time the database is opened.
write_buffer_size指定在轉(zhuǎn)換為已排序的磁盤文件之前要在內(nèi)存中積累的數(shù)據(jù)量劫谅。較大的值可以提高性能嚷掠,特別是在批量加載期間。最高max_write_buffer_number的寫緩沖區(qū)可以同時(shí)保存在內(nèi)存中不皆,因此您可能希望調(diào)整這個(gè)參數(shù)來控制內(nèi)存使用贯城。另外霹娄,更大的寫緩沖區(qū)將導(dǎo)致下一次打開數(shù)據(jù)庫時(shí)更長(zhǎng)的恢復(fù)時(shí)間犬耻。
Related option is <code>Options::max_write_buffer_number</code>, which is maximum number of write buffers that are built up in memory. The default is 2, so that when 1 write buffer is being flushed to storage, new writes can continue to the other write buffer. The flush operation is executed in a [[Thread Pool]].
相關(guān)選項(xiàng)為Options::max_write_buffer_number,它是內(nèi)存中構(gòu)建的最大寫緩沖區(qū)數(shù)渡蜻。默認(rèn)值是2透典,因此當(dāng)一個(gè)寫緩沖區(qū)被刷新到存儲(chǔ)時(shí)峭咒,新的寫可以繼續(xù)到另一個(gè)寫緩沖區(qū)。刷新操作在[[Thread Pool]]中執(zhí)行则果。
<code>Options::min_write_buffer_number_to_merge</code> is the minimum number of write buffers that will be merged together before writing to storage. If set to 1, then all write buffers are flushed to L0 as individual files and this increases read amplification because a get request has to check all of these files. Also, an in-memory merge may result in writing lesser data to storage if there are duplicate records in each of these individual write buffers. Default: 1
min_write_buffer_number_to_merge是寫入存儲(chǔ)之前將合并在一起的寫緩沖區(qū)的最小數(shù)量西壮。如果設(shè)置為1叫惊,那么所有的寫緩沖區(qū)都將作為單獨(dú)的文件刷新到L0霍狰,這將增加讀放大抡草,因?yàn)間et請(qǐng)求必須檢查所有這些文件饰及。此外,如果每個(gè)單獨(dú)的寫緩沖區(qū)中都有重復(fù)的記錄康震,那么內(nèi)存中的合并可能會(huì)導(dǎo)致向存儲(chǔ)空間寫入較少的數(shù)據(jù)腿短。默認(rèn)值:1
Compression
Each block is individually compressed before being written to persistent storage. Compression is on by default since the default compression method is very fast, and is automatically disabled for uncompressible data. In rare cases, applications may want to disable compression entirely, but should only do so if benchmarks show a performance improvement:
每個(gè)塊在被寫入持久存儲(chǔ)之前都被單獨(dú)壓縮橘忱。默認(rèn)情況下钝诚,壓縮是打開的,因?yàn)槟J(rèn)的壓縮方法非常快祈噪,并且對(duì)于不可壓縮的數(shù)據(jù)自動(dòng)禁用辑鲤。在極少數(shù)情況下月褥,應(yīng)用程序可能希望完全禁用壓縮瓢喉,但只有在基準(zhǔn)測(cè)試顯示性能提高時(shí)才應(yīng)該這樣做:
rocksdb::Options options;
options.compression = rocksdb::kNoCompression;
... rocksdb::DB::Open(options, name, ...) ....
Also [[Dictionary Compression]] is also available.
此外[[Dictionary Compression]]也是可用的栓票。
Cache
The contents of the database are stored in a set of files in the filesystem and each file stores a sequence of compressed blocks. If <code>options.block_cache</code> is non-NULL, it is used to cache frequently used uncompressed block contents. We use operating systems file cache to cache our raw data, which is compressed. So file cache acts as a cache for compressed data.
數(shù)據(jù)庫的內(nèi)容存儲(chǔ)在文件系統(tǒng)中的一組文件中走贪,每個(gè)文件存儲(chǔ)一系列壓縮塊坠狡。如果選項(xiàng)逃沿。block_cache是非空的幻锁,用于緩存常用的未壓縮塊內(nèi)容越败。我們使用操作系統(tǒng)文件緩存來緩存被壓縮的原始數(shù)據(jù)究飞。因此亿傅,文件緩存充當(dāng)了壓縮數(shù)據(jù)的緩存葵擎。
#include "rocksdb/cache.h"
rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = rocksdb::NewLRUCache(100 * 1048576); // 100MB uncompressed cache
rocksdb::Options options;
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_options));
rocksdb::DB* db;
rocksdb::DB::Open(options, name, &db);
... use the db ...
delete db
When performing a bulk read, the application may wish to disable caching so that the data processed by the bulk read does not end up displacing most of the cached contents. A per-iterator option can be used to achieve this:
在執(zhí)行批量讀取時(shí)酬滤,應(yīng)用程序可能希望禁用緩存寓涨,以便由批量讀取處理的數(shù)據(jù)最終不會(huì)替換大多數(shù)緩存的內(nèi)容戒良∨雌椋可以使用每迭代器選項(xiàng)來實(shí)現(xiàn)這一點(diǎn):
rocksdb::ReadOptions options;
options.fill_cache = false;
rocksdb::Iterator* it = db->NewIterator(options);
for (it->SeekToFirst(); it->Valid(); it->Next()) {
...
}
You can also disable block cache by setting <code>options.no_block_cache</code> to true.
您還可以通過設(shè)置選項(xiàng)禁用塊緩存沃呢。no_block_cache為true薄霜。
See [[Block Cache]] for more details.
詳情請(qǐng)參見[[Block Cache]]。
Key Layout
Note that the unit of disk transfer and caching is a block. Adjacent keys (according to the database sort order) will usually be placed in the same block. Therefore the application can improve its performance by placing keys that are accessed together near each other and placing infrequently used keys in a separate region of the key space.
注意搪缨,磁盤傳輸和緩存的單位是一個(gè)塊副编。相鄰的鍵(根據(jù)數(shù)據(jù)庫排序順序)通常被放在同一個(gè)塊中痹届。因此队腐,應(yīng)用程序可以通過將被訪問的鍵放在相鄰的位置柴淘,并將不經(jīng)常使用的鍵放在鍵空間的單獨(dú)區(qū)域中來提高性能。
For example, suppose we are implementing a simple file system on top of <code>rocksdb</code>. The types of entries we might wish to store are:
例如敛熬,假設(shè)我們正在rocksdb上實(shí)現(xiàn)一個(gè)簡(jiǎn)單的文件系統(tǒng)应民。我們可能希望存儲(chǔ)的條目類型是:
filename -> permission-bits, length, list of file_block_ids
file_block_id -> data
We might want to prefix <code>filename</code> keys with one letter (say '/') and the <code>file_block_id</code> keys with a different letter (say '0') so that scans over just the metadata do not force us to fetch and cache bulky file contents.
我們可能想要文件名鍵的前綴是一個(gè)字母(比如'/')诲锹,而file_block_id鍵的前綴是一個(gè)不同的字母(比如'0')辕狰,這樣只掃描元數(shù)據(jù)就不會(huì)迫使我們獲取和緩存大量的文件內(nèi)容。
Filters
Because of the way <code>rocksdb</code> data is organized on disk, a single <code>Get()</code> call may involve multiple reads from disk. The optional <code>FilterPolicy</code> mechanism can be used to reduce the number of disk reads substantially.
由于rocksdb數(shù)據(jù)在磁盤上的組織方式盐捷,一個(gè)Get()調(diào)用可能涉及多個(gè)磁盤讀取碉渡≈团担可選的FilterPolicy機(jī)制可以大大減少磁盤讀取的數(shù)量习霹。
rocksdb::Options options;
rocksdb::BlockBasedTableOptions bbto;
bbto.filter_policy.reset(rocksdb::NewBloomFilterPolicy(
10 /* bits_per_key */,
false /* use_block_based_builder*/));
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(bbto));
rocksdb::DB* db;
rocksdb::DB::Open(options, "/tmp/testdb", &db);
... use the database ...
delete db;
delete options.filter_policy;
The preceding code associates a [[Bloom Filter | RocksDB-Bloom-Filter]] based filtering policy with the database. Bloom filter based filtering relies on keeping some number of bits of data in memory per key (in this case 10 bits per key since that is the argument we passed to NewBloomFilter). This filter will reduce the number of unnecessary disk reads needed for <code>Get()</code> calls by a factor of approximately a 100. Increasing the bits per key will lead to a larger reduction at the cost of more memory usage. We recommend that applications whose working set does not fit in memory and that do a lot of random reads set a filter policy.
上述代碼將基于[[Bloom Filter | RocksDB-Bloom-Filter]]的過濾策略與數(shù)據(jù)庫關(guān)聯(lián)淋叶∩烽荩基于Bloom過濾器的過濾依賴于在內(nèi)存中每個(gè)鍵保留一定數(shù)量的數(shù)據(jù)位(在本例中每個(gè)鍵保留10位斟湃,因?yàn)檫@是我們傳遞給NewBloomFilter的參數(shù))。此篩選器將減少Get()調(diào)用所需的不必要磁盤讀取數(shù)量注暗,大約為100倍友存。增加每個(gè)鍵的比特將導(dǎo)致更大的減少屡立,但代價(jià)是更多的內(nèi)存使用膨俐。我們建議工作集不適合內(nèi)存的應(yīng)用程序設(shè)置一個(gè)過濾策略焚刺,并進(jìn)行大量隨機(jī)讀取乳愉。
If you are using a custom comparator, you should ensure that the filter policy you are using is compatible with your comparator. For example, consider a comparator that ignores trailing spaces when comparing keys. <code>NewBloomFilter</code> must not be used with such a comparator. Instead, the application should provide a custom filter policy that also ignores trailing spaces.
如果您使用的是自定義比較器蔓姚,那么您應(yīng)該確保所使用的篩選策略與您的比較器兼容慨丐。例如房揭,考慮在比較鍵時(shí)忽略尾隨空格的比較器捅暴。NewBloomFilter不能與這樣的比較器一起使用蓬痒。相反,應(yīng)用程序應(yīng)該提供一個(gè)自定義的過濾策略钧椰,該策略也應(yīng)該忽略尾隨空格嫡霞。
For example:
class CustomFilterPolicy : public rocksdb::FilterPolicy {
private:
FilterPolicy* builtin_policy_;
public:
CustomFilterPolicy() : builtin_policy_(NewBloomFilter(10, false)) { }
~CustomFilterPolicy() { delete builtin_policy_; }
const char* Name() const { return "IgnoreTrailingSpacesFilter"; }
void CreateFilter(const Slice* keys, int n, std::string* dst) const {
// Use builtin bloom filter code after removing trailing spaces
std::vector<Slice> trimmed(n);
for (int i = 0; i < n; i++) {
trimmed[i] = RemoveTrailingSpaces(keys[i]);
}
return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
}
bool KeyMayMatch(const Slice& key, const Slice& filter) const {
// Use builtin bloom filter code after removing trailing spaces
return builtin_policy_->KeyMayMatch(RemoveTrailingSpaces(key), filter);
}
};
Advanced applications may provide a filter policy that does not use a bloom filter but uses some other mechanisms for summarizing a set of keys. See <code>rocksdb/filter_policy.h</code> for detail.
高級(jí)應(yīng)用程序可能提供不使用bloom過濾器的過濾策略诊沪,但使用其他一些機(jī)制來匯總一組鍵端姚。請(qǐng)參見rocksdb/filter_policy.h渐裸。
Checksums
<code>rocksdb</code> associates checksums with all data it stores in the file system. There are two separate controls provided over how aggressively these checksums are verified:
Rocksdb將校驗(yàn)和與存儲(chǔ)在文件系統(tǒng)中的所有數(shù)據(jù)關(guān)聯(lián)起來昏鹃。對(duì)于這些校驗(yàn)和的驗(yàn)證力度有兩種不同的控制:
<ul>
<li>
<code>ReadOptions::verify_checksums</code> forces checksum verification of all data that is read from the file system on behalf of a particular read. This is on by default.
ReadOptions::verify_checksum強(qiáng)制對(duì)代表特定讀操作從文件系統(tǒng)讀取的所有數(shù)據(jù)進(jìn)行校驗(yàn)和驗(yàn)證洞渤。這是默認(rèn)開啟的载迄。
<li> <code>Options::paranoid_checks</code> may be set to true before opening a database to make the database implementation raise an error as soon as it detects an internal corruption. Depending on which portion of the database has been corrupted, the error may be raised when the database is opened, or later by another database operation. By default, paranoid checking is on.
Options::paranoid_checks可以在打開數(shù)據(jù)庫之前設(shè)置為true护昧,以使數(shù)據(jù)庫實(shí)現(xiàn)在檢測(cè)到內(nèi)部損壞時(shí)立即引發(fā)錯(cuò)誤。根據(jù)數(shù)據(jù)庫的哪個(gè)部分已損壞慈格,該錯(cuò)誤可能在打開數(shù)據(jù)庫時(shí)引發(fā)浴捆,或者稍后由另一個(gè)數(shù)據(jù)庫操作引發(fā)选泻。默認(rèn)情況下,偏執(zhí)檢查是打開的梯捕。
</ul>
Checksum verification can also be manually triggered by calling DB::VerifyChecksum()
. This API walks through all the SST files in all levels for all column families, and for each SST file, verifies the checksum embedded in the metadata and data blocks. At present, it is only supported for the BlockBasedTable format. The files are verified serially, so the API call may take a significant amount of time to finish. This API can be useful for proactive verification of data integrity in a distributed system, for example, where a new replica can be created if the database is found to be corrupt.
校驗(yàn)和校驗(yàn)也可以通過調(diào)用DB::VerifyChecksum()來手動(dòng)觸發(fā)。這個(gè)API遍歷所有列族的所有級(jí)別的所有SST文件碌奉,并對(duì)每個(gè)SST文件驗(yàn)證嵌入元數(shù)據(jù)和數(shù)據(jù)塊中的校驗(yàn)和。目前嫉拐,它只支持BlockBasedTable格式婉徘。這些文件是串行驗(yàn)證的判哥,因此API調(diào)用可能要花很長(zhǎng)時(shí)間才能完成塌计。這個(gè)API對(duì)于分布式系統(tǒng)中的數(shù)據(jù)完整性的主動(dòng)驗(yàn)證非常有用锌仅,例如热芹,在分布式系統(tǒng)中伊脓,如果發(fā)現(xiàn)數(shù)據(jù)庫損壞报腔,可以創(chuàng)建一個(gè)新的副本纯蛾。
If a database is corrupted (perhaps it cannot be opened when paranoid checking is turned on), the <code>rocksdb::RepairDB</code> function may be used to recover as much of the data as possible.
如果數(shù)據(jù)庫損壞了(可能在啟用偏執(zhí)檢查時(shí)無法打開)翻诉,可以使用rocksdb::RepairDB函數(shù)來恢復(fù)盡可能多的數(shù)據(jù)碰煌。
Compaction
RocksDB keeps rewriting existing data files. This is to clean stale versions of keys, and to keep the data structure optimal for reads.
RocksDB一直在重寫現(xiàn)有的數(shù)據(jù)文件拄查。這是為了清除過時(shí)的鍵版本堕扶,并保持?jǐn)?shù)據(jù)結(jié)構(gòu)的最佳讀取稍算。
The information about compaction has been moved to Compaction. Users don't have to know internal of compactions before operating RocksDB.
關(guān)于壓縮的信息已移動(dòng)到“壓縮”糊探。用戶在運(yùn)行RocksDB之前不需要了解內(nèi)部壓縮。
Approximate Sizes
The <code>GetApproximateSizes</code> method can be used to get the approximate number of bytes of file system space used by one or more key ranges.
GetApproximateSizes方法可用于獲取一個(gè)或多個(gè)鍵范圍使用的文件系統(tǒng)空間的大約字節(jié)數(shù)。
rocksdb::Range ranges[2];
ranges[0] = rocksdb::Range("a", "c");
ranges[1] = rocksdb::Range("x", "z");
uint64_t sizes[2];
db->GetApproximateSizes(ranges, 2, sizes);
The preceding call will set <code>sizes[0]</code> to the approximate number of bytes of file system space used by the key range <code>[a..c)</code> and <code>sizes[1]</code> to the approximate number of bytes used by the key range <code>[x..z)</code>.
前面的調(diào)用將設(shè)置sizes[0]為鍵范圍[a..c]所使用的文件系統(tǒng)空間的大約字節(jié)數(shù)髓考,設(shè)置sizes[1]為鍵范圍[x..z]所使用的大約字節(jié)數(shù)氨菇。
Environment
All file operations (and other operating system calls) issued by the <code>rocksdb</code> implementation are routed through a <code>rocksdb::Env</code> object. Sophisticated clients may wish to provide their own <code>Env</code> implementation to get better control. For example, an application may introduce artificial delays in the file IO paths to limit the impact of <code>rocksdb</code> on other activities in the system.
由rocksdb實(shí)現(xiàn)發(fā)出的所有文件操作(以及其他操作系統(tǒng)調(diào)用)都通過一個(gè)rocksdb::Env對(duì)象進(jìn)行路由查蓉。成熟的客戶可能希望提供他們自己的Env實(shí)現(xiàn)以獲得更好的控制豌研。例如鹃共,應(yīng)用程序可能會(huì)在文件IO路徑中引入人為延遲及汉,以限制rocksdb對(duì)系統(tǒng)中其他活動(dòng)的影響。
class SlowEnv : public rocksdb::Env {
.. implementation of the Env interface ...
};
SlowEnv env;
rocksdb::Options options;
options.env = &env;
Status s = rocksdb::DB::Open(options, ...);
Porting
<code>rocksdb</code> may be ported to a new platform by providing platform specific implementations of the types/methods/functions exported by <code>rocksdb/port/port.h</code>. See <code>rocksdb/port/port_example.h</code> for more details.
通過提供Rocksdb /port/port.h導(dǎo)出的類型/方法/函數(shù)的特定平臺(tái)實(shí)現(xiàn)驻龟,Rocksdb可以被移植到一個(gè)新的平臺(tái)上翁狐。詳見rocksdb/port/port_example.h露懒。
In addition, the new platform may need a new default <code>rocksdb::Env</code> implementation. See <code>rocksdb/util/env_posix.h</code> for an example.
此外懈词,新平臺(tái)可能需要一個(gè)新的默認(rèn)rocksdb::Env實(shí)現(xiàn)坎弯。示例請(qǐng)參見rocksdb/util/env_posix.h抠忘。
Manageability
To be able to efficiently tune your application, it is always helpful if you have access to usage statistics. You can collect those statistics by setting <code>Options::table_properties_collectors</code> or <code>Options::statistics</code>. For more information, refer to <code>rocksdb/table_properties.h</code> and <code>rocksdb/statistics.h</code>. These should not add significant overhead to your application and we recommend exporting them to other monitoring tools. See [[Statistics]]. You can also profile single requests using [[Perf Context and IO Stats Context]]. Users can register [[EventListener]] for callbacks for some internal events.
為了能夠有效地調(diào)優(yōu)應(yīng)用程序拧咳,能夠訪問使用統(tǒng)計(jì)數(shù)據(jù)總是很有幫助的呛踊。您可以通過設(shè)置<code>Options::table_properties_collectors</code>或<code>Options::statistics</code>來收集這些統(tǒng)計(jì)信息谭网。更多信息愉择,請(qǐng)參考<code>rocksdb/table_properties.h</code>和<code>rocksdb/statistics.h</code>锥涕。這些不會(huì)給您的應(yīng)用程序增加很大的開銷层坠,我們建議將它們導(dǎo)出到其他監(jiān)視工具中破花。[[Statistics]]座每。你也可以使用[[Perf Context and IO Stats Context]]來分析單個(gè)請(qǐng)求峭梳。用戶可以為一些內(nèi)部事件的回調(diào)注冊(cè)[[EventListener]]葱椭。
Purging WAL files
By default, old write-ahead logs are deleted automatically when they fall out of scope and application doesn't need them anymore. There are options that enable the user to archive the logs and then delete them lazily, either in TTL fashion or based on size limit.
默認(rèn)情況下孵运,當(dāng)舊的預(yù)寫日志超出范圍且應(yīng)用程序不再需要它們時(shí)掐松,將自動(dòng)刪除它們大磺。有一些選項(xiàng)允許用戶對(duì)日志進(jìn)行歸檔杠愧,然后根據(jù)TTL方式或大小限制惰性地刪除它們流济。
The options are <code>Options::WAL_ttl_seconds</code> and <code>Options::WAL_size_limit_MB</code>. Here is how they can be used:
設(shè)置項(xiàng)為:options::WAL_ttl_seconds和options::WAL_size_limit_MB绳瘟。下面是它們的用法:
<ul>
<li>
If both set to 0, logs will be deleted asap and will never get into the archive.
如果兩者都設(shè)置為0糖声,則日志將被盡快刪除蘸泻,并且永遠(yuǎn)不會(huì)進(jìn)入存檔悦施。
<li>
If <code>WAL_ttl_seconds</code> is 0 and WAL_size_limit_MB is not 0, WAL files will be checked every 10 min and if total size is greater then <code>WAL_size_limit_MB</code>, they will be deleted starting with the earliest until size_limit is met. All empty files will be deleted.
如果WAL_ttl_seconds為0且WAL_size_limit_MB不為0并扇,則WAL文件將每10分鐘檢查一次,如果總大小大于WAL_size_limit_MB抡诞,則從最早的文件開始刪除穷蛹,直到滿足size_limit渗勘。所有空文件將被刪除。
<li>
If <code>WAL_ttl_seconds</code> is not 0 and WAL_size_limit_MB is 0, then WAL files will be checked every <code>WAL_ttl_seconds / 2</code> and those that are older than WAL_ttl_seconds will be deleted.
如果WAL_ttl_seconds不為0且WAL_size_limit_MB為0俩莽,那么每WAL_ttl_seconds / 2都會(huì)檢查WAL文件乔遮,并且那些比WAL_ttl_seconds早的文件將被刪除扮超。
<li>
If both are not 0, WAL files will be checked every 10 min and both checks will be performed with ttl being first.
如果兩者都不為0,則WAL文件將每10分鐘檢查一次蹋肮,兩次檢查都將首先執(zhí)行ttl出刷。
</ul>
Other Information
To set up RocksDB options:
設(shè)置RocksDB選項(xiàng):
- Set Up Options And Basic Tuning
- Some detailed Tuning Guide
Details about the <code>rocksdb</code> implementation may be found in the following documents:
關(guān)于rocksdb實(shí)現(xiàn)的詳細(xì)信息可以在以下文檔中找到:
- RocksDB Overview and Architecture
- Format of an immutable Table file
- <a href="log_format.txt">Format of a log file</a>
</ul>