創(chuàng)建image過程的代碼走讀扛吞。過程中滥比,發(fā)現(xiàn)自己對librados aio機制和cls 注冊的函數(shù)調(diào)用機制不太了解盲泛,有空單獨寫篇文寺滚。
淺析
先走一遍流程玛迄,從宏觀上看一下image創(chuàng)建的過程蓖议。
初始化rbd并創(chuàng)建image勒虾。
ceph osd pool create rbd 32
rbd pool init rbd
rbd create --size 1024 rbd/testimage
查看已有的對象
rados ls -p rbd
rbd_directory
rbd_id.testimage
rbd_info
rbd_object_map.105d2ae8944a
rbd_header.105d2ae8944a
一個pool中的rbd對象分成兩類:
第一類修然,整個pool的rbd元數(shù)據(jù)對象
1.rbd_directory
:在每個pool中都存在愕宋,用于保存該pool下所有的image的信息中贝。該對象的omap中保存該pool中所有image的name和id邻寿。對于每一個image绣否,保存兩條信息蒜撮,第一條key為id_<image id>
段磨,value為image name薇溃;第二條key為name_<image name>
沐序,value為image id策幼。
rados listomapvals rbd_directory -p rbd
id_105d2ae8944a
value (13 bytes) :
00000000 09 00 00 00 74 65 73 74 69 6d 61 67 65 |....testimage|
0000000d
name_testimage
value (16 bytes) :
00000000 0c 00 00 00 31 30 35 64 32 61 65 38 39 34 34 61 |....105d2ae8944a|
00000010
2.rbd_info
:正常情況下內(nèi)容為overwrite validated特姐,如果是v1 image唐含,情況不同捷枯。暫時忽略淮捆。
第二類,一個image的元數(shù)據(jù)對象
文檔描述如下:
/* New-style rbd image 'testimage' consists of objects
* rbd_id.testimage - id of image
* rbd_header.<id> - image metadata
* rbd_object_map.<id> - optional image object map
* rbd_data.<id>.00000000
* rbd_data.<id>.00000001
* ... - data
*/
但之前的rados ls結果只有前三個。為了加速image創(chuàng)建案站、節(jié)省空間嚼吞。數(shù)據(jù)對象只有在使用時才會被分配。
1.rbd_id.testimage
:被稱為image的id_obj
對象蹬碧,其內(nèi)容為該image的id舱禽。
2.rbd_header.105d2ae8944a
:被稱為image的head_obj
對象,它的omap中保存了該image的元數(shù)據(jù)恩沽。
rados listomapvals rbd_header.105d2ae8944a -p rbd
create_timestamp - 創(chuàng)建時間
value (8 bytes) :
00000000 17 25 96 5a a7 bc 94 35 |.%.Z...5|
00000008
features - 開啟的特性
value (8 bytes) :
00000000 3d 00 00 00 00 00 00 00 |=.......|
00000008
object_prefix - data對象的前綴
value (25 bytes) :
00000000 15 00 00 00 72 62 64 5f 64 61 74 61 2e 31 30 35 |....rbd_data.105|
00000010 64 32 61 65 38 39 34 34 61 |d2ae8944a|
00000019
order - 每個data對象的大小
value (1 bytes) :
00000000 16 |.|
00000001
size - image size
value (8 bytes) :
00000000 00 00 00 40 00 00 00 00 |...@....|
00000008
snap_seq - 當前存在的最新的seq
value (8 bytes) :
00000000 00 00 00 00 00 00 00 00 |........|
00000008
如果創(chuàng)建了快照會有快照相關的key value存在于omap中誊稚,暫且不表。
3.rbd_object_map.105d2ae8944a
:用于支持object map特性罗心,開啟object map時會創(chuàng)建。
代碼
省略了部分代碼渤闷,不影響閱讀疾瓮。
image的創(chuàng)建從librbd.cc的create函數(shù)開始,調(diào)用了internal.cc中的create飒箭。注意狼电,有多個版本的create函數(shù),其區(qū)別主要在于指定選項的多寡弦蹂,其最終實現(xiàn)是一致的肩碟。
/*
io_ctx參數(shù)為調(diào)用librados創(chuàng)建的,用于連接rados中對應的pool
librados::IoCtx io_ctx;
rados.ioctx_create(pool_name.c_str(), io_ctx);
name參數(shù)表示要創(chuàng)建的image的名稱
size參數(shù)為image size
order為rbd對應到rados中每個對象的大小凸椿,默認為4MB削祈,即1<<22
*/
int RBD::create(IoCtx& io_ctx, const char *name, uint64_t size, int *order)
{
int r = librbd::create(io_ctx, name, size, order);
return r;
}
internal.cc中的create
int create(librados::IoCtx& io_ctx, const char *imgname, uint64_t size,
int *order)
{
uint64_t order_ = *order;
ImageOptions opts;
int r = opts.set(RBD_IMAGE_OPTION_ORDER, order_);
assert(r == 0);
// 轉調(diào)
r = create(io_ctx, imgname, "", size, opts, "", "", false);
int r1 = opts.get(RBD_IMAGE_OPTION_ORDER, &order_);
assert(r1 == 0);
*order = order_;
return r;
}
// 真正的實現(xiàn)
int create(IoCtx& io_ctx, const std::string &image_name,
const std::string &image_id, uint64_t size,
ImageOptions& opts,
const std::string &non_primary_global_image_id,
const std::string &primary_mirror_uuid,
bool skip_mirror_enable)
{
// 準備image的id,不存在則生成
std::string id(image_id);
if (id.empty()) {
id = util::generate_image_id(io_ctx);
}
CephContext *cct = (CephContext *)io_ctx.cct();
ldout(cct, 10) << __func__ << " name=" << image_name << ", "
<< "id= " << id << ", "
<< "size=" << size << ", opts=" << opts << dendl;
// 準備image的format類型脑漫,不存在則設為默認值
uint64_t format;
if (opts.get(RBD_IMAGE_OPTION_FORMAT, &format) != 0)
format = cct->_conf->get_val<int64_t>("rbd_default_format");
bool old_format = format == 1;
// make sure it doesn't already exist, in either format
int r = detect_format(io_ctx, image_name, NULL, NULL);
if (r != -ENOENT) {
if (r) {
lderr(cct) << "Could not tell if " << image_name << " already exists"
<< dendl;
return r;
}
lderr(cct) << "rbd image " << image_name << " already exists" << dendl;
return -EEXIST;
}
// 準備order髓抑,不存在則設為默認值
uint64_t order = 0;
if (opts.get(RBD_IMAGE_OPTION_ORDER, &order) != 0 || order == 0) {
order = cct->_conf->get_val<int64_t>("rbd_default_order");
}
r = image::CreateRequest<>::validate_order(cct, order);
if (r < 0) {
return r;
}
// 根據(jù)不同的format,創(chuàng)建不同的鏡像优幸,old format只為向下兼容启昧,不深究
if (old_format) {
r = create_v1(io_ctx, image_name.c_str(), size, order);
} else {
// ceph 使用的線程池和隊列,ContextWQ是異步回調(diào)方式的隊列
// 放入其中的任務劈伴,在線程池中執(zhí)行完成后密末,最終會調(diào)用用戶實現(xiàn)的回調(diào)函數(shù)(Context::finish())
ThreadPool *thread_pool;
ContextWQ *op_work_queue;
ImageCtx::get_thread_pool_instance(cct, &thread_pool, &op_work_queue);
C_SaferCond cond;
// new一個CreateRequest對象握爷,其中模版參數(shù)為默認值,ImageCtx
// 在構造函數(shù)中严里,解析出所有需要的參數(shù)新啼,列舉如下:
/*
name
id
size
features
order
stripe_unit
stripe_count
journal_order
journal_splay_width
journal_pool
data_pool
*/
image::CreateRequest<> *req = image::CreateRequest<>::create(
io_ctx, image_name, id, size, opts, non_primary_global_image_id,
primary_mirror_uuid, skip_mirror_enable, op_work_queue, &cond);
// 執(zhí)行操作的入口函數(shù)
req->send();
// 等待req的完成
r = cond.wait();
}
int r1 = opts.set(RBD_IMAGE_OPTION_ORDER, order);
assert(r1 == 0);
return r;
}
CreateRequest.h/cc中定義了創(chuàng)建操作的具體實現(xiàn),先貼出狀態(tài)圖刹碾。之后的代碼執(zhí)行流程與狀態(tài)圖一致燥撞。
/**
* @verbatim
*
* <start> . . . . > . . . . .
* | .
* v .
* VALIDATE POOL v (pool validation
* | . disabled)
* v .
* VALIDATE OVERWRITE .
* | .
* v .
* (error: bottom up) CREATE ID OBJECT. . < . . . . .
* _______<_______ |
* | | v
* | | ADD IMAGE TO DIRECTORY
* | | / |
* | REMOVE ID OBJECT<-------/ v
* | | NEGOTIATE FEATURES (when using default features)
* | | |
* | | v (stripingv2 disabled)
* | | CREATE IMAGE. . . . > . . . .
* v | / | .
* | REMOVE FROM DIR<--------/ v .
* | | SET STRIPE UNIT COUNT .
* | | / | \ . . . . . > . . . .
* | REMOVE HEADER OBJ<------/ v /. (object-map
* | |\ OBJECT MAP RESIZE . . < . . * v disabled)
* | | \ / | \ . . . . . > . . . .
* | | *<-----------/ v /. (journaling
* | | FETCH MIRROR MODE. . < . . * v disabled)
* | | / | .
* | REMOVE OBJECT MAP<--------/ v .
* | |\ JOURNAL CREATE .
* | | \ / | .
* v | *<------------/ v .
* | | MIRROR IMAGE ENABLE .
* | | / | .
* | JOURNAL REMOVE*<-------/ | .
* | v .
* |_____________>___________________<finish> . . . . < . . . .
*
* @endverbatim
*/
對應于狀態(tài)圖每一步的函數(shù)如下:
- send(),校驗各種參數(shù)迷帜,開始流程
- validate_pool()物舒,檢驗
rbd_directory
是否存在 - validate_overwrite(),檢驗
rbd_info
存在及內(nèi)容戏锹。與image舊版本有關冠胯,略。 - create_id_object()锦针,創(chuàng)建
rbd_id.<image name>
對象荠察,并設置其內(nèi)容為image id - add_image_to_directory(),將image name和id加入
rbd_directory
的oamp中 - negotiate_features()奈搜,對features參數(shù)做一些處理
- create_image()悉盆,創(chuàng)建
rbd_header.<image id>
對象,并存入各種元數(shù)據(jù)到其omap - set_stripe_unit_count()馋吗,設置
stripe_unit
和stripe_count
到header omap - object_map_resize()焕盟,設置
object_count
和object_state
到header omap - fetch_mirror_mode(),mirror特性宏粤,暫時略
- journal_create()京髓,journal特性,暫時略
- mirror_image_enable()商架,mirror特性堰怨,暫時略
- complete(),完成流程
下面是上述函數(shù)的詳細代碼:
send蛇摸,狀態(tài)機的入口函數(shù)备图,在這個函數(shù)中驗證各種參數(shù)的正確性,如果出錯赶袄,則調(diào)用complete函數(shù)揽涮,complete函數(shù)最終會調(diào)用繼承自Context::finish()的回調(diào)函數(shù),進行錯誤處理饿肺。如果沒有出錯蒋困,則調(diào)用validate_pool()函數(shù),進入下一狀態(tài)敬辣。
template<typename I>
void CreateRequest<I>::send() {
ldout(m_cct, 20) << dendl;
// 校驗各種參數(shù)
int r = validate_features(m_cct, m_features, m_force_non_primary);
if (r < 0) {
complete(r);
return;
}
r = validate_order(m_cct, m_order);
if (r < 0) {
complete(r);
return;
}
r = validate_striping(m_cct, m_order, m_stripe_unit, m_stripe_count);
if (r < 0) {
complete(r);
return;
}
r = validate_data_pool(m_cct, m_ioctx, m_features, m_data_pool,
&m_data_pool_id);
if (r < 0) {
complete(r);
return;
}
if (((m_features & RBD_FEATURE_OBJECT_MAP) != 0) &&
(!validate_layout(m_cct, m_size, m_layout))) {
complete(-EINVAL);
return;
}
// 進入下一狀態(tài)
validate_pool();
}
validate_pool雪标,校驗rbd_directory
對象是否存在
template<typename I>
void CreateRequest<I>::validate_pool() {
// 判斷是否跳過validate_pool階段
if (!m_cct->_conf->get_val<bool>("rbd_validate_pool")) {
create_id_object();
return;
}
// 將handle_validate_pool函數(shù)封裝成AioCompletion對象零院,作為aio_operate完成時的回調(diào)函數(shù)
// 在handle_validate_pool函數(shù)中調(diào)用了validate_overwrite,進入下一狀態(tài)
using klass = CreateRequest<I>;
librados::AioCompletion *comp =
create_rados_callback<klass, &klass::handle_validate_pool>(this);
librados::ObjectReadOperation op;
op.stat(NULL, NULL, NULL);
m_outbl.clear();
// 通過讀取rbd_directory對象村刨,判斷其是否存在
int r = m_ioctx.aio_operate(RBD_DIRECTORY, comp, &op, &m_outbl);
assert(r == 0);
comp->release();
}
template<typename I>
void CreateRequest<I>::handle_validate_pool(int r) {
ldout(m_cct, 20) << "r=" << r << dendl;
if (r == 0) {
validate_overwrite();
return;
} else if ((r < 0) && (r != -ENOENT)) {
lderr(m_cct) << "failed to stat RBD directory: " << cpp_strerror(r)
<< dendl;
complete(r);
return;
}
// allocate a self-managed snapshot id if this a new pool to force
// self-managed snapshot mode
// This call is executed just once per (fresh) pool, hence we do not
// try hard to make it asynchronous (and it's pretty safe not to cause
// deadlocks).
uint64_t snap_id;
r = m_ioctx.selfmanaged_snap_create(&snap_id);
if (r == -EINVAL) {
lderr(m_cct) << "pool not configured for self-managed RBD snapshot support"
<< dendl;
complete(r);
return;
} else if (r < 0) {
lderr(m_cct) << "failed to allocate self-managed snapshot: "
<< cpp_strerror(r) << dendl;
complete(r);
return;
}
r = m_ioctx.selfmanaged_snap_remove(snap_id);
if (r < 0) {
// we've already switched to self-managed snapshots -- no need to
// error out in case of failure here.
ldout(m_cct, 10) << "failed to release self-managed snapshot " << snap_id
<< ": " << cpp_strerror(r) << dendl;
}
validate_overwrite();
}
validate_overwrite告抄,校驗rbd_info
對象的內(nèi)容,與新舊版本image有關嵌牺,可以忽略打洼。
template <typename I>
void CreateRequest<I>::validate_overwrite() {
...
// handle_validate_overwrite為aio_operate的回調(diào)函數(shù)
// handle_validate_overwrite函數(shù)會調(diào)用create_id_object進入下一狀態(tài)
using klass = CreateRequest<I>;
librados::AioCompletion *comp =
create_rados_callback<klass, &klass::handle_validate_overwrite>(this);
librados::ObjectReadOperation op;
op.read(0, 0, nullptr, nullptr);
m_outbl.clear();
// 通過讀取rbd_info對象,判斷rbd_info對象是否存在
int r = m_data_io_ctx.aio_operate(RBD_INFO, comp, &op, &m_outbl);
assert(r == 0);
comp->release();
}
template <typename I>
void CreateRequest<I>::handle_validate_overwrite(int r) {
ldout(m_cct, 20) << "r=" << r << dendl;
bufferlist bl;
bl.append("overwrite validated");
// 如果rbd_info存在并且逆粹,內(nèi)容為overwrite validated募疮,直接進入下一狀態(tài)
if (r == 0 && m_outbl.contents_equal(bl)) {
create_id_object();
return;
} else if ((r < 0) && (r != -ENOENT)) {
lderr(m_cct) << "failed to read RBD info: " << cpp_strerror(r) << dendl;
complete(r);
return;
}
// 舊版本image相關,不管它
// validate the pool supports overwrites. We cannot use rbd_directory
// since the v1 images store the directory as tmap data within the object.
ldout(m_cct, 10) << "validating overwrite support" << dendl;
bufferlist initial_bl;
initial_bl.append("validate");
r = m_data_io_ctx.write(RBD_INFO, initial_bl, initial_bl.length(), 0);
if (r >= 0) {
r = m_data_io_ctx.write(RBD_INFO, bl, bl.length(), 0);
}
if (r == -EOPNOTSUPP) {
lderr(m_cct) << "pool missing required overwrite support" << dendl;
complete(-EINVAL);
return;
} else if (r < 0) {
lderr(m_cct) << "failed to validate overwrite support: " << cpp_strerror(r)
<< dendl;
complete(r);
return;
}
create_id_object();
}
create_id_object僻弹,創(chuàng)建rbd_id.<image name>
對象
template<typename I>
void CreateRequest<I>::create_id_object() {
ldout(m_cct, 20) << dendl;
// 創(chuàng)建一個writeoption對象
librados::ObjectWriteOperation op;
// 創(chuàng)建該對象
op.create(true);
// 通過cls client調(diào)用注冊在osd上的set_id函數(shù)
// 其功能為將op對應的對象的內(nèi)容設置為image_id阿浓。
// 也就是將rbd_id.<image name>的內(nèi)容設置為image id
cls_client::set_id(&op, m_image_id);
// handle_create_id_object為aio_operate完成后調(diào)用的回調(diào)函數(shù)
// 在handle_create_id_object中,調(diào)用了add_image_to_directory奢方,進入下一狀態(tài)
using klass = CreateRequest<I>;
librados::AioCompletion *comp =
create_rados_callback<klass, &klass::handle_create_id_object>(this);
// 疑問搔扁。
// 之前已經(jīng)通過cls創(chuàng)建了rbd_id對象爸舒,這里的作用是蟋字?做進一步驗證?或者僅僅為了觸發(fā)回調(diào)函數(shù)扭勉?
// 或者說鹊奖,之前cls的操作并不會直接執(zhí)行,需要通過aio_operate來觸發(fā)涂炎。我傾向于后者忠聚。
int r = m_ioctx.aio_operate(m_id_obj, comp, &op);
assert(r == 0);
comp->release();
}
template<typename I>
void CreateRequest<I>::handle_create_id_object(int r) {
ldout(m_cct, 20) << "r=" << r << dendl;
if (r < 0) {
lderr(m_cct) << "error creating RBD id object: " << cpp_strerror(r)
<< dendl;
complete(r);
return;
}
add_image_to_directory();
}
add_image_to_directory,在rbd_directory
對象中加入該image的id和name
template<typename I>
void CreateRequest<I>::add_image_to_directory() {
ldout(m_cct, 20) << dendl;
// 通過cls client調(diào)用注冊在osd上的dir_add_image函數(shù)唱捣,
// 在rbd_directory的omap中增加兩條key value两蟀。
librados::ObjectWriteOperation op;
cls_client::dir_add_image(&op, m_image_name, m_image_id);
using klass = CreateRequest<I>;
librados::AioCompletion *comp =
create_rados_callback<klass, &klass::handle_add_image_to_directory>(this);
int r = m_ioctx.aio_operate(RBD_DIRECTORY, comp, &op);
assert(r == 0);
comp->release();
}
template<typename I>
void CreateRequest<I>::handle_add_image_to_directory(int r) {
ldout(m_cct, 20) << "r=" << r << dendl;
if (r < 0) {
lderr(m_cct) << "error adding image to directory: " << cpp_strerror(r)
<< dendl;
m_r_saved = r;
remove_id_object();
}
negotiate_features();
}
negotiate_features,
template<typename I>
void CreateRequest<I>::negotiate_features() {
if (!m_negotiate_features) {
create_image();
return;
}
ldout(m_cct, 20) << dendl;
librados::ObjectReadOperation op;
// 獲取所有的features
cls_client::get_all_features_start(&op);
using klass = CreateRequest<I>;
librados::AioCompletion *comp =
create_rados_callback<klass, &klass::handle_negotiate_features>(this);
// 執(zhí)行op并觸發(fā)回調(diào)函數(shù)
m_outbl.clear();
int r = m_ioctx.aio_operate(RBD_DIRECTORY, comp, &op, &m_outbl);
assert(r == 0);
comp->release();
}
template<typename I>
void CreateRequest<I>::handle_negotiate_features(int r) {
ldout(m_cct, 20) << "r=" << r << dendl;
uint64_t all_features;
if (r >= 0) {
bufferlist::iterator it = m_outbl.begin();
// 將返回的features decode到all_features
r = cls_client::get_all_features_finish(&it, &all_features);
}
if (r < 0) {
ldout(m_cct, 10) << "error retrieving server supported features set: "
<< cpp_strerror(r) << dendl;
} else if ((m_features & all_features) != m_features) {
m_features &= all_features;
ldout(m_cct, 10) << "limiting default features set to server supported: "
<< m_features << dendl;
}
create_image();
}
create_image震缭,
template<typename I>
void CreateRequest<I>::create_image() {
ldout(m_cct, 20) << dendl;
assert(m_data_pool.empty() || m_data_pool_id != -1);
// 準備數(shù)據(jù)對象的名稱
ostringstream oss;
oss << RBD_DATA_PREFIX;
if (m_data_pool_id != -1) {
oss << stringify(m_ioctx.get_id()) << ".";
}
oss << m_image_id;
if (oss.str().length() > RBD_MAX_BLOCK_NAME_PREFIX_LENGTH) {
lderr(m_cct) << "object prefix '" << oss.str() << "' too large" << dendl;
complete(-EINVAL);
return;
}
librados::ObjectWriteOperation op;
op.create(true);
// 通過cls注冊的函數(shù)赂毯,創(chuàng)建rbd_header對象,并設置omap中的值
cls_client::create_image(&op, m_size, m_order, m_features, oss.str(),
m_data_pool_id);
using klass = CreateRequest<I>;
librados::AioCompletion *comp =
create_rados_callback<klass, &klass::handle_create_image>(this);
int r = m_ioctx.aio_operate(m_header_obj, comp, &op);
assert(r == 0);
comp->release();
}
template<typename I>
void CreateRequest<I>::handle_create_image(int r) {
ldout(m_cct, 20) << "r=" << r << dendl;
if (r < 0) {
lderr(m_cct) << "error writing header: " << cpp_strerror(r) << dendl;
m_r_saved = r;
remove_from_dir();
return;
}
set_stripe_unit_count();
}
以下函數(shù)代碼暫時省略拣宰。流程類似党涕。
set_stripe_unit_count
object_map_resize
fetch_mirror_mode
journal_create
mirror_image_enable
最后調(diào)用complete函數(shù),傳入的參數(shù)為0
template<typename I>
void CreateRequest<I>::complete(int r) {
if (r == 0) {
ldout(m_cct, 20) << "done." << dendl;
}
// 釋放數(shù)據(jù)對象上下文
m_data_io_ctx.close();
// 調(diào)用CreateRequest結束回調(diào)函數(shù)巡社,完成步驟
m_on_finish->complete(r);
delete this;
}