參考資料:《Ceph 設(shè)計原理與實(shí)現(xiàn)》
CRUSH(Controlled Replication Under Scalable Hashing)靖榕,是一種基于哈希的數(shù)據(jù)分布算法。以數(shù)據(jù)唯一標(biāo)識符夜只、當(dāng)前存儲集群的拓?fù)錂C(jī)構(gòu)以及數(shù)據(jù)備份策略作為 CRUSH 輸入九榔,可以隨時隨地通過計算獲取數(shù)據(jù)所在的底層存儲設(shè)備位置并直接與其通信芳室,從而避免查表操作,實(shí)現(xiàn)去中心化和高度并發(fā)呈枉。
1 straw 與 starw2
straw 算法將所有元素比作吸管趁尼,針對指定輸入,為每個元素隨機(jī)計算一個長度猖辫,最后從中選擇長度最長的那個元素作為結(jié)果輸出酥泞。這個過程被稱為抽簽,對應(yīng)元素長度成為簽長啃憎。針對硬件差異芝囤,引入了權(quán)重,讓權(quán)重大的元素更容易被抽中辛萍,使得數(shù)據(jù)再異構(gòu)存儲網(wǎng)絡(luò)中也能合理分布悯姊。
straw 算法舉例:
1)假定當(dāng)前集合中一共包含 n 個元素:(e1, e2, ..., en)
2)向集合中添加新元素:(e1, e2, ..., en, en+1)
3)針對任意輸入 x,加入 en+1 之前贩毕,分別計算每個元素簽長并假定其中最大值為 dmax:(d1, d2, ..., dn)
4)因?yàn)樾略?en+1 的簽長計算只和自身編號和權(quán)重有關(guān)悯许,所以可以獨(dú)立計算出簽長:dn+1
5)straw 算法總是選擇最大的簽長作為最終結(jié)果,所以:
如果dn+1 > dmax辉阶,那么 x 將被重新映射到新元素 en+1先壕;反之瘩扼,對 x 已有的映射結(jié)果無影響。
添加一個元素启上,straw 算法會隨機(jī)地將一些原有元素中的數(shù)據(jù)重新映射至新加入的元素中邢隧;同理店印,刪除一個元素冈在,straw 算法會將該元素中全部數(shù)據(jù)隨機(jī)地重新映射至其他元素中。因此按摘,無論添加或者刪除元素包券,都不回導(dǎo)致數(shù)據(jù)在第三方元素遷移,只涉及遷出元素和遷入元素炫贤。
# straw 算法偽代碼溅固,input 輸入,r 隨機(jī)因子兰珍,
max_x = -1
max_item = -1
for each item:
x = hash(input, r)
x = x * item_straw
if x > max_x:
max_x = x
max_item = item
return max_item
# item_straw 通過權(quán)重計算得到
reverse = rearrange all weights in reverse order #逆序排列所有權(quán)重
straw = -1
weight_diff_prev_total = 0
for each item:
item_straw = straw * 0x10000
weight_diff_prev = (reverse[current_item] - reverse[prev_item]) * items_remain
weight_diff_prev_total += weight_diff_prev
weight_diff_next = (reverse[next_item] - reverse[current_item]) * items_remain
scale = weight_diff_prev_total / (weight_diff_prev_total + weight_diff_next)
stram *= pow(1/scale, 1/items_reamain)
item_straw 不但取決于每個元素的自身權(quán)重侍郭,而且也和集合當(dāng)中所有其他元素的權(quán)重相關(guān)。從而導(dǎo)致每次有元素加入當(dāng)前集合或者從當(dāng)前集合中刪除時掠河,會引起不相關(guān)的數(shù)據(jù)遷移亮元。
strew2 算法中不考慮和其他元素之間的聯(lián)系,僅關(guān)注自身權(quán)重唠摹,得到了新 的算法爆捞。
max_x = -1
max_item = -1
for each item:
x = hash(input, r)
x = ln(x / 65536) / weight
if x > max_x:
max_x = x
max_item = item
return max_item
x 在執(zhí)行 hash 算法后,結(jié)果必然落在[0, 65535]之間勾拉,因此 ln(x / 65536x) 結(jié)果為復(fù)數(shù)煮甥,將其除以自身權(quán)重后,則權(quán)重越大藕赞,得到的 x 結(jié)果越大(x < 0)成肘。體現(xiàn)了權(quán)重對于每個元素抽簽結(jié)果的正反饋?zhàn)饔谩?/p>
2 crush 簡介
Crush 算法需要輸入三個參數(shù):對象、cluster map(集群拓?fù)浣Y(jié)構(gòu)) 和 placement rule(數(shù)據(jù)分布策略)斧蜕。一般而言双霍,placement rule 不會輕易改動,cluster map 類似現(xiàn)實(shí)中的服務(wù)器地址惩激,當(dāng)這兩個參數(shù)都不變動的時候店煞,每次輸入一個對象,計算得到的結(jié)果都是確定的风钻。又因?yàn)椴捎昧?hash 算法顷蟀,所以每個硬盤被選中的概率也大約一致。從而骡技,可以保證數(shù)據(jù)在整個集群中均勻分布鸣个。
上圖一個簡化的 cluster map 羞反,每個葉子節(jié)點(diǎn)都是真實(shí)的最小物理存儲設(shè)備(例如磁盤),成為 device囤萤;所有的中間節(jié)點(diǎn)統(tǒng)稱為 bucket昼窗,每個 bucket 可以是一些 devices 的集合,也可以是低一級的 buckets 集合涛舍;根節(jié)點(diǎn)成為 root澄惊,是整個集群的入口。每個節(jié)點(diǎn)都擁有唯一的數(shù)字 ID 和類型富雅,以標(biāo)識其在集群中所處的位置和層級掸驱,但是只有葉子節(jié)點(diǎn),才擁有非負(fù) ID没佑,表明其是承載數(shù)據(jù)的最終設(shè)備毕贼。上一級節(jié)點(diǎn)權(quán)重是其所有孩子節(jié)點(diǎn)的權(quán)重之和。
這里給出 cluster map 常見的節(jié)點(diǎn)類型蛤奢,根據(jù)規(guī)模從小到大排列鬼癣;osd、host啤贩、chassis待秃、rack、row瓜晤、pdu锥余、pod、room痢掠、datacenter驱犹、region、root足画。
placement rule 用來完成數(shù)據(jù)映射雄驹。每條 palcement rule 可以包含多個操作,這些操作共有3種類型:take淹辞、select医舆、emit。
- take
take 從cluster map 選擇指定編號的 bucket(某個特定的 bucket)象缀,并以此作為后續(xù)步驟的輸入蔬将。例如系統(tǒng)默認(rèn)的 placement rule 總是以 cluster map 中的 root 節(jié)點(diǎn)作為輸入開始執(zhí)行的。- select
select 從輸入的 bucket 當(dāng)中隨機(jī)選擇指定類型和數(shù)量的條目(items)央星。Ceph 當(dāng)前支持兩種備份策略——副本和糾刪碼霞怀,相應(yīng)的有兩種 select 算法——firstn 和 indep。- emit
emit 輸出最終選擇結(jié)果給上級調(diào)用者并返回莉给。
下圖以 firstn 為例展示了 select 從指定的 bucket 當(dāng)中查找指定數(shù)量條目的過程毙石。
(1)在當(dāng)前開始查找的 bucket 下選擇一個 item
這里就用到我們上述提到的選擇算法廉沮,例如 straw,用于從對應(yīng)的 bucket 中計算出合適的條目徐矩。輸入為對象的特征標(biāo)識符 x 和隨機(jī)因子 r (r 實(shí)際上是作為哈希函數(shù)的種子)滞时。為了防止死循環(huán),還需要對每個副本過程中的嘗試次數(shù)進(jìn)行限制滤灯,稱為全局嘗試次數(shù)(choose_total_tries)坪稽。
(2)沖突
沖突指選中的條目已經(jīng)存在于輸出條目列表之中。
(3)OSD過載(或失效)
- 集群規(guī)模較小力喷、集群整體容量有限刽漂,導(dǎo)致集群 PG 總數(shù)有限。
- crush 算法本身缺陷——crush 的基本選擇算法中弟孟,以 straw2 為例,每次選擇都是計算單個條目被選中的獨(dú)立概率样悟,但是 ceph 中的多副本策略要求選出多個條目拂募。所以從原理上 crush 就無法處理好多副本模式下的副本均勻分布問題。
這些因素導(dǎo)致在真實(shí)的 Ceph 集群中窟她,特別是在異構(gòu)集群中陈症,很容易出現(xiàn)磁盤數(shù)據(jù)分布不均勻的問題。所以磁盤的權(quán)重是動態(tài)變化的震糖,并且除了可以根據(jù)容量來計算真實(shí)權(quán)重外录肯,Ceph 還設(shè)置了一個 reweight 權(quán)重。當(dāng)選中一個 osd 后吊说, 還會根據(jù)輸入量 x 和 OSD 編號 hash 計算一次论咏,當(dāng)結(jié)果小于 OSD reweight 時,才會真正選擇此 osd颁井。
由上面給出的過載測試流程圖可以看出厅贪,reweight 設(shè)置得越高,通過測試的概率越高(若 reweight > 0x10000雅宾,則通過概率100%)养涮,反之亦然。在實(shí)際應(yīng)用中眉抬,通過降低過載 OSD 或者增加空閑 OSD 的 reweight 都可以觸發(fā)數(shù)據(jù)在 OSD 之間重新分布贯吓。
過載測試的另一個好處:對 OSD 暫時失效和 OSD 永久刪除的場景進(jìn)行區(qū)分。如果時暫時失效蜀变,只需要把 reweight 設(shè)為0悄谐,避免引發(fā) Ceph 集群大的波動。
初始時昏苏,Ceph 將每個 OSD 的 reweight 都設(shè)置為 0x10000尊沸,過載測試失效威沫。
可以通過命令行獲取修改 crush map 配置文件。
- 獲取 crush map
ceph osd getcrushmap -o {compiled-crushmap-filename}
- 反編譯 crush map
crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
cat crushmapdecompliedbywq
# begin crush map
tunable choose_local_tries 0 # 已廢棄洼专,為做向后兼容設(shè)為0
tunable choose_local_fallback_tries 0 # 已廢棄棒掠,為做向后兼容設(shè)為0
tunable choose_total_tries 50 # 選擇 bucket 最大嘗試次數(shù),默認(rèn)值 50
tunable chooseleaf_descend_once 1 # 已廢棄屁商,為做向后兼容設(shè)為1
tunable chooseleaf_vary_r 1 #
tunable chooseleaf_stable 1 # 避免一些不必要的 pg 遷移
tunable straw_calc_version 1 # starw 算法版本烟很,為向后兼容設(shè)為1
tunable allowed_bucket_algs 54 # 允許使用的 bucket 選擇算法,54 代表 straw2 算法
# devices
# 每一個最末端的的物理設(shè)備蜡镶,也叫葉子節(jié)點(diǎn)就叫device雾袱,可以對磁盤進(jìn)行智能識別為 hdd ssd nvme類型。
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
# types
# type 是可以自定義的, 是bucket的類型官还,一般設(shè)計為層級結(jié)構(gòu)芹橡,編號必須為正整數(shù)。
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
# 所有的中間節(jié)點(diǎn)就叫做bucket望伦,bucket可以是一些devices的集合也可以是低一級的buckets的集合, 根節(jié)點(diǎn)稱為root是整個集群的入口林说, bucket的id必須是負(fù)數(shù)且唯一,一個bucket在crush map 實(shí)際存儲位置是 buckets[-1-(bucket id)]屯伞。
host localhost {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.032
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.011
item osd.1 weight 0.011
item osd.2 weight 0.011
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.032
alg straw2
hash 0 # rjenkins1
item localhost weight 0.032
}
# rules
# placement rule
rule replicated_rule {
id 0 # id
type replicated # 類型 [replicated|erasure]
min_size 1 # 如果副本數(shù)小于這個數(shù)值腿箩,就不會應(yīng)用這條rule
max_size 10 # 如果副本數(shù)大于這個數(shù)值,就不會應(yīng)用這條rule
step take default # crush規(guī)則的入口劣摇,一般是類型為root的bucket
step choose firstn 0 type osd # 分為choose和chooseleaf兩種珠移,num代表選擇的數(shù)量,type是預(yù)期的bucket類型末融。
step emit # 輸出結(jié)果
}
# end crush map
這里單獨(dú)把 placement rule 提出來钧惧,下面給出了編寫方法。
rule <rulename> {
ruleset <ruleset>
type [replicated|erasure]
min_size <min-size>
max_size <max-size>
step take <bucket-name>
step select [choose|chooseleaf] [firstn|indep] <num> type <bucket-type>
step emit
}
placement rule 執(zhí)行流程如下:
take 操作選擇一個 bucket滑潘,一般是 root 類型的 bucket垢乙。
choose 操作由不同的選擇方式,其輸入都是上一步的輸出语卤。
a) choose firstn 深度優(yōu)先選擇出 num 個類型為 bucket-type 個的子 bucket追逮。
b) chooseleaf 先選擇出 num 個類型為 bucket-type 個子 bucket,然后遞歸到葉子節(jié)點(diǎn)粹舵。
- 如果 num == 0钮孵,num 為 pool 設(shè)置的副本數(shù)。
- 如果 0 < num < pool.size(pool 設(shè)置的副本數(shù))眼滤,那么選出 num 個巴席。
- 如果 num < 0,選出 pool.size - |num| 個诅需。
firstn 和 indep 都是深度優(yōu)先算法。主要區(qū)別在于:如果 num 為4,如果無法選出4個結(jié)果時拣帽,firstn 返回[1,2分衫,4],而 indep 會返回[1般此,2蚪战,CRUSH_ITEM_NONE,4]铐懊。一般選擇 firstn 模式邀桑。
- 編譯 crush map
crushtool -c {decompiled-crush-map-filename} -o {compiled-crush-map-filename}
- 測試 crush map
crushtool -i {compiled-crush-map-filename} --test --min-x 0 --max-x {nums} --num-rep {nums} --ruleset --show_mappings
- 注入集群,使之生效
ceph osd setcrushmap -i {compiled-crush-map-filename}
3 _calc_target()
_calc_target() 是源碼中計算主 osd 目標(biāo)方法科乎,大致流程如下:
- 根據(jù) poolid 獲取 pool 信息壁畸,包括 type、crush id喜喂、pg 數(shù)量等瓤摧。
- 判斷是否強(qiáng)制重發(fā),標(biāo)志位:force_resend
- 判斷是否有緩存池玉吁,若有,則更新 pool 信息為緩存池信息腻异,若沒有进副,則不操作。
- 根據(jù)發(fā)送對象的信息(name悔常,key影斑,namespce)和 poolid,來計算 pg 信息机打,得到關(guān)鍵參數(shù) m_seed矫户。
- 根據(jù) pg 信息,使用 crush 算法計算對象發(fā)送到哪組 osd 以及主 osd残邀。
- 根據(jù)本次發(fā)送的目標(biāo)是否和上一次發(fā)送的一致(對象和 pool 都一樣才為一致)以及 any_change 參數(shù)皆辽,重置 force_resend。
- 讀取 osd 狀態(tài)(CEPH_OSDMAP_PAUSERD芥挣、CEPH_OSDMAP_PAUSEWR)驱闷,判斷是否暫停發(fā)送。
- 判斷是否合法變更空免,標(biāo)志位:legacy_change空另。
- 根據(jù) pool 的 pg 數(shù)量是否變化,若變化蹋砚,則 split_or_merge 標(biāo)志位置真扼菠。
- 更新 op_target_t 參數(shù)摄杂,包括發(fā)送主 osd 以及 osd 組,成功返回 RECALC_OP_TARGET_NEED_RESEND循榆。
以上過程涉及兩次選擇計算析恢,第一次是選擇 pg,第二次是選擇 osd冯痢,接下來將分別介紹這兩個選擇函數(shù) osdmap->object_locator_to_pg() 和 osdmap->pg_to_up_acting_osds()氮昧。
int Objecter::_calc_target(op_target_t *t, Connection *con, bool any_change)
{
//設(shè)置讀寫標(biāo)志
bool is_read = t->flags & CEPH_OSD_FLAG_READ;
bool is_write = t->flags & CEPH_OSD_FLAG_WRITE;
//獲取 osdmap epoch
t->epoch = osdmap->get_epoch();
//獲取指定 poolid 的 pool 信息
const pg_pool_t *pi = osdmap->get_pg_pool(t->base_oloc.pool);
if (!pi) {
t->osd = -1;
return RECALC_OP_TARGET_POOL_DNE;
}
ldout(cct,30) << __func__ << " base pi " << pi
<< " pg_num " << pi->get_pg_num() << dendl;
// 第一次更新 force_recend
bool force_resend = false;
if (osdmap->get_epoch() == pi->last_force_op_resend) {
if (t->last_force_resend < pi->last_force_op_resend) {
t->last_force_resend = pi->last_force_op_resend;
force_resend = true;
} else if (t->last_force_resend == 0) {
force_resend = true;
}
}
//替換為緩存池
// apply tiering
t->target_oid = t->base_oid;
t->target_oloc = t->base_oloc;
if ((t->flags & CEPH_OSD_FLAG_IGNORE_OVERLAY) == 0) {
if (is_read && pi->has_read_tier())
t->target_oloc.pool = pi->read_tier;
if (is_write && pi->has_write_tier())
t->target_oloc.pool = pi->write_tier;
pi = osdmap->get_pg_pool(t->target_oloc.pool);
if (!pi) {
t->osd = -1;
return RECALC_OP_TARGET_POOL_DNE;
}
}
//計算 pg
pg_t pgid;
if (t->precalc_pgid) {
ceph_assert(t->flags & CEPH_OSD_FLAG_IGNORE_OVERLAY);
ceph_assert(t->base_oid.name.empty()); // make sure this is a pg op
ceph_assert(t->base_oloc.pool == (int64_t)t->base_pgid.pool());
pgid = t->base_pgid;
} else {
//具體計算 pg 函數(shù)入口
int ret = osdmap->object_locator_to_pg(t->target_oid, t->target_oloc,
pgid);
if (ret == -ENOENT) {
t->osd = -1;
return RECALC_OP_TARGET_POOL_DNE;
}
}
ldout(cct,20) << __func__ << " target " << t->target_oid << " "
<< t->target_oloc << " -> pgid " << pgid << dendl;
ldout(cct,30) << __func__ << " target pi " << pi
<< " pg_num " << pi->get_pg_num() << dendl;
t->pool_ever_existed = true;
int size = pi->size;
int min_size = pi->min_size;
unsigned pg_num = pi->get_pg_num();
unsigned pg_num_pending = pi->get_pg_num_pending();
int up_primary, acting_primary;
vector<int> up, acting;
// cursh 算法
osdmap->pg_to_up_acting_osds(pgid, &up, &up_primary,
&acting, &acting_primary);
bool sort_bitwise = osdmap->test_flag(CEPH_OSDMAP_SORTBITWISE);
bool recovery_deletes = osdmap->test_flag(CEPH_OSDMAP_RECOVERY_DELETES);
unsigned prev_seed = ceph_stable_mod(pgid.ps(), t->pg_num, t->pg_num_mask);
pg_t prev_pgid(prev_seed, pgid.pool());
//第二次更新 force_resend
if (any_change && PastIntervals::is_new_interval(
t->acting_primary,
acting_primary,
t->acting,
acting,
t->up_primary,
up_primary,
t->up,
up,
t->size,
size,
t->min_size,
min_size,
t->pg_num,
pg_num,
t->pg_num_pending,
pg_num_pending,
t->sort_bitwise,
sort_bitwise,
t->recovery_deletes,
recovery_deletes,
prev_pgid)) {
force_resend = true;
}
//更新 should_be_paused、legacy_change浦楣、split_or_merge
bool unpaused = false;
bool should_be_paused = target_should_be_paused(t);
if (t->paused && !should_be_paused) {
unpaused = true;
}
t->paused = should_be_paused;
bool legacy_change =
t->pgid != pgid ||
is_pg_changed(
t->acting_primary, t->acting, acting_primary, acting,
t->used_replica || any_change);
bool split_or_merge = false;
if (t->pg_num) {
split_or_merge =
prev_pgid.is_split(t->pg_num, pg_num, nullptr) ||
prev_pgid.is_merge_source(t->pg_num, pg_num, nullptr) ||
prev_pgid.is_merge_target(t->pg_num, pg_num);
}
//重置 op_target_t
if (legacy_change || split_or_merge || force_resend) {
t->pgid = pgid;
t->acting = acting; // osd 組
t->acting_primary = acting_primary; // 主 osd
t->up_primary = up_primary;
t->up = up;
t->size = size;
t->min_size = min_size;
t->pg_num = pg_num;
t->pg_num_mask = pi->get_pg_num_mask();
t->pg_num_pending = pg_num_pending;
osdmap->get_primary_shard(
pg_t(ceph_stable_mod(pgid.ps(), t->pg_num, t->pg_num_mask), pgid.pool()),
&t->actual_pgid);
t->sort_bitwise = sort_bitwise;
t->recovery_deletes = recovery_deletes;
ldout(cct, 10) << __func__ << " "
<< " raw pgid " << pgid << " -> actual " << t->actual_pgid
<< " acting " << acting
<< " primary " << acting_primary << dendl;
t->used_replica = false;
if (acting_primary == -1) {
t->osd = -1;
} else {
int osd;
bool read = is_read && !is_write;
//讀操作袖肥,會從幾個acting osd 組中,隨機(jī)選一個振劳,作為讀取目標(biāo) osd
if (read && (t->flags & CEPH_OSD_FLAG_BALANCE_READS)) {
int p = rand() % acting.size();
if (p)
t->used_replica = true;
osd = acting[p];
ldout(cct, 10) << " chose random osd." << osd << " of " << acting
<< dendl;
} else if (read && (t->flags & CEPH_OSD_FLAG_LOCALIZE_READS) &&
acting.size() > 1) {
// look for a local replica. prefer the primary if the
// distance is the same.
int best = -1;
int best_locality = 0;
for (unsigned i = 0; i < acting.size(); ++i) {
int locality = osdmap->crush->get_common_ancestor_distance(
cct, acting[i], crush_location);
ldout(cct, 20) << __func__ << " localize: rank " << i
<< " osd." << acting[i]
<< " locality " << locality << dendl;
if (i == 0 ||
(locality >= 0 && best_locality >= 0 &&
locality < best_locality) ||
(best_locality < 0 && locality >= 0)) {
best = i;
best_locality = locality;
if (i)
t->used_replica = true;
}
}
ceph_assert(best >= 0);
osd = acting[best];
} else {
osd = acting_primary;
}
// 目標(biāo) osd 為 acting_primary osd
t->osd = osd;
}
}
if (legacy_change || unpaused || force_resend) {
return RECALC_OP_TARGET_NEED_RESEND;
}
if (split_or_merge &&
(osdmap->require_osd_release >= CEPH_RELEASE_LUMINOUS ||
HAVE_FEATURE(osdmap->get_xinfo(acting_primary).features,
RESEND_ON_SPLIT))) {
return RECALC_OP_TARGET_NEED_RESEND;
}
return RECALC_OP_TARGET_NO_ACTION;
}
3.1 osdmap->object_locator_to_pg()
OSDMap::object_locator_to_pg() 輸入?yún)?shù)有3個:
- object_t:這是一個結(jié)構(gòu)體椎组,主要參數(shù)只有一個:name 對象名稱。
- object_locator_t:對象定位器历恐,其主要參數(shù):pool id寸癌,key,namespace弱贼,hash(選擇哪種 hash 算法)蒸苇。
- pg_t:要計算的目標(biāo) pg,其主要參數(shù):pool id吮旅,m_seed(hash 種子數(shù)溪烤,用于計算 osd)。
pg_t 中的 pool id 就是 loc 中的 pool id庇勃,實(shí)際需要計算的只有 m_seed檬嘀。根據(jù)下面給出的源碼可知,當(dāng) loc.hash 大于0時责嚷,會把 loc.hash 直接當(dāng)成 m_seed鸳兽,構(gòu)造 pg_t。當(dāng) loc.hash 小于0時罕拂,調(diào)用 map_to_pg() 方法計算 m_seed揍异。
int OSDMap::object_locator_to_pg(
const object_t& oid, const object_locator_t& loc, pg_t &pg) const
{
if (loc.hash >= 0) {
if (!get_pg_pool(loc.get_pool())) {
return -ENOENT;
}
pg = pg_t(loc.hash, loc.get_pool());
return 0;
}
return map_to_pg(loc.get_pool(), oid.name, loc.key, loc.nspace, &pg);
}
strcut pg_t{
...
//構(gòu)造函數(shù)
pg_t(ps_t seed, uint64_t pool) :
m_pool(pool), m_seed(seed) {}
...
}
根據(jù) map_to_pg() 方法可知,ps(m_seed)是由 loc 中的 key 或者 name 配合 namespace 通過 pool->hash_key() 計算出來的聂受。繼續(xù)深入 hash_key() 可以得到調(diào)用鏈:hash_key() -> ceph_str_hash() -> ceph_str_hash_rjenkins()/ceph_str_hash_linux() 蒿秦。一般選擇 rjenkins hash 算法。
int OSDMap::map_to_pg(
int64_t poolid,
const string& name,
const string& key,
const string& nspace,
pg_t *pg) const
{
// calculate ps (placement seed)
const pg_pool_t *pool = get_pg_pool(poolid);
if (!pool)
return -ENOENT;
ps_t ps;
if (!key.empty())
ps = pool->hash_key(key, nspace);
else
ps = pool->hash_key(name, nspace);
*pg = pg_t(ps, poolid);
return 0;
}
通過分析 rjenkins hash 算法蛋济,發(fā)現(xiàn)無論輸入為多少位的字符串棍鳖,輸出總是定長的無符號32位數(shù)字。通過循環(huán)拆分,它把前12位字符渡处,分別拆成3個32位數(shù)字a镜悉、b、c医瘫。方法是:把字符串的每一個字符挨個轉(zhuǎn)成8位的數(shù)字(char 占1 byte侣肄,8 bit),并循環(huán)把第1醇份、2稼锅、3、4個字符拼成32位數(shù)字a僚纷,第5矩距、6、7怖竭、8個字符拼成數(shù)字b锥债,第9、10痊臭、11哮肚、12個字符拼成數(shù)字c,通過 mix() 攪拌函數(shù)充分?jǐn)嚢鑑广匙、b允趟、c。12個字符位一組鸦致,循環(huán)處理輸入的字符串拼窥,并與之前的a、b蹋凝、c相加。若長度不足12总棵,則按順序向前推鳍寂。如:長度位5,只處理前5個字符情龄,前4個字符拼成a迄汛,最后一個拼成b,c為0骤视,執(zhí)行 mix() 函數(shù)鞍爱。最終輸出結(jié)果為c,c就是 hash 種子 m_seed专酗。
#define mix(a, b, c) \
do { \
a = a - b; a = a - c; a = a ^ (c >> 13); \
b = b - c; b = b - a; b = b ^ (a << 8); \
c = c - a; c = c - b; c = c ^ (b >> 13); \
a = a - b; a = a - c; a = a ^ (c >> 12); \
b = b - c; b = b - a; b = b ^ (a << 16); \
c = c - a; c = c - b; c = c ^ (b >> 5); \
a = a - b; a = a - c; a = a ^ (c >> 3); \
b = b - c; b = b - a; b = b ^ (a << 10); \
c = c - a; c = c - b; c = c ^ (b >> 15); \
} while (0)
unsigned ceph_str_hash_rjenkins(const char *str, unsigned length)
{
const unsigned char *k = (const unsigned char *)str;
__u32 a, b, c; /* the internal state */
__u32 len; /* how many key bytes still need mixing */
/* Set up the internal state */
len = length;
a = 0x9e3779b9; /* the golden ratio; an arbitrary value */
b = a;
c = 0; /* variable initialization of internal state */
/* handle most of the key */
while (len >= 12) {
a = a + (k[0] + ((__u32)k[1] << 8) + ((__u32)k[2] << 16) +
((__u32)k[3] << 24));
b = b + (k[4] + ((__u32)k[5] << 8) + ((__u32)k[6] << 16) +
((__u32)k[7] << 24));
c = c + (k[8] + ((__u32)k[9] << 8) + ((__u32)k[10] << 16) +
((__u32)k[11] << 24));
mix(a, b, c);
k = k + 12;
len = len - 12;
}
/* handle the last 11 bytes */
c = c + length;
switch (len) { /* all the case statements fall through */
case 11:
c = c + ((__u32)k[10] << 24);
case 10:
c = c + ((__u32)k[9] << 16);
case 9:
c = c + ((__u32)k[8] << 8);
/* the first byte of c is reserved for the length */
case 8:
b = b + ((__u32)k[7] << 24);
case 7:
b = b + ((__u32)k[6] << 16);
case 6:
b = b + ((__u32)k[5] << 8);
case 5:
b = b + k[4];
case 4:
a = a + ((__u32)k[3] << 24);
case 3:
a = a + ((__u32)k[2] << 16);
case 2:
a = a + ((__u32)k[1] << 8);
case 1:
a = a + k[0];
/* case 0: nothing left to add */
}
mix(a, b, c);
return c;
}
3.2 osdmap->pg_to_up_acting_osds()
pg_to_up_acting_osds() 函數(shù)可以根據(jù)輸入的 pg_t 計算出 up osd 組睹逃、主 up osd、acting osd 組、主 acting osd沉填。注意疗隶,其中 acting 系列才是最終發(fā)送的 osd 目標(biāo)。
通過分析 _pg_to_up_acting_osds() 方法翼闹,發(fā)現(xiàn)計算出最終的的 up osd 組和主 up osd斑鼻,主要經(jīng)過以下5步驟:
- _pg_to_raw_osds():根據(jù) pg 位置進(jìn)行 crush 運(yùn)算,計算出一組 up osd 保存在 raw 中猎荠。
- _apply_unmap():啟用 upmap坚弱,人為指定 pg 到 osd 的映射,得到一組新的 osd关摇。用于均衡 osd 數(shù)據(jù)分布荒叶。
- _raw_to_up_osds():把 raw 數(shù)組轉(zhuǎn)移到 up 數(shù)組,其中保存 osd 組拒垃。
- _pick_primary():從 up 數(shù)組中選出一個主 osd停撞。
- _apply_primary_affinity():啟用 up_primary ,重新選出主 osd悼瓮。
其中 _pg_to_up_acting_osds() 實(shí)現(xiàn)將 pg 通過 crush 算法映射到 osd戈毒,根據(jù) crush map 的規(guī)則計算出一組 osd,是 crush 算法的入口横堡,也是 ceph 的最核心部分埋市。下面將主要介紹此函數(shù),另外四個函數(shù)不做具體分析命贴。
void pg_to_up_acting_osds(pg_t pg, vector<int> *up, int *up_primary,
vector<int> *acting, int *acting_primary) const {
_pg_to_up_acting_osds(pg, up, up_primary, acting, acting_primary);
}
void OSDMap::_pg_to_up_acting_osds(
const pg_t& pg, vector<int> *up, int *up_primary,
vector<int> *acting, int *acting_primary,
bool raw_pg_to_pg) const
{
...
if (_acting.empty() || up || up_primary) {
_pg_to_raw_osds(*pool, pg, &raw, &pps);
_apply_upmap(*pool, pg, &raw);
_raw_to_up_osds(*pool, raw, &_up);
_up_primary = _pick_primary(_up);
_apply_primary_affinity(pps, *pool, &_up, &_up_primary);
...
}
_pg_to_raw_osds() 二次計算了 hash 種子道宅,通過 raw_pg_to_pps() 方法。具體步驟胸蛛,先根據(jù)之前計算的 m_seed 和 b_mask(可以認(rèn)為等于 pg 數(shù)減一)取余(通過 ceph_stable_mod() 函數(shù))污茵,即把 m_seed 定位到所有的 pg 中的一個,得到一個小于等于 pg 數(shù)量的值葬项,然后再把這個值和 poolid 進(jìn)行 rjenkins1 hash 運(yùn)算泞当,得到新的 hash 種子,并命名為 pps (placement ps)民珍。
crush->find_rule() 獲取 crush map 的編號襟士。可以設(shè)置多組 crush map嚷量,并指定實(shí)際使用的 map陋桂。
crush->do_rule() 根據(jù) pps,crush map蝶溶,osd 比重嗜历,計算出 pg 到一組 osd 的映射。
_remove_nonexistent_osds() 刪除不存在的 osd。
void OSDMap::_pg_to_raw_osds(
const pg_pool_t& pool, pg_t pg,
vector<int> *osds,
ps_t *ppps) const
{
// map to osds[]
//獲取新的 hash 種子
ps_t pps = pool.raw_pg_to_pps(pg); // placement ps
unsigned size = pool.get_size();
//獲取 crushmap
// what crush rule?
int ruleno = crush->find_rule(pool.get_crush_rule(), pool.get_type(), size);
if (ruleno >= 0)
//計算 osd 組
crush->do_rule(ruleno, pps, *osds, size, osd_weight, pg.pool());
_remove_nonexistent_osds(pool, *osds);
if (ppps)
*ppps = pps;
}
//取模運(yùn)算秸脱,相當(dāng)于:return (x % bmask) < b ? (x % bmask) : (x % (bamsk / 2));
static inline int ceph_stable_mod(int x, int b, int bmask)
{
//注意:x
if ((x & bmask) < b)
return x & bmask;
else
return x & (bmask >> 1);
}
do_rule() 中調(diào)用了 crush_do_rule()落包,這是真正實(shí)現(xiàn) crush 算法的地方。
template<typename WeightVector>
void do_rule(int rule, int x, vector<int>& out, int maxout,
const WeightVector& weight,
uint64_t choose_args_index) const {
int rawout[maxout];
char work[crush_work_size(crush, maxout)];
crush_init_workspace(crush, work);
crush_choose_arg_map arg_map = choose_args_get_with_fallback(
choose_args_index);
//計算 osd 組摊唇,結(jié)果保存在 rawout咐蝇,numrep 為
int numrep = crush_do_rule(crush, rule, x, rawout, maxout, &weight[0],
weight.size(), work, arg_map.args);
if (numrep < 0)
numrep = 0;
out.resize(numrep);
for (int i=0; i<numrep; i++)
out[i] = rawout[i];
}
3.2.1 crush_do_rule()
/**
* crush_do_rule - calculate a mapping with the given input and rule
* @map: the crush_map
* @ruleno: the rule id
* @x: hash input
* @result: pointer to result vector
* @result_max: maximum result size
* @weight: weight vector (for map leaves) // 葉子節(jié)點(diǎn)就是osd
* @weight_max: size of weight vector
* @cwin: Pointer to at least map->working_size bytes of memory or NULL.
*/
int crush_do_rule(const struct crush_map *map,
int ruleno, int x, int *result, int result_max,
const __u32 *weight, int weight_max,
void *cwin, const struct crush_choose_arg *choose_args)
{
int result_len;
struct crush_work *cw = cwin;
int *a = (int *)((char *)cw + map->working_size);
int *b = a + result_max;
int *c = b + result_max;
int *w = a;
int *o = b;
int recurse_to_leaf; // 是否遞歸到葉子節(jié)點(diǎn)
int wsize = 0;
int osize; // 當(dāng)前step 選擇出來的結(jié)果數(shù)量
int *tmp;
const struct crush_rule *rule;
__u32 step;
int i, j;
int numrep;
int out_size;
/*
* the original choose_total_tries value was off by one (it
* counted "retries" and not "tries"). add one.
* crush map 文件中的choose_total_tries變量是重試的次數(shù),所以總次數(shù)需要+1
*/
int choose_tries = map->choose_total_tries + 1;
int choose_leaf_tries = 0;
/*
* the local tries values were counted as "retries", though,
* and need no adjustment
*/
int choose_local_retries = map->choose_local_tries;
int choose_local_fallback_retries = map->choose_local_fallback_tries;
int vary_r = map->chooseleaf_vary_r;
int stable = map->chooseleaf_stable;
if ((__u32)ruleno >= map->max_rules) {
dprintk(" bad ruleno %d\n", ruleno);
return 0;
}
rule = map->rules[ruleno];
result_len = 0;
// 這里開始循環(huán)執(zhí)行rule的每一步
for (step = 0; step < rule->len; step++) {
int firstn = 0; // 是否使用 firstn 深度優(yōu)先算法
const struct crush_rule_step *curstep = &rule->steps[step];
switch (curstep->op) {
case CRUSH_RULE_TAKE: // 當(dāng)op 為 take的時候是沒有arg2的
// 判斷參數(shù)是否正確巷查,bucket是否存在
if ((curstep->arg1 >= 0 &&
curstep->arg1 < map->max_devices) ||
(-1-curstep->arg1 >= 0 &&
-1-curstep->arg1 < map->max_buckets && // 這里可以看出 bucket的id 是有順序的有序,從-1開始-n,存儲在map中是0至于n-1,
map->buckets[-1-curstep->arg1])) { // The bucket found at __buckets[i]__ must have a crush_bucket.id == -1-i
w[0] = curstep->arg1; // arg1 就是bucket id岛请, 就是root 的id 旭寿,作為下一step開始的點(diǎn)
wsize = 1;
} else {
dprintk(" bad take value %d\n", curstep->arg1);
}
break;
// CRUSH_RULE_SET_* 相關(guān)的參數(shù)都是用來設(shè)置crush 參數(shù)的
case CRUSH_RULE_SET_CHOOSE_TRIES:
if (curstep->arg1 > 0)
choose_tries = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSELEAF_TRIES:
if (curstep->arg1 > 0)
choose_leaf_tries = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES:
if (curstep->arg1 >= 0)
choose_local_retries = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES:
if (curstep->arg1 >= 0)
choose_local_fallback_retries = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSELEAF_VARY_R:
if (curstep->arg1 >= 0)
vary_r = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSELEAF_STABLE:
if (curstep->arg1 >= 0)
stable = curstep->arg1;
break;
case CRUSH_RULE_CHOOSELEAF_FIRSTN:
case CRUSH_RULE_CHOOSE_FIRSTN:
firstn = 1;
/* fall through */
case CRUSH_RULE_CHOOSELEAF_INDEP:
case CRUSH_RULE_CHOOSE_INDEP:
if (wsize == 0)
break;
// 帶有CHOOSELEAF的操作都是要遞歸到子節(jié)點(diǎn)的
recurse_to_leaf =
curstep->op ==
CRUSH_RULE_CHOOSELEAF_FIRSTN ||
curstep->op ==
CRUSH_RULE_CHOOSELEAF_INDEP;
/* reset output */
osize = 0; // osize 當(dāng)前step已經(jīng)選出來的數(shù)量
for (i = 0; i < wsize; i++) {
int bno; // bucket id
numrep = curstep->arg1; // 這個numrep 是要選擇的個數(shù),可能為負(fù)數(shù)
if (numrep <= 0) {
numrep += result_max;
if (numrep <= 0)
continue;
}
j = 0;
/* make sure bucket id is valid */
bno = -1 - w[i];
if (bno < 0 || bno >= map->max_buckets) {
// w[i] is probably CRUSH_ITEM_NONE
dprintk(" bad w[i] %d\n", w[i]);
continue;
}
if (firstn) { // 如果使用的是 firstn 深度優(yōu)先算法
int recurse_tries;
if (choose_leaf_tries)
recurse_tries =
choose_leaf_tries;
else if (map->chooseleaf_descend_once) // 這里一直都是設(shè)置為1的崇败,因?yàn)闀斐梢恍┻吔鐔栴}
recurse_tries = 1;
else
recurse_tries = choose_tries;
osize += crush_choose_firstn(
map,
cw,
map->buckets[bno],
weight, weight_max,
x, numrep,
curstep->arg2,
o+osize, j,
result_max-osize,
choose_tries,
recurse_tries,
choose_local_retries,
choose_local_fallback_retries,
recurse_to_leaf,
vary_r,
stable,
c+osize,
0,
choose_args);
} else {
out_size = ((numrep < (result_max-osize)) ?
numrep : (result_max-osize));
crush_choose_indep(
map,
cw,
map->buckets[bno],
weight, weight_max,
x, out_size, numrep,
curstep->arg2,
o+osize, j,
choose_tries,
choose_leaf_tries ?
choose_leaf_tries : 1,
recurse_to_leaf,
c+osize,
0,
choose_args);
osize += out_size;
}
}
if (recurse_to_leaf)
/* copy final _leaf_ values to output set */
memcpy(o, c, osize*sizeof(*o));
/* swap o and w arrays */
tmp = o;
o = w;
w = tmp; // 上一step輸出的結(jié)果盅称,作為下一step的開始,在上一步選擇好的基礎(chǔ)上在進(jìn)行下一步的選擇
wsize = osize;
break;
case CRUSH_RULE_EMIT:
for (i = 0; i < wsize && result_len < result_max; i++) {
result[result_len] = w[i];
result_len++;
}
wsize = 0;
break;
default:
dprintk(" unknown op %d at step %d\n",
curstep->op, step);
break;
}
}
return result_len;
}
可以看到后室,choose 步驟有兩個函數(shù)缩膝,分別是用于復(fù)制算法的 crush_choose_firstn() 和 糾刪碼算法的 crush_choose_indep(),這里主要介紹 choose_firstn岸霹。crush_do_rule() 中實(shí)際就是解析了 crushmap 中的 rule 規(guī)則疾层,并根據(jù)其設(shè)置的參數(shù)、步驟贡避,按部就班的執(zhí)行命令痛黎,算出 osd 組。
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type osd
step emit
}
crush_choose_firstn() 調(diào)用 crush_bucket_choose() 選擇需要的副本數(shù)刮吧,并對選擇出來的 OSD 做了相關(guān)的沖突檢查湖饱,如果沖突檢查失效或者過載,繼續(xù)選擇新的 OSD杀捻,這里有個最大重試次數(shù):local_retries琉历。
/**
* crush_choose_firstn - choose numrep distinct items of given type
* @map: the crush_map
* @bucket: the bucket we are choose an item from
* @x: crush input value
* @numrep: the number of items to choose
* @type: the type of item to choose
* @out: pointer to output vector
* @outpos: our position in that vector
* @out_size: size of the out vector
* @tries: number of attempts to make
* @recurse_tries: number of attempts to have recursive chooseleaf make
* @local_retries: localized retries
* @local_fallback_retries: localized fallback retries
* @recurse_to_leaf: true if we want one device under each item of given type (chooseleaf instead of choose)
* @stable: stable mode starts rep=0 in the recursive call for all replicas
* @vary_r: pass r to recursive calls
* @out2: second output vector for leaf items (if @recurse_to_leaf)
* @parent_r: r value passed from the parent
*/
static int crush_choose_firstn(const struct crush_map *map,
struct crush_work *work,
const struct crush_bucket *bucket,
const __u32 *weight, int weight_max,
int x, int numrep, int type,
int *out, int outpos,
int out_size,
unsigned int tries,
unsigned int recurse_tries,
unsigned int local_retries,
unsigned int local_fallback_retries,
int recurse_to_leaf,
unsigned int vary_r,
unsigned int stable,
int *out2,
int parent_r,
const struct crush_choose_arg *choose_args)
{
int rep; // 計數(shù)器,用來記錄已經(jīng)選擇的數(shù)量
unsigned int ftotal, flocal;
int retry_descent, retry_bucket, skip_rep;
const struct crush_bucket *in = bucket;
int r;
int i;
int item = 0;
int itemtype;
int collide, reject;
int count = out_size;
dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d tries %d \
recurse_tries %d local_retries %d local_fallback_retries %d \
parent_r %d stable %d\n",
recurse_to_leaf ? "_LEAF" : "",
bucket->id, x, outpos, numrep,
tries, recurse_tries, local_retries, local_fallback_retries,
parent_r, stable);
for (rep = stable ? 0 : outpos; rep < numrep && count > 0 ; rep++) {
/* keep trying until we get a non-out, non-colliding item */
ftotal = 0; // fail total 失敗的總次數(shù)
skip_rep = 0; // 是否跳過這一次選擇
do {
retry_descent = 0;
in = bucket; /* initial bucket */
/* choose through intervening buckets */
flocal = 0; // 當(dāng)前bucket的選擇重試的次數(shù)水醋,局部重試次數(shù)
do {
collide = 0; // 判斷是否有沖撞
retry_bucket = 0;
r = rep + parent_r; // 隨機(jī)因子r
/* r' = r + f_total */
r += ftotal; // 如果選擇失敗,這里要加上失敗次數(shù)再進(jìn)行重試
/* bucket choose */
if (in->size == 0) {
reject = 1;
goto reject;
}
if (local_fallback_retries > 0 &&
flocal >= (in->size>>1) &&
flocal > local_fallback_retries)
item = bucket_perm_choose( // 這是一個后備選擇算法彪置,會記錄之前沖突過的item拄踪,觸發(fā)的條件比較苛刻
in, work->work[-1-in->id],
x, r);
else
item = crush_bucket_choose( // 這里從輸入的bucket中選擇一個item 出來,item 就是bucket的id 號
in, work->work[-1-in->id],
x, r,
(choose_args ? &choose_args[-1-in->id] : 0),
outpos);
if (item >= map->max_devices) { // 如果選出來的item id 比 devices個數(shù)還大肯定是錯誤的
dprintk(" bad item %d\n", item);
skip_rep = 1;
break;
}
/* desired type? */
if (item < 0) // bucket id 都是小于0的拳魁,如果不是那選出來的就是osd
itemtype = map->buckets[-1-item]->type;
else
itemtype = 0; // 不然的話就是osd 類型
dprintk(" item %d type %d\n", item, itemtype);
/* keep going? */
if (itemtype != type) { // 如果選出來的bucket type 跟預(yù)期的bucket type不一樣
if (item >= 0 ||
(-1-item) >= map->max_buckets) {
dprintk(" bad item type %d\n", type);
skip_rep = 1;
break;
}
in = map->buckets[-1-item]; // 將剛剛找到的bucket作為下一次查找的輸入(遞歸)
retry_bucket = 1; // 重新選擇
continue;
}
// 到這一步證明找到的是目標(biāo)類型的bucket或者osd惶桐,跟已經(jīng)找到的進(jìn)行對比,如果沖突那么需要重新查找
/* collision? */
for (i = 0; i < outpos; i++) {
if (out[i] == item) {
collide = 1; // 判斷選擇的是否沖突
break;
}
}
reject = 0;
if (!collide && recurse_to_leaf) { // 如果選出來的bucket不沖突,并且需要遞歸到葉節(jié)點(diǎn)osd
if (item < 0) { // 如果是bucket類型的
int sub_r;
if (vary_r)
sub_r = r >> (vary_r-1);
else
sub_r = 0;
if (crush_choose_firstn(
map,
work,
map->buckets[-1-item], // 注意這里入口變成了剛剛選出來的bucket
weight, weight_max,
x, stable ? 1 : outpos+1, 0,
out2, outpos, count,
recurse_tries, 0,
local_retries,
local_fallback_retries,
0,
vary_r,
stable,
NULL,
sub_r,
choose_args) <= outpos)
/* didn't get leaf */
reject = 1;
} else { // osd
/* we already have a leaf! */
out2[outpos] = item; // 這個是應(yīng)用在需要遞歸到葉子節(jié)點(diǎn)的輸出
}
}
if (!reject && !collide) {
/* out? */
if (itemtype == 0)
reject = is_out(map, weight, // 進(jìn)行osd reweight 的再次過濾
weight_max,
item, x);
}
reject:
if (reject || collide) { // 如果‘沖突‘或者‘故障‘了姚糊,那就重新查找隨機(jī)因子 r 遞增
ftotal++;
flocal++;
if (collide && flocal <= local_retries) // 如果再當(dāng)前bucket下重試次數(shù)還達(dá)到上限local_retries
/* retry locally a few times */
retry_bucket = 1;
else if (local_fallback_retries > 0 &&
flocal <= in->size + local_fallback_retries)
/* exhaustive bucket search */
retry_bucket = 1;
else if (ftotal < tries)
/* then retry descent */
retry_descent = 1;
else
/* else give up */
skip_rep = 1;
dprintk(" reject %d collide %d "
"ftotal %u flocal %u\n",
reject, collide, ftotal,
flocal);
}
} while (retry_bucket); // 在當(dāng)前bucket下重試選擇(局部重試)贿衍,每一次都從頭開始是很消耗資源的
} while (retry_descent); // 從最開始的bucket處開始重新選擇(從頭開始)
if (skip_rep) {
dprintk("skip rep\n");
continue;
}
dprintk("CHOOSE got %d\n", item);
out[outpos] = item;
outpos++;
count--;
#ifndef __KERNEL__
if (map->choose_tries && ftotal <= map->choose_total_tries)
map->choose_tries[ftotal]++;
#endif
}
dprintk("CHOOSE returns %d\n", outpos);
return outpos;
}
crush_bucket_choose() 是選擇 bucket 的算法,有 uniform救恨、list贸辈、tree、straw肠槽、straw2 這5種擎淤。一般都是 starw2 算法。
static int crush_bucket_choose(const struct crush_bucket *in,
struct crush_work_bucket *work,
int x, int r,
const struct crush_choose_arg *arg,
int position)
{
dprintk(" crush_bucket_choose %d x=%d r=%d\n", in->id, x, r);
BUG_ON(in->size == 0);
switch (in->alg) {
case CRUSH_BUCKET_UNIFORM:
return bucket_uniform_choose(
(const struct crush_bucket_uniform *)in,
work, x, r);
case CRUSH_BUCKET_LIST:
return bucket_list_choose((const struct crush_bucket_list *)in,
x, r);
case CRUSH_BUCKET_TREE:
return bucket_tree_choose((const struct crush_bucket_tree *)in,
x, r);
case CRUSH_BUCKET_STRAW:
return bucket_straw_choose(
(const struct crush_bucket_straw *)in,
x, r);
case CRUSH_BUCKET_STRAW2:
return bucket_straw2_choose(
(const struct crush_bucket_straw2 *)in,
x, r, arg, position);
default:
dprintk("unknown bucket %d alg %d\n", in->id, in->alg);
return in->items[0];
}
}
bucket_straw2_choose() 的輸入?yún)?shù)中秸仙,x 為 pgid嘴拢,r 為副本個數(shù)。具體實(shí)現(xiàn)如下:
- generate_exponential_distribution():對每個 item 計算 hash 值寂纪,并進(jìn)行 ln 運(yùn)算席吴,把得到的結(jié)果除以自身權(quán)重 weight。
- hight_draw:比較 draw 值捞蛋,并記錄最大的 draw 值所對應(yīng)的 bucket孝冒。
static int bucket_straw2_choose(const struct crush_bucket_straw2 *bucket,
int x, int r, const struct crush_choose_arg *arg,
int position)
{
unsigned int i, high = 0;
__s64 draw, high_draw = 0;
__u32 *weights = get_choose_arg_weights(bucket, arg, position);
__s32 *ids = get_choose_arg_ids(bucket, arg);
for (i = 0; i < bucket->h.size; i++) {
dprintk("weight 0x%x item %d\n", weights[i], ids[i]);
if (weights[i]) {
draw = generate_exponential_distribution(bucket->h.hash, x, ids[i], r, weights[i]);
} else {
draw = S64_MIN;
}
if (i == 0 || draw > high_draw) {
high = i;
high_draw = draw;
}
}
return bucket->h.items[high];
}
is_out() 通過 osd reweight 再次過濾,具體步驟上文有介紹襟交,這里不再贅述迈倍。
static int is_out(const struct crush_map *map,
const __u32 *weight, int weight_max,
int item, int x)
{
if (item >= weight_max)
return 1;
if (weight[item] >= 0x10000)
return 0;
if (weight[item] == 0)
return 1;
if ((crush_hash32_2(CRUSH_HASH_RJENKINS1, x, item) & 0xffff)
< weight[item])
return 0;
return 1;
}