ceph osd crush reweight 與ceph osd reweight

osd crush weight
osd weight
crushtool 修改crushmap
test

在我們執(zhí)行ceph osd tree命令的時候顯示內容里面會顯示一個WEIGHT還有REWEIGHT电谣，那它們到底是什么呢惶傻？

[root@xt7 ceph]# ceph osd tree
//第二列對應osd crush weight逃延，倒數(shù)第二列對應osd weight
ID  WEIGHT   TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-13  2.66554 root metadata
-14  1.00401     host xt7-metadata
 23  1.00000         osd.23             up  1.00000          1.00000
-15  1.05763     host xt6-metadata
 11  1.00000         osd.11             up  1.00000          1.00000
-16  0.60390     host xt8-metadata
 35  1.00000         osd.35             up  1.00000          1.00000
-12        0 root default
-11        0     host xt7-default
-10        0     host xt6-default
 -9        0     host xt8-default
 -8  2.90688 root ssd
 -7  0.79999     host xt7-ssd
 14  0.79999         osd.14             up  1.00000          1.00000
 -6  0.90689     host xt6-ssd
  2  1.00000         osd.2              up  1.00000          1.00000
 -5  1.20000     host xt8-ssd
 26  1.20000         osd.26             up  1.00000          1.00000
 -4 30.99991 root hdd
 -3  8.99994     host xt7-hdd
 12  0.79999         osd.12             up  1.00000          1.00000
 13  0.79999         osd.13             up  1.00000          1.00000
 15  0.79999         osd.15             up  1.00000          1.00000
 16  0.79999         osd.16             up  1.00000          1.00000
 17  0.79999         osd.17             up  0.70000          1.00000
 18  1.00000         osd.18             up  1.00000          1.00000
 19  1.00000         osd.19             up  1.00000          1.00000
 20  1.00000         osd.20             up  1.00000          1.00000
 21  1.00000         osd.21             up  1.00000          1.00000
 22  1.00000         osd.22             up  1.00000          1.00000
 -2 10.00000     host xt6-hdd
 0  1.00000         osd.0              up  1.00000          1.00000
 1  1.00000         osd.1              up  1.00000          1.00000
 3  1.00000         osd.3              up  1.00000          1.00000
 4  1.00000         osd.4              up  1.00000          1.00000
 5  1.00000         osd.5              up  1.00000          1.00000
 6  1.00000         osd.6              up  1.00000          1.00000
 7  1.00000         osd.7              up  1.00000          1.00000
 8  1.00000         osd.8              up  1.00000          1.00000
 9  1.00000         osd.9              up  1.00000          1.00000
 10  1.00000         osd.10             up  1.00000          1.00000
 -1 11.99997     host xt8-hdd
 24  1.20000         osd.24             up  1.00000          1.00000
 25  1.20000         osd.25             up  1.00000          1.00000
 27  1.20000         osd.27             up  1.00000          1.00000
 28  1.20000         osd.28             up  1.00000          1.00000
 29  1.20000         osd.29             up  1.00000          1.00000
 30  1.20000         osd.30             up  1.00000          1.00000
 31  1.20000         osd.31             up  1.00000          1.00000
 32  1.20000         osd.32             up  1.00000          1.00000
 33  1.20000         osd.33             up  1.00000          1.00000
 34  1.20000         osd.34             up  1.00000          1.00000

osd crush weight

Crush weight實際上為bucket item weight逆皮，下面是關于bucket item weight的描述：

Weighting Bucket Items
Ceph expresses bucket weights as doubles, which allows for fine weighting. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1TB storage device. In such a scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 3.00 would represent approximately 3TB. Higher level buckets have a weight that is the sum total of the leaf items aggregated by the bucket.
A bucket item weight is one dimensional, but you may also calculate your item weights to reflect the performance of the storage drive. For example, if you have many 1TB drives where some have relatively low data transfer rate and the others have a relatively high data transfer rate, you may weight them differently, even though they have the same capacity (e.g., a weight of 0.80 for the first set of drives with lower total throughput, and 1.20 for the second set of drives with higher total throughput).

“ceph osd crush reweight” sets the CRUSH weight of the OSD. This weight is an arbitrary value (generally the size of the disk in TB or something) and controls how much data the system tries to allocate to the OSD.

簡單來說末早，bucket weight表示設備(device)的容量，1TB對應1.00样眠，500G對應0.5，bucket weight是所有item weight之和挤土，item weight的變化會影響bucket weight的變化，也就是osd.X會影響host误算。對與它的調整會立即重新分配pg,遷移數(shù)據仰美，這個值一般在剛init 萬osd的時候根據osd的容量進行設置迷殿。

Command:
ceph osd crush reweight osd.1 1.2

osd weight

Osd weight的取值為0~1。osd reweight并不會影響host咖杂。當osd被踢出集群時贪庙，osd weight被設置0，加入集群時翰苫，設置為1。

“ceph osd reweight” sets an override weight on the OSD. This value is in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the data that would otherwise live on this drive. It does *not* change the weights assigned to the buckets above the OSD, and is a corrective measure in case the normal CRUSH distribution isn’t working out quite right. (For instance, if one of your OSDs is at 90% and the others are at 50%, you could reduce this weight to try and compensate for it.)

Note that ‘ceph osd reweight’ is not a persistent setting. When an OSD gets marked out, the osd weight will be set to 0. When it gets marked in again, the weight will be changed to 1.
Because of this ‘ceph osd reweight’ is a temporary solution. You should only use it to keep your cluster running while you’re ordering more hardware.

osd weight 也會立即重新分配pg,并且會把 (USE_DATA * (1-weight))的數(shù)據重新分配地方这橙，進行數(shù)據的在線遷移奏窑，一般用于osd near full 或者 osd full 時臨時把這個值調低，給集群加盤擴容操作（刪除不用的數(shù)據也是一種常見的方式）屈扎。
Command:
ceph osd reweight 1 0.7

使用crushtool 來修改crushmap：

獲取當前的crushmap:
ceph osd getcrushmap -o crushmap.bin

列出某個pool的使用情況和副本數(shù)：
ceph osd dump | grep ‘^pool 0’

對crushmap進行反編譯：
curshtool -d crushmap.bin -o crushmap.txt

進行修改：
vim crushmap.txt

host xt1-hdd {
    id -1       # do not change unnecessarily
    # weight 2.500
    alg straw
    hash 0  # rjenkins1
    item osd.1 weight 1.500  //修改這兩個值埃唯，注意這里面只能修改item weight
    item osd.0 weight 1.500
}

重新編譯新的crushmap：
crushtool -c crushmap.txt -o crushmap-new.bin

將新的CRUSH map 應用到ceph 集群中:
ceph osd setcrushmap -i crushmap-new.bin

ceph osd df --cluster xtao
ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 2 1.00000  1.00000  7668M  959M  6709M 12.52 1.29 166
 3 1.00000  0.70000  7668M  528M  7140M  6.89 0.71  90
 1 1.50000  1.00000  7668M  920M  6748M 12.00 1.23 141
 0 1.50000  1.00000  7668M  572M  7096M  7.46 0.77 132
              TOTAL 30675M 2980M 27695M  9.72
MIN/MAX VAR: 0.71/1.29  STDDEV: 2.53

確實生效

My test 1: ceph osd crush reweight

測試之前：

ceph health detail --cluster xtao
HEALTH_OK

ceph pg dump pgs_brief --cluster xtao | less
pg_stat     state        up   up_primary   acting  acting_primary
9.6b     active+clean    [3,0]     3         [3,0]       3
10.68    active+clean    [0,3]     0         [0,3]       0
10.1f   active+clean  [2,1]    2       [2,1]   2

ceph osd df --cluster xtao
ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 2 1.00000  1.00000  7668M  719M  6949M  9.38 0.97 119
 3 1.00000  1.00000  7668M  754M  6914M  9.84 1.02 137
 1 1.00000  1.00000  7668M  646M  7022M  8.42 0.87 121
 0 1.00000  1.00000  7668M  836M  6832M 10.91 1.13 135
              TOTAL 30675M 2956M 27719M  9.64
MIN/MAX VAR: 0.87/1.13  STDDEV: 0.89

sh /ws/dump_pg.sh
pool :    9    10    | SUM
--------------------------------
osd.0    71    64    | 135
osd.1    57    64    | 121
osd.2    63    56    | 119
osd.3    65    72    | 137
--------------------------------
SUM :    256    256    |

我們以osd.1為例，它上面有121個pg, 并且觀察它和pg 10.1f變化鹰晨，它之前在[2,1]上面,而且它下面有個對象是：

ceph osd map cephfs_data 10000002213.00000015 --cluster xtao
osdmap e1148 pool 'cephfs_data' (10) object '10000002213.00000015' -> pg 10.553f4b1f (10.1f) -> up ([2,1], p2) acting ([2,1], p2)

ls /var/lib/ceph/osd/xtao-1/current/10.1f_head/10000002213.00000015__head_553F4B1F__a
-rw-r--r-- 1 root root 4.0M 7月  19 21:36 10000002213.00000015__head_553F4B1F__a

接下來開始實驗：
修改osd.1的crush weight 到 1.5

ceph osd crush reweight osd.1 1.5 --cluster xtao
reweighted item id 1 name 'osd.1' to 1.5 in crush map

ceph health detail --cluster xtao
HEALTH_ERR 35 pgs are stuck inactive for more than 300 seconds; 1 pgs backfill_wait; 14 pgs degraded; 9 pgs peering; 35 pgs stuck inactive; 38 pgs stuck unclean; recovery 5/912 objects degraded (0.548%); recovery 4/912 objects misplaced (0.439%)
pg 10.9 is stuck inactive for 371.069851, current state activating+remapped, last acting [3,0]

pg 9.4d is stuck inactive for 424.559393, current state activating+remapped, last acting [1,2]
pg 10.1f is stuck inactive for 371.326352, current state activating+remapped, last acting [2,1]
pg 9.25 is stuck inactive for 583.209478, current state activating, last acting [1,2]
pg 10.1f is stuck unclean for 371.326495, current state activating+remapped, last acting [2,1]
pg 9.4d is stuck unclean for 424.559540, current state activating+remapped, last acting [1,2]
pg 9.7d is stuck unclean for 387.550744, current state activating+remapped, last acting [3,1]
pg 9.4c is stuck unclean for 378.600928, current state activating+remapped, last acting [3,0]
...
pg 10.2a is peering, acting [1,3]
pg 9.29 is activating+degraded, acting [3,1]
pg 10.2e is peering, acting [1,2]
pg 10.32 is activating+degraded, acting [1,3]
pg 10.39 is peering, acting [1,2]
pg 9.3a is activating+degraded, acting [2,1]
pg 10.6f is peering, acting [1,2]
pg 10.7d is active+degraded, acting [0,2]
pg 9.f is activating+degraded, acting [2,1]
pg 9.c is activating+degraded, acting [3,1]
recovery 5/912 objects degraded (0.548%)
recovery 4/912 objects misplaced (0.439%)

pg 10.1f 狀態(tài)改變成了 activating+remapped
等到remap + peering完成之后：

ceph pg dump pgs_brief --cluster xtao | less
pg_stat     state         up   up_primary   acting  acting_primary
10.1f   active+clean  [0,2]    0       [0,2]      0

pg 10.1f的actiing發(fā)生了改變：[2,1] -—> [0,2]

ceph osd df --cluster xtao
ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 2 1.00000  1.00000  7668M  716M  6952M  9.34 0.97 118
 3 1.00000  1.00000  7668M  768M  6900M 10.02 1.04 138
 1 1.50000  1.00000  7668M  912M  6756M 11.90 1.23 162
 0 1.00000  1.00000  7668M  568M  7100M  7.41 0.77  94
              TOTAL 30675M 2964M 27711M  9.66
MIN/MAX VAR: 0.77/1.23  STDDEV: 1.60

osd 1上面的數(shù)據變多了： 646M —> 912M


sh /ws/dump_pg.sh
pool :    9    10    | SUM
--------------------------------
osd.0    48    46    | 94
osd.1    80    82    | 162
osd.2    62    56    | 118
osd.3    66    72    | 138
--------------------------------
SUM :    256    256    |

osd 1上面的pg變多了：121 --> 162

My test 2: ceph osd reweight

繼上面結結果繼續(xù)測試：

ceph osd reweight 3 0.7 --cluster xtao

eph pg dump pgs_brief --cluster xtao |less
pg_stat     state         up      up_primary      acting  acting_primary
 9.6b    active+clean    [3,0]        3            [3,0]        3

ceph health detail --cluster xtao
HEALTH_ERR 48 pgs are stuck inactive for more than 300 seconds; 5 pgs degraded; 34 pgs peering; 48 pgs stuck inactive
pg 9.39 is stuck inactive for 46841.599332, current state remapped+peering, last acting [3,1]
pg 10.55 is stuck inactive for 47017.191571, current state activating+degraded, last acting [0,2]
pg 9.3e is stuck inactive for 46817.170840, current state remapped+peering, last acting [1,3]
pg 9.7 is stuck inactive for 46841.604976, current state activating+remapped, last acting [0,3]
pg 9.43 is stuck inactive for 46774.034149, current state remapped+peering, last acting [3,1]
pg 10.73 is stuck inactive for 46970.796402, current state remapped+peering, last acting [1,3]
pg 10.46 is stuck inactive for 47000.825895, current state remapped+peering, last acting [1,3]
pg 10.17 is stuck inactive for 47057.230433, current state activating+degraded, last acting [0,2]
pg 10.3c is stuck inactive for 47007.838808, current state remapped+peering, last acting [1,3]
pg 9.44 is stuck inactive for 46841.600278, current state remapped+peering, last acting [3,1]
pg 9.75 is stuck inactive for 46770.993348, current state remapped+peering, last acting [3,1]
pg 9.4b is stuck inactive for 1194759.272957, current state activating+degraded, last acting [0,2]
pg 9.48 is stuck inactive for 46841.600050, current state activating, last acting [0,2]
pg 10.4b is stuck inactive for 46817.057872, current state remapped+peering, last acting [3,1]
pg 9.8 is stuck inactive for 46841.602502, current state remapped+peering, last acting [1,3]
pg 9.6b is stuck inactive for 935534.199493, current state activating+remapped, last acting [0,3]
pg 9.61 is stuck inactive for 46841.598176, current state activating+remapped, last acting [0,3]

ceph osd df --cluster xtao
ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 2 1.00000  1.00000  7668M  962M  6706M 12.56 1.29 166
 3 1.00000  0.70000  7668M  528M  7140M  6.89 0.71  90
 1 1.50000  1.00000  7668M  917M  6751M 11.97 1.23 159
 0 1.00000  1.00000  7668M  566M  7102M  7.39 0.76  97
              TOTAL 30675M 2975M 27700M  9.70
MIN/MAX VAR: 0.71/1.29  STDDEV: 2.56

osd.3的數(shù)據量減少：768M —> 528M,數(shù)據量變?yōu)榱嗽瓉淼?8.75% 

sh /ws/dump_pg.sh
dumped all in format plain

pool :    9    10    | SUM
--------------------------------
osd.0    49    48    | 97
osd.1    79    80    | 159
osd.2    88    78    | 166
osd.3    40    50    | 90
--------------------------------
SUM :    256    256    |

可以看到osd.3上面的pg個數(shù)減少：138 --> 90

參考：http://cephnotes.ksperis.com/blog/2013/12/09/ceph-osd-reweight

最后編輯于：2018.07.23 10:30:26

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末墨叛，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子模蜡，更是在濱河造成了極大的恐慌漠趁，老刑警劉巖，帶你破解...
沈念sama閱讀 211,561評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件忍疾，死亡現(xiàn)場離奇詭異闯传，居然都是意外死亡，警方通過查閱死者的電腦和手機卤妒，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,218評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門甥绿，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人则披，你說我怎么就攤上這事共缕。” “怎么了士复？”我有些...
開封第一講書人閱讀 157,162評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵图谷，是天一觀的道長。經常有香客問我阱洪，道長蜓萄，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 56,470評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任澄峰，我火速辦了婚禮，結果婚禮上俏竞，老公的妹妹穿的比我還像新娘堂竟。我一直安慰自己玻佩，他們只是感情好出嘹，可當我...
茶點故事閱讀 65,550評論 6贊 385
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著咬崔，像睡著了一般。火紅的嫁衣襯著肌膚如雪垮斯。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,806評論 1贊 290
城市分裂傳說
那天兜蠕，我揣著相機與錄音扰肌，去河邊找鬼。笑死熊杨，一個胖子當著我的面吹牛，可吹牛的內容都是我干的晶府。我是一名探鬼主播，決...
沈念sama閱讀 38,951評論 3贊 407
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼沼头，長吁一口氣：“原來是場噩夢啊……” “哼书劝！你這毒婦竟也來了？” 一聲冷哼從身側響起购对，我...
開封第一講書人閱讀 37,712評論 0贊 266
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎垂蜗，沒想到半個月后解幽，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經...
沈念sama閱讀 44,166評論 1贊 303
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡片部，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 36,510評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年霜定，在試婚紗的時候發(fā)現(xiàn)自己被綠了廊鸥。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片辖所。...
茶點故事閱讀 38,643評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖吆视，靈堂內的尸體忽然破棺而出酥宴，到底是詐尸還是另有隱情，我是刑警寧澤幅虑，帶...
沈念sama閱讀 34,306評論 4贊 330
?日本核電站爆炸內幕
正文年R本政府宣布倒庵，位于F島的核電站，受9級特大地震影響擎宝，放射性物質發(fā)生泄漏浑玛。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 39,930評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一极阅、第九天我趴在偏房一處隱蔽的房頂上張望涨享。院中可真熱鬧筋搏，春花似錦厕隧、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,745評論 0贊 21
一樁弒父案建丧，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至翎朱，卻和暖如春艰亮，著一層夾襖步出監(jiān)牢的瞬間挣郭，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,983評論 1贊 266
情欲美人皮
我被黑心中介騙來泰國打工侄非，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留流译，地道東北人。一個月前我還...
沈念sama閱讀 46,351評論 2贊 360
代替公主和親
正文我出身青樓叠赦，卻偏偏與公主長得像革砸，于是被迫代替她去往敵國和親除秀。傳聞我的和親對象是個殘疾皇子算利，可洞房花燭夜當晚...
茶點故事閱讀 43,509評論 2贊 348