自己搭的3個(gè)OSD節(jié)點(diǎn)的集群的健康狀態(tài)經(jīng)常處在”WARN”狀態(tài)够挂,replicas設(shè)置為3,OSD節(jié)點(diǎn)數(shù)量大于3璧眠,存放的data數(shù)量也不多颈抚,ceph -s 不是期待的health ok,而是active+undersized+degraded柴罐。被這個(gè)問(wèn)題困擾有段時(shí)間徽缚,因?yàn)閷?duì)Ceph不太了解而一直沒(méi)有找到解決方案,直到最近發(fā)郵件到社區(qū)才得到解決[1]革屠。
PG狀態(tài)的含義
PG的非正常狀態(tài)說(shuō)明可以參考[2]凿试,undersized與degraded的含義記錄于此:
undersized
The placement group has fewer copies than the configured pool replication level.
degraded
Ceph has not replicated some objects in the placement group the correct number of times yet.
這兩種狀態(tài)一般同時(shí)出現(xiàn),大概的意思就是有些PG沒(méi)有滿足設(shè)定的replicas數(shù)量要求似芝,PG中的部分objects亦如此那婉。看下PG的詳細(xì)信息:
ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs
stuck undersized; 2 pgs undersized
pg 17.58 is stuck unclean for 61033.947719, current state
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck unclean for 61033.948201, current state
active+undersized+degraded, last acting [0,2]
pg 17.58 is stuck undersized for 61033.343824, current state
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck undersized for 61033.327566, current state
active+undersized+degraded, last acting [0,2]
pg 17.58 is stuck degraded for 61033.343835, current state
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck degraded for 61033.327576, current state
active+undersized+degraded, last acting [0,2]
pg 17.16 is active+undersized+degraded, acting [0,2]
pg 17.58 is active+undersized+degraded, acting [2,0]
解決辦法
雖然設(shè)定的拷貝數(shù)量是3党瓮,但是PG 17.58與17.58卻只有兩個(gè)拷貝详炬,分別存放在OSD 0與OSD 2上。
而究其原因則是我們的OSD所在的磁盤(pán)不是同質(zhì)的麻诀,從而每個(gè)OSD的weight不同痕寓,而Ceph對(duì)異質(zhì)OSD的支持不是很好。從而導(dǎo)致部分PG無(wú)法滿足我們?cè)O(shè)定的備份數(shù)量限制蝇闭。
OSD狀態(tài)樹(shù):
ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.89049 root default
-2 1.81360 host ceph3
2 1.81360 osd.2 up 1.00000 1.00000
-3 0.44969 host ceph4
3 0.44969 osd.3 up 1.00000 1.00000
-4 3.62720 host ceph1
0 1.81360 osd.0 up 1.00000 1.00000
1 1.81360 osd.1 up 1.00000 1.00000
解決辦法是另外構(gòu)建一個(gè)OSD呻率,使其容量大小和其它節(jié)點(diǎn)相同,是否可以有偏差呻引?猜測(cè)應(yīng)該有一個(gè)可以接受的偏差范圍礼仗,重構(gòu)后的OSD節(jié)點(diǎn)樹(shù)看起來(lái)像這樣:
$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.25439 root default
-2 1.81360 host ceph3
2 1.81360 osd.2 up 1.00000 1.00000
-3 0 host ceph4
-4 3.62720 host ceph1
0 1.81360 osd.0 up 1.00000 1.00000
1 1.81360 osd.1 up 1.00000 1.00000
-5 1.81360 host ceph2
3 1.81360 osd.3 up 1.00000 1.00000
ceph4節(jié)點(diǎn)被刪除,重新加入了另一個(gè)OSD節(jié)點(diǎn)ceph2逻悠。
$ ceph -s
cluster 20ab1119-a072-4bdf-9402-9d0ce8c256f4
health HEALTH_OK
monmap e2: 2 mons at {ceph2=192.168.17.21:6789/0,ceph4=192.168.17.23:6789/0}
election epoch 26, quorum 0,1 ceph2,ceph4
osdmap e599: 4 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v155011: 100 pgs, 1 pools, 18628 bytes data, 1 objects
1129 MB used, 7427 GB / 7428 GB avail
100 active+clean
另外元践,為了滿足HA的要求,OSD需要分散在不同的節(jié)點(diǎn)上童谒,這里拷貝數(shù)量為3单旁,則需要有三個(gè)OSD節(jié)點(diǎn)來(lái)承載這些OSD,如果三個(gè)OSD分布在兩個(gè)OSD節(jié)點(diǎn)上饥伊,則依然可能會(huì)出現(xiàn)”active+undersized+degraded”的狀態(tài)象浑。
官方是這樣說(shuō)的:
This, combined with the default CRUSH failure domain, ensures that replicas or erasure code shards are separated across hosts and a single host failure will not affect availability.
理解如有錯(cuò)誤還望能點(diǎn)醒。
[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47070.html
[2] http://docs.ceph.com/docs/master/rados/operations/pg-states/
轉(zhuǎn): https://blog.csdn.net/chenwei8280/article/details/80785595