背景
由于異常斷電導(dǎo)致三個(gè)副本所在的故障域题翰,都各有兩個(gè)osd處于down的狀態(tài),情況非常危機(jī),需要現(xiàn)場(chǎng)去修復(fù)豹障,防止數(shù)據(jù)丟失冯事。
ceph 版本:0.94.10
上述是由于三個(gè)domain里面各有兩個(gè)osd處于down的狀態(tài)。
原因:
In certain cases, the ceph-osd Peering process can run into problems, preventing a PG from becoming active and usable.
這是因?yàn)閜eering block啦
當(dāng)然peering過程是一個(gè)非常復(fù)雜的過程血公,后面的需要的話一定要整理一番昵仅。
注:
除了在上述故障域的pg,其他pg在數(shù)據(jù)恢復(fù)的過程中都從down的狀態(tài)累魔,變到正常的狀態(tài)摔笤。我們需要關(guān)心2.9a4,2.eac這兩個(gè)pg的問題。(其實(shí)這兩個(gè)pg在斷電之后是處于stale狀態(tài)垦写,后面我們那邊運(yùn)維人員意外拉起了一個(gè)osd后就處于pg down情況)吕世。
事實(shí)上我們要解決的問題是pg stale問題,所以我們要拉起相關(guān)的osd梯投。
解決方案
找到osd
命令:
ceph-objectstore-tool --data-path xxx --journal-path xxx --op list-ops
ceph tell <pgid> query
ceph pg <pgid> query
我們這邊是514osd的問題命辖。
查看514啟動(dòng)失敗日志:
-3> 2019-07-09 22:13:40.493486 7f97a6d81880 20 read_log coll 2.9a4_head log_oid 2/9a4//head
-2> 2019-07-09 22:13:40.493565 7f97a6d81880 20 read_log 231404'10412575 (231404'10412574) modify 2/2d7009a4/rbd_data.d0dbe05a5f008d.0000000000004532/head by client.156138418.0:1761183652 2019-07-04 03:44:14.804189
-1> 2019-07-09 22:13:40.493582 7f97a6d81880 20 read_log 233319'10412574 (100460'2793103) modify 2/c03f99a4/rbd_data.c2639d7eb9c025.000000000000a802/head by client.232205693.0:3303 2019-07-05 07:43:03.907875
0> 2019-07-09 22:13:40.495397 7f97a6d81880 -1 osd/PGLog.cc: In function 'static void PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, std::set<std::basic_string<char> >*)' thread 7f97a6d81880 time 2019-07-09 22:13:40.493592
osd/PGLog.cc: 911: FAILED assert(last_e.version.version < e.version.version)
ceph version 0.94.10.1 (c5ce8260cade179b7dd358a340351c4029e239c1)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbdf665]
2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0x1a38) [0x7751e8]
3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7f852f]
4: (OSD::load_pgs()+0xa99) [0x6bd539]
5: (OSD::init()+0x181a) [0x6c10da]
6: (main()+0x29dd) [0x64854d]
7: (__libc_start_main()+0xf5) [0x7f97a411daf5]
8: /usr/bin/ceph-osd() [0x661f39]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
在read_log 過程中出現(xiàn)的問題,查看ceph這部分代碼
具體代碼在這里出現(xiàn)的斷言錯(cuò)誤分蓖。
分析原因:
osd在異常斷電(機(jī)械磁盤異常掉電導(dǎo)致一部分pglog出錯(cuò))中相關(guān)pglog未寫正確尔艇。
解決辦法:
刪除由于pglog 出錯(cuò)的object
1.使用ceph-kvstore-tool工具。
但由于這個(gè)命令沒有rm 子命令(L版本才支持)么鹤,因?yàn)楣ぞ叨际窍蛳录嫒莸睦熘悖覀冃枰浦睱版本的這個(gè)工具,拷貝ceph-kvstore-tool 到目標(biāo)機(jī)/usr/bin/下午磁,同時(shí)拷貝 /usr/lib64/ceph/libceph-common.so /usr/lib64/ceph/libceph-common.so.0到規(guī)定目錄 。
2.使用ceph-objectstore-tool
2因?yàn)閯h除這個(gè)log 毡们,需要具備prefix 和key 迅皇。所以我們需要在ceph-objectstore-tool中read log 的邏輯中加上獲取相關(guān)信息的代碼,代碼如下:
diff --git a/src/os/DBObjectMap.cc b/src/os/DBObjectMap.cc
index b856849..7946763 100644
--- a/src/os/DBObjectMap.cc
+++ b/src/os/DBObjectMap.cc
@@ -250,12 +250,14 @@ int DBObjectMap::DBObjectMapIteratorImpl::init()
}
ObjectMap::ObjectMapIterator DBObjectMap::get_iterator(
- const ghobject_t &oid)
+ const ghobject_t &oid, std::ostream *out)
{
MapHeaderLock hl(this, oid);
Header header = lookup_map_header(hl, oid);
if (!header)
return ObjectMapIterator(new EmptyIteratorImpl());
+ if (out)
+ *out << "header seq: " << header_key(header->seq) << std::endl;
DBObjectMapIterator iter = _get_iterator(header);
iter->hlock.swap(hl);
return iter;
diff --git a/src/os/DBObjectMap.h b/src/os/DBObjectMap.h
index de80d6f..7ec43b0 100644
--- a/src/os/DBObjectMap.h
+++ b/src/os/DBObjectMap.h
@@ -219,7 +219,7 @@ public:
int list_objects(vector<ghobject_t> *objs ///< [out] objects
);
- ObjectMapIterator get_iterator(const ghobject_t &oid);
+ ObjectMapIterator get_iterator(const ghobject_t &oid, std::ostream *out = NULL);
static const string USER_PREFIX;
static const string XATTR_PREFIX;
diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
index e0afbd0..bce7d6d 100644
--- a/src/os/FileStore.cc
+++ b/src/os/FileStore.cc
@@ -4747,7 +4747,10 @@ ObjectMap::ObjectMapIterator FileStore::get_omap_iterator(coll_t c,
if (r < 0)
return ObjectMap::ObjectMapIterator();
}
- return object_map->get_iterator(hoid);
+ ostringstream oss;
+ ObjectMap::ObjectMapIterator oiter = object_map->get_iterator(hoid, &oss);
+ dout(0) << "nyao: " << " " << oss.str() << dendl;
+ return oiter;
}
int FileStore::_collection_hint_expected_num_objs(coll_t c, uint32_t pg_num,
diff --git a/src/os/ObjectMap.h b/src/os/ObjectMap.h
index 86f9e3e..27de1ad 100644
--- a/src/os/ObjectMap.h
+++ b/src/os/ObjectMap.h
@@ -150,7 +150,7 @@ public:
virtual ~ObjectMapIteratorImpl() {}
};
typedef ceph::shared_ptr<ObjectMapIteratorImpl> ObjectMapIterator;
- virtual ObjectMapIterator get_iterator(const ghobject_t &oid) {
+ virtual ObjectMapIterator get_iterator(const ghobject_t &oid, std::ostream *oss = NULL) {
return ObjectMapIterator();
}
diff --git a/src/osd/PGLog.cc b/src/osd/PGLog.cc
index a34903f..11fd555 100644
--- a/src/osd/PGLog.cc
+++ b/src/osd/PGLog.cc
@@ -19,7 +19,8 @@
#include "PG.h"
#include "SnapMapper.h"
#include "../include/unordered_map.h"
-
+//#include "os/DBObjectMap.h"
+//#include "os/KeyValueDB.h"
#define dout_subsys ceph_subsys_osd
static coll_t META_COLL("meta");
@@ -906,8 +907,10 @@ void PGLog::read_log(ObjectStore *store, coll_t pg_coll,
pg_log_entry_t e;
e.decode_with_checksum(bp);
dout(20) << "read_log " << e << dendl;
- oss<<e.get_key_name();
- oss<<" ";
+ dout(20) << e.get_key_name() << dendl;
+
+ //DBObjectMap* dbobjectmap= new DBObjectMap(new KeyValueDB());
+
//oss<<"hello_world ";
if (!log.log.empty()) {
pg_log_entry_t last_e(log.log.back());
通過此等方法確認(rèn)了prefix 和key 衙熔,把編譯完的程序?qū)肽繕?biāo)機(jī)上登颓,當(dāng)然這個(gè)工具靜態(tài)鏈接一些東西,需要把一些靜態(tài)庫導(dǎo)進(jìn)去红氯。
3.刪除信息
當(dāng)然提前用get命令把相關(guān)的key給拿出來框咙,以防刪錯(cuò)。
此時(shí)啟動(dòng)514osd痢甘,osd正常啟動(dòng)喇嘱。
4.query 相關(guān)pg
照提示進(jìn)行l(wèi)ost 相關(guān)osd
ceph osd lost <osd id> --yes-i-really-mean-it
重啟osd.514 ,集群恢復(fù)正常。
復(fù)盤相關(guān)pg狀態(tài)
pg stale:
stale - The placement group status has not been updated by a ceph-osd, indicating that all nodes storing this placement group may be down.
準(zhǔn)備一個(gè)三個(gè)節(jié)點(diǎn)的簡單ceph環(huán)境
步驟1:建rbd pool 向里面fio寫些數(shù)據(jù)塞栅。
步驟2:down掉相關(guān)osd使用ceph-objectstore-tool 工具刪除掉其中一個(gè)pg者铜,三個(gè)osd都要進(jìn)行操作。
步驟3:ceph pg deep-scrub <pgid>
pg down:
1、準(zhǔn)備一個(gè)三節(jié)點(diǎn)的簡單ceph 環(huán)境作烟。
步驟1:down掉一個(gè)節(jié)點(diǎn)愉粤,
步驟2:向集群fio寫數(shù)據(jù)
步驟3:down掉剩余兩個(gè)節(jié)點(diǎn)
步驟4:拉起首次down的那個(gè)節(jié)點(diǎn)
2、時(shí)鐘不同步也會(huì)導(dǎo)致pg down
步驟1:準(zhǔn)備osd tree 為下圖這樣的環(huán)境
步驟2:設(shè)置noout狀態(tài) ceph osd set noout
步驟3:down 掉osd.0 osd.2 osd.3
步驟4:設(shè)置n2 host的主機(jī)時(shí)鐘為 14年 date -s 2014-14-12
步驟5.啟動(dòng)osd 0 osd.2
結(jié)果:
如果最后osd.3啟動(dòng)后拿撩,集群可恢復(fù)衣厘,說明只要兩個(gè)副本時(shí)間是正確的,既可以將pg達(dá)成一致