Overview
直接從SBC-4文檔中摘錄一段吧槽华,原文描述非常清晰:
Any medium has the potential for medium defects that cause data to be lost. Therefore, physical blocks and/or logical blocks may contain additional information that allows the detection of changes to the logical block data caused by medium defect or other phenomena. The additional information may also allow the logical block data to be reconstructed following the detection of such a change (e.g., ECC bytes).
A medium defect causes:
a) a recovered error if the device server is able to read or write a logical block within the logical unit’s
recovery limits; or
b) an unrecovered error if the device server is unable to read or write a logical block within the logical
unit’s recovery limits,
where the logical unit’s recovery limits are:
a) specified in the Read-Write Error Recovery mode page (see 6.5.10);
b) specified in the Verify Error Recovery mode page (see 6.5.11); or
c) vendor specific, if the device server does not implement the Read-Write Error Recovery mode page or the Verify Error Recovery mode page.
上面實(shí)際上也定義了recovered error和unrecovered error的區(qū)別:在規(guī)定的限制內(nèi),如果logical block中的內(nèi)容可讀或者可寫(視操作而定)肮柜,那么就算是recovered棍弄。如果做不到望薄,那么就是unrecovered。這個(gè)限制(recovery limits)指的是Read-Write Error Recovery mode page呼畸、Verify Error Recovery mode page或者vendor自定義的page中規(guī)定的限制(比如最多重試次數(shù)痕支、recovery時(shí)間限制等等)。
塊壞恢復(fù)
硬盤本身的支持
Medium defects可能會(huì)造成潛在的數(shù)據(jù)丟失蛮原。硬盤可能會(huì)提供恢復(fù)的機(jī)制卧须,但也可能不提供。理想情況下儒陨,硬盤可以自動(dòng)修復(fù)這些壞塊花嘶,將LBA重新映射到好的物理塊上。這種機(jī)制稱為塊壞自動(dòng)重分配(automatic reassignment of defects)蹦漠。硬盤是否支持這種機(jī)制椭员,可以看一下Read-Write Error Recovery mode page。
> sginfo -a /dev/sdc
... # 省略一些輸出
Read-Write Error Recovery mode page (0x1)
-----------------------------------------
AWRE 1
ARRE 1
TB 0
RC 0
EER 0
PER 0
DTE 0
DCR 0
Read Retry Count 1
Correction Span 0
Head Offset Count 0
Data Strobe Offset Count 0
Write Retry Count 1
Recovery Time Limit (ms) 0
... # 省略一些輸出
AWRE:AWRE為0表示硬盤不應(yīng)該執(zhí)行automatic write reassignment笛园。AWRE為1表示當(dāng)硬盤在寫過程中遇到recovered或者unrecovered error的時(shí)候隘击,會(huì)嘗試對(duì)壞塊進(jìn)行reassignment。
ARRE:ARRE為0表示硬盤不應(yīng)該執(zhí)行automatic read reassignment研铆。ARRE為1表示當(dāng)硬盤在讀過程中遇到recovered的時(shí)候埋同,會(huì)嘗試對(duì)壞塊進(jìn)行reassignment。注意蚜印,讀請求與寫請求不一樣的是莺禁,讀請求只有遇到recovered error的時(shí)候(且ARRE為1),才會(huì)嘗試reassignment窄赋,對(duì)于unrecovered error哟冬,讀請求是不會(huì)自動(dòng)reassign的楼熄。
Write Retry Count:這個(gè)參數(shù)指的不是write的重試次數(shù)(雖然看起來特別像),而是指的對(duì)于write請求浩峡,如果出錯(cuò)了可岂,硬盤的recovery次數(shù)。也就是說比如write請求出錯(cuò)了翰灾,硬盤嘗試進(jìn)行recovery的次數(shù)缕粹,最多不超過Write Retry Count規(guī)定的次數(shù)。
Read Retry Count:跟Write Retry Count類似纸淮,針對(duì)讀請求平斩。
讀請求故障處理(Read with unrecovered Medium error)
當(dāng)讀請求不可恢復(fù)(unrecovered)的錯(cuò)誤時(shí)(例如scsi status是CHECK CONDITION,sense key是MEDIUM ERROR咽块,ASC是UNRECOVERED READ ERROR)绘面,硬盤是不會(huì)觸發(fā)自動(dòng)重分配的。需要由上層的應(yīng)用程序做顯式的處理:
a) 如果應(yīng)用程序可以重新生成相關(guān)的數(shù)據(jù)(例如從RAID的其它硬盤中重新構(gòu)造出來)侈沪,并且AWRE bit是1揭璃,那么應(yīng)用程序可以發(fā)送一個(gè)write命令將數(shù)據(jù)寫入,這個(gè)write命令會(huì)觸發(fā)automatic write reassignment亭罪。
b) 如果應(yīng)用程序可以重新生成相關(guān)的數(shù)據(jù)瘦馍,并且AWRE bit是0,那么應(yīng)用程序需要先發(fā)送REASSIGN BLOCKS命令為故障的LBA重新分配物理塊应役,然后再發(fā)送WRITE命令將數(shù)據(jù)寫入情组。
c) 如果應(yīng)用程序無法重新生成相關(guān)的數(shù)據(jù),那么應(yīng)用程序可以嘗試使用REASSIGN BLOCKS命令來重新為LBA分配物理塊扛吞。但是由于數(shù)據(jù)無法再生成呻惕,所以這些數(shù)據(jù)就丟失了。
獲取defect列表
硬盤為了記錄這些有故障的壞塊滥比,就需要一個(gè)列表亚脆,這個(gè)列表就是PLIST(primary defect list)和GLIST(grown defect list)。使用sginfo命令可以查看硬盤的PLIST和GLIST列表盲泛。
[root@node100 ~]# sginfo -d /dev/sdc
INQUIRY response (cmd: 0x12)
----------------------------
Device Type 0
Vendor: HGST
Product: HUS728T8TAL5204
Revision level: C414
>>> Unable to read primary (PLIST) defect data.
Defect Lists
------------
0 entries (0 bytes) in grown (GLIST) table.
Format (4) is: bytes from index [Cyl:Head:Off]
Offset -1 marks whole track as bad.
這個(gè)PLIST和GLIST有什么區(qū)別呢濒持?簡單的說PLIST記錄出廠時(shí)發(fā)現(xiàn)的壞塊,在映射LBA的時(shí)候會(huì)自動(dòng)跳過這些壞塊寺滚,不影響性能柑营。而GLIST是使用過程中發(fā)現(xiàn)的壞塊,這些LBA可能會(huì)被重分配到其它的物理塊上村视,訪問這些壞塊就會(huì)影響性能官套。具體的可以參考這篇文檔PLIST基本缺陷列表與GLIST 成長缺陷列表。
如何制造硬盤故障(軟件)
SBC文檔中提到,可以使用WRITE LONG命令來制造假的不可恢復(fù)的故障奶赔。
參考資料
SBC-4