1. 磁盤布局及文件系統(tǒng)結(jié)構(gòu)
一般磁盤經(jīng)過分區(qū)后,由MBR + MBR GAP + 若干分區(qū)組成.
a. MBR+MBR GAP一般是2048bytes, 主要用于寫入引導(dǎo)程序(如grub、LILO等),引導(dǎo)系統(tǒng)啟動(dòng);
b. MBR固定是512字節(jié),446(引導(dǎo)寫入?yún)^(qū)域)+64(分區(qū)表)+2(固定55aa)
分區(qū)進(jìn)行格式化后,在分區(qū)的開頭會(huì)預(yù)留空間作為Boot sector(一般1024bytes),剩下的空間切成若干個(gè)塊組存谎,塊組的構(gòu)成見下圖.
塊組各部分功能描述
超級(jí)塊(Super block)
a. 超級(jí)塊用于描述文件系統(tǒng)的基本信息,如起始位置肥隆、block和inode的數(shù)量及大小既荚、文件系統(tǒng)支持的特性等.
b. 超級(jí)塊對(duì)于文件系統(tǒng)是至關(guān)重要的,一般位于塊組0的第一個(gè)block中栋艳,若干備份存在其他塊組中恰聘,超級(jí)塊損壞,會(huì)導(dǎo)致文件系統(tǒng)無法識(shí)別.
c. 超級(jí)塊示例如下
Filesystem volume name: <none>
Last mounted on: /root
Filesystem UUID: c55383bd-6336-4503-a8c8-a644f340bc4c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 6430720
Block count: 25700608
Reserved block count: 1285029
Free blocks: 7497348
Free inodes: 5878745
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Reserved GDT blocks: 1018
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Thu Feb 18 10:40:42 2021
Last mount time: Wed Mar 23 19:12:09 2022
Last write time: Wed Mar 23 19:12:09 2022
Mount count: 172
Maximum mount count: -1
Last checked: Mon Nov 8 14:04:12 2021
Check interval: 0 (<none>)
Lifetime writes: 3056 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
First orphan inode: 1049286
Default directory hash: half_md4
Directory Hash Seed: 11a0c299-a897-4d08-a36e-5d72df2a9581
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0xe080d5c7
Journal features: journal_incompat_revoke journal_64bit journal_checksum_v3
Journal size: 256M
Journal length: 65536
Journal sequence: 0x013ce92a
Journal start: 48115
Journal checksum type: crc32c
Journal checksum: 0xf282bb35
塊組描述表(GDT)
主要用于描述每個(gè)塊組的起始位置,塊組內(nèi)超級(jí)塊晴叨、塊組描述表凿宾、inode表、inode位圖兼蕊、數(shù)據(jù)塊位圖初厚、數(shù)據(jù)塊等具體位置,示例如下.
Group 0: (Blocks 1-8192) [ITABLE_ZEROED]
Checksum 0x2777, unused inodes 2004
主 superblock at 1, Group descriptors at 2-4
保留的GDT塊位于 5-260
Block bitmap at 261 (+260), Inode bitmap at 277 (+276)
Inode表位于 293-544 (+292)
3854 free blocks, 2004 free inodes, 2 directories, 2004個(gè)未使用的inodes
可用塊數(shù): 4339-8192
可用inode數(shù): 13-2016
Group 1: (Blocks 8193-16384) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0x2c84, unused inodes 2016
備份 superblock at 8193, Group descriptors at 8194-8196
保留的GDT塊位于 8197-8452
Block bitmap at 262 (bg #0 + 261), Inode bitmap at 278 (bg #0 + 277)
Inode表位于 545-796 (bg #0 + 544)
7932 free blocks, 2016 free inodes, 0 directories, 2016個(gè)未使用的inodes
可用塊數(shù): 8453-16384
可用inode數(shù): 2017-4032
Group 2: (Blocks 16385-24576) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0x2336, unused inodes 2016
Block bitmap at 263 (bg #0 + 262), Inode bitmap at 279 (bg #0 + 278)
Inode表位于 797-1048 (bg #0 + 796)
8192 free blocks, 2016 free inodes, 0 directories, 2016個(gè)未使用的inodes
可用塊數(shù): 16385-24576
可用inode數(shù): 4033-6048
Group 3: (Blocks 24577-32768) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0x5ca1, unused inodes 2016
備份 superblock at 24577, Group descriptors at 24578-24580
保留的GDT塊位于 24581-24836
Block bitmap at 264 (bg #0 + 263), Inode bitmap at 280 (bg #0 + 279)
Inode表位于 1049-1300 (bg #0 + 1048)
7932 free blocks, 2016 free inodes, 0 directories, 2016個(gè)未使用的inodes
可用塊數(shù): 24837-32768
可用inode數(shù): 6049-8064
塊位圖(Block bitmap)
數(shù)據(jù)塊使用情況對(duì)照表孙技,1bit對(duì)應(yīng)一個(gè)數(shù)據(jù)塊.
inode位圖(inode bitmap)
inode塊使用情況對(duì)照表产禾,1bit對(duì)應(yīng)一個(gè)數(shù)據(jù)塊.
inode table
inode是文件索引項(xiàng),每個(gè)文件對(duì)應(yīng)一個(gè)inode牵啦,inode中記錄了文件的基本屬性及關(guān)聯(lián)的數(shù)據(jù)塊亚情,inode數(shù)據(jù)結(jié)構(gòu)如下.
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner Uid */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode Change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion Time */
__le16 i_gid; /* Low 16 bits of Group Id */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
union {
struct {
__le32 l_i_version;
} linux1;
struct {
__u32 h_i_translator;
} hurd1;
struct {
__u32 m_i_reserved1;
} masix1;
} osd1; /* OS dependent 1 */
__le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
__le32 i_size_high;
__le32 i_obso_faddr; /* Obsoleted fragment address */
union {
struct {
__le16 l_i_blocks_high; /* were l_i_reserved1 */
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
__le16 l_i_checksum_lo;/* crc32c(uuid+inum+inode) LE */
__le16 l_i_reserved;
} linux2;
struct {
__le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
__u16 h_i_mode_high;
__u16 h_i_uid_high;
__u16 h_i_gid_high;
__u32 h_i_author;
} hurd2;
struct {
__le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
__le16 m_i_file_acl_high;
__u32 m_i_reserved2[2];
} masix2;
} osd2; /* OS dependent 2 */
__le16 i_extra_isize;
__le16 i_checksum_hi; /* crc32c(uuid+inum+inode) BE */
__le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */
__le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */
__le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
__le32 i_crtime; /* File Creation time */
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
__le32 i_version_hi; /* high 32 bits for 64-bit version */
__le32 i_projid; /* Project ID */
};
數(shù)據(jù)塊(data block)
用于存放數(shù)據(jù)的block,每個(gè)block固定大小和唯一編號(hào).
文件系統(tǒng)運(yùn)行基本原理
普通文件與目錄區(qū)別
普通文件與目錄哈雏,都是文件楞件,都有唯一的inode作為指向索引及數(shù)據(jù)塊記錄其內(nèi)容. 兩者的區(qū)別主要在內(nèi)容上,目錄的數(shù)據(jù)塊記錄的是一張表裳瘪,這張表描述目錄下所有文件文件名和inode的一一對(duì)應(yīng)關(guān)系.
文件創(chuàng)建過程(順序不一定對(duì))
a. 分配inode x和block y給新文件土浸,x指向y
b. 更新inode和block位圖
c. 在新文件父目錄的數(shù)據(jù)塊中,更新文件名-inode對(duì)照表彭羹,增加一條記錄
文件讀取過程
文件讀取栅迄,是從根目錄開始,一層一層往下查找的皆怕,以/etc/fstab為例如下:
a. 讀取/的inode(根目錄的inode固定,一般為2)西篓,找到/的block愈腾,進(jìn)行讀取,找到etc的inode:
b. 讀取etc的inode后岂津,找到etc/的block虱黄,讀取獲得fstab的inode
c. 讀取fstab的inode,獲得fstab的block
d. 再讀取fstab的block吮成,即獲取fstab內(nèi)容
2. ext3/4 JDB日志系統(tǒng)介紹
JDB日志系統(tǒng)是ext3在ext2的基礎(chǔ)上增加的功能.
2.1 JDB日志系統(tǒng)功能闡述
假設(shè)系統(tǒng)運(yùn)行在ext2分區(qū)上橱乱,正在讀寫磁盤. 突然掉電,或系統(tǒng)崩潰粱甫,你不得不強(qiáng)制重啟系統(tǒng)泳叠,然而此時(shí)還有數(shù)據(jù)在內(nèi)存緩沖區(qū)未寫入磁盤;重啟系統(tǒng)后茶宵,你可能會(huì)發(fā)現(xiàn)部分?jǐn)?shù)據(jù)丟失危纫,甚至文件系統(tǒng)元數(shù)據(jù)丟失,文件系統(tǒng)不完整一致,分區(qū)無法掛載等种蝶;調(diào)用fsck可能會(huì)修復(fù)分區(qū)契耿,但會(huì)耗費(fèi)大量時(shí)間. JDB日志系統(tǒng)主要是為應(yīng)對(duì)此類情況而設(shè)計(jì)的,但JDB不能減少系統(tǒng)崩潰的概率螃征,它主要解決的問題是:a. 系統(tǒng)出現(xiàn)異常重啟時(shí)搪桂,盡量保持文件系統(tǒng)的完整和一致性(完整和一致性指文件系統(tǒng)元數(shù)據(jù)如超級(jí)塊、塊組描述表盯滚、塊位圖踢械、inode位圖、inode表及數(shù)據(jù)塊淌山,互相之間的對(duì)照關(guān)系是準(zhǔn)確的裸燎,比如塊位圖和數(shù)據(jù)塊實(shí)際被占用的情況一致,inode表和數(shù)據(jù)塊的映射關(guān)系是準(zhǔn)確的等)泼疑;b. 文件系統(tǒng)損壞后德绿,可修復(fù)的情況下,減少修復(fù)的耗時(shí)退渗,主要是利用日志記錄進(jìn)行修復(fù).
2.2 JDB運(yùn)行基本原理
a. 定義原子操作
修改文件系統(tǒng)的任一系統(tǒng)調(diào)用都通常劃分為操縱磁盤數(shù)據(jù)結(jié)構(gòu)的一系列低級(jí)操作. 原子操作是單個(gè)低級(jí)操作或若干低級(jí)操作的組合移稳,是內(nèi)部不再分割的操作,該操作要么完全完成会油,要么根本沒有執(zhí)行个粱,不存在部分完成的狀態(tài)。比如說為文件分配一個(gè)磁盤塊翻翩,可以看成一個(gè)原子操作都许。分配一個(gè)磁盤塊,可能需要修改一個(gè)inode塊嫂冻、一個(gè)磁盤塊位圖胶征、最多三個(gè)間接索引塊、塊組塊桨仿、超級(jí)塊睛低,一共最多7個(gè)磁盤塊。將分配一個(gè)磁盤塊看成一個(gè)原子操作服傍,意味著上述修改7個(gè)磁盤塊的操作要么都成功钱雷,要么都失敗,不可能有第三種狀態(tài)吹零。
b. 一系列原子操作組合成一個(gè)事務(wù)
實(shí)現(xiàn)日志文件系統(tǒng)時(shí)罩抗,可以將一個(gè)原子操作就作為一個(gè)事務(wù)來處理,但是這樣實(shí)現(xiàn)的效率比較低瘪校。若干個(gè)原子操作組合成一個(gè)事務(wù)澄暮,對(duì)磁盤日志以事務(wù)為單位進(jìn)行管理名段,以提高讀寫日志的效率.
c. 在磁盤上劃分空間存儲(chǔ)事務(wù)日志
將原子操作組成的事務(wù),寫到日志空間上泣懊,這部分日志即為磁盤數(shù)據(jù)操作的歷史記錄伸辟,利用這部分?jǐn)?shù)據(jù)回溯,可實(shí)現(xiàn)數(shù)據(jù)還原.
d. 通過事務(wù)狀態(tài)跟蹤事務(wù)完成情況
事務(wù)運(yùn)行會(huì)經(jīng)歷下面的一系列狀態(tài):
運(yùn)行(running):事務(wù)當(dāng)前在內(nèi)存中馍刮,還可以接受新的原子操作信夫。在一個(gè)系統(tǒng)中,僅有一個(gè)事務(wù)可以處于運(yùn)行狀態(tài)
鎖定(locked):事務(wù)不再接受新的原子操作卡啰,但現(xiàn)有原子操作們還沒有完成静稻。一旦所有原子操作都完成了,事務(wù)將進(jìn)入下一個(gè)狀態(tài)
寫入(flush):事務(wù)中的所有原子操作都完成了匈辱,事務(wù)正在寫入日志
提交(commit):事務(wù)已寫入日志振湾。事務(wù)會(huì)寫一個(gè)提交塊,指示事務(wù)log已寫入日志
完成(Finished):事務(wù)寫到日志之后亡脸,它會(huì)留在那直到所有的塊都被更新到磁盤上的實(shí)際位置
2.3 JDB事務(wù)日志結(jié)構(gòu)
從上圖可以看到押搪,JDB日志有超級(jí)塊、描述塊浅碾、數(shù)據(jù)塊大州、提交塊及取消塊組成.
a. 超級(jí)塊(JFS_SUPERBLOCK):日志中超級(jí)塊起的作用與文件系統(tǒng)中超級(jí)塊的作用是類似的,都是用于組織管理一段磁盤空間.
b. 描述塊(JFS_DESCRIPTOR_BLOCK):一個(gè)事務(wù)以描述塊開始垂谢,以提交塊結(jié)束. 描述塊主要作用是描述本事務(wù)中的日志塊厦画,記錄的是哪個(gè)磁盤塊的操作記錄.
c. 數(shù)據(jù)塊:記錄磁盤塊的數(shù)據(jù)操作
d. 提交塊(JFS_COMMIT_BLOCK):提交塊表明一個(gè)事務(wù)的完成.
e. 取消塊(JFS_REVOKE_BLOCK):事務(wù)中包含刪除磁盤塊操作時(shí),就會(huì)在日志中寫一個(gè)取消塊滥朱,表明取消塊之前根暑,對(duì)應(yīng)磁盤塊的操作都可以忽略.
通過debugfs查看JDB日志
debugfs: logdump -a
Journal starts at block 1, transaction 51
Found expected sequence 51, type 1 (descriptor block) at block 1
Dumping descriptor block, sequence 51, at block 1:
FS block 293 logged at journal block 2 (flags 0x0)
FS block 277 logged at journal block 3 (flags 0x2)
FS block 2 logged at journal block 4 (flags 0x2)
FS block 294 logged at journal block 5 (flags 0x2)
FS block 4325 logged at journal block 6 (flags 0x2)
FS block 1 logged at journal block 7 (flags 0xa)
Found expected sequence 51, type 2 (commit block) at block 8
Found expected sequence 52, type 1 (descriptor block) at block 9
Dumping descriptor block, sequence 52, at block 9:
FS block 294 logged at journal block 10 (flags 0x8)
Found expected sequence 52, type 2 (commit block) at block 11
Found expected sequence 53, type 1 (descriptor block) at block 12
Dumping descriptor block, sequence 53, at block 12:
FS block 277 logged at journal block 13 (flags 0x0)
FS block 2 logged at journal block 14 (flags 0x2)
FS block 294 logged at journal block 15 (flags 0x2)
FS block 293 logged at journal block 16 (flags 0x2)
FS block 4325 logged at journal block 17 (flags 0x2)
FS block 262 logged at journal block 18 (flags 0xa)
Found expected sequence 53, type 2 (commit block) at block 19
Found expected sequence 54, type 1 (descriptor block) at block 20
Dumping descriptor block, sequence 54, at block 20:
FS block 294 logged at journal block 21 (flags 0x0)
FS block 4325 logged at journal block 22 (flags 0x2)
FS block 293 logged at journal block 23 (flags 0x2)
FS block 1 logged at journal block 24 (flags 0x2)
FS block 2 logged at journal block 25 (flags 0x2)
FS block 277 logged at journal block 26 (flags 0x2)
FS block 131105 logged at journal block 27 (flags 0xa)
Found expected sequence 54, type 2 (commit block) at block 28
Found expected sequence 55, type 1 (descriptor block) at block 29
Dumping descriptor block, sequence 55, at block 29:
FS block 135137 logged at journal block 30 (flags 0x0)
FS block 131105 logged at journal block 31 (flags 0x2)
FS block 1 logged at journal block 32 (flags 0x2)
FS block 131075 logged at journal block 33 (flags 0x2)
FS block 3 logged at journal block 34 (flags 0x2)
FS block 131089 logged at journal block 35 (flags 0xa)
Found expected sequence 55, type 2 (commit block) at block 36
Found expected sequence 56, type 1 (descriptor block) at block 37
Dumping descriptor block, sequence 56, at block 37:
FS block 131105 logged at journal block 38 (flags 0x0)
FS block 293 logged at journal block 39 (flags 0x2)
FS block 131089 logged at journal block 40 (flags 0x2)
FS block 3 logged at journal block 41 (flags 0x2)
FS block 135138 logged at journal block 42 (flags 0x2)
FS block 1 logged at journal block 43 (flags 0xa)
Found expected sequence 56, type 2 (commit block) at block 44
Found expected sequence 57, type 1 (descriptor block) at block 45
Dumping descriptor block, sequence 57, at block 45:
FS block 131105 logged at journal block 46 (flags 0x0)
FS block 131089 logged at journal block 47 (flags 0x2)
FS block 3 logged at journal block 48 (flags 0x2)
FS block 135138 logged at journal block 49 (flags 0xa)
Found expected sequence 57, type 2 (commit block) at block 50
Found expected sequence 58, type 1 (descriptor block) at block 51
Dumping descriptor block, sequence 58, at block 51:
FS block 262 logged at journal block 52 (flags 0x0)
FS block 2 logged at journal block 53 (flags 0x2)
FS block 131105 logged at journal block 54 (flags 0xa)
Found expected sequence 58, type 2 (commit block) at block 55
Found expected sequence 59, type 1 (descriptor block) at block 56
Dumping descriptor block, sequence 59, at block 56:
FS block 131105 logged at journal block 57 (flags 0x0)
FS block 135138 logged at journal block 58 (flags 0x2)
FS block 1 logged at journal block 59 (flags 0x2)
FS block 3 logged at journal block 60 (flags 0x2)
FS block 131089 logged at journal block 61 (flags 0xa)
Found expected sequence 59, type 2 (commit block) at block 62
Found sequence 36 (not 60) at block 63: end of journal.
3. ext2/3/4文件系統(tǒng)調(diào)試工具介紹
mke2fs:用于創(chuàng)建ext文件系統(tǒng)
dumpe2fs:查看文件系統(tǒng)超級(jí)塊和塊組描述表
tune2fs:用于調(diào)整文件系統(tǒng)參數(shù)
e2fsck:檢查和修復(fù)文件系統(tǒng)
debugfs:文件系統(tǒng)debug工具,功能強(qiáng)大徙邻,可以用來查看JDB日志
extundelete:利用JDB日志购裙,修復(fù)被刪除的文件
badblocks:檢查磁道壞塊
4. 關(guān)于文件系統(tǒng)&數(shù)據(jù)的恢復(fù)的思考
a. 對(duì)于異常被刪除的數(shù)據(jù),第一時(shí)間卸載磁盤鹃栽,避免新的數(shù)據(jù)覆蓋舊數(shù)據(jù),然后通過對(duì)JDB日志分析躯畴,有可能能恢復(fù)數(shù)據(jù)民鼓,extundelete工具就是利用的這個(gè)原理.
b. 文件系統(tǒng)損壞的話,可嘗試通過fsck蓬抄、e2fsck工具進(jìn)行修復(fù)丰嘉,fsck會(huì)全盤檢查,效率較低嚷缭;e2fsck利用JDB日志饮亏,修復(fù)效率會(huì)更高.
c. 文件系統(tǒng)結(jié)構(gòu)破壞的情況下耍贾,是否有機(jī)會(huì)恢復(fù)數(shù)據(jù)?
這種情況是最讓人頭疼的,也是最復(fù)雜的路幸,僅從理論上分析荐开,應(yīng)該是有可能的,主要考慮以下幾個(gè)方面的問題:
i:文件系統(tǒng)的目錄結(jié)構(gòu)存在于inode table和數(shù)據(jù)塊中简肴,如果能定位到這2部分的位置晃听,應(yīng)該就能恢復(fù)大部分?jǐn)?shù)據(jù).
ii:關(guān)于inode table的位置定位,inode塊通常是連續(xù)且大小固定的砰识,inode又是統(tǒng)一固定的數(shù)據(jù)結(jié)構(gòu)能扒,猜測通過特征比對(duì),應(yīng)該能識(shí)別出來辫狼,定位到連續(xù)的inode塊初斑,也就能定位到inode位置.
iii:數(shù)據(jù)塊的起始位置就在inode table的結(jié)束位置.
iiii:還有就是要確定數(shù)據(jù)塊的編號(hào),數(shù)據(jù)塊的編號(hào)定位不準(zhǔn)確的話膨处,獲取的內(nèi)容錯(cuò)誤的见秤,并且會(huì)導(dǎo)致連鎖反應(yīng)的錯(cuò)誤,所以必須要精確.
5. 參考文獻(xiàn)
Ext3文件系統(tǒng)及JDB介紹
journal block device jbd源代碼分析
6. 遺留
文件系統(tǒng)結(jié)構(gòu)被破壞灵迫,文件系統(tǒng)和數(shù)據(jù)的恢復(fù)手段