深入linux內(nèi)核架構(gòu)--虛擬文件系統(tǒng)(簡介)
在Linux中,“萬物兼文件”,我們知道在linux下面有很多文件系統(tǒng),如EXT/2/3/4练链,XFS等,為了很好的支持各種類型的文件系統(tǒng)念祭,Linux抽象了一層虛擬文件系統(tǒng)層兑宇,用于更加靈活的適配各種具體的文件系統(tǒng)實現(xiàn)。其基本架構(gòu)如下:
可以看到所有的虛擬文件系統(tǒng)操作都必須在內(nèi)核態(tài)執(zhí)行,這是由于對于系統(tǒng)存儲及外部設(shè)備的訪問極其復(fù)雜隶糕,這部分的操作不能交給用戶去操作瓷产,否則系統(tǒng)會非常不穩(wěn)定。
文件系統(tǒng)類型
- 基于磁盤的文件系統(tǒng)
在非易失介質(zhì)存儲存儲文件的經(jīng)典方法枚驻,也就是為我們所熟知的各類文件系統(tǒng)濒旦,注入EXT2/3/4, FAT等 - 虛擬文件系統(tǒng)
在內(nèi)核中生成,是一種使用用戶應(yīng)用程序與用戶通信的方法再登,最為人所知的就是proc文件系統(tǒng)尔邓,其不需要與任何種類的硬件上存儲信息,所有的信息都存儲在內(nèi)存中锉矢,伴隨著進程而消亡 - 網(wǎng)絡(luò)文件系統(tǒng)
這種文件系統(tǒng)可以訪問其他計算機上的數(shù)據(jù)梯嗽,本機不會陷入內(nèi)核態(tài),所有的請求會發(fā)送到其他機器執(zhí)行沽损,因此網(wǎng)絡(luò)文件系統(tǒng)一般會以FUSE的形式掛載灯节。
通用文件系統(tǒng)
虛擬文件系統(tǒng)定義了一些了方法和抽象以及文件系統(tǒng)中對象(或文件)的統(tǒng)一視圖,但是在不同的實現(xiàn)中绵估,會截然不同筛圆,其提供的是一個通用的全集养叛,其提供的許多操作在某些子系統(tǒng)中并不需要粘姜,比如proc系統(tǒng)中的write_page操作盗温。
在處理文件時,內(nèi)核空間和用戶空間使用的對象是不同的缝左,在用戶空間一個文件有一個"文件描述符"標(biāo)識亿遂,是一個整數(shù),也就是我們經(jīng)常說的FD渺杉,只在一個進程內(nèi)部有效崩掘,兩個不同進程之間可以使用同一個FD;而FD對應(yīng)的內(nèi)核空間的數(shù)據(jù)結(jié)構(gòu)是struct file少办,其主要的成員為address_space,address_space是真正與底層設(shè)備交互數(shù)據(jù)結(jié)構(gòu)诵原,而另外一個管理文件元信息的數(shù)據(jù)結(jié)構(gòu)是inode英妓,其存儲著文件的鏈接,訪問時間绍赛,版本蔓纠,對應(yīng)的后端設(shè)備,所在的超級塊等等元信息吗蚌,但是不包括文件名腿倚,文件名存儲在struct dentry中,這是由于文件名是用于索引及管理inode的蚯妇,而dentry就是用于管理inode的敷燎,而dentry則通過super_block索引暂筝。
下面我們就來具體討論一下具體的各個結(jié)構(gòu)及他們的關(guān)系,并討論一下在linux中打開一個文件到寫入具體經(jīng)歷了哪些事情硬贯。
VFS結(jié)構(gòu)
inode
inode用于管理文件的元數(shù)據(jù)信息焕襟,包括權(quán)限信息,訪問信息饭豹,鏈接信息鸵赖,存儲設(shè)備信息等, 對應(yīng)的操作主要包括鏈接拄衰、權(quán)限它褪、,其數(shù)據(jù)結(jié)構(gòu)如下:
相關(guān)介紹參考inode
/*
* Keep mostly read-only and often accessed (especially for
* the RCU path lookup and 'stat' data) fields at the beginning
* of the 'struct inode'
*/
struct inode {
...
const struct inode_operations *i_op; // inode的操作翘悉,與具體的文件系統(tǒng)相關(guān)
struct super_block *i_sb; // 超級塊
struct address_space *i_mapping; // 地址空間茫打,真正的與設(shè)備交互模塊
...
/* Stat data, not accessed from path walking */
unsigned long i_ino; // inode 編號
/*
* Filesystems may only read i_nlink directly. They shall use the
* following functions for modification:
*
* (set|clear|inc|drop)_nlink
* inode_(inc|dec)_link_count
*/
union {
const unsigned int i_nlink;
unsigned int __i_nlink;
};
dev_t i_rdev;
loff_t i_size;
struct timespec64 i_atime; // 最后訪問時間
struct timespec64 i_mtime; // 最后修改時間
struct timespec64 i_ctime; // 創(chuàng)建時間
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes; // 文件大小字節(jié)數(shù)
u8 i_blkbits; // 文件大小對應(yīng)的塊長度
u8 i_write_hint;
blkcnt_t i_blocks; // 文件長度 / 塊長度
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;
#endif
/* Misc */
unsigned long i_state;
struct rw_semaphore i_rwsem;
unsigned long dirtied_when; /* jiffies of first dirtying */
unsigned long dirtied_time_when;
struct hlist_node i_hash;
struct list_head i_io_list; /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *i_wb; /* the associated cgroup wb */
/* foreign inode detection, see wbc_detach_inode() */
int i_wb_frn_winner;
u16 i_wb_frn_avg_time;
u16 i_wb_frn_history;
#endif
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
struct list_head i_wb_list; /* backing dev writeback list */
union {
struct hlist_head i_dentry; // 一個inode可能被多個dentry使用(link)
struct rcu_head i_rcu;
};
atomic64_t i_version;
atomic_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
#ifdef CONFIG_IMA
atomic_t i_readcount; /* struct files open RO */
#endif
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
struct file_lock_context *i_flctx;
struct address_space i_data;
struct list_head i_devices;
union {
struct pipe_inode_info *i_pipe; // 管道類型
struct block_device *i_bdev; // 塊設(shè)備
struct cdev *i_cdev; // 字符設(shè)備
char *i_link; // 不知道是啥
unsigned i_dir_seq; // 不知道是啥
};
__u32 i_generation;
#ifdef CONFIG_FSNOTIFY
__u32 i_fsnotify_mask; /* all events this inode cares about */
struct fsnotify_mark_connector __rcu *i_fsnotify_marks;
#endif
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
struct fscrypt_info *i_crypt_info;
#endif
void *i_private; /* fs or device private pointer */
} __randomize_layout;
struct inode_operations {
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); // 根據(jù)inode中的dir及dentry中的filename 查找 inode
const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); // 查找inode目錄下的對于dentryfilename的所有鏈接
int (*permission) (struct inode *, int);
struct posix_acl * (*get_acl)(struct inode *, int);
int (*readlink) (struct dentry *, char __user *,int);
int (*create) (struct inode *,struct dentry *, umode_t, bool);
int (*link) (struct dentry *,struct inode *,struct dentry *); // 創(chuàng)建hard link
int (*unlink) (struct inode *,struct dentry *); // 刪除hardlink
int (*symlink) (struct inode *,struct dentry *,const char *); // 創(chuàng)建軟連接
int (*mkdir) (struct inode *,struct dentry *,umode_t); // 根據(jù)mode及dentry中的目錄名創(chuàng)建目錄,并生成inode
int (*rmdir) (struct inode *,struct dentry *); // 刪除目錄
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); // 根據(jù)
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int); // VFS to move the file specified by old_dentry from the old_dir directory to the directory new_dir, with the filename specified by new_dentry
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
int (*update_time)(struct inode *, struct timespec64 *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode);
int (*tmpfile) (struct inode *, struct dentry *, umode_t);
int (*set_acl)(struct inode *, struct posix_acl *, int);
} ____cacheline_aligned;
dentry
dentry主要用于管理文件名镐确,建立與所有子目錄項的聯(lián)系包吝。
dentry state
dentry可以有三種狀態(tài) used,unused源葫,negative
used:關(guān)聯(lián)到一個有效的inode
unused:關(guān)聯(lián)到了一個有效的inode诗越,但是引用數(shù)為0,還沒被真正刪除
negative:沒有可關(guān)聯(lián)的inode息堂,可能是文件被刪除了嚷狞,或者根本沒有存儲設(shè)備的文件
dentry cache
通過一個path查找對應(yīng)的dentry,如果每次都從磁盤中去獲取的話會比較耗資源荣堰,所以提供了一個lru緩存用于加速查找床未,比如我們查找 /usr/bin/java這個文件的目錄項的時候,先需要找到 / 的 目錄項振坚,然后/bin薇搁,依次類推直到找到path的結(jié)尾,這樣中間的查找過程中涉及到的目錄項就會被緩存起來渡八,方便下次查找啃洋。而這個查找過程在下面的look_up中詳細分析
更多細節(jié)看dentry
其數(shù)據(jù)結(jié)構(gòu)如下:
struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
struct lockref d_lockref; /* per-dentry lock and refcount */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
};
struct list_head d_child; /* child of parent list */
struct list_head d_subdirs; /* our children */
/*
* d_alias and d_rcu can share memory
*/
union {
struct hlist_node d_alias; /* inode alias list */
struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
struct rcu_head d_rcu;
} d_u;
} __randomize_layout;
struct dentry_operations {
int (*d_revalidate)(struct dentry *, unsigned int); // 檢測dentry有消息
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *); // 計算dentry的hash值
int (*d_compare)(const struct dentry *, // 比較文件名
unsigned int, const char *, const struct str *);
int (*d_delete)(const struct dentry *);
// 刪除目錄項,默認實現(xiàn)為將引用置0屎鳍,也就是標(biāo)位unused
int (*d_init)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_prune)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *); //當(dāng)丟失inode時宏娄,釋放dentry
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *);
} ____cacheline_aligned;
super_block
超級塊用于管理掛載點對于的實際文件系統(tǒng)中的一些參數(shù),包括:塊長度逮壁,文件系統(tǒng)可處理的最大文件長度孵坚,文件系統(tǒng)類型,對應(yīng)的存儲設(shè)備等。(注:在之前的整體結(jié)構(gòu)圖中superblock會有一個files指向所有打開的文件卖宠,但是在下面的數(shù)據(jù)結(jié)構(gòu)中并沒有找到相關(guān)的代碼巍杈,是因為之前該結(jié)構(gòu)會用于判斷umount邏輯時,確保所有文件都已被關(guān)閉逗堵,新版的不知道怎么處理這個邏輯了秉氧,后續(xù)看到了再補上)
相關(guān)superblock的管理主要在文件系統(tǒng)的掛載邏輯,這個后續(xù)在講到掛載相關(guān)的模塊是詳細分析蜒秤。而superblock主要功能是管理inode汁咏。
詳細信息見superblock
其數(shù)據(jù)結(jié)構(gòu)如下:
struct super_block {
struct list_head s_list; /* Keep this first */
dev_t s_dev; /* search index; _not_ kdev_t */
unsigned char s_blocksize_bits; // 塊字節(jié)
unsigned long s_blocksize; // log2(塊字節(jié))
loff_t s_maxbytes; /* Max file size */
struct file_system_type *s_type; // 文件系統(tǒng)類型
const struct super_operations *s_op; // 超級塊的操作
const struct dquot_operations *dq_op;
const struct quotactl_ops *s_qcop;
const struct export_operations *s_export_op;
unsigned long s_flags;
unsigned long s_iflags; /* internal SB_I_* flags */
unsigned long s_magic;
struct dentry *s_root; // 根目錄項。所有的path lookup 都是從此開始
struct rw_semaphore s_umount;
int s_count;
atomic_t s_active;
#ifdef CONFIG_SECURITY
void *s_security;
#endif
const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
const struct fscrypt_operations *s_cop;
#endif
struct hlist_bl_head s_roots; /* alternate root dentries for NFS */
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node s_instances;
unsigned int s_quota_types; /* Bitmask of supported quota types */
struct quota_info s_dquot; /* Diskquota specific options */
struct sb_writers s_writers;
/*
* Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
* s_fsnotify_marks together for cache efficiency. They are frequently
* accessed and rarely modified.
*/
void *s_fs_info; /* Filesystem private info */
/* Granularity of c/m/atime in ns (cannot be worse than a second) */
u32 s_time_gran;
#ifdef CONFIG_FSNOTIFY
__u32 s_fsnotify_mask;
struct fsnotify_mark_connector __rcu *s_fsnotify_marks;
#endif
char s_id[32]; /* Informational name */
uuid_t s_uuid; /* UUID */
unsigned int s_max_links;
fmode_t s_mode;
/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
*/
struct mutex s_vfs_rename_mutex; /* Kludge */
/*
* Filesystem subtype. If non-empty the filesystem type field
* in /proc/mounts will be "type.subtype"
*/
char *s_subtype;
const struct dentry_operations *s_d_op; /* default d_op for dentries */
/*
* Saved pool identifier for cleancache (-1 means none)
*/
int cleancache_poolid;
struct shrinker s_shrink; /* per-sb shrinker handle */
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
/* Pending fsnotify inode refs */
atomic_long_t s_fsnotify_inode_refs;
/* Being remounted read-only */
int s_readonly_remount;
/* AIO completions deferred from interrupt context */
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;
/*
* Owning user namespace and default context in which to
* interpret filesystem uids, gids, quotas, device nodes,
* xattrs and security labels.
*/
struct user_namespace *s_user_ns;
/*
* The list_lru structure is essentially just a pointer to a table
* of per-node lru lists, each of which has its own spinlock.
* There is no need to put them into separate cachelines.
*/
struct list_lru s_dentry_lru; // 目錄項緩存
struct list_lru s_inode_lru; // inode 緩存
struct rcu_head rcu;
struct work_struct destroy_work;
struct mutex s_sync_lock; /* sync serialisation lock */
/*
* Indicates how deep in a filesystem stack this SB is
*/
int s_stack_depth;
/* s_inode_list_lock protects s_inodes */
spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
struct list_head s_inodes; /* all inodes */
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
} __randomize_layout;
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb); // 在當(dāng)前sb創(chuàng)建inode
void (*destroy_inode)(struct inode *); // 在當(dāng)前sb刪除inode
void (*dirty_inode) (struct inode *, int flags); // 標(biāo)記為臟inode
int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 寫回
int (*drop_inode) (struct inode *); // 同delete作媚,不過inode的引用必須為0
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *); // 卸載sb
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_super) (struct super_block *);
int (*freeze_fs) (struct super_block *);
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *); // 查詢元信息
int (*remount_fs) (struct super_block *, int *, char *); //重新掛載
void (*umount_begin) (struct super_block *); // 主要用于NFS
// 查詢相關(guān)
int (*show_options)(struct seq_file *, struct dentry *);
int (*show_devname)(struct seq_file *, struct dentry *);
int (*show_path)(struct seq_file *, struct dentry *);
int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
struct dquot **(*get_dquots)(struct inode *);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
long (*nr_cached_objects)(struct super_block *,
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
};
address_space
之前提到spuerblock用于管理inode攘滩,而dentry用于文件名管理,文件名到inode的映射及目錄的管理纸泡,而inode用于管理一些文件的元數(shù)據(jù)信息漂问,但是真正的將文件與磁盤等存儲設(shè)備的交互由誰來做呢?write一份數(shù)據(jù)是怎么從內(nèi)存寫回磁盤女揭,而又如何從磁盤讀數(shù)據(jù)到內(nèi)存呢蚤假?這就是address_space主要需要處理的工作,address_space主要用于處理內(nèi)存到后端設(shè)備之間的數(shù)據(jù)同步吧兔,其具體工作原理在內(nèi)存緩存中詳細介紹磷仰。
struct address_space {
struct inode *host; // 所在的inode 以便于獲取文件元信息
struct xarray i_pages; // 文件對應(yīng)的內(nèi)存頁
gfp_t gfp_mask; // 內(nèi)存類型
atomic_t i_mmap_writable; // VM_SHARED映射計數(shù)
struct rb_root_cached i_mmap; // mmap私有和共享映射的樹結(jié)構(gòu)
struct rw_semaphore i_mmap_rwsem;
unsigned long nrpages; // 文件大小對應(yīng)的內(nèi)存頁數(shù)量
unsigned long nrexceptional;
pgoff_t writeback_index; //回寫由此開始
const struct address_space_operations *a_ops; // 地址空間操作
unsigned long flags; // 錯誤標(biāo)識位
errseq_t wb_err; //
spinlock_t private_lock;
struct list_head private_list;
void *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc); // 回寫一頁
int (*readpage)(struct file *, struct page *); //讀取一頁數(shù)據(jù)到內(nèi)存中
/* Write back some dirty pages from this mapping. */
int (*writepages)(struct address_space *, struct writeback_control *); // 回寫臟頁
/* Set a page dirty. Return true if this dirtied it */
int (*set_page_dirty)(struct page *page); // 標(biāo)記臟頁
/*
* Reads in the requested pages. Unlike ->readpage(), this is
* PURELY used for read-ahead!.
*/
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
*/
int (*migratepage) (struct address_space *,
struct page *, struct page *, enum migrate_mode);
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, unsigned long,
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page)(struct address_space *, struct page *);
/* swapfile support */
int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
sector_t *span);
void (*swap_deactivate)(struct file *file);
};
file
前文中提到對于進程來說,用戶空間看到的整數(shù)fd境蔼,而內(nèi)核中的對應(yīng)的數(shù)據(jù)結(jié)構(gòu)則為file灶平,所有用戶空間對于fd的操作都會由系統(tǒng)調(diào)用轉(zhuǎn)換到操作file。
更多詳細信息見file
其數(shù)據(jù)結(jié)構(gòu)如下:
struct task_struct {
...
/* Filesystem information: */
struct fs_struct *fs; // root & pwd path
/* Open file information: */
struct files_struct *files; // opened files
/* Namespaces: */
struct nsproxy *nsproxy;
...
};
/*
* Open file table structure
*/
struct files_struct {
/*
* read mostly part
*/
atomic_t count; // 打開文件數(shù)
bool resize_in_progress; //
wait_queue_head_t resize_wait;
struct fdtable __rcu *fdt; // fd table
struct fdtable fdtab; // fd table
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd; // 該進程打開的下一個fd
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT]; //打開的文件
};
struct fdtable {
unsigned int max_fds; // ulimit -n 打開句柄上限
struct file __rcu **fd; /* current fd array */
unsigned long *close_on_exec;
unsigned long *open_fds; // fd占用位圖
unsigned long *full_fds_bits;
struct rcu_head rcu;
};
struct file {
union {
struct llist_node fu_llist;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path; // 路徑
struct inode *f_inode; /* cached value */
const struct file_operations *f_op; // 文件操作
/*
* Protects f_ep_links, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
enum rw_hint f_write_hint;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
struct mutex f_pos_lock;
loff_t f_pos; // 當(dāng)前文件的操作位置
struct fown_struct f_owner; // 當(dāng)前文件所在的進程
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybe others */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_head f_ep_links;
struct list_head f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping; // 地址空間
errseq_t f_wb_err;
} __randomize_layout
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int); // 移動操作位置
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *); // 將文件與虛擬內(nèi)存映射
unsigned long mmap_supported_flags;
int (*open) (struct inode *, struct file *); //
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *); // 對一個file 加鎖
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
loff_t, size_t, unsigned int);
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;
虛擬文件系統(tǒng)實戰(zhàn)
由此對于虛擬文件的基本架構(gòu)有了一定的理解箍土,但是如果想要對于虛擬文件有比較深刻的認識還是比較模糊的逢享,那么我們來通過自己偽碼來操作一下文件,以描述linux內(nèi)核是如何來讀寫文件的吴藻,我們以寫文件為例來過一下整個流程:
需求:從0開始向文件/testmount/testdir/testfile1.txt 中寫入 hello world
基本過程其基本系統(tǒng)調(diào)用過程為1.mkdir 2. creat 3. open 4. write
mkdir對應(yīng)的函數(shù)調(diào)用的執(zhí)行過程如下:
rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))
creat對應(yīng)的函數(shù)調(diào)用的執(zhí)行過程如下:
testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )
open 的系統(tǒng)調(diào)用的執(zhí)行過程如下
testFileInode->f_op->open(testFileInode, testfile)
write的系統(tǒng)調(diào)用的執(zhí)行過程如下
testfile->f_op->write(file, "hello world", len, 0)
具體流程:
- 假設(shè)現(xiàn)在我們有一個快磁盤設(shè)備/dev/sda瞒爬,我們將其格式化為EX2文件系統(tǒng),具體怎么將塊設(shè)備格式化這個我們再設(shè)備管理章節(jié)在描述沟堡。
- 我們將該磁盤掛載到/testmount 目錄疮鲫,這樣內(nèi)核就會通過掛載模塊注冊對應(yīng)的superblock,具體如何掛載且聽下回分解弦叶。
- 我們想要寫文件/testmount/testdir/testfile1.txt文件,那么首先會要根據(jù)文件名完整路徑查找對應(yīng)的目錄項妇多,并在不存在的時候創(chuàng)建對應(yīng)的inode文件伤哺。
3.1 根據(jù)完整路徑找到對應(yīng)的掛載點的superblock,我們這里最精確的匹配sb是/testmount
3.2 找到sb后,找到當(dāng)前sb的root dentry立莉,找到root dentry對應(yīng)的inode绢彤,通過inode中的address_space從磁盤中讀取信息,如果是目錄則其中存儲內(nèi)容為所有子條目信息蜓耻,從而構(gòu)建完整的root dentry中的子條目茫舶;發(fā)現(xiàn)沒有對應(yīng)testdir的目錄,這時候就會報目錄不存在的錯誤刹淌;用戶開始創(chuàng)建對應(yīng)的目錄饶氏,并將對應(yīng)的信息寫回inode對應(yīng)的設(shè)備;同理也需要在/testdir目錄下創(chuàng)建testfile1.txt文件并寫回/testdir對應(yīng)的inode設(shè)備有勾。 - 找到inode之后疹启,我們需要通過open系統(tǒng)調(diào)用打開對應(yīng)的文件,進程通過files_struct中的next_fd申請分配一個文件描述符蔼卡,然后調(diào)用inode->f_op->open(inode, file)喊崖,生成一個file對象,并將inode中的address_space信息傳到file中雇逞,然后將用戶空間的fd關(guān)聯(lián)到該file對象荤懂。
- 打開文件之后所有后續(xù)的讀寫操作都是通過該fd來進行,在內(nèi)核層面就是通過對應(yīng)的file數(shù)據(jù)結(jié)構(gòu)操作文件塘砸,比如我們要寫入hello world节仿,那么就是通過調(diào)用file->f_op->write;
其實file->f_op其實是講對應(yīng)的字節(jié)內(nèi)容寫入到address_space中對應(yīng)的內(nèi)存中谣蠢,address_space再選擇合適的時間寫回磁盤粟耻,這就是我們常說的緩存系統(tǒng),當(dāng)然我們也可以通過fsync系統(tǒng)調(diào)用強制將數(shù)據(jù)同步回存儲系統(tǒng)眉踱。在f_op的函數(shù)中都可以看到__user描述信息挤忙,說明數(shù)據(jù)是來自用戶空間的內(nèi)存地址,這些數(shù)據(jù)最終要寫到內(nèi)核緩存的address_space中的page內(nèi)存中谈喳,這就是我們常說的內(nèi)核拷貝册烈,后來就出來了大家所熟知的零拷貝sendfile,直接在兩個fd直接拷貝數(shù)據(jù)婿禽,操作的都是內(nèi)核里面的page數(shù)據(jù)赏僧,不需要到用戶地址空間走一遭。
結(jié)語
至此vfs的基本流程就介紹完了扭倾,但是對于super_block的掛載淀零,address_space的具體讀寫操作后續(xù)再慢慢補上。其中address_space會在也緩存及塊緩存中詳細介紹膛壹,因為這一塊是特別復(fù)雜的而且與具體的文件系統(tǒng)實現(xiàn)相關(guān)驾中,后續(xù)將結(jié)合EX2文件系統(tǒng)一起介紹唉堪。
作者:淡泊寧靜_3652
鏈接:http://www.reibang.com/p/a98cb5519a50
來源:簡書
著作權(quán)歸作者所有。商業(yè)轉(zhuǎn)載請聯(lián)系作者獲得授權(quán)肩民,非商業(yè)轉(zhuǎn)載請注明出處唠亚。