深入linux內(nèi)核架構(gòu)--虛擬文件系統(tǒng)(簡介)

在Linux中棵磷,“萬物兼文件”蛾狗,我們知道在linux下面有很多文件系統(tǒng),如EXT/2/3/4仪媒,XFS等沉桌,為了很好的支持各種類型的文件系統(tǒng),Linux抽象了一層虛擬文件系統(tǒng)層算吩,用于更加靈活的適配各種具體的文件系統(tǒng)實(shí)現(xiàn)留凭。其基本架構(gòu)如下:


虛擬文件系統(tǒng)架構(gòu)

可以看到所有的虛擬文件系統(tǒng)操作都必須在內(nèi)核態(tài)執(zhí)行,這是由于對于系統(tǒng)存儲(chǔ)及外部設(shè)備的訪問極其復(fù)雜偎巢,這部分的操作不能交給用戶去操作蔼夜,否則系統(tǒng)會(huì)非常不穩(wěn)定。

文件系統(tǒng)類型

  1. 基于磁盤的文件系統(tǒng)
    在非易失介質(zhì)存儲(chǔ)存儲(chǔ)文件的經(jīng)典方法压昼,也就是為我們所熟知的各類文件系統(tǒng)挎扰,注入EXT2/3/4, FAT等
  2. 虛擬文件系統(tǒng)
    在內(nèi)核中生成,是一種使用用戶應(yīng)用程序與用戶通信的方法巢音,最為人所知的就是proc文件系統(tǒng),其不需要與任何種類的硬件上存儲(chǔ)信息尽超,所有的信息都存儲(chǔ)在內(nèi)存中官撼,伴隨著進(jìn)程而消亡
  3. 網(wǎng)絡(luò)文件系統(tǒng)
    這種文件系統(tǒng)可以訪問其他計(jì)算機(jī)上的數(shù)據(jù),本機(jī)不會(huì)陷入內(nèi)核態(tài)似谁,所有的請求會(huì)發(fā)送到其他機(jī)器執(zhí)行傲绣,因此網(wǎng)絡(luò)文件系統(tǒng)一般會(huì)以FUSE的形式掛載掠哥。

通用文件系統(tǒng)

虛擬文件系統(tǒng)定義了一些了方法和抽象以及文件系統(tǒng)中對象(或文件)的統(tǒng)一視圖,但是在不同的實(shí)現(xiàn)中秃诵,會(huì)截然不同续搀,其提供的是一個(gè)通用的全集,其提供的許多操作在某些子系統(tǒng)中并不需要菠净,比如proc系統(tǒng)中的write_page操作禁舷。
在處理文件時(shí),內(nèi)核空間和用戶空間使用的對象是不同的毅往,在用戶空間一個(gè)文件有一個(gè)"文件描述符"標(biāo)識(shí)牵咙,是一個(gè)整數(shù),也就是我們經(jīng)常說的FD攀唯,只在一個(gè)進(jìn)程內(nèi)部有效洁桌,兩個(gè)不同進(jìn)程之間可以使用同一個(gè)FD;而FD對應(yīng)的內(nèi)核空間的數(shù)據(jù)結(jié)構(gòu)是struct file侯嘀,其主要的成員為address_space另凌,address_space是真正與底層設(shè)備交互數(shù)據(jù)結(jié)構(gòu),而另外一個(gè)管理文件元信息的數(shù)據(jù)結(jié)構(gòu)是inode戒幔,其存儲(chǔ)著文件的鏈接吠谢,訪問時(shí)間,版本溪食,對應(yīng)的后端設(shè)備囊卜,所在的超級塊等等元信息,但是不包括文件名栅组,文件名存儲(chǔ)在struct dentry中,這是由于文件名是用于索引及管理inode的枢析,而dentry就是用于管理inode的啊易,而dentry則通過super_block索引窟却。
下面我們就來具體討論一下具體的各個(gè)結(jié)構(gòu)及他們的關(guān)系咖城,并討論一下在linux中打開一個(gè)文件到寫入具體經(jīng)歷了哪些事情他匪。

VFS結(jié)構(gòu)

VFS結(jié)構(gòu)

inode

inode用于管理文件的元數(shù)據(jù)信息壤靶,包括權(quán)限信息亲铡,訪問信息,鏈接信息知允,存儲(chǔ)設(shè)備信息等榆芦, 對應(yīng)的操作主要包括鏈接、權(quán)限迷守、犬绒,其數(shù)據(jù)結(jié)構(gòu)如下:
相關(guān)介紹參考inode

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
    ...
    const struct inode_operations   *i_op; // inode的操作,與具體的文件系統(tǒng)相關(guān)
    struct super_block  *i_sb; // 超級塊
    struct address_space    *i_mapping; // 地址空間兑凿,真正的與設(shè)備交互模塊
        ...
    /* Stat data, not accessed from path walking */
    unsigned long       i_ino; // inode 編號(hào)
    /*
     * Filesystems may only read i_nlink directly.  They shall use the
     * following functions for modification:
     *
     *    (set|clear|inc|drop)_nlink
     *    inode_(inc|dec)_link_count
     */
    union {
        const unsigned int i_nlink;
        unsigned int __i_nlink;
    };
    dev_t           i_rdev;
    loff_t          i_size;
    struct timespec64   i_atime; // 最后訪問時(shí)間
    struct timespec64   i_mtime; // 最后修改時(shí)間
    struct timespec64   i_ctime; // 創(chuàng)建時(shí)間
    spinlock_t          i_lock; /* i_blocks, i_bytes, maybe i_size */
    unsigned short      i_bytes; // 文件大小字節(jié)數(shù)
    u8                  i_blkbits;       // 文件大小對應(yīng)的塊長度
    u8                  i_write_hint;
    blkcnt_t            i_blocks; // 文件長度 / 塊長度

#ifdef __NEED_I_SIZE_ORDERED
    seqcount_t      i_size_seqcount;
#endif

    /* Misc */
    unsigned long       i_state;
    struct rw_semaphore i_rwsem;

    unsigned long       dirtied_when;   /* jiffies of first dirtying */
    unsigned long       dirtied_time_when;

    struct hlist_node   i_hash;
    struct list_head    i_io_list;  /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
    struct bdi_writeback    *i_wb;      /* the associated cgroup wb */

    /* foreign inode detection, see wbc_detach_inode() */
    int         i_wb_frn_winner;
    u16         i_wb_frn_avg_time;
    u16         i_wb_frn_history;
#endif
    struct list_head    i_lru;      /* inode LRU list */
    struct list_head    i_sb_list;
    struct list_head    i_wb_list;  /* backing dev writeback list */
    union {
        struct hlist_head   i_dentry; // 一個(gè)inode可能被多個(gè)dentry使用(link)
        struct rcu_head i_rcu;
    };
    atomic64_t  i_version;
    atomic_t        i_count;
    atomic_t        i_dio_count;
    atomic_t        i_writecount;
#ifdef CONFIG_IMA
    atomic_t        i_readcount; /* struct files open RO */
#endif
    const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
    struct file_lock_context    *i_flctx;
    struct address_space    i_data;
    struct list_head    i_devices;
    union {
        struct pipe_inode_info  *i_pipe; // 管道類型
        struct block_device *i_bdev; // 塊設(shè)備
        struct cdev     *i_cdev;  // 字符設(shè)備
        char            *i_link; // 不知道是啥
        unsigned        i_dir_seq; // 不知道是啥
    };
    __u32           i_generation;
#ifdef CONFIG_FSNOTIFY
    __u32           i_fsnotify_mask; /* all events this inode cares about */
    struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
#endif

#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    struct fscrypt_info *i_crypt_info;
#endif
    void            *i_private; /* fs or device private pointer */
} __randomize_layout;
struct inode_operations {
    struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); // 根據(jù)inode中的dir及dentry中的filename 查找 inode
    const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); // 查找inode目錄下的對于dentryfilename的所有鏈接
    int (*permission) (struct inode *, int);
    struct posix_acl * (*get_acl)(struct inode *, int);

    int (*readlink) (struct dentry *, char __user *,int);

    int (*create) (struct inode *,struct dentry *, umode_t, bool);
    int (*link) (struct dentry *,struct inode *,struct dentry *); // 創(chuàng)建hard link
    int (*unlink) (struct inode *,struct dentry *); // 刪除hardlink
    int (*symlink) (struct inode *,struct dentry *,const char *); // 創(chuàng)建軟連接
    int (*mkdir) (struct inode *,struct dentry *,umode_t); // 根據(jù)mode及dentry中的目錄名創(chuàng)建目錄凯力,并生成inode
    int (*rmdir) (struct inode *,struct dentry *); // 刪除目錄
    int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); // 根據(jù)
    int (*rename) (struct inode *, struct dentry *,
            struct inode *, struct dentry *, unsigned int); // VFS to move the file specified by old_dentry from the old_dir directory to the directory new_dir, with the filename specified by new_dentry
    int (*setattr) (struct dentry *, struct iattr *);
    int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
    ssize_t (*listxattr) (struct dentry *, char *, size_t);
    int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
              u64 len);
    int (*update_time)(struct inode *, struct timespec64 *, int);
    int (*atomic_open)(struct inode *, struct dentry *,
               struct file *, unsigned open_flag,
               umode_t create_mode); 
    int (*tmpfile) (struct inode *, struct dentry *, umode_t);
    int (*set_acl)(struct inode *, struct posix_acl *, int);
} ____cacheline_aligned;

dentry

dentry主要用于管理文件名,建立與所有子目錄項(xiàng)的聯(lián)系礼华。

dentry state

dentry可以有三種狀態(tài) used咐鹤,unused,negative
used:關(guān)聯(lián)到一個(gè)有效的inode
unused:關(guān)聯(lián)到了一個(gè)有效的inode圣絮,但是引用數(shù)為0祈惶,還沒被真正刪除
negative:沒有可關(guān)聯(lián)的inode,可能是文件被刪除了,或者根本沒有存儲(chǔ)設(shè)備的文件

dentry cache

通過一個(gè)path查找對應(yīng)的dentry捧请,如果每次都從磁盤中去獲取的話會(huì)比較耗資源凡涩,所以提供了一個(gè)lru緩存用于加速查找,比如我們查找 /usr/bin/java這個(gè)文件的目錄項(xiàng)的時(shí)候疹蛉,先需要找到 / 的 目錄項(xiàng)活箕,然后/bin,依次類推直到找到path的結(jié)尾可款,這樣中間的查找過程中涉及到的目錄項(xiàng)就會(huì)被緩存起來育韩,方便下次查找。而這個(gè)查找過程在下面的look_up中詳細(xì)分析
更多細(xì)節(jié)看dentry
其數(shù)據(jù)結(jié)構(gòu)如下:

struct dentry {
    /* RCU lookup touched fields */
    unsigned int d_flags;       /* protected by d_lock */
    seqcount_t d_seq;       /* per dentry seqlock */
    struct hlist_bl_node d_hash;    /* lookup hash list */
    struct dentry *d_parent;    /* parent directory */
    struct qstr d_name;
    struct inode *d_inode;      /* Where the name belongs to - NULL is
                     * negative */
    unsigned char d_iname[DNAME_INLINE_LEN];    /* small names */

    /* Ref lookup also touches following */
    struct lockref d_lockref;   /* per-dentry lock and refcount */
    const struct dentry_operations *d_op;
    struct super_block *d_sb;   /* The root of the dentry tree */
    unsigned long d_time;       /* used by d_revalidate */
    void *d_fsdata;         /* fs-specific data */

    union {
        struct list_head d_lru;     /* LRU list */
        wait_queue_head_t *d_wait;  /* in-lookup ones only */
    };
    struct list_head d_child;   /* child of parent list */
    struct list_head d_subdirs; /* our children */
    /*
     * d_alias and d_rcu can share memory
     */
    union {
        struct hlist_node d_alias;  /* inode alias list */
        struct hlist_bl_node d_in_lookup_hash;  /* only for in-lookup ones */
        struct rcu_head d_rcu;
    } d_u;
} __randomize_layout;
struct dentry_operations {
    int (*d_revalidate)(struct dentry *, unsigned int); // 檢測dentry有消息
    int (*d_weak_revalidate)(struct dentry *, unsigned int);
    int (*d_hash)(const struct dentry *, struct qstr *); // 計(jì)算dentry的hash值
    int (*d_compare)(const struct dentry *, // 比較文件名
            unsigned int, const char *, const struct str *);
    int (*d_delete)(const struct dentry *); 
                     // 刪除目錄項(xiàng)闺鲸,默認(rèn)實(shí)現(xiàn)為將引用置0筋讨,也就是標(biāo)位unused
    int (*d_init)(struct dentry *);
    void (*d_release)(struct dentry *);
    void (*d_prune)(struct dentry *);
    void (*d_iput)(struct dentry *, struct inode *); //當(dāng)丟失inode時(shí),釋放dentry
    char *(*d_dname)(struct dentry *, char *, int);
    struct vfsmount *(*d_automount)(struct path *);
    int (*d_manage)(const struct path *, bool);
    struct dentry *(*d_real)(struct dentry *, const struct inode *);
} ____cacheline_aligned;

super_block

超級塊用于管理掛載點(diǎn)對于的實(shí)際文件系統(tǒng)中的一些參數(shù)摸恍,包括:塊長度悉罕,文件系統(tǒng)可處理的最大文件長度,文件系統(tǒng)類型误墓,對應(yīng)的存儲(chǔ)設(shè)備等蛮粮。(注:在之前的整體結(jié)構(gòu)圖中superblock會(huì)有一個(gè)files指向所有打開的文件,但是在下面的數(shù)據(jù)結(jié)構(gòu)中并沒有找到相關(guān)的代碼谜慌,是因?yàn)橹霸摻Y(jié)構(gòu)會(huì)用于判斷umount邏輯時(shí)然想,確保所有文件都已被關(guān)閉,新版的不知道怎么處理這個(gè)邏輯了欣范,后續(xù)看到了再補(bǔ)上
相關(guān)superblock的管理主要在文件系統(tǒng)的掛載邏輯变泄,這個(gè)后續(xù)在講到掛載相關(guān)的模塊是詳細(xì)分析。而superblock主要功能是管理inode恼琼。
詳細(xì)信息見superblock
其數(shù)據(jù)結(jié)構(gòu)如下:

struct super_block {
    struct list_head    s_list;     /* Keep this first */
    dev_t           s_dev;      /* search index; _not_ kdev_t */
    unsigned char       s_blocksize_bits; // 塊字節(jié)
    unsigned long       s_blocksize; // log2(塊字節(jié))
    loff_t          s_maxbytes; /* Max file size */
    struct file_system_type *s_type; // 文件系統(tǒng)類型
    const struct super_operations   *s_op; // 超級塊的操作
    const struct dquot_operations   *dq_op;
    const struct quotactl_ops   *s_qcop;
    const struct export_operations *s_export_op;
    unsigned long       s_flags;
    unsigned long       s_iflags;   /* internal SB_I_* flags */
    unsigned long       s_magic;
    struct dentry       *s_root; // 根目錄項(xiàng)妨蛹。所有的path lookup 都是從此開始
    struct rw_semaphore s_umount;
    int         s_count;
    atomic_t        s_active;
#ifdef CONFIG_SECURITY
    void                    *s_security;
#endif
    const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    const struct fscrypt_operations *s_cop;
#endif
    struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
    struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
    struct block_device *s_bdev;
    struct backing_dev_info *s_bdi;
    struct mtd_info     *s_mtd;
    struct hlist_node   s_instances;
    unsigned int        s_quota_types;  /* Bitmask of supported quota types */
    struct quota_info   s_dquot;    /* Diskquota specific options */

    struct sb_writers   s_writers;

    /*
     * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
     * s_fsnotify_marks together for cache efficiency. They are frequently
     * accessed and rarely modified.
     */
    void            *s_fs_info; /* Filesystem private info */

    /* Granularity of c/m/atime in ns (cannot be worse than a second) */
    u32         s_time_gran;
#ifdef CONFIG_FSNOTIFY
    __u32           s_fsnotify_mask;
    struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
#endif

    char            s_id[32];   /* Informational name */
    uuid_t          s_uuid;     /* UUID */

    unsigned int        s_max_links;
    fmode_t         s_mode;

    /*
     * The next field is for VFS *only*. No filesystems have any business
     * even looking at it. You had been warned.
     */
    struct mutex s_vfs_rename_mutex;    /* Kludge */

    /*
     * Filesystem subtype.  If non-empty the filesystem type field
     * in /proc/mounts will be "type.subtype"
     */
    char *s_subtype;

    const struct dentry_operations *s_d_op; /* default d_op for dentries */

    /*
     * Saved pool identifier for cleancache (-1 means none)
     */
    int cleancache_poolid;

    struct shrinker s_shrink;   /* per-sb shrinker handle */

    /* Number of inodes with nlink == 0 but still referenced */
    atomic_long_t s_remove_count;

    /* Pending fsnotify inode refs */
    atomic_long_t s_fsnotify_inode_refs;

    /* Being remounted read-only */
    int s_readonly_remount;

    /* AIO completions deferred from interrupt context */
    struct workqueue_struct *s_dio_done_wq;
    struct hlist_head s_pins;

    /*
     * Owning user namespace and default context in which to
     * interpret filesystem uids, gids, quotas, device nodes,
     * xattrs and security labels.
     */
    struct user_namespace *s_user_ns;

    /*
     * The list_lru structure is essentially just a pointer to a table
     * of per-node lru lists, each of which has its own spinlock.
     * There is no need to put them into separate cachelines.
     */
    struct list_lru     s_dentry_lru; // 目錄項(xiàng)緩存
    struct list_lru     s_inode_lru; // inode 緩存
    struct rcu_head     rcu;
    struct work_struct  destroy_work;

    struct mutex        s_sync_lock;    /* sync serialisation lock */

    /*
     * Indicates how deep in a filesystem stack this SB is
     */
    int s_stack_depth;

    /* s_inode_list_lock protects s_inodes */
    spinlock_t      s_inode_list_lock ____cacheline_aligned_in_smp;
    struct list_head    s_inodes;   /* all inodes */

    spinlock_t      s_inode_wblist_lock;
    struct list_head    s_inodes_wb;    /* writeback inodes */
} __randomize_layout;
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb); // 在當(dāng)前sb創(chuàng)建inode
    void (*destroy_inode)(struct inode *); // 在當(dāng)前sb刪除inode
    void (*dirty_inode) (struct inode *, int flags); // 標(biāo)記為臟inode
    int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 寫回
    int (*drop_inode) (struct inode *); // 同delete,不過inode的引用必須為0
    void (*evict_inode) (struct inode *);
    void (*put_super) (struct super_block *);  // 卸載sb
    int (*sync_fs)(struct super_block *sb, int wait); 
    int (*freeze_super) (struct super_block *);
    int (*freeze_fs) (struct super_block *);
    int (*thaw_super) (struct super_block *);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *); // 查詢元信息
    int (*remount_fs) (struct super_block *, int *, char *); //重新掛載
    void (*umount_begin) (struct super_block *); // 主要用于NFS
        // 查詢相關(guān)
    int (*show_options)(struct seq_file *, struct dentry *);
    int (*show_devname)(struct seq_file *, struct dentry *);
    int (*show_path)(struct seq_file *, struct dentry *);
    int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
    struct dquot **(*get_dquots)(struct inode *);
#endif
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
    long (*nr_cached_objects)(struct super_block *,
                  struct shrink_control *);
    long (*free_cached_objects)(struct super_block *,
                    struct shrink_control *);
};

address_space

之前提到spuerblock用于管理inode晴竞,而dentry用于文件名管理蛙卤,文件名到inode的映射及目錄的管理罐盔,而inode用于管理一些文件的元數(shù)據(jù)信息系宜,但是真正的將文件與磁盤等存儲(chǔ)設(shè)備的交互由誰來做呢蟆盐?write一份數(shù)據(jù)是怎么從內(nèi)存寫回磁盤闰蛔,而又如何從磁盤讀數(shù)據(jù)到內(nèi)存呢?這就是address_space主要需要處理的工作斧蜕,address_space主要用于處理內(nèi)存到后端設(shè)備之間的數(shù)據(jù)同步瓤帚,其具體工作原理在內(nèi)存緩存中詳細(xì)介紹嘿架。

struct address_space {
    struct inode        *host; // 所在的inode 以便于獲取文件元信息
    struct xarray       i_pages; // 文件對應(yīng)的內(nèi)存頁
    gfp_t           gfp_mask; // 內(nèi)存類型
    atomic_t        i_mmap_writable; // VM_SHARED映射計(jì)數(shù)
    struct rb_root_cached   i_mmap; // mmap私有和共享映射的樹結(jié)構(gòu)
    struct rw_semaphore i_mmap_rwsem;
    unsigned long       nrpages; // 文件大小對應(yīng)的內(nèi)存頁數(shù)量
    unsigned long       nrexceptional;
    pgoff_t         writeback_index; //回寫由此開始
    const struct address_space_operations *a_ops; // 地址空間操作
    unsigned long       flags; // 錯(cuò)誤標(biāo)識(shí)位
    errseq_t        wb_err; //
    spinlock_t      private_lock;
    struct list_head    private_list;
    void            *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc); // 回寫一頁
    int (*readpage)(struct file *, struct page *); //讀取一頁數(shù)據(jù)到內(nèi)存中

    /* Write back some dirty pages from this mapping. */
    int (*writepages)(struct address_space *, struct writeback_control *); // 回寫臟頁

    /* Set a page dirty.  Return true if this dirtied it */
    int (*set_page_dirty)(struct page *page); // 標(biāo)記臟頁

    /*
     * Reads in the requested pages. Unlike ->readpage(), this is
     * PURELY used for read-ahead!.
     */
    int (*readpages)(struct file *filp, struct address_space *mapping,
            struct list_head *pages, unsigned nr_pages);

    int (*write_begin)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned flags,
                struct page **pagep, void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned copied,
                struct page *page, void *fsdata);

    /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
    sector_t (*bmap)(struct address_space *, sector_t);
    void (*invalidatepage) (struct page *, unsigned int, unsigned int);
    int (*releasepage) (struct page *, gfp_t);
    void (*freepage)(struct page *);
    ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
    /*
     * migrate the contents of a page to the specified target. If
     * migrate_mode is MIGRATE_ASYNC, it must not block.
     */
    int (*migratepage) (struct address_space *,
            struct page *, struct page *, enum migrate_mode);
    bool (*isolate_page)(struct page *, isolate_mode_t);
    void (*putback_page)(struct page *);
    int (*launder_page) (struct page *);
    int (*is_partially_uptodate) (struct page *, unsigned long,
                    unsigned long);
    void (*is_dirty_writeback) (struct page *, bool *, bool *);
    int (*error_remove_page)(struct address_space *, struct page *);

    /* swapfile support */
    int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                sector_t *span);
    void (*swap_deactivate)(struct file *file);
};

file

前文中提到對于進(jìn)程來說垛耳,用戶空間看到的整數(shù)fd栅屏,而內(nèi)核中的對應(yīng)的數(shù)據(jù)結(jié)構(gòu)則為file飘千,所有用戶空間對于fd的操作都會(huì)由系統(tǒng)調(diào)用轉(zhuǎn)換到操作file。
更多詳細(xì)信息見file
其數(shù)據(jù)結(jié)構(gòu)如下:

struct task_struct {
       ...
    /* Filesystem information: */
    struct fs_struct        *fs; // root & pwd path

    /* Open file information: */
    struct files_struct     *files; // opened files

    /* Namespaces: */
    struct nsproxy          *nsproxy;
        ...
};
/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
    atomic_t count; // 打開文件數(shù)
    bool resize_in_progress; //
    wait_queue_head_t resize_wait;

    struct fdtable __rcu *fdt; // fd table
    struct fdtable fdtab; // fd table
  /*
   * written part on a separate cache line in SMP
   */
    spinlock_t file_lock ____cacheline_aligned_in_smp;
    unsigned int next_fd; // 該進(jìn)程打開的下一個(gè)fd
    unsigned long close_on_exec_init[1];
    unsigned long open_fds_init[1];
    unsigned long full_fds_bits_init[1];
    struct file __rcu * fd_array[NR_OPEN_DEFAULT]; //打開的文件
};
struct fdtable {
    unsigned int max_fds; // ulimit -n 打開句柄上限
    struct file __rcu **fd;      /* current fd array */
    unsigned long *close_on_exec;
    unsigned long *open_fds;  // fd占用位圖
    unsigned long *full_fds_bits;
    struct rcu_head rcu;
};
struct file {
    union {
        struct llist_node   fu_llist;
        struct rcu_head     fu_rcuhead;
    } f_u;
    struct path     f_path;  // 路徑
    struct inode        *f_inode;    /* cached value */
    const struct file_operations    *f_op; // 文件操作
    /*
     * Protects f_ep_links, f_flags.
     * Must not be taken from IRQ context.
     */
    spinlock_t      f_lock;
    enum rw_hint        f_write_hint;
    atomic_long_t   f_count;
    unsigned int        f_flags;
    fmode_t         f_mode;
    struct mutex        f_pos_lock;
    loff_t          f_pos; // 當(dāng)前文件的操作位置
    struct fown_struct  f_owner; // 當(dāng)前文件所在的進(jìn)程
    const struct cred   *f_cred;
    struct file_ra_state    f_ra;
    u64         f_version;
#ifdef CONFIG_SECURITY
    void            *f_security;
#endif
    /* needed for tty driver, and maybe others */
    void            *private_data;

#ifdef CONFIG_EPOLL
    /* Used by fs/eventpoll.c to link all the hooks to this file */
    struct list_head    f_ep_links;
    struct list_head    f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
    struct address_space    *f_mapping; // 地址空間
    errseq_t        f_wb_err;
} __randomize_layout
struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int); // 移動(dòng)操作位置
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
    int (*iterate) (struct file *, struct dir_context *);
    int (*iterate_shared) (struct file *, struct dir_context *);
    __poll_t (*poll) (struct file *, struct poll_table_struct *);
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
    int (*mmap) (struct file *, struct vm_area_struct *); // 將文件與虛擬內(nèi)存映射
    unsigned long mmap_supported_flags;
    int (*open) (struct inode *, struct file *); // 
    int (*flush) (struct file *, fl_owner_t id);
    int (*release) (struct inode *, struct file *);
    int (*fsync) (struct file *, loff_t, loff_t, int datasync);
    int (*fasync) (int, struct file *, int);
    int (*lock) (struct file *, int, struct file_lock *);
    ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
    unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
    int (*check_flags)(int); 
    int (*flock) (struct file *, int, struct file_lock *); // 對一個(gè)file 加鎖
    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
    int (*setlease)(struct file *, long, struct file_lock **, void **);
    long (*fallocate)(struct file *file, int mode, loff_t offset,
              loff_t len);
    void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
    unsigned (*mmap_capabilities)(struct file *);
#endif
    ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
            loff_t, size_t, unsigned int);
    loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
                   struct file *file_out, loff_t pos_out,
                   loff_t len, unsigned int remap_flags);
    int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;

虛擬文件系統(tǒng)實(shí)戰(zhàn)

由此對于虛擬文件的基本架構(gòu)有了一定的理解栈雳,但是如果想要對于虛擬文件有比較深刻的認(rèn)識(shí)還是比較模糊的护奈,那么我們來通過自己偽碼來操作一下文件,以描述linux內(nèi)核是如何來讀寫文件的甫恩,我們以寫文件為例來過一下整個(gè)流程:
需求:從0開始向文件/testmount/testdir/testfile1.txt 中寫入 hello world
基本過程其基本系統(tǒng)調(diào)用過程為1.mkdir 2. creat 3. open 4. write
mkdir對應(yīng)的函數(shù)調(diào)用的執(zhí)行過程如下:
rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))
creat對應(yīng)的函數(shù)調(diào)用的執(zhí)行過程如下:
testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )
open 的系統(tǒng)調(diào)用的執(zhí)行過程如下
testFileInode->f_op->open(testFileInode, testfile)
write的系統(tǒng)調(diào)用的執(zhí)行過程如下
testfile->f_op->write(file, "hello world", len, 0)
具體流程:

  1. 假設(shè)現(xiàn)在我們有一個(gè)快磁盤設(shè)備/dev/sda逆济,我們將其格式化為EX2文件系統(tǒng),具體怎么將塊設(shè)備格式化這個(gè)我們再設(shè)備管理章節(jié)在描述磺箕。
  2. 我們將該磁盤掛載到/testmount 目錄,這樣內(nèi)核就會(huì)通過掛載模塊注冊對應(yīng)的superblock抛虫,具體如何掛載且聽下回分解松靡。
  3. 我們想要寫文件/testmount/testdir/testfile1.txt文件,那么首先會(huì)要根據(jù)文件名完整路徑查找對應(yīng)的目錄項(xiàng)建椰,并在不存在的時(shí)候創(chuàng)建對應(yīng)的inode文件雕欺。
    3.1 根據(jù)完整路徑找到對應(yīng)的掛載點(diǎn)的superblock,我們這里最精確的匹配sb是/testmount
    3.2 找到sb后棉姐,找到當(dāng)前sb的root dentry屠列,找到root dentry對應(yīng)的inode,通過inode中的address_space從磁盤中讀取信息伞矩,如果是目錄則其中存儲(chǔ)內(nèi)容為所有子條目信息笛洛,從而構(gòu)建完整的root dentry中的子條目;發(fā)現(xiàn)沒有對應(yīng)testdir的目錄乃坤,這時(shí)候就會(huì)報(bào)目錄不存在的錯(cuò)誤苛让;用戶開始創(chuàng)建對應(yīng)的目錄,并將對應(yīng)的信息寫回inode對應(yīng)的設(shè)備湿诊;同理也需要在/testdir目錄下創(chuàng)建testfile1.txt文件并寫回/testdir對應(yīng)的inode設(shè)備狱杰。
  4. 找到inode之后,我們需要通過open系統(tǒng)調(diào)用打開對應(yīng)的文件厅须,進(jìn)程通過files_struct中的next_fd申請分配一個(gè)文件描述符仿畸,然后調(diào)用inode->f_op->open(inode, file),生成一個(gè)file對象朗和,并將inode中的address_space信息傳到file中错沽,然后將用戶空間的fd關(guān)聯(lián)到該file對象。
  5. 打開文件之后所有后續(xù)的讀寫操作都是通過該fd來進(jìn)行例隆,在內(nèi)核層面就是通過對應(yīng)的file數(shù)據(jù)結(jié)構(gòu)操作文件甥捺,比如我們要寫入hello world,那么就是通過調(diào)用file->f_op->write镀层;
    其實(shí)file->f_op其實(shí)是講對應(yīng)的字節(jié)內(nèi)容寫入到address_space中對應(yīng)的內(nèi)存中镰禾,address_space再選擇合適的時(shí)間寫回磁盤皿曲,這就是我們常說的緩存系統(tǒng),當(dāng)然我們也可以通過fsync系統(tǒng)調(diào)用強(qiáng)制將數(shù)據(jù)同步回存儲(chǔ)系統(tǒng)吴侦。在f_op的函數(shù)中都可以看到__user描述信息屋休,說明數(shù)據(jù)是來自用戶空間的內(nèi)存地址,這些數(shù)據(jù)最終要寫到內(nèi)核緩存的address_space中的page內(nèi)存中备韧,這就是我們常說的內(nèi)核拷貝劫樟,后來就出來了大家所熟知的零拷貝sendfile,直接在兩個(gè)fd直接拷貝數(shù)據(jù)织堂,操作的都是內(nèi)核里面的page數(shù)據(jù)叠艳,不需要到用戶地址空間走一遭。

結(jié)語

至此vfs的基本流程就介紹完了易阳,但是對于super_block的掛載附较,address_space的具體讀寫操作后續(xù)再慢慢補(bǔ)上。其中address_space會(huì)在也緩存及塊緩存中詳細(xì)介紹潦俺,因?yàn)檫@一塊是特別復(fù)雜的而且與具體的文件系統(tǒng)實(shí)現(xiàn)相關(guān)拒课,后續(xù)將結(jié)合EX2文件系統(tǒng)一起介紹。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末事示,一起剝皮案震驚了整個(gè)濱河市早像,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌肖爵,老刑警劉巖卢鹦,帶你破解...
    沈念sama閱讀 221,198評論 6 514
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異遏匆,居然都是意外死亡法挨,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,334評論 3 398
  • 文/潘曉璐 我一進(jìn)店門幅聘,熙熙樓的掌柜王于貴愁眉苦臉地迎上來凡纳,“玉大人,你說我怎么就攤上這事帝蒿〖雒樱” “怎么了?”我有些...
    開封第一講書人閱讀 167,643評論 0 360
  • 文/不壞的土叔 我叫張陵葛超,是天一觀的道長暴氏。 經(jīng)常有香客問我,道長绣张,這世上最難降的妖魔是什么答渔? 我笑而不...
    開封第一講書人閱讀 59,495評論 1 296
  • 正文 為了忘掉前任,我火速辦了婚禮侥涵,結(jié)果婚禮上沼撕,老公的妹妹穿的比我還像新娘宋雏。我一直安慰自己,他們只是感情好务豺,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,502評論 6 397
  • 文/花漫 我一把揭開白布磨总。 她就那樣靜靜地躺著,像睡著了一般笼沥。 火紅的嫁衣襯著肌膚如雪蚪燕。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,156評論 1 308
  • 那天奔浅,我揣著相機(jī)與錄音馆纳,去河邊找鬼。 笑死汹桦,一個(gè)胖子當(dāng)著我的面吹牛厕诡,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播营勤,決...
    沈念sama閱讀 40,743評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼壹罚!你這毒婦竟也來了葛作?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,659評論 0 276
  • 序言:老撾萬榮一對情侶失蹤猖凛,失蹤者是張志新(化名)和其女友劉穎赂蠢,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體辨泳,經(jīng)...
    沈念sama閱讀 46,200評論 1 319
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡虱岂,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,282評論 3 340
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了菠红。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片第岖。...
    茶點(diǎn)故事閱讀 40,424評論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖试溯,靈堂內(nèi)的尸體忽然破棺而出蔑滓,到底是詐尸還是另有隱情,我是刑警寧澤遇绞,帶...
    沈念sama閱讀 36,107評論 5 349
  • 正文 年R本政府宣布键袱,位于F島的核電站,受9級特大地震影響摹闽,放射性物質(zhì)發(fā)生泄漏蹄咖。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,789評論 3 333
  • 文/蒙蒙 一付鹿、第九天 我趴在偏房一處隱蔽的房頂上張望澜汤。 院中可真熱鬧蚜迅,春花似錦、人聲如沸银亲。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,264評論 0 23
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽务蝠。三九已至拍谐,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間馏段,已是汗流浹背轩拨。 一陣腳步聲響...
    開封第一講書人閱讀 33,390評論 1 271
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留院喜,地道東北人亡蓉。 一個(gè)月前我還...
    沈念sama閱讀 48,798評論 3 376
  • 正文 我出身青樓,卻偏偏與公主長得像喷舀,于是被迫代替她去往敵國和親砍濒。 傳聞我的和親對象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,435評論 2 359