圖解Linux的IO模型和相關(guān)技術(shù)

阻塞IO模型（Blocking I/O）

file

Linux 內(nèi)核一開始提供了 read 與 write 阻塞式操作徘铝。

當(dāng)客戶端連接時，會在對應(yīng)進(jìn)程的文件描述符目錄（/proc/進(jìn)程號/fd）生成對應(yīng)的文件描述符（0 標(biāo)準(zhǔn)輸入吞杭；1 標(biāo)準(zhǔn)輸出；2 標(biāo)準(zhǔn)錯誤輸出；）植捎，比如 fd 8 , fd 9；
應(yīng)用程序需要讀取的時候阳柔，通過系統(tǒng)調(diào)用 read (fd8)讀取焰枢，如果數(shù)據(jù)還沒到來，此應(yīng)用程序的進(jìn)程或線程會阻塞等待舌剂。

man 2 read

概述
       #include <unistd.h>
       ssize_t read(int fd, void *buf, size_t count);
描述
       read() 從文件描述符 fd 中讀取 count 字節(jié)的數(shù)據(jù)并放入從 buf 開始的緩沖區(qū)中.
       如果 count 為零,read()返回0,不執(zhí)行其他任何操作. 如果 count 大于SSIZE_MAX,那么結(jié)果將不可預(yù)料.
返回值
       成功時返回讀取到的字節(jié)數(shù)(為零表示讀到文件描述符), 此返回值受文件剩余字節(jié)數(shù)限制.當(dāng)返回值小于指定的字節(jié)數(shù)時 并不意味著錯誤;這可能是因為當(dāng)前可讀取的字節(jié)數(shù)小于指定的 字節(jié)數(shù)(比如已經(jīng)接近文件結(jié)尾,或
       者正在從管道或者終端讀取數(shù) 據(jù),或者 read()被信號中斷). 發(fā)生錯誤時返回-1,并置 errno 為相應(yīng)值.在這種情況下無法得知文件偏移位置是否有變化.

問題

如果出現(xiàn)了很多的客戶端連接济锄，比如1000個，那么應(yīng)用程序就會啟用1000個進(jìn)程或線程阻塞等待霍转。此時會出現(xiàn)性能問題：

CPU 會不停的切換荐绝，造成進(jìn)程或線程上下文切換開銷，實際讀取IO的時間占比會下降避消，造成CPU算力浪費(fèi)低滩。
因此，推動了 non-blocking I/O 的誕生岩喷。

非阻塞IO模型（non-blocking I/O）

file

此時恕沫，Linux 內(nèi)核一開始提供了 read 與 write 非阻塞式操作，可以通過socket設(shè)置SOCK_NONBLOCK標(biāo)記纱意。

此時應(yīng)用程序就不需要每一個文件描述符一個線程去處理婶溯，可以只有一個線程不停輪詢?nèi)プx取read，如果沒有數(shù)據(jù)到來偷霉，也會直接返回迄委。
如果有數(shù)據(jù)，則可以調(diào)度去處理業(yè)務(wù)邏輯类少。

man 2 socket

Since  Linux  2.6.27, the type argument serves a second purpose: in addition to specifying a socket type, it may include the bitwise OR of any of the following values, to modify the behavior of
       socket():

       SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor.  Using this flag saves extra calls to fcntl(2) to  achieve
                       the same result.

從這里可以看出來 socket Linux 2.6.27內(nèi)核開始支持非阻塞模式叙身。

問題

同理，當(dāng)出現(xiàn)了很多的客戶端連接硫狞，比如1000個曲梗，那就會觸發(fā)1000次系統(tǒng)調(diào)用。（1000次系統(tǒng)調(diào)用開銷也很客觀）

因此妓忍，有了 select虏两。

IO復(fù)用模型（I/O multiplexing） - select

file

此時，Linux 內(nèi)核一開始提供了 select 操作世剖，可以把1000次的系統(tǒng)調(diào)用定罢，簡化為一次系統(tǒng)調(diào)用，輪詢發(fā)生在內(nèi)核空間旁瘫。

select系統(tǒng)調(diào)用會返回可用的 fd集合祖凫，應(yīng)用程序此時只需要遍歷可用的 fd 集合琼蚯，去讀取數(shù)據(jù)進(jìn)行業(yè)務(wù)處理即可。

man 2 select

SYNOPSIS
       #include <sys/select.h>
       int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
         
DESCRIPTION
       select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). A file
       descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or a sufficiently small write(2)) without blocking.

       select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation. See BUGS.

可以看到支持傳輸多個文件描述符交由內(nèi)核輪詢惠况。

問題

雖然從1000次系統(tǒng)調(diào)用遭庶，降為一次系統(tǒng)調(diào)用的開銷，但是系統(tǒng)調(diào)用開銷中需要傳參1000個文件描述符稠屠。這也會造成一定的內(nèi)存開銷峦睡。

因此，有了 epoll权埠。

select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation. See BUGS.

IO復(fù)用模型（I/O multiplexing） - epoll

file

man epoll
man 2 epoll_create
man 2 epoll_ctl
man 2 epoll_wait

epoll：

SYNOPSIS
       #include <sys/epoll.h>
             
DESCRIPTION
       The  epoll  API  performs  a  similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll API can be used either as an edge-triggered or a
       level-triggered interface and scales well to large numbers of watched file descriptors.

       The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be considered as a container for two lists:

       ? The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring.

       ? The ready list: the set of file descriptors that are "ready" for I/O.  The ready list is a subset of (or, more precisely, a set of references to) the file descriptors in  the  interest  list.
         The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors.

epoll_create ：

內(nèi)核會產(chǎn)生一個epoll 實例數(shù)據(jù)結(jié)構(gòu)并返回一個文件描述符epfd榨了。

epoll_ctl ：

對文件描述符 fd 和其監(jiān)聽事件 epoll_event 進(jìn)行注冊，刪除攘蔽，或者修改其監(jiān)聽事件 epoll_event 龙屉。

SYNOPSIS
       #include <sys/epoll.h>
       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

DESCRIPTION
       This system call is used to add, modify, or remove entries in the interest list of the epoll(7) instance referred to by the file descriptor epfd. It requests that the operation op be performed
       for the target file descriptor, fd.

       Valid values for the op argument are:
       EPOLL_CTL_ADD
              Add an entry to the interest list of the epoll file descriptor, epfd. The entry includes the file descriptor, fd, a reference to the corresponding open file description (see epoll(7)
              and open(2)), and the settings specified in event.
       EPOLL_CTL_MOD
              Change the settings associated with fd in the interest list to the new settings specified in event.
       EPOLL_CTL_DEL
          Remove (deregister) the target file descriptor fd from the interest list. The event argument is ignored and can be NULL (but see BUGS below).

epoll_wait ：

阻塞等待注冊的事件發(fā)生，返回事件的數(shù)目满俗，并將觸發(fā)的可用事件寫入epoll_events數(shù)組中转捕。

擴(kuò)展

https://www.zhihu.com/question/39792257
https://programmer.group/5dc6d7d3c6146.html

其他IO優(yōu)化技術(shù)

man 2 mmap
man 2 sendfile
man 2 fork

mmap：

就是在用戶的虛擬地址空間中尋找空閑的一段地址進(jìn)行對文件的操作，不必再調(diào)用read唆垃、write系統(tǒng)調(diào)用瓜富，它的最終目的是將磁盤中的文件映射到用戶進(jìn)程的虛擬地址空間，實現(xiàn)用戶進(jìn)程對文件的直接讀寫降盹，減少了文件復(fù)制的開銷，提高了用戶的訪問效率谤辜。

以讀為例：

file

深入剖析mmap原理 - 從三個關(guān)鍵問題說起： http://www.reibang.com/p/eece39beee20

使用場景

kafka的數(shù)據(jù)文件就是用的mmap蓄坏，寫入文件，可以不經(jīng)過用戶空間到內(nèi)核的拷貝丑念，直接內(nèi)核空間落盤涡戳。

再比如Java中的MappedByteBuffer底層在Linux就是mmap。

sendfile：

file

sendfile系統(tǒng)調(diào)用在兩個文件描述符之間直接傳遞數(shù)據(jù)(完全在內(nèi)核中操作)脯倚，從而避免了數(shù)據(jù)在內(nèi)核緩沖區(qū)和用戶緩沖區(qū)之間的拷貝渔彰，操作效率很高，被稱之為零拷貝推正。

使用場景

比如 kafka恍涂，消費(fèi)者進(jìn)行消費(fèi)時，kafka直接調(diào)用 sendfile（Java中的FileChannel.transferTo）植榕，實現(xiàn)內(nèi)核數(shù)據(jù)從內(nèi)存或數(shù)據(jù)文件中讀出再沧，直接發(fā)送到網(wǎng)卡，而不需要經(jīng)過用戶空間的兩次拷貝尊残，實現(xiàn)了所謂"零拷貝"炒瘸。

再比如Tomcat淤堵、Nginx、Apache等web服務(wù)器返回靜態(tài)資源等顷扩，將數(shù)據(jù)用網(wǎng)絡(luò)發(fā)送出去拐邪，都運(yùn)用了sendfile。

fork

man 2 fork

創(chuàng)建子進(jìn)程有三種方式：

fork隘截，調(diào)用后扎阶，子進(jìn)程有自己的pid和task_struct結(jié)構(gòu)，基于父進(jìn)程的所有數(shù)據(jù)資源進(jìn)行副本拷貝技俐，主要是復(fù)制自己的指針乘陪，并不會復(fù)制父進(jìn)程的虛存空間，并且父子進(jìn)程同時進(jìn)行雕擂，變量互相隔離啡邑，互不干擾。

現(xiàn)在Linux中是采取了Copy-On-Write(COW井赌，寫時復(fù)制)技術(shù)谤逼，為了降低開銷，fork最初并不會真的產(chǎn)生兩個不同的拷貝仇穗，因為在那個時候流部，大量的數(shù)據(jù)其實完全是一樣的。
寫時復(fù)制是在推遲真正的數(shù)據(jù)拷貝纹坐。若后來確實發(fā)生了寫入枝冀，那意味著父進(jìn)程和子進(jìn)程的數(shù)據(jù)不一致了，于是產(chǎn)生復(fù)制動作耘子，每個進(jìn)程拿到屬于自己的那一份果漾，這樣就可以降低系統(tǒng)調(diào)用的開銷。

NOTES
       Under  Linux,  fork()  is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique
       task structure for the child.

vfork谷誓，vfork系統(tǒng)調(diào)用不同于fork绒障，用vfork創(chuàng)建的子進(jìn)程與父進(jìn)程共享地址空間，也就是說子進(jìn)程完全運(yùn)行在父進(jìn)程的地址空間上捍歪，也就是子進(jìn)程對虛擬地址空間任何數(shù)據(jù)的修改同樣為父進(jìn)程所見户辱。并且vfork完子進(jìn)程，父進(jìn)程是阻塞等待子進(jìn)程結(jié)束才會繼續(xù)糙臼。
clone庐镐，可以認(rèn)為是fork 與 vfork的混合用法。由用戶通過參clone_flags 的設(shè)置來決定哪些資源共享变逃，哪些資源副本拷貝焚鹊。由標(biāo)志CLONE_VFORK來決定子進(jìn)程在執(zhí)行時父進(jìn)程是阻塞還是運(yùn)行，若沒有設(shè)置該標(biāo)志，則父子進(jìn)程同時運(yùn)行末患，設(shè)置了該標(biāo)志研叫，則父進(jìn)程掛起，直到子進(jìn)程結(jié)束為止璧针。

總結(jié)
- fork的用途
  一個進(jìn)程希望對自身進(jìn)行副本拷貝嚷炉，從而父子進(jìn)程能同時執(zhí)行不同段的代碼。
  比如 redis的RDB持久化就是采用的就是fork探橱，保證副本拷貝的時點準(zhǔn)確申屹，并且速度快，不影響父進(jìn)程繼續(xù)提供服務(wù)隧膏。
- vfork的用途
  用vfork創(chuàng)建的進(jìn)程主要目的是用exec函數(shù)先執(zhí)行另外的程序哗讥。
- clone的用途
  用于有選擇地設(shè)置父子進(jìn)程之間哪些資源需要共享，哪些資源需要副本拷貝胞枕。

@SvenAugustus (https://my.oschina.net/langxSpirit)