Explicit block device plugging

http://lwn.net/Articles/438256/

Since the dawn of time, or for at least as long as I have been involved, the Linux kernel has deployed a concept called "plugging" on block devices. When I/O is queued to an empty device, that device enters a plugged state. This means that I/O isn't immediately dispatched to the low level device driver, instead it is held back by this plug. When a process is going to wait on the I/O to finish, the device is unplugged and request dispatching to the device driver is started. The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth. With the release of the 2.6.39-rc1 kernel, block device plugging was drastically changed. Before we go into that, lets take a historic look at how plugging has evolved.

Back in the early days, plugging a device involved global state. This was before SMP scalability was an issue, and having global state made it easier to handle the unplugging. If a process was about to block for I/O, any plugged device was simply unplugged. This scheme persisted in pretty much the same form until the early versions of the 2.6 kernel, where it began to severely impact SMP scalability on I/O-heavy workloads.

In response to this problem, the plug state was turned into a per-device entity in 2004. This scaled well, but now you suddenly had no way to unplug all devices when going to sleep waiting for page I/O. This meant that the virtual memory subsystem had to be able to unplug the specific device that would be servicing page I/O. A special hack was added for this: <tt>sync_page()</tt> in <tt>struct address_space_operations</tt>; this hook would unplug the device of interest.

If you have a more complicated I/O setup with device mapper or RAID components, those layers would in turn unplug any lower-level device. The unplug event would thus percolate down the stack. Some heuristics were also added to auto-unplug the device if a certain depth of requests had been added, or if some period of time had passed before the unplug event was seen. With the asymmetric nature of plugging where the device was automatically plugged but had to be explicitly unplugged, we've had our fair share of I/O stall bugs in the kernel. While crude, the auto-unplug would at least ensure that we would chuck along if someone missed an unplug call after I/O submission.

With really fast devices hitting the market, once again plugging had become a scalability problem and hacks were again added to avoid this. Essentially we disabled plugging on solid-state devices that were able to do queueing. While plugging originally was a good win, it was time to reevaluate things. The asymmetric nature of the API was always ugly and a source of bugs, and the <tt>sync_page()</tt> hook was always hated by the memory management people. The time had come to rewrite the whole thing.

The primary use of plugging was to allow an I/O submitter to send down multiple pieces of I/O before handing it to the device. Instead of maintaining these I/O fragments as shared state in the device, a new on-stack structure was created to contain this I/O for a short period, allowing the submitter to build up a small queue of related requests. The state is now tracked in <tt>struct blk_plug</tt>, which is little more than a linked list and a <tt>should_sort</tt> flag informing <tt>blk_finish_plug()</tt> whether or not to sort this list before flushing the I/O. We'll come back to that later.

<pre style="overflow: visible; white-space: pre; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;"> struct blk_plug {
unsigned long magic;
struct list_head list;
unsigned int should_sort;
};
</pre>

The magic member is a temporary addition to detect uninitialized use cases, it will eventually be removed. The new API to do this is straightforward and simple to use:

<pre style="overflow: visible; white-space: pre; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;"> struct blk_plug plug;

blk_start_plug(&plug);
submit_batch_of_io();
blk_finish_plug(&plug);

</pre>

<tt>blk_start_plug()</tt> takes care of initializing the structure and tracking it inside the task structure of the current process. The latter is important to be able to automatically flush the queued I/O should the task end up blocking between the call to <tt>blk_start_plug()</tt> and <tt>blk_finish_plug()</tt>. If that happens, we want to ensure that pending I/O is sent off to the devices immediately. This is important from a performance perspective, but also to ensure that we don't deadlock. If the task is blocking for a memory allocation, memory management reclaim could end up wanting to free a page belonging to a request that is currently residing on our private plug. Similarly, the caller may itself end up waiting for some of the plugged I/O to finish. By flushing this list when the process goes to sleep, we avoid these types of deadlocks.

If <tt>blk_start_plug()</tt> is called and the task already has a plug structure registered, it is simply ignored. This can happen in cases where the upper layers plug for submitting a series of I/O, and further down in the call chain someone else does the same. I/O submitted without the knowledge of the original plugger will thus end up on the originally assigned plug, and be flushed whenever the original caller ends the plug by calling<tt>blk_finish_plug()</tt>, or if some part of the call path goes to sleep or is scheduled out.

Since the plug state is now device agnostic, we may end up in a situation where multiple devices have pending I/O on this plug list. These may end up on the plug list in an interleaved fashion, potentially causing <tt>blk_finish_plug()</tt> to grab and release the related queue locks multiple times. To avoid this problem, a <tt>should_sort</tt> flag in the <tt>blk_plug</tt> structure is used to keep track of whether we have I/O belonging to more than I/O distinct queue pending. If we do, the list is sorted to group identical queues together. This scales better than grabbing and releasing the same locks multiple times.

With this new scheme in place, the device need no longer be notified of unplug events. The queue <tt>unplug_fn()</tt> used to exist for this purpose alone, it has now been removed. For most drivers it is safe to just remove this hook and the related code. However, some drivers used plugging to delay I/O operations in response to resource shortages. One example of that was the SCSI midlayer; if we failed to map a new SCSI request due to a memory shortage, the queue was plugged to ensure that we would call back into the dispatch functions later on. Since this mechanism no longer exists, a similar API has been provided for such use cases. Drivers may now use blk_delay_queue() for this:

<pre style="overflow: visible; white-space: pre; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;"> blk_delay_queue(queue, delay_in_msecs);
</pre>

The block layer will re-invoke request queueing after the specified number of milliseconds have passed. It will be invoked from process context, just as it would have been with the unplug event. <tt>blk_delay_queue()</tt> honors the queue stopped state, so if <tt>blk_stop_queue()</tt> was called before <tt>blk_delay_queue()</tt>, or if is called after the fact but before the delay has passed, the request handler will not be invoked. <tt>blk_delay_queue()</tt> must only be used for conditions where the caller doesn't necessarily know when that condition will change states. If resources internal to the driver cause it to need to halt operations for a while, it is more efficient to use <tt>blk_stop_queue()</tt> and <tt>blk_start_queue()</tt> to manage those directly.

These changes have been merged for the 2.6.39 kernel. While a few problems have been found (and fixed), it would appear that the plugging changes have been integrated without greatly disturbing Linus's calm development cycle.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末掉蔬,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子傍菇,更是在濱河造成了極大的恐慌彼妻,老刑警劉巖其爵,帶你破解...
    沈念sama閱讀 223,126評(píng)論 6 520
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件咖耘,死亡現(xiàn)場(chǎng)離奇詭異饭于,居然都是意外死亡蜀踏,警方通過查閱死者的電腦和手機(jī)维蒙,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,421評(píng)論 3 400
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來果覆,“玉大人颅痊,你說我怎么就攤上這事【执” “怎么了斑响?”我有些...
    開封第一講書人閱讀 169,941評(píng)論 0 366
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)钳榨。 經(jīng)常有香客問我舰罚,道長(zhǎng),這世上最難降的妖魔是什么薛耻? 我笑而不...
    開封第一講書人閱讀 60,294評(píng)論 1 300
  • 正文 為了忘掉前任营罢,我火速辦了婚禮,結(jié)果婚禮上饼齿,老公的妹妹穿的比我還像新娘饲漾。我一直安慰自己,他們只是感情好候醒,可當(dāng)我...
    茶點(diǎn)故事閱讀 69,295評(píng)論 6 398
  • 文/花漫 我一把揭開白布能颁。 她就那樣靜靜地躺著,像睡著了一般倒淫。 火紅的嫁衣襯著肌膚如雪伙菊。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,874評(píng)論 1 314
  • 那天敌土,我揣著相機(jī)與錄音镜硕,去河邊找鬼。 笑死返干,一個(gè)胖子當(dāng)著我的面吹牛兴枯,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播矩欠,決...
    沈念sama閱讀 41,285評(píng)論 3 424
  • 文/蒼蘭香墨 我猛地睜開眼财剖,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了癌淮?” 一聲冷哼從身側(cè)響起躺坟,我...
    開封第一講書人閱讀 40,249評(píng)論 0 277
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎乳蓄,沒想到半個(gè)月后咪橙,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,760評(píng)論 1 321
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,840評(píng)論 3 343
  • 正文 我和宋清朗相戀三年美侦,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了产舞。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,973評(píng)論 1 354
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡菠剩,死狀恐怖易猫,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情赠叼,我是刑警寧澤擦囊,帶...
    沈念sama閱讀 36,631評(píng)論 5 351
  • 正文 年R本政府宣布,位于F島的核電站嘴办,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏买鸽。R本人自食惡果不足惜涧郊,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 42,315評(píng)論 3 336
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望眼五。 院中可真熱鬧妆艘,春花似錦、人聲如沸看幼。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,797評(píng)論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)诵姜。三九已至汽煮,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間棚唆,已是汗流浹背暇赤。 一陣腳步聲響...
    開封第一講書人閱讀 33,926評(píng)論 1 275
  • 我被黑心中介騙來泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留宵凌,地道東北人鞋囊。 一個(gè)月前我還...
    沈念sama閱讀 49,431評(píng)論 3 379
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像瞎惫,于是被迫代替她去往敵國(guó)和親溜腐。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,982評(píng)論 2 361

推薦閱讀更多精彩內(nèi)容