- 論文地址
- 閱讀方式
- Deep Residual Learning for Image Recognition
- 圖像識(shí)別的深度殘差學(xué)習(xí)
論文地址
https://arxiv.org/pdf/1512.03385.pdf
閱讀方式
本文采用原文芽卿、翻譯、記錄的排版。
筆者使用如何閱讀深度學(xué)習(xí)論文的方法進(jìn)行閱讀,文中標(biāo)注的 @1(第一步)、@2么夫、@3假夺、@4
分別表示在第該步閱讀中的記錄和思考梳毙。
注:為了加深理解唤殴,大家可以根據(jù)使用 TensorFlow 2 Keras 實(shí)現(xiàn) ResNet 網(wǎng)絡(luò)實(shí)踐 ResNet 網(wǎng)絡(luò)。
Deep Residual Learning for Image Recognition
圖像識(shí)別的深度殘差學(xué)習(xí)
@1 本論文介紹深度殘差在圖像識(shí)別的運(yùn)用到腥,可以猜到深度殘差就是本文論的核心朵逝。
Abstract
摘要
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
更深的神經(jīng)網(wǎng)絡(luò)更難訓(xùn)練。我們提出了一種殘差學(xué)習(xí)框架來(lái)減輕網(wǎng)絡(luò)訓(xùn)練乡范,這些網(wǎng)絡(luò)比以前使用的網(wǎng)絡(luò)更深配名。我們明確地將層變?yōu)閷W(xué)習(xí)關(guān)于層輸入的殘差函數(shù),而不是學(xué)習(xí)未參考的函數(shù)晋辆。我們提供了全面的經(jīng)驗(yàn)證據(jù)說(shuō)明這些殘差網(wǎng)絡(luò)很容易優(yōu)化渠脉,并可以顯著增加深度來(lái)提高準(zhǔn)確性。在 ImageNet
數(shù)據(jù)集上我們?cè)u(píng)估了深度高達(dá) 152
層的殘差網(wǎng)絡(luò)——比 VGG
[41]深 8
倍但仍具有較低的復(fù)雜度瓶佳。這些殘差網(wǎng)絡(luò)的集合在 ImageNet
測(cè)試集上取得了 3.57%
的錯(cuò)誤率芋膘。這個(gè)結(jié)果在 ILSVRC 2015
分類(lèi)任務(wù)上贏得了第一名。我們也在 CIFAR-10
上分析了 100
層和 1000
層的殘差網(wǎng)絡(luò)霸饲。
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
對(duì)于許多視覺(jué)識(shí)別任務(wù)而言为朋,表示的深度是至關(guān)重要的。僅由于我們非常深度的表示厚脉,我們便在 COCO
目標(biāo)檢測(cè)數(shù)據(jù)集上得到了 28%
的相對(duì)提高习寸。深度殘差網(wǎng)絡(luò)是我們向 ILSVRC
和 COCO 2015
競(jìng)賽提交的基礎(chǔ),我們也贏得了 ImageNet
檢測(cè)任務(wù)傻工,ImageNet
定位任務(wù)霞溪,COCO
檢測(cè)和 COCO
分割任務(wù)的第一名孵滞。
@1 摘要中指出更深的神經(jīng)網(wǎng)絡(luò)更難訓(xùn)練,而作者提出的深度殘差網(wǎng)絡(luò)可以解決這個(gè)問(wèn)題鸯匹,從而可以通過(guò)顯著增加深度來(lái)提高準(zhǔn)確性坊饶。并且,深度殘差網(wǎng)絡(luò)在幾次大賽中都獲得了第一名的成績(jī)忽你。
1 Introduction
1 簡(jiǎn)介
Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]. Deep networks naturally integrate low/mid/high-level features [50] and classifiers in an end-to-end multi-layer fashion, and the “l(fā)evels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other non-trivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.
深度卷積神經(jīng)網(wǎng)絡(luò)[22, 21]造就了圖像分類(lèi)[21, 49, 39]的一系列突破幼东。深度網(wǎng)絡(luò)自然地將低/中/高級(jí)特征[49]和分類(lèi)器以端到端多層方式進(jìn)行集成,特征的“級(jí)別”可以通過(guò)堆疊層的數(shù)量(深度)來(lái)豐富科雳。最近的證據(jù)[40, 43]顯示網(wǎng)絡(luò)深度至關(guān)重要根蟹,在具有挑戰(zhàn)性的 ImageNet
數(shù)據(jù)集上領(lǐng)先的結(jié)果都采用了“非常深”[40]的模型,深度從 16 [40]到 30 [16]之間糟秘。許多其它重要的視覺(jué)識(shí)別任務(wù)[7, 11, 6, 32, 27]也從非常深的模型中得到了極大受益简逮。
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with back-propagation [22].
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
在深度重要性的推動(dòng)下,出現(xiàn)了一個(gè)問(wèn)題:學(xué)些更好的網(wǎng)絡(luò)是否像堆疊更多的層一樣容易尿赚?回答這個(gè)問(wèn)題的一個(gè)障礙是梯度消失/爆炸[14, 1, 8]這個(gè)眾所周知的問(wèn)題散庶,它從一開(kāi)始就阻礙了收斂。然而凌净,這個(gè)問(wèn)題通過(guò)標(biāo)準(zhǔn)初始化[23, 8, 36, 12]和中間標(biāo)準(zhǔn)化層[16]在很大程度上已經(jīng)解決悲龟,這使得數(shù)十層的網(wǎng)絡(luò)能通過(guò)具有反向傳播的隨機(jī)梯度下降(SGD)開(kāi)始收斂。
圖 1. 具有
20
層和56
層“普通”網(wǎng)絡(luò)的CIFAR-10
上的訓(xùn)練誤差(左)和測(cè)試誤差(右)冰寻。更深的網(wǎng)絡(luò)具有更高的訓(xùn)練誤差须教,從而具有更高的測(cè)試誤差。ImageNet
上的類(lèi)似現(xiàn)象如圖 4 所示斩芭。
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
當(dāng)更深的網(wǎng)絡(luò)能夠開(kāi)始收斂時(shí)轻腺,暴露了一個(gè)退化問(wèn)題:隨著網(wǎng)絡(luò)深度的增加,準(zhǔn)確率達(dá)到飽和(這可能并不奇怪)然后迅速下降划乖。意外的是贬养,這種下降不是由過(guò)擬合引起的,并且在適當(dāng)?shù)纳疃饶P蜕咸砑?strong>更多的層會(huì)導(dǎo)致更高的訓(xùn)練誤差琴庵,正如[10, 41]中報(bào)告的那樣误算,并且由我們的實(shí)驗(yàn)完全證實(shí)。圖 1 顯示了一個(gè)典型的例子迷殿。
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).
退化(訓(xùn)練準(zhǔn)確率)表明不是所有的系統(tǒng)都很容易優(yōu)化尉桩。讓我們考慮一個(gè)較淺的架構(gòu)及其更深層次的對(duì)象,為其添加更多的層贪庙。存在通過(guò)構(gòu)建得到更深層模型的解決方案:添加的層是恒等映射蜘犁,其他層是從學(xué)習(xí)到的較淺模型的拷貝。這種構(gòu)造解決方案的存在表明止邮,較深的模型不應(yīng)該產(chǎn)生比其對(duì)應(yīng)的較淺模型更高的訓(xùn)練誤差这橙。但是實(shí)驗(yàn)表明奏窑,我們目前現(xiàn)有的解決方案無(wú)法找到與構(gòu)建的解決方案相比相對(duì)不錯(cuò)或更好的解決方案(或在合理的時(shí)間內(nèi)無(wú)法實(shí)現(xiàn))。
In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as , we let the stacked nonlinear layers fit another mapping of . The original mapping is recast into . We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
Figure 2. Residual learning: a building block.
在本文中屈扎,我們通過(guò)引入深度殘差學(xué)習(xí)框架解決了退化問(wèn)題埃唯。我們明確地讓這些層擬合殘差映射,而不是希望每幾個(gè)堆疊的層直接擬合期望的基礎(chǔ)映射鹰晨。形式上墨叛,將期望的基礎(chǔ)映射表示為 ,我們將堆疊的非線性層擬合另一個(gè)映射 模蜡。原始的映射重寫(xiě)為 漠趁。我們假設(shè)殘差映射比原始的、未參考的映射更容易優(yōu)化忍疾。在極端情況下闯传,如果一個(gè)恒等映射是最優(yōu)的,那么將殘差置為零比通過(guò)一堆非線性層來(lái)擬合恒等映射更容易卤妒。
圖 2. 殘差學(xué)習(xí):構(gòu)建塊甥绿。
The formulation of can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.
公式 可以通過(guò)帶有“快捷連接”的前向神經(jīng)網(wǎng)絡(luò)(圖 2)來(lái)實(shí)現(xiàn)≡蚺快捷連接[2, 33, 48]是那些跳過(guò)一層或更多層的連接共缕。在我們的案例中,快捷連接簡(jiǎn)單地執(zhí)行恒等映射士复,并將其輸出添加到堆疊層的輸出(圖 2)骄呼。恒等快捷連接既不增加額外的參數(shù)也不增加計(jì)算復(fù)雜度。整個(gè)網(wǎng)絡(luò)仍然可以由帶有反向傳播的 SGD 進(jìn)行端到端的訓(xùn)練判没,并且可以使用公共庫(kù)(例如,Caffe [19])輕松實(shí)現(xiàn)隅茎,而無(wú)需修改求解器澄峰。
We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.
我們?cè)?ImageNet
[36]上進(jìn)行了綜合實(shí)驗(yàn)來(lái)顯示退化問(wèn)題并評(píng)估我們的方法。我們發(fā)現(xiàn):1)我們極深的殘差網(wǎng)絡(luò)易于優(yōu)化辟犀,但當(dāng)深度增加時(shí)俏竞,對(duì)應(yīng)的“簡(jiǎn)單”網(wǎng)絡(luò)(簡(jiǎn)單堆疊層)表現(xiàn)出更高的訓(xùn)練誤差;2)我們的深度殘差網(wǎng)絡(luò)可以從大大增加的深度中輕松獲得準(zhǔn)確性收益堂竟,生成的結(jié)果實(shí)質(zhì)上比以前的網(wǎng)絡(luò)更好魂毁。
Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.
CIFAR-10
數(shù)據(jù)集上[20]也顯示出類(lèi)似的現(xiàn)象,這表明了優(yōu)化的困難以及我們的方法的影響不僅僅是針對(duì)一個(gè)特定的數(shù)據(jù)集出嘹。我們?cè)谶@個(gè)數(shù)據(jù)集上展示了成功訓(xùn)練的超過(guò) 100
層的模型席楚,并探索了超過(guò) 1000
層的模型。
On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.
在 ImageNet
分類(lèi)數(shù)據(jù)集[35]中税稼,我們通過(guò)非常深的殘差網(wǎng)絡(luò)獲得了很好的結(jié)果烦秩。我們的 152
層殘差網(wǎng)絡(luò)是 ImageNet
上最深的網(wǎng)絡(luò)垮斯,同時(shí)還具有比 VGG
網(wǎng)絡(luò)[40]更低的復(fù)雜性。我們的模型集合在 ImageNet
測(cè)試集上有 3.57% top-5
的錯(cuò)誤率只祠,并在 ILSVRC 2015
分類(lèi)比賽中獲得了第一名兜蠕。極深的表示在其它識(shí)別任務(wù)中也有極好的泛化性能,并帶領(lǐng)我們?cè)谶M(jìn)一步贏得了第一名:包括 ILSVRC & COCO 2015
競(jìng)賽中的 ImageNet
檢測(cè)抛寝,ImageNet
定位熊杨,COCO
檢測(cè)和 COCO
分割。堅(jiān)實(shí)的證據(jù)表明殘差學(xué)習(xí)準(zhǔn)則是通用的盗舰,并且我們期望它適用于其它的視覺(jué)和非視覺(jué)問(wèn)題晶府。
@2 從簡(jiǎn)介部分可以了解到,更深的網(wǎng)絡(luò)面臨著梯度消失/爆炸這個(gè)退化問(wèn)題岭皂,并且不是由過(guò)擬合引起郊霎。作者提出通過(guò)深度殘差(恒等映射、快捷連接)來(lái)解決這個(gè)退化問(wèn)題爷绘,并且既不增加額外的參數(shù)也不增加計(jì)算復(fù)雜度熟吏,使得網(wǎng)絡(luò)易于優(yōu)化伟恶,提高了泛化性能。同時(shí),作者在多個(gè)數(shù)據(jù)集中的實(shí)踐也表明殘差學(xué)習(xí)準(zhǔn)則是通用的碑诉,不局限于特定的數(shù)據(jù)集,也不一定局限于視覺(jué)問(wèn)題堰汉。
2 Related Work
2 相關(guān)工作
Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.
殘差表示巷嚣。在圖像識(shí)別中,VLAD
[18]是一種通過(guò)關(guān)于字典的殘差向量進(jìn)行編碼的表示形式楷扬,Fisher
矢量[30]可以表示為 VLAD
的概率版本[18]解幽。它們都是圖像檢索和圖像分類(lèi)[4,47]中強(qiáng)大的淺層表示。對(duì)于矢量量化烘苹,編碼殘差矢量[17]被證明比編碼原始矢量更有效躲株。
In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.
在低級(jí)視覺(jué)和計(jì)算機(jī)圖形學(xué)中,為了求解偏微分方程(PDE)镣衡,廣泛使用的 Multigrid
方法[3]將系統(tǒng)重構(gòu)為在多個(gè)尺度上的子問(wèn)題霜定,其中每個(gè)子問(wèn)題負(fù)責(zé)較粗尺度和較細(xì)尺度的殘差解。Multigrid
的替代方法是層次化基礎(chǔ)預(yù)處理[44,45]廊鸥,它依賴于表示兩個(gè)尺度之間殘差向量的變量望浩。已經(jīng)被證明[3,44,45]這些求解器比不知道解的殘差性質(zhì)的標(biāo)準(zhǔn)求解器收斂得更快。這些方法表明惰说,良好的重構(gòu)或預(yù)處理可以簡(jiǎn)化優(yōu)化磨德。
Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.
快捷連接。導(dǎo)致快捷連接[2,33,48]的實(shí)踐和理論已經(jīng)被研究了很長(zhǎng)時(shí)間吆视。訓(xùn)練多層感知機(jī)(MLP)的早期實(shí)踐是添加一個(gè)線性層來(lái)連接網(wǎng)絡(luò)的輸入和輸出[33,48]剖张。在[43,24]中切诀,一些中間層直接連接到輔助分類(lèi)器,用于解決梯度消失/爆炸搔弄。論文[38,37,31,46]提出了通過(guò)快捷連接實(shí)現(xiàn)層間響應(yīng)幅虑,梯度和傳播誤差的方法。在[43]中顾犹,一個(gè)“inception”層由一個(gè)快捷分支和一些更深的分支組成倒庵。
Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).
和我們同時(shí)進(jìn)行的工作,“highway networks” [41, 42]提出了門(mén)功能[15]的快捷連接炫刷。這些門(mén)是數(shù)據(jù)相關(guān)且有參數(shù)的擎宝,與我們不具有參數(shù)的恒等快捷連接相反。當(dāng)門(mén)控快捷連接“關(guān)閉”(接近零)時(shí)浑玛,高速網(wǎng)絡(luò)中的層表示非殘差函數(shù)绍申。相反,我們的公式總是學(xué)習(xí)殘差函數(shù)顾彰;我們的恒等快捷連接永遠(yuǎn)不會(huì)關(guān)閉极阅,所有的信息總是通過(guò),還有額外的殘差函數(shù)要學(xué)習(xí)涨享。此外筋搏,高速網(wǎng)絡(luò)還沒(méi)有證實(shí)極度增加的深度(例如,超過(guò) 100 個(gè)層)帶來(lái)的準(zhǔn)確性收益厕隧。
@3 作者指出他并不是殘差思想的第一個(gè)提出者奔脐,不過(guò)作者將其很好地運(yùn)用起來(lái)了。
3. Deep Residual Learning
3. 深度殘差學(xué)習(xí)
3.1. Residual Learning
3.1. 殘差學(xué)習(xí)
Let us consider as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function . The original function thus becomes . Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.
我們考慮 作為幾個(gè)堆疊層(不必是整個(gè)網(wǎng)絡(luò))要擬合的基礎(chǔ)映射吁讨,x
表示這些層中第一層的輸入髓迎。假設(shè)多個(gè)非線性層可以漸近地近似復(fù)雜函數(shù),它等價(jià)于假設(shè)它們可以漸近地近似殘差函數(shù)建丧,即 (假設(shè)輸入輸出是相同維度)排龄。因此,我們明確讓這些層近似參數(shù)函數(shù) 茶鹃,而不是期望堆疊層近似 。因此原始函數(shù)變?yōu)?艰亮。盡管兩種形式應(yīng)該都能漸近地近似要求的函數(shù)(如假設(shè))闭翩,但學(xué)習(xí)的難易程度可能是不同的。
This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
關(guān)于退化問(wèn)題的反直覺(jué)現(xiàn)象激發(fā)了這種重構(gòu)(圖 1 左)迄埃。正如我們?cè)谝灾杏懻摰哪菢恿圃希绻砑拥膶涌梢员粯?gòu)建為恒等映射,更深模型的訓(xùn)練誤差應(yīng)該不大于它對(duì)應(yīng)的更淺版本侄非。退化問(wèn)題表明求解器通過(guò)多個(gè)非線性層來(lái)近似恒等映射可能有困難蕉汪。通過(guò)殘差學(xué)習(xí)的重構(gòu)流译,如果恒等映射是最優(yōu)的,求解器可能簡(jiǎn)單地將多個(gè)非線性連接的權(quán)重推向零來(lái)接近恒等映射者疤。
In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity map-pings provide reasonable preconditioning.
Figure 7. Standard deviations (std) of layer responses on CIFAR-10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order.
在實(shí)際情況下福澡,恒等映射不太可能是最優(yōu)的,但是我們的重構(gòu)可能有助于對(duì)問(wèn)題進(jìn)行預(yù)處理驹马。如果最優(yōu)函數(shù)比零映射更接近于恒等映射革砸,則求解器應(yīng)該更容易找到關(guān)于恒等映射的抖動(dòng),而不是將該函數(shù)作為新函數(shù)來(lái)學(xué)習(xí)糯累。我們通過(guò)實(shí)驗(yàn)(圖 7)顯示學(xué)習(xí)的殘差函數(shù)通常有更小的響應(yīng)算利,表明恒等映射提供了合理的預(yù)處理。
圖 7. 層響應(yīng)在
CIFAR-10
上的標(biāo)準(zhǔn)差(std
)泳姐。這些響應(yīng)是每個(gè)3×3
層的輸出效拭,在1BN1
之后非線性之前。上面:以原始順序顯示層胖秒。下面:響應(yīng)按降序排列缎患。
3.2. Identity Mapping by Shortcuts
3.2. 快捷恒等映射
We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:
(1)
Here and are the input and output vectors of the layers considered. The function represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, in which denotes ReLU [29] and the biases are omitted for simplifying notations. The operation is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., , see Fig. 2).
我們每隔幾個(gè)堆疊層采用殘差學(xué)習(xí)。構(gòu)建塊如圖 2 所示扒怖。在本文中我們考慮構(gòu)建塊正式定義為:
(1)
和 是考慮的層的輸入和輸出向量较锡。函數(shù) 表示要學(xué)習(xí)的殘差映射。圖 2 中的例子有兩層盗痒, 中 表示 ReLU[29]蚂蕴,為了簡(jiǎn)化寫(xiě)法忽略偏置項(xiàng)。 操作通過(guò)快捷連接和各個(gè)元素相加來(lái)執(zhí)行俯邓。在相加之后我們采納了第二種非線性(即 骡楼,看圖 2)。
The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
方程(1)中的快捷連接既沒(méi)有引入外部參數(shù)又沒(méi)有增加計(jì)算復(fù)雜度稽鞭。這不僅在實(shí)踐中有吸引力鸟整,而且在簡(jiǎn)單網(wǎng)絡(luò)和殘差網(wǎng)絡(luò)的比較中也很重要。我們可以公平地比較同時(shí)具有相同數(shù)量的參數(shù)朦蕴,相同深度篮条,寬度和計(jì)算成本的簡(jiǎn)單/殘差網(wǎng)絡(luò)(除了不可忽略的元素加法之外)。
The dimensions of and must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
{2}
We can also use a square matrix in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.
方程(1)中 和 的維度必須是相等的吩抓。如果不是這種情況(例如涉茧,當(dāng)更改輸入/輸出通道時(shí)),我們可以通過(guò)快捷連接執(zhí)行線性投影 來(lái)匹配維度:
{2}
我們也可以使用方程(1)中的方陣 疹娶。但是我們將通過(guò)實(shí)驗(yàn)表明伴栓,恒等映射足以解決退化問(wèn)題,并且是合算的,因此 僅在匹配維度時(shí)使用钳垮。
The form of the residual function is flexible. Experiments in this paper involve a function that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: , for which we have not observed advantages.
Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet-34. Right: a “bottleneck” building block for ResNet-50/101/152.
圖 5.
ImageNet
的深度殘差函數(shù) 惑淳。左:ResNet-34
的構(gòu)建塊(在56×56
的特征圖上),如圖 3饺窿。右:ResNet-50/101/152
的 “bottleneck”構(gòu)建塊歧焦。
殘差函數(shù) 的形式是可變的。本文中的實(shí)驗(yàn)包括有兩層或三層(圖 5)的函數(shù) 短荐,同時(shí)可能有更多的層倚舀。但如果 只有一層,方程(1)類(lèi)似于線性層:忍宋,我們沒(méi)有看到優(yōu)勢(shì)痕貌。
We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.
我們還注意到,為了簡(jiǎn)單起見(jiàn)糠排,盡管上述符號(hào)是關(guān)于全連接層的舵稠,但它們同樣適用于卷積層。函數(shù) 可以表示多個(gè)卷積層入宦。元素加法在兩個(gè)特征圖上逐通道進(jìn)行哺徊。
3.3. Network Architectures
3.3. 網(wǎng)絡(luò)架構(gòu)
We have tested various plain/residual nets, and have ob-served consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.
我們測(cè)試了各種簡(jiǎn)單/殘差網(wǎng)絡(luò),并觀察到了一致的現(xiàn)象乾闰。為了提供討論的實(shí)例落追,我們描述了 ImageNet
的兩個(gè)模型如下。
Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41] (Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).
Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Table 1 shows more details and other variants.
Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down-sampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2.
簡(jiǎn)單網(wǎng)絡(luò)涯肩。 我們簡(jiǎn)單網(wǎng)絡(luò)的基準(zhǔn)(圖 3轿钠,中間)主要受到 VGG
網(wǎng)絡(luò)[40](圖 3,左圖)的哲學(xué)啟發(fā)病苗。卷積層主要有 3×3
的濾波器疗垛,并遵循兩個(gè)簡(jiǎn)單的設(shè)計(jì)規(guī)則:(i)對(duì)于相同的輸出特征圖尺寸,層具有相同數(shù)量的濾波器硫朦;(ii)如果特征圖尺寸減半贷腕,則濾波器數(shù)量加倍,以便保持每層的時(shí)間復(fù)雜度咬展。我們通過(guò)步長(zhǎng)為 2
的卷積層直接執(zhí)行下采樣泽裳。網(wǎng)絡(luò)以全局平均池化層和具有 softmax
的 1000
維全連接層結(jié)束。圖 3(中間)的加權(quán)層總數(shù)為 34
破婆。
圖 3.
ImageNet
的網(wǎng)絡(luò)架構(gòu)例子涮总。左:作為參考的VGG-19
模型[41]。中:具有34
個(gè)參數(shù)層的簡(jiǎn)單網(wǎng)絡(luò)(36
億FLOPs
)荠割。右:具有34
個(gè)參數(shù)層的殘差網(wǎng)絡(luò)(36
億FLOPs
)妹卿。帶點(diǎn)的快捷連接增加了維度。表 1 顯示了更多細(xì)節(jié)和其它變種蔑鹦。
表 1.
ImageNet
架構(gòu)夺克。構(gòu)建塊顯示在括號(hào)中(也可看圖 5),以及構(gòu)建塊的堆疊數(shù)量嚎朽。下采樣通過(guò)步長(zhǎng)為2
的conv3_1
,conv4_1
和conv5_1
執(zhí)行铺纽。
It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34-layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).
值得注意的是我們的模型與 VGG
網(wǎng)絡(luò)(圖 3 左)相比,有更少的濾波器和更低的復(fù)雜度哟忍。我們的 34
層基準(zhǔn)有 36
億 FLOP
(乘加)狡门,僅是 VGG-19
(196
億 FLOP
)的 18%
。
Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.
殘差網(wǎng)絡(luò)锅很∑淞螅基于上述的簡(jiǎn)單網(wǎng)絡(luò),我們插入快捷連接(圖 3爆安,右)叛复,將網(wǎng)絡(luò)轉(zhuǎn)換為其對(duì)應(yīng)的殘差版本。當(dāng)輸入和輸出具有相同的維度時(shí)(圖 3 中的實(shí)線快捷連接)時(shí)扔仓,可以直接使用恒等快捷連接(方程(1))褐奥。當(dāng)維度增加(圖 3 中的虛線快捷連接)時(shí),我們考慮兩個(gè)選項(xiàng):(A)快捷連接仍然執(zhí)行恒等映射翘簇,額外填充零輸入以增加維度撬码。此選項(xiàng)不會(huì)引入額外的參數(shù);(B)方程(2)中的投影快捷連接用于匹配維度(由 1×1 卷積完成)版保。對(duì)于這兩個(gè)選項(xiàng)呜笑,當(dāng)快捷連接跨越兩種尺寸的特征圖時(shí),它們執(zhí)行時(shí)步長(zhǎng)為 2找筝。
3.4. Implementation
3.4. 實(shí)現(xiàn)
Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16].
ImageNet
中我們的實(shí)現(xiàn)遵循[21蹈垢,41]的實(shí)踐。調(diào)整圖像大小袖裕,其較短的邊在[256,480]之間進(jìn)行隨機(jī)采樣曹抬,用于尺度增強(qiáng)[40]。224×224
裁剪是從圖像或其水平翻轉(zhuǎn)中隨機(jī)采樣急鳄,并逐像素減去均值[21]谤民。使用了[21]中的標(biāo)準(zhǔn)顏色增強(qiáng)。在每個(gè)卷積之后和激活之前疾宏,我們采用批量歸一化(BN)[16]张足。我們按照[12]的方法初始化權(quán)重,從零開(kāi)始訓(xùn)練所有的簡(jiǎn)單/殘差網(wǎng)絡(luò)坎藐。我們使用批大小為 256
的 SGD
方法为牍。學(xué)習(xí)速度從 0.1
開(kāi)始哼绑,當(dāng)誤差穩(wěn)定時(shí)學(xué)習(xí)率除以 10
,并且模型訓(xùn)練高達(dá) 60×104
次迭代碉咆。我們使用的權(quán)重衰減為 0.0001
抖韩,動(dòng)量為 0.9
。根據(jù)[16]的實(shí)踐疫铜,我們不使用 dropout
[14]茂浮。
In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully-convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).
在測(cè)試階段,為了比較學(xué)習(xí)我們采用標(biāo)準(zhǔn)的 10-crop
測(cè)試[21]壳咕。對(duì)于最好的結(jié)果席揽,我們采用如[41, 13]中的全卷積形式,并在多尺度上對(duì)分?jǐn)?shù)進(jìn)行平均(圖像歸一化谓厘,短邊位于 {224, 256, 384, 480, 640}
中)幌羞。
@3 作者在本節(jié)先講了殘差網(wǎng)絡(luò)更好的理論依據(jù)是原始函數(shù)和殘差函數(shù)學(xué)習(xí)的難易程度是不同的;
然后說(shuō)明了殘差函數(shù)的形式是可變的竟稳,文中使用的是兩層或三層的新翎,并且不采用一層的(類(lèi)似于線性層);
緊接著通過(guò)對(duì)比VGG
住练、34-layer plain
地啰、34-layer residual
講解了網(wǎng)絡(luò)的結(jié)構(gòu);
最后講解了網(wǎng)絡(luò)的實(shí)現(xiàn)細(xì)節(jié)讲逛。
4. Experiments
4. 實(shí)驗(yàn)
4.1. ImageNet Classification
4.1. ImageNet 分類(lèi)
We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.
我們?cè)?ImageNet 2012
分類(lèi)數(shù)據(jù)集[36]對(duì)我們的方法進(jìn)行了評(píng)估亏吝,該數(shù)據(jù)集由 1000
個(gè)類(lèi)別組成。這些模型在 128
萬(wàn)張訓(xùn)練圖像上進(jìn)行訓(xùn)練盏混,并在 5
萬(wàn)張驗(yàn)證圖像上進(jìn)行評(píng)估蔚鸥。我們也獲得了測(cè)試服務(wù)器報(bào)告的在 10
萬(wàn)張測(cè)試圖像上的最終結(jié)果。我們?cè)u(píng)估了 top-1
和 top-5
錯(cuò)誤率许赃。
Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures.
簡(jiǎn)單網(wǎng)絡(luò)止喷。我們首先評(píng)估 18
層和 34
層的簡(jiǎn)單網(wǎng)絡(luò)。34
層簡(jiǎn)單網(wǎng)絡(luò)在圖 3(中間)混聊。18
層簡(jiǎn)單網(wǎng)絡(luò)是一種類(lèi)似的形式弹谁。有關(guān)詳細(xì)的體系結(jié)構(gòu),請(qǐng)參見(jiàn)表 1句喜。
The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem -- the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.
Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures.
表 2 中的結(jié)果表明预愤,較深的 34
層簡(jiǎn)單網(wǎng)絡(luò)比較淺的 18
層簡(jiǎn)單網(wǎng)絡(luò)有更高的驗(yàn)證誤差。為了揭示原因咳胃,在圖 4(左圖)中植康,我們比較訓(xùn)練過(guò)程中的訓(xùn)練/驗(yàn)證誤差。我們觀察到退化問(wèn)題——雖然 18
層簡(jiǎn)單網(wǎng)絡(luò)的解空間是 34
層簡(jiǎn)單網(wǎng)絡(luò)解空間的子空間展懈,但 34
層簡(jiǎn)單網(wǎng)絡(luò)在整個(gè)訓(xùn)練過(guò)程中具有較高的訓(xùn)練誤差销睁。
表 2.
ImageNet
驗(yàn)證集上的Top-1
錯(cuò)誤率(%供璧,10 個(gè)裁剪圖像測(cè)試)。相比于對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)冻记,ResNet
沒(méi)有額外的參數(shù)嗜傅。圖 4 顯示了訓(xùn)練過(guò)程。
Figure 4. Training on
ImageNet
. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.
圖 4. 在
ImageNet
上訓(xùn)練檩赢。細(xì)曲線表示訓(xùn)練誤差,粗曲線表示中心裁剪圖像的驗(yàn)證誤差违寞。左:18
層和34
層的簡(jiǎn)單網(wǎng)絡(luò)贞瞒。右:18
層和34
層的ResNet
。在本圖中趁曼,殘差網(wǎng)絡(luò)與對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)相比沒(méi)有額外的參數(shù)军浆。
We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error3. The reason for such optimization difficulties will be studied in the future.
Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions.
我們認(rèn)為這種優(yōu)化難度不可能是由于梯度消失引起的。這些簡(jiǎn)單網(wǎng)絡(luò)使用 BN[16]訓(xùn)練挡闰,這保證了前向傳播信號(hào)有非零方差乒融。我們還驗(yàn)證了反向傳播的梯度,結(jié)果顯示其符合 BN 的正常標(biāo)準(zhǔn)摄悯。因此既不是前向信號(hào)消失也不是反向信號(hào)消失赞季。實(shí)際上,34
層簡(jiǎn)單網(wǎng)絡(luò)仍能取得有競(jìng)爭(zhēng)力的準(zhǔn)確率(表 3)奢驯,這表明在某種程度上來(lái)說(shuō)求解器仍工作申钩。我們推測(cè)深度簡(jiǎn)單網(wǎng)絡(luò)可能有指數(shù)級(jí)低收斂特性,這影響了訓(xùn)練誤差的降低瘪阁。這種優(yōu)化困難的原因?qū)?lái)會(huì)研究撒遣。
表 3.
ImageNet
驗(yàn)證集錯(cuò)誤率(%,10 個(gè)裁剪圖像測(cè)試)管跺。VGG16
是基于我們的測(cè)試結(jié)果的义黎。ResNet-50/101/152
的選擇B
僅使用投影增加維度。
Residual Networks. Next we evaluate 18-layer and 34-layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.
殘差網(wǎng)絡(luò)豁跑。接下來(lái)我們?cè)u(píng)估 18
層和 34
層殘差網(wǎng)絡(luò)(ResNets
)廉涕。基準(zhǔn)架構(gòu)與上述的簡(jiǎn)單網(wǎng)絡(luò)相同艇拍,如圖 3(右)所示火的,預(yù)計(jì)每對(duì) 3×3
濾波器都會(huì)添加快捷連接。在第一次比較(表 2 和圖 4 右側(cè))中淑倾,我們對(duì)所有快捷連接都使用恒等映射和零填充以增加維度(選項(xiàng) A)馏鹤。所以與對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)相比,它們沒(méi)有額外的參數(shù)。
We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.
我們從表 2
和圖 4
中可以看到三個(gè)主要的觀察結(jié)果却盘。首先,殘留學(xué)習(xí)的情況變了——34 層 ResNet 比 18 層 ResNet 更好(2.8%)甲脏。更重要的是治力,34 層 ResNet 顯示出較低的訓(xùn)練誤差蒙秒,并且可以泛化到驗(yàn)證數(shù)據(jù)。這表明在這種情況下宵统,退化問(wèn)題得到了很好的解決晕讲,我們從增加的深度中設(shè)法獲得了準(zhǔn)確性收益。
Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.
第二马澈,與對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)相比瓢省,由于成功的減少了訓(xùn)練誤差,34 層 ResNet 降低了 3.5%的 top-1 錯(cuò)誤率痊班。這種比較證實(shí)了在極深系統(tǒng)中殘差學(xué)習(xí)的有效性勤婚。
Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.
最后,我們還注意到 18
層的簡(jiǎn)單/殘差網(wǎng)絡(luò)同樣地準(zhǔn)確(表 2)涤伐,但 18
層 ResNet
收斂更快(圖 4 右和左)馒胆。當(dāng)網(wǎng)絡(luò)“不過(guò)度深”時(shí)(18
層),目前的 SGD
求解器仍能在簡(jiǎn)單網(wǎng)絡(luò)中找到好的解凝果。在這種情況下祝迂,ResNet
通過(guò)在早期提供更快的收斂簡(jiǎn)便了優(yōu)化。
Identity vs. Projection Shortcuts. We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter-free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.
恒等和投影快捷連接我們已經(jīng)表明沒(méi)有參數(shù)器净,恒等快捷連接有助于訓(xùn)練液兽。接下來(lái)我們調(diào)查投影快捷連接(方程 2)。在表 3
中我們比較了三個(gè)選項(xiàng):(A) 零填充快捷連接用來(lái)增加維度掌动,所有的快捷連接是沒(méi)有參數(shù)的(與表 2 和圖 4 右相同)四啰;(B)投影快捷連接用來(lái)增加維度,其它的快捷連接是恒等的粗恢;(C)所有的快捷連接都是投影柑晒。
Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.
表 3 顯示,所有三個(gè)選項(xiàng)都比對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)好很多眷射。選項(xiàng) B
比 A
略好匙赞。我們認(rèn)為這是因?yàn)?A
中的零填充確實(shí)沒(méi)有殘差學(xué)習(xí)。選項(xiàng) C
比 B
稍好妖碉,我們把這歸因于許多(十三)投影快捷連接引入了額外參數(shù)涌庭。但 A/B/C
之間的細(xì)微差異表明,投影快捷連接對(duì)于解決退化問(wèn)題不是至關(guān)重要的欧宜。因此我們?cè)诒疚牡氖S嗖糠?strong>不再使用選項(xiàng) C坐榆,以減少內(nèi)存/時(shí)間復(fù)雜性和模型大小。恒等快捷連接對(duì)于不增加下面介紹的瓶頸結(jié)構(gòu)的復(fù)雜性尤為重要冗茸。
Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design. For each residual function F , we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.
更深的瓶頸結(jié)構(gòu)席镀。接下來(lái)我們描述 ImageNet
中我們使用的更深的網(wǎng)絡(luò)網(wǎng)絡(luò)匹中。由于關(guān)注我們能承受的訓(xùn)練時(shí)間,我們將構(gòu)建塊修改為瓶頸設(shè)計(jì)豪诲。對(duì)于每個(gè)殘差函數(shù) 顶捷,我們使用 3
層堆疊而不是 2
層(圖 5)。三層是 1×1
屎篱,3×3
和 1×1
卷積服赎,其中 1×1
層負(fù)責(zé)減小然后增加(恢復(fù))維度,使 3×3
層成為具有較小輸入/輸出維度的瓶頸交播。圖 5 展示了一個(gè)示例重虑,兩個(gè)設(shè)計(jì)具有相似的時(shí)間復(fù)雜度。
The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity short-cut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.
無(wú)參數(shù)恒等快捷連接對(duì)于瓶頸架構(gòu)尤為重要堪侯。如果圖 5(右)中的恒等快捷連接被投影替換,則可以顯示出時(shí)間復(fù)雜度和模型大小加倍荔仁,因?yàn)榭旖葸B接是連接到兩個(gè)高維端伍宦。因此,恒等快捷連接可以為瓶頸設(shè)計(jì)得到更有效的模型乏梁。
50-layer ResNet: We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.
50 層 ResNet:我們用 3
層瓶頸塊替換 34
層網(wǎng)絡(luò)中的每一個(gè) 2
層塊次洼,得到了一個(gè) 50
層 ResNet
(表 1)。我們使用選項(xiàng) B
來(lái)增加維度遇骑。該模型有 38
億 FLOP
卖毁。
101-layer and 152-layer ResNet: We construct 101-layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).
101 層和 152 層 ResNet:我們通過(guò)使用更多的 3
層瓶頸塊來(lái)構(gòu)建 101
層和 152
層 ResNets
(表 1)。值得注意的是落萎,盡管深度顯著增加亥啦,但 152
層 ResNet
(113 億 FLOP)仍然比 VGG-16/19
網(wǎng)絡(luò)(153/196 億 FLOP)具有更低的復(fù)雜度。
The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).
50/101/152
層 ResNet
比 34
層 ResNet
的準(zhǔn)確性要高得多(表 3 和 4)练链。我們沒(méi)有觀察到退化問(wèn)題翔脱,因此可以從顯著增加的深度中獲得顯著的準(zhǔn)確性收益。所有評(píng)估指標(biāo)都能證明深度的收益(表 3 和表 4)媒鼓。
Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). This entry won the 1st place in ILSVRC 2015.
Table 4. Error rates (%) of single-model results on the ImageNet validation set (except reported on the test set).
Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.
與最先進(jìn)的方法比較届吁。在表 4 中,我們與以前最好的單一模型結(jié)果進(jìn)行比較绿鸣。我們基準(zhǔn)的 34
層 ResNet
已經(jīng)取得了非常有競(jìng)爭(zhēng)力的準(zhǔn)確性疚沐。我們的 152
層 ResNet
具有單模型 4.49%
的 top-5
錯(cuò)誤率。這種單一模型的結(jié)果勝過(guò)以前的所有綜合結(jié)果(表 5)潮模。我們結(jié)合了六種不同深度的模型亮蛔,形成一個(gè)集合(在提交時(shí)僅有兩個(gè) 152
層)。這在測(cè)試集上得到了 3.5%
的 top-5
錯(cuò)誤率(表 5)擎厢。這次提交在 2015
年 ILSVRC
中榮獲了第一名尔邓。
表 4. 單一模型在
ImageNet
驗(yàn)證集上的錯(cuò)誤率(%)(除了?是測(cè)試集上報(bào)告的錯(cuò)誤率)晾剖。
表 5. 模型綜合的錯(cuò)誤率(%)。
top-5
錯(cuò)誤率是ImageNet
測(cè)試集上的并由測(cè)試服務(wù)器報(bào)告的梯嗽。
4.2. CIFAR-10 and Analysis
4.2. CIFAR-10 和分析
We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.
我們對(duì) CIFAR-10
數(shù)據(jù)集[20]進(jìn)行了更多的研究齿尽,其中包括 10
個(gè)類(lèi)別中的 5
萬(wàn)張訓(xùn)練圖像和 1
萬(wàn)張測(cè)試圖像。我們介紹了在訓(xùn)練集上進(jìn)行訓(xùn)練和在測(cè)試集上進(jìn)行評(píng)估的實(shí)驗(yàn)灯节。我們的焦點(diǎn)在于極深網(wǎng)絡(luò)的行為循头,但不是推動(dòng)最先進(jìn)的結(jié)果,所以我們有意使用如下的簡(jiǎn)單架構(gòu)炎疆。
The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is per-formed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:
When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.
簡(jiǎn)單/殘差架構(gòu)遵循圖 3(中/右)的形式卡骂。網(wǎng)絡(luò)輸入是 32×32
的圖像,每個(gè)像素減去均值形入。第一層是 3×3
卷積全跨。然后我們?cè)诖笮?{32,16,8}
的特征圖上分別使用了帶有 3×3
卷積的 6n
個(gè)堆疊層,每個(gè)特征圖大小使用 2n
層亿遂。濾波器數(shù)量分別為 {16,32,64}
浓若。下采樣由步長(zhǎng)為 2
的卷積進(jìn)行。網(wǎng)絡(luò)以全局平均池化蛇数,一個(gè) 10
維全連接層和 softmax
作為結(jié)束挪钓。共有 6n+2
個(gè)堆疊的加權(quán)層。下表總結(jié)了這個(gè)架構(gòu):
當(dāng)使用快捷連接時(shí)耳舅,它們連接到成對(duì)的 3×3
卷積層上(共 3n
個(gè)快捷連接)碌上。在這個(gè)數(shù)據(jù)集上,我們?cè)谒邪咐卸际褂?strong>恒等快捷連接(即選項(xiàng) A)浦徊,因此我們的殘差模型與對(duì)應(yīng)的簡(jiǎn)單模型具有完全相同的深度馏予,寬度和參數(shù)數(shù)量。
We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.
我們使用的權(quán)重衰減為 0.0001
和動(dòng)量為 0.9
盔性,并采用[12]和 BN[16]中的權(quán)重初始化吗蚌,但沒(méi)有使用丟棄。這些模型在兩個(gè) GPU
上進(jìn)行訓(xùn)練纯出,批處理大小為 128
蚯妇。我們開(kāi)始使用的學(xué)習(xí)率為 0.1
,在 32k
次和 48k
次迭代后學(xué)習(xí)率除以 10
暂筝,并在 64k
次迭代后終止訓(xùn)練箩言,這是由 45k/5k
的訓(xùn)練/驗(yàn)證集分割決定的。我們按照[24]中的簡(jiǎn)單數(shù)據(jù)增強(qiáng)進(jìn)行訓(xùn)練:每邊填充 4
個(gè)像素焕襟,并從填充圖像或其水平翻轉(zhuǎn)圖像中隨機(jī)采樣 32×32
的裁剪圖像陨收。對(duì)于測(cè)試,我們只評(píng)估原始 32×32
圖像的單一視圖。
We compare n = {3, 5, 7, 9}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [42]), suggesting that such an optimization difficulty is a fundamental problem.
Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.
我們比較了 n = 3,5,7,9
务漩,得到了 20
層拄衰,32
層,44
層和 56
層的網(wǎng)絡(luò)饵骨。圖 6(左)顯示了簡(jiǎn)單網(wǎng)絡(luò)的行為翘悉。深度簡(jiǎn)單網(wǎng)絡(luò)經(jīng)歷了深度增加,隨著深度增加表現(xiàn)出了更高的訓(xùn)練誤差居触。這種現(xiàn)象類(lèi)似于 ImageNet
中(圖 4妖混,左)和 MNIST
中(請(qǐng)看[41])的現(xiàn)象,表明這種優(yōu)化困難是一個(gè)基本的問(wèn)題轮洋。
圖 6. 在
CIFAR-10
上訓(xùn)練制市。虛線表示訓(xùn)練誤差,粗線表示測(cè)試誤差弊予。左:簡(jiǎn)單網(wǎng)絡(luò)祥楣。簡(jiǎn)單的110
層網(wǎng)絡(luò)錯(cuò)誤率超過(guò)60%
沒(méi)有展示。中間:ResNet
汉柒。右:110
層ResNet
和1202
層ResNet
误褪。
Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demonstrate accuracy gains when the depth increases.
圖 6(中)顯示了 ResNet
的行為。與 ImageNet
的情況類(lèi)似(圖 4竭翠,右)振坚,我們的 ResNet
設(shè)法克服優(yōu)化困難并隨著深度的增加展示了準(zhǔn)確性收益薇搁。
We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging5. So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin networks such as FitNet [35] and Highway [42] (Table 6), yet is among the state-of-the-art results (6.43%, Table 6).
Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [43].
我們進(jìn)一步探索了 n=18
得到了 110
層的 ResNet
斋扰。在這種情況下,我們發(fā)現(xiàn) 0.1
的初始學(xué)習(xí)率對(duì)于收斂來(lái)說(shuō)太大了啃洋。因此我們使用 0.01
的學(xué)習(xí)率開(kāi)始訓(xùn)練传货,直到訓(xùn)練誤差低于 80%
(大約 400
次迭代),然后學(xué)習(xí)率變回到 0.1
并繼續(xù)訓(xùn)練宏娄。學(xué)習(xí)過(guò)程的剩余部分與前面做的一樣问裕。這個(gè) 110
層網(wǎng)絡(luò)收斂的很好(圖 6,中)孵坚。它與其它的深且窄的網(wǎng)絡(luò)例如 FitNet
[34]和 Highway41
相比有更少的參數(shù)粮宛,但結(jié)果仍在目前最好的結(jié)果之間(6.43%,表 6)卖宠。
表 6. 在
CIFAR-10
測(cè)試集上的分類(lèi)誤差巍杈。所有的方法都使用了數(shù)據(jù)增強(qiáng)。對(duì)于ResNet-110
扛伍,像論文[42]中那樣筷畦,我們運(yùn)行了 5 次并展示了“最好的(mean±std)”。
Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.
層響應(yīng)分析刺洒。圖 7
顯示了層響應(yīng)的標(biāo)準(zhǔn)偏差(std)鳖宾。這些響應(yīng)每個(gè) 3×3
層的輸出吼砂,在 BN
之后和其他非線性(ReLU
/加法)之前。對(duì)于 ResNets
鼎文,該分析揭示了殘差函數(shù)的響應(yīng)強(qiáng)度渔肩。圖 7
顯示 ResNet
的響應(yīng)比其對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)的響應(yīng)更小。這些結(jié)果支持了我們的基本動(dòng)機(jī)(第 3.1 節(jié))漂问,殘差函數(shù)通常具有比非殘差函數(shù)更接近零赖瞒。我們還注意到,更深的 ResNet
具有較小的響應(yīng)幅度蚤假,如圖 7
中 ResNet-20
栏饮,56
和 110
之間的比較所證明的。當(dāng)層數(shù)更多時(shí)磷仰,單層 ResNet 趨向于更少地修改信號(hào)袍嬉。
Figure 7. Standard deviations (std) of layer responses on CIFAR-10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order.
圖 7. 層響應(yīng)在
CIFAR-10
上的標(biāo)準(zhǔn)差(std
)。這些響應(yīng)是每個(gè)3×3
層的輸出灶平,在1BN1
之后非線性之前伺通。上面:以原始順序顯示層。下面:響應(yīng)按降序排列逢享。
Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6).
探索超過(guò) 1000 層罐监。我們探索超過(guò) 1000
層的過(guò)深的模型。我們?cè)O(shè)置 n=200
瞒爬,得到了 1202
層的網(wǎng)絡(luò)弓柱,其訓(xùn)練如上所述。我們的方法顯示沒(méi)有優(yōu)化困難侧但,這個(gè) 103
層網(wǎng)絡(luò)能夠?qū)崿F(xiàn)訓(xùn)練誤差 <0.1%
(圖 6矢空,右圖)。其測(cè)試誤差仍然很好(7.93%禀横,表 6)屁药。
But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout [10] or dropout [14] is applied to obtain the best results ([10, 25, 24, 35]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future.
但是,這種極深的模型仍然存在著開(kāi)放的問(wèn)題柏锄。這個(gè) 1202
層網(wǎng)絡(luò)的測(cè)試結(jié)果比我們的 110
層網(wǎng)絡(luò)的測(cè)試結(jié)果更差酿箭,雖然兩者都具有類(lèi)似的訓(xùn)練誤差。我們認(rèn)為這是因?yàn)?strong>過(guò)擬合趾娃。對(duì)于這種小型數(shù)據(jù)集缭嫡,1202
層網(wǎng)絡(luò)可能是不必要的大(19.4M)。在這個(gè)數(shù)據(jù)集應(yīng)用強(qiáng)大的正則化茫舶,如 maxout
[9]或者 dropout
[13]來(lái)獲得最佳結(jié)果([10,25,24,35])械巡。在本文中,我們不使用 maxout/dropout
,只是簡(jiǎn)單地通過(guò)設(shè)計(jì)深且窄的架構(gòu)簡(jiǎn)單地進(jìn)行正則化讥耗,而不會(huì)分散集中在優(yōu)化難點(diǎn)上的注意力有勾。但結(jié)合更強(qiáng)的正規(guī)化可能會(huì)改善結(jié)果,我們將來(lái)會(huì)研究古程。
4.3. Object Detection on PASCAL and MS COCO
4.3. 在 PASCAL 和 MS COCO 上的目標(biāo)檢測(cè)
Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 [5] and COCO [26]. We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [41] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we obtain a 6.0% increase in COCO’s standard metric (mAP@[.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations.
Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also Table 10 and 11 for better results.
Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also Table 9 for better results.
我們的方法對(duì)其他識(shí)別任務(wù)有很好的泛化性能蔼卡。表 7
和表 8
顯示了 PASCAL VOC 2007
和 2012
[5]以及 COCO
[26]的目標(biāo)檢測(cè)基準(zhǔn)結(jié)果。我們采用Faster R-CNN
[32]作為檢測(cè)方法挣磨。在這里雇逞,我們感興趣的是用 ResNet-101
替換 VGG-16
[40]。使用這兩種模式的檢測(cè)實(shí)現(xiàn)(見(jiàn)附錄)是一樣的茁裙,所以收益只能歸因于更好的網(wǎng)絡(luò)塘砸。最顯著的是,在有挑戰(zhàn)性的 COCO
數(shù)據(jù)集中晤锥,COCO
的標(biāo)準(zhǔn)度量指標(biāo)(mAP@[.5掉蔬,.95])增長(zhǎng)了 6.0%
,相對(duì)改善了 28%
矾瘾。這種收益完全是由于學(xué)習(xí)表示女轿。
表 7. 在
PASCAL VOC 2007/2012
測(cè)試集上使用基準(zhǔn)Faster R-CNN
的目標(biāo)檢測(cè)mAP(%)
。更好的結(jié)果請(qǐng)看附錄壕翩。
表 8. 在
COCO
驗(yàn)證集上使用基準(zhǔn)Faster R-CNN
的目標(biāo)檢測(cè)mAP(%)
蛉迹。更好的結(jié)果請(qǐng)看附錄。
Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix.
基于深度殘差網(wǎng)絡(luò)放妈,我們?cè)?ILSVRC & COCO 2015
競(jìng)賽的幾個(gè)任務(wù)中獲得了第一名北救,分別是:ImageNet 檢測(cè),ImageNet 定位大猛,COCO 檢測(cè)扭倾,COCO 分割
淀零。跟多細(xì)節(jié)請(qǐng)看附錄挽绩。
$
3 作者通過(guò)實(shí)驗(yàn)ImageNet 分類(lèi)、CIFAR-10 和分析證實(shí)了
- 深度簡(jiǎn)單網(wǎng)絡(luò)可能有指數(shù)級(jí)低收斂特性
- 投影快捷連接對(duì)于解決退化問(wèn)題不是至關(guān)重要的
ResNet
的響應(yīng)比其對(duì)應(yīng)的簡(jiǎn)單網(wǎng)絡(luò)的響應(yīng)更小驾中,即殘差函數(shù)通常具有比非殘差函數(shù)更接近零同時(shí)也探索超過(guò)
1000
層的網(wǎng)絡(luò)唉堪,并指出這種極深的模型仍然存在著開(kāi)放的問(wèn)題。
總得來(lái)說(shuō)肩民,Resnet 在各個(gè)數(shù)據(jù)集和各項(xiàng)任務(wù)中表現(xiàn)都較好唠亚,可以用于替換舊的一些 backbone。
$
References
- [1] Y.Bengio,P.Simard, and P.Frasconi.Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
- [2] C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995.
- [3] W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000.
- [4] K.Chatfield,V.Lempitsky,A.Vedaldi,and A.Zisserman.Thedevil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
- [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010.
- [6] S.GidarisandN.Komodakis.Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015.
- [7] R. Girshick. Fast R-CNN. In ICCV, 2015.
- [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In
CVPR, 2014. - [9] X. Glorot and Y. Bengio. Understanding the difficulty of training
deep feedforward neural networks. In AISTATS, 2010. - [10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
Y. Bengio. Maxout networks. arXiv:1302.4389, 2013. - [11] K.HeandJ.Sun.Convolutional neural networks at constrained time cost. In CVPR, 2015.
- [12] K.He,X.Zhang,S.Ren,andJ.Sun.Spatial pyramid pooling in deep
convolutional networks for visual recognition. In ECCV, 2014. - [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In
ICCV, 2015. - [14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
- [15] S.HochreiterandJ.Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
- [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015. - [17] H.Jegou,M.Douze,andC.Schmid.Product quantization for nearest neighbor search. TPAMI, 33, 2011.
- [18] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012.
- [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for
fast feature embedding. arXiv:1408.5093, 2014. - [20] A. Krizhevsky. Learning multiple layers of features from tiny im-
ages. Tech Report, 2009. - [21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification
with deep convolutional neural networks. In NIPS, 2012. - [22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand-
written zip code recognition. Neural computation, 1989. - [23] Y.LeCun,L.Bottou,G.B.Orr,andK.-R.Mu ?ller.Efficientbackprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998.
- [24] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
supervised nets. arXiv:1409.5185, 2014. - [25] M.Lin,Q.Chen,andS.Yan.Network in network.arXiv:1312.4400, 2013.
- [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dolla ?r, and C. L. Zitnick. Microsoft COCO: Common objects in
context. In ECCV. 2014. - [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks
for semantic segmentation. In CVPR, 2015. - [28] G. Montu ?far, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.
- [29] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- [30] F.Perronnin and C.Dance.Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
- [31] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012.
- [32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- [33] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015.
- [34] B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996.
- [35] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
- [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
- [37] A. M. Saxe, J. L. McClelland, and S. Ganguli.
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.
arXiv:1312.6120, 2013. - [38] N.N.Schraudolph.Accelerated gradient descent by factor-centering decomposition. Technical report, 1998.
- [39] N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade, pages 207–226. Springer, 1998.
- [40] P.Sermanet, D.Eigen, X.Zhang, M.Mathieu, R.Fergus, and Y.LeCun. Overfeat: Integrated recognition,localization and detection using convolutional networks. In ICLR, 2014.
- [41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- [42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
- [43] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015.
- [44] C.Szegedy, W.Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- [45] R. Szeliski. Fast surface interpolation using hierarchical basis functions. TPAMI, 1990.
- [46] R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006.
- [47] T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochastic gradient towards second-order methods–backpropagation learning with transformations in nonlinearities. In Neural Information Processing, 2013.
- [48] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008.
- [49] W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999.
- [50] M.D.ZeilerandR.Fergus.Visualizing and understanding convolutional neural networks. In ECCV, 2014.