翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
Very Deep Convolutional Networks for Large-Scale Image Recognition
ABSTRACT
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3 × 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
摘要
在這項(xiàng)工作中甜熔,我們研究了卷積網(wǎng)絡(luò)深度在大規(guī)模的圖像識(shí)別環(huán)境下對(duì)準(zhǔn)確性的影響义黎。我們的主要貢獻(xiàn)是使用非常小的(3×3)卷積濾波器架構(gòu)對(duì)網(wǎng)絡(luò)深度的增加進(jìn)行了全面評(píng)估,這表明通過將深度推到16-19加權(quán)層可以實(shí)現(xiàn)對(duì)現(xiàn)有技術(shù)配置的顯著改進(jìn)。這些發(fā)現(xiàn)是我們的ImageNet Challenge 2014提交的基礎(chǔ)土砂,我們的團(tuán)隊(duì)在定位和分類過程中分別獲得了第一名和第二名积担。我們還表明懈万,我們的表示對(duì)于其他數(shù)據(jù)集泛化的很好呀忧,在其它數(shù)據(jù)集上取得了最好的結(jié)果。我們使我們的兩個(gè)性能最好的ConvNet模型可公開獲得栅哀,以便進(jìn)一步研究計(jì)算機(jī)視覺中深度視覺表示的使用震肮。
1 INTRODUCTION
Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).
1 引言
卷積網(wǎng)絡(luò)(ConvNets)近來在大規(guī)模圖像和視頻識(shí)別方面取得了巨大成功(Krizhevsky等,2012留拾;Zeiler&Fergus戳晌,2013;Sermanet等痴柔,2014沦偎;Simonyan&Zisserman,2014)由于大的公開圖像存儲(chǔ)庫(kù),例如ImageNet豪嚎,以及高性能計(jì)算系統(tǒng)的出現(xiàn)鸿捧,例如GPU或大規(guī)模分布式集群(Dean等,2012)疙渣,使這成為可能。特別是堆巧,在深度視覺識(shí)別架構(gòu)的進(jìn)步中妄荔,ImageNet大型視覺識(shí)別挑戰(zhàn)(ILSVRC)(Russakovsky等,2014)發(fā)揮了重要作用谍肤,它已經(jīng)成為幾代大規(guī)模圖像分類系統(tǒng)的測(cè)試臺(tái)啦租,從高維度淺層特征編碼(Perronnin等,2010)(ILSVRC-2011的獲勝者)到深層ConvNets(Krizhevsky等荒揣,2012)(ILSVRC-2012的獲獎(jiǎng)?wù)撸?/p>
With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design —— its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers.
隨著ConvNets在計(jì)算機(jī)視覺領(lǐng)域越來越商品化篷角,為了達(dá)到更好的準(zhǔn)確性,已經(jīng)進(jìn)行了許多嘗試來改進(jìn)Krizhevsky等人(2012)最初的架構(gòu)系任。例如恳蹲,ILSVRC-2013(Zeiler&Fergus,2013俩滥;Sermanet等嘉蕾,2014)表現(xiàn)最佳的提交使用了更小的感受窗口尺寸和更小的第一卷積層步長(zhǎng)。另一條改進(jìn)措施在整個(gè)圖像和多個(gè)尺度上對(duì)網(wǎng)絡(luò)進(jìn)行密集地訓(xùn)練和測(cè)試(Sermanet等霜旧,2014错忱;Howard,2014)挂据。在本文中以清,我們解決了ConvNet架構(gòu)設(shè)計(jì)的另一個(gè)重要方面——其深度。為此崎逃,我們修正了架構(gòu)的其它參數(shù)掷倔,并通過添加更多的卷積層來穩(wěn)定地增加網(wǎng)絡(luò)的深度,這是可行的婚脱,因?yàn)樵谒袑又惺褂梅浅P〉模?×3)卷積濾波器今魔。
As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models to facilitate further research.
因此,我們提出了更為精確的ConvNet架構(gòu)障贸,不僅可以在ILSVRC分類和定位任務(wù)上取得的最佳的準(zhǔn)確性错森,而且還適用于其它的圖像識(shí)別數(shù)據(jù)集,它們可以獲得優(yōu)異的性能篮洁,即使使用相對(duì)簡(jiǎn)單流程的一部分(例如涩维,通過線性SVM分類深度特征而不進(jìn)行微調(diào))。我們發(fā)布了兩款表現(xiàn)最好的模型1,以便進(jìn)一步研究瓦阐。
The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.
本文的其余部分組織如下蜗侈。在第2節(jié),我們描述了我們的ConvNet配置睡蟋。圖像分類訓(xùn)練和評(píng)估的細(xì)節(jié)在第3節(jié)踏幻,并在第4節(jié)中在ILSVRC分類任務(wù)上對(duì)配置進(jìn)行了比較。第5節(jié)總結(jié)了論文戳杀。為了完整起見该面,我們還將在附錄A中描述和評(píng)估我們的ILSVRC-2014目標(biāo)定位系統(tǒng),并在附錄B中討論了非常深的特征在其它數(shù)據(jù)集上的泛化信卡。最后隔缀,附錄C包含了主要的論文修訂列表。
2 CONVNET CONFIGURATIONS
To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.
2. ConvNet配置
為了衡量ConvNet深度在公平環(huán)境中所帶來的改進(jìn)傍菇,我們所有的ConvNet層配置都使用相同的規(guī)則猾瘸,靈感來自Ciresan等(2011);Krizhevsky等人(2012年)丢习。在本節(jié)中牵触,我們首先描述我們的ConvNet配置的通用設(shè)計(jì)(第2.1節(jié)),然后詳細(xì)說明評(píng)估中使用的具體配置(第2.2節(jié))咐低。最后荒吏,我們的設(shè)計(jì)選擇將在2.3節(jié)進(jìn)行討論并與現(xiàn)有技術(shù)進(jìn)行比較。
2.1 ARCHITECTURE
During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.
在訓(xùn)練期間渊鞋,我們的ConvNet的輸入是固定大小的224×224 RGB圖像绰更。我們唯一的預(yù)處理是從每個(gè)像素中減去在訓(xùn)練集上計(jì)算的RGB均值。圖像通過一堆卷積(conv.)層锡宋,我們使用感受野很小的濾波器:3×3(這是捕獲左/右儡湾,上/下,中心概念的最小尺寸)执俩。在其中一種配置中徐钠,我們還使用了1×1卷積濾波器,可以看作輸入通道的線性變換(后面是非線性)役首。卷積步長(zhǎng)固定為1個(gè)像素尝丐;卷積層輸入的空間填充要滿足卷積之后保留空間分辨率,即3×3卷積層的填充為1個(gè)像素衡奥〉空間池化由五個(gè)最大池化層進(jìn)行,這些層在一些卷積層之后(不是所有的卷積層之后都是最大池化)矮固。在2×2像素窗口上進(jìn)行最大池化失息,步長(zhǎng)為2。
A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
一堆卷積層(在不同架構(gòu)中具有不同深度)之后是三個(gè)全連接(FC)層:前兩個(gè)每個(gè)都有4096個(gè)通道,第三個(gè)執(zhí)行1000維ILSVRC分類盹兢,因此包含1000個(gè)通道(一個(gè)通道對(duì)應(yīng)一個(gè)類別)邻梆。最后一層是soft-max層。所有網(wǎng)絡(luò)中全連接層的配置是相同的绎秒。
All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).
所有隱藏層都配備了修正(ReLU(Krizhevsky等浦妄,2012))非線性。我們注意到见芹,我們的網(wǎng)絡(luò)(除了一個(gè))都不包含局部響應(yīng)規(guī)范化(LRN)(Krizhevsky等校辩,2012):將在第4節(jié)看到,這種規(guī)范化并不能提高在ILSVRC數(shù)據(jù)集上的性能辆童,但增加了內(nèi)存消耗和計(jì)算時(shí)間。在應(yīng)用的地方惠赫,LRN層的參數(shù)是(Krizhevsky等把鉴,2012)的參數(shù)。
2.2 CONFIGURATIONS
The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.
Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv?receptive field size?-?number of channels?”. The ReLU activation function is not shown for brevity.
2.2 配置
本文中評(píng)估的ConvNet配置在表1中列出儿咱,每列一個(gè)庭砍。接下來我們將按網(wǎng)站名稱(A-E)來提及網(wǎng)絡(luò)。所有配置都遵循2.1節(jié)提出的通用設(shè)計(jì)混埠,并且僅是深度不同:從網(wǎng)絡(luò)A中的11個(gè)加權(quán)層(8個(gè)卷積層和3個(gè)FC層)到網(wǎng)絡(luò)E中的19個(gè)加權(quán)層(16個(gè)卷積層和3個(gè)FC層)怠缸。卷積層的寬度(通道數(shù))相當(dāng)小,從第一層中的64開始钳宪,然后在每個(gè)最大池化層之后增加2倍揭北,直到達(dá)到512。
表1:ConvNet配置(以列顯示)吏颖。隨著更多的層被添加搔体,配置的深度從左(A)增加到右(E)(添加的層以粗體顯示)。卷積層參數(shù)表示為“conv?感受野大小?-通道數(shù)?”半醉。為了簡(jiǎn)潔起見,不顯示ReLU激活功能。
In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).
Table 2: Number of parameters (in millions).
在表2中灰伟,我們報(bào)告了每個(gè)配置的參數(shù)數(shù)量峰尝。盡管深度很大,我們的網(wǎng)絡(luò)中權(quán)重?cái)?shù)量并不大于具有更大卷積層寬度和感受野的較淺網(wǎng)絡(luò)中的權(quán)重?cái)?shù)量(144M的權(quán)重在(Sermanet等人衬吆,2014)中)梁钾。
表2:參數(shù)數(shù)量(百萬(wàn)級(jí)別)
2.3 DISCUSSION
Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al., 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has $C$ channels, the stack is parametrised by $3(32C2)=27C^2$ weights; at the same time, a single 7 × 7 conv. layer would require $72C2=49C^2$ parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).
2.3 討論
我們的ConvNet配置與ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比賽(Zeiler&Fergus逊抡,2013陈轿;Sermanet等,2014)表現(xiàn)最佳的參賽提交中使用的ConvNet配置有很大不同。不是在第一卷積層中使用相對(duì)較大的感受野(例如麦射,在(Krizhevsky等人蛾娶,2012)中的11×11,步長(zhǎng)為4潜秋,或在(Zeiler&Fergus蛔琅,2013;Sermanet等峻呛,2014)中的7×7罗售,步長(zhǎng)為2),我們?cè)谡麄€(gè)網(wǎng)絡(luò)使用非常小的3×3感受野钩述,與輸入的每個(gè)像素(步長(zhǎng)為1)進(jìn)行卷積寨躁。很容易看到兩個(gè)3×3卷積層堆疊(沒有空間池化)有5×5的有效感受野;三個(gè)這樣的層具有7×7的有效感受野牙勘。那么我們獲得了什么职恳?例如通過使用三個(gè)3×3卷積層的堆疊來替換單個(gè)7×7層。首先方面,我們結(jié)合了三個(gè)非線性修正層放钦,而不是單一的,這使得決策函數(shù)更具判別性恭金。其次操禀,我們減少參數(shù)的數(shù)量:假設(shè)三層3×3卷積堆疊的輸入和輸出有$C$個(gè)通道,堆疊卷積層的參數(shù)為$3(32C2)=27C2$個(gè)權(quán)重横腿;同時(shí)颓屑,單個(gè)7×7卷積層將需要$72C2=49C2$個(gè)參數(shù),即參數(shù)多81%耿焊。這可以看作是對(duì)7×7卷積濾波器進(jìn)行正則化邢锯,迫使它們通過3×3濾波器(在它們之間注入非線性)進(jìn)行分解。
The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1 × 1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1 × 1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).
結(jié)合1×1卷積層(配置C搀别,表1)是增加決策函數(shù)非線性而不影響卷積層感受野的一種方式丹擎。即使在我們的案例下,1×1卷積基本上是在相同維度空間上的線性投影(輸入和輸出通道的數(shù)量相同)歇父,由修正函數(shù)引入附加的非線性蒂培。應(yīng)該注意的是1×1卷積層最近在Lin等人(2014)的“Network in Network”架構(gòu)中已經(jīng)得到了使用。
Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.
Ciresan等人(2011)以前使用小尺寸的卷積濾波器榜苫,但是他們的網(wǎng)絡(luò)深度遠(yuǎn)遠(yuǎn)低于我們的網(wǎng)絡(luò)护戳,他們并沒有在大規(guī)模的ILSVRC數(shù)據(jù)集上進(jìn)行評(píng)估。Goodfellow等人(2014)在街道號(hào)識(shí)別任務(wù)中采用深層ConvNets(11個(gè)權(quán)重層)垂睬,顯示出增加的深度導(dǎo)致了更好的性能媳荒。GooLeNet(Szegedy等抗悍,2014),ILSVRC-2014分類任務(wù)的表現(xiàn)最好的項(xiàng)目钳枕,是獨(dú)立于我們工作之外的開發(fā)的缴渊,但是類似的是它是基于非常深的ConvNets(22個(gè)權(quán)重層)和小卷積濾波器(除了3×3,它們也使用了1×1和5×5卷積)鱼炒。然而衔沼,它們的網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)比我們的更復(fù)雜,并且在第一層中特征圖的空間分辨率被更積極地減少昔瞧,以減少計(jì)算量指蚁。正如將在第4.5節(jié)顯示的那樣,我們的模型在單網(wǎng)絡(luò)分類精度方面勝過Szegedy等人(2014)自晰。
3 CLASSIFICATION FRAMEWORK
In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation.
3 分類框架
在上一節(jié)中凝化,我們介紹了我們的網(wǎng)絡(luò)配置的細(xì)節(jié)。在本節(jié)中酬荞,我們將介紹分類ConvNet訓(xùn)練和評(píng)估的細(xì)節(jié)搓劫。
3.1 TRAINING
The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (the L2 penalty multiplier set to $5·10^{?4}$) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5). The learning rate was initially set to $10^{?2}$, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.
3.1 訓(xùn)練
ConvNet訓(xùn)練過程通常遵循Krizhevsky等人(2012)(除了從多尺度訓(xùn)練圖像中對(duì)輸入裁剪圖像進(jìn)行采樣外,如下文所述)袜蚕。也就是說,通過使用具有動(dòng)量的小批量梯度下降(基于反向傳播(LeCun等人绢涡,1989))優(yōu)化多項(xiàng)式邏輯回歸目標(biāo)函數(shù)來進(jìn)行訓(xùn)練牲剃。批量大小設(shè)為256,動(dòng)量為0.9雄可。訓(xùn)練通過權(quán)重衰減(L2懲罰乘子設(shè)定為$5·10{?4}$)進(jìn)行正則化凿傅,前兩個(gè)全連接層執(zhí)行丟棄正則化(丟棄率設(shè)定為0.5)。學(xué)習(xí)率初始設(shè)定為$10{?2}$数苫,然后當(dāng)驗(yàn)證集準(zhǔn)確率停止改善時(shí)聪舒,減少10倍。學(xué)習(xí)率總共降低3次虐急,學(xué)習(xí)在37萬(wàn)次迭代后停止(74個(gè)epochs)箱残。我們推測(cè),盡管與(Krizhevsky等止吁,2012)相比我們的網(wǎng)絡(luò)參數(shù)更多被辑,網(wǎng)絡(luò)的深度更大,但網(wǎng)絡(luò)需要更小的epoch就可以收斂敬惦,這是由于(a)由更大的深度和更小的卷積濾波器尺寸引起的隱式正則化盼理,(b)某些層的預(yù)初始化。
The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fully-connected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and $10^{?2}$ variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).
網(wǎng)絡(luò)權(quán)重的初始化是重要的俄删,因?yàn)橛捎谏疃染W(wǎng)絡(luò)中梯度的不穩(wěn)定宏怔,不好的初始化可能會(huì)阻礙學(xué)習(xí)奏路。為了規(guī)避這個(gè)問題,我們開始訓(xùn)練配置A(表1)臊诊,足夠淺以隨機(jī)初始化進(jìn)行訓(xùn)練鸽粉。然后,當(dāng)訓(xùn)練更深的架構(gòu)時(shí)妨猩,我們用網(wǎng)絡(luò)A的層初始化前四個(gè)卷積層和最后三個(gè)全連接層(中間層被隨機(jī)初始化)潜叛。我們沒有減少預(yù)初始化層的學(xué)習(xí)率,允許他們?cè)趯W(xué)習(xí)過程中改變壶硅。對(duì)于隨機(jī)初始化(如果應(yīng)用)威兜,我們從均值為0和方差為$10^{?2}$的正態(tài)分布中采樣權(quán)重。偏置初始化為零庐椒。值得注意的是椒舵,在提交論文之后,我們發(fā)現(xiàn)可以通過使用Glorot&Bengio(2010)的隨機(jī)初始化程序來初始化權(quán)重而不進(jìn)行預(yù)訓(xùn)練约谈。
To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below.
為了獲得固定大小的224×224 ConvNet輸入圖像笔宿,它們從歸一化的訓(xùn)練圖像中被隨機(jī)裁剪(每個(gè)圖像每次SGD迭代進(jìn)行一次裁剪)。為了進(jìn)一步增強(qiáng)訓(xùn)練集棱诱,裁剪圖像經(jīng)過了隨機(jī)水平翻轉(zhuǎn)和隨機(jī)RGB顏色偏移(Krizhevsky等泼橘,2012)。下面解釋訓(xùn)練圖像歸一化迈勋。
Training image size. Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale). While the crop size is fixed to 224 × 224, in principle S can take on any value not less than 224: for $S = 224$ the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for $S ? 224$ the crop will correspond to a small part of the image, containing a small object or an object part.
訓(xùn)練圖像大小炬灭。令S是等軸歸一化的訓(xùn)練圖像的最小邊,ConvNet輸入從S中裁剪(我們也將S稱為訓(xùn)練尺度)靡菇。雖然裁剪尺寸固定為224×224重归,但原則上S可以是不小于224的任何值:對(duì)于$S=224$,裁剪圖像將捕獲整個(gè)圖像的統(tǒng)計(jì)數(shù)據(jù)厦凤,完全擴(kuò)展訓(xùn)練圖像的最小邊鼻吮;對(duì)于$S?224$,裁剪圖像將對(duì)應(yīng)于圖像的一小部分较鼓,包含小對(duì)象或?qū)ο蟮囊徊糠帧?/p>
We consider two approaches for setting the training scale S. The first is to fix S, which corresponds to single-scale training (note that image content within the sampled crops can still represent multi-scale image statistics). In our experiments, we evaluated models trained at two fixed scales: $S = 256$ (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and $S = 384$. Given a ConvNet configuration, we first trained the network using $S = 256$. To speed-up training of the $S = 384$ network, it was initialised with the weights pre-trained with S = 256, and we used a smaller initial learning rate of $10^{?3}$.
我們考慮兩種方法來設(shè)置訓(xùn)練尺度S椎木。第一種是修正對(duì)應(yīng)單尺度訓(xùn)練的S(注意,采樣裁剪圖像中的圖像內(nèi)容仍然可以表示多尺度圖像統(tǒng)計(jì))博烂。在我們的實(shí)驗(yàn)中拓哺,我們?cè)u(píng)估了以兩個(gè)固定尺度訓(xùn)練的模型:$S = 256$(已經(jīng)在現(xiàn)有技術(shù)中廣泛使用(Krizhevsky等人,2012脖母;Zeiler&Fergus士鸥,2013;Sermanet等谆级,2014))和$S = 384$烤礁。給定ConvNet配置讼积,我們首先使用$S=256$來訓(xùn)練網(wǎng)絡(luò)。為了加速$S = 384$網(wǎng)絡(luò)的訓(xùn)練脚仔,用$S = 256$預(yù)訓(xùn)練的權(quán)重來進(jìn)行初始化勤众,我們使用較小的初始學(xué)習(xí)率$10^{?3}$。
The second approach to setting S is multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range $[S_{min},S_{max}]$ (we used $S_{min} = 256$ and $S_{max} = 512$). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed $S = 384$.
設(shè)置S的第二種方法是多尺度訓(xùn)練鲤脏,其中每個(gè)訓(xùn)練圖像通過從一定范圍$[S_{min}们颜,S_{max}]$(我們使用$S_{min} = 256$和$S_{max} = 512$)隨機(jī)采樣S來單獨(dú)進(jìn)行歸一化。由于圖像中的目標(biāo)可能具有不同的大小猎醇,因此在訓(xùn)練期間考慮到這一點(diǎn)是有益的窥突。這也可以看作是通過尺度抖動(dòng)進(jìn)行訓(xùn)練集增強(qiáng),其中單個(gè)模型被訓(xùn)練在一定尺度范圍內(nèi)識(shí)別對(duì)象硫嘶。為了速度的原因阻问,我們通過對(duì)具有相同配置的單尺度模型的所有層進(jìn)行微調(diào),訓(xùn)練了多尺度模型沦疾,并用固定的$S = 384$進(jìn)行預(yù)訓(xùn)練称近。
3.2 TESTING
At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale). We note that Q is not necessarily equal to the training scale S (as we will show in Sect. 4, using several values of Q for each S leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.
3.2 測(cè)試
在測(cè)試時(shí),給出訓(xùn)練的ConvNet和輸入圖像哮塞,它按以下方式分類刨秆。首先,將其等軸地歸一化到預(yù)定義的最小圖像邊忆畅,表示為Q(我們也將其稱為測(cè)試尺度)衡未。我們注意到,Q不一定等于訓(xùn)練尺度S(正如我們?cè)诘?節(jié)中所示邻眷,每個(gè)S使用Q的幾個(gè)值會(huì)導(dǎo)致性能改進(jìn))眠屎。然后剔交,網(wǎng)絡(luò)以類似于(Sermanet等人肆饶,2014)的方式密集地應(yīng)用于歸一化的測(cè)試圖像上。即岖常,全連接層首先被轉(zhuǎn)換成卷積層(第一FC層轉(zhuǎn)換到7×7卷積層驯镊,最后兩個(gè)FC層轉(zhuǎn)換到1×1卷積層)。然后將所得到的全卷積網(wǎng)絡(luò)應(yīng)用于整個(gè)(未裁剪)圖像上竭鞍。結(jié)果是類得分圖的通道數(shù)等于類別的數(shù)量板惑,以及取決于輸入圖像大小的可變空間分辨率。最后偎快,為了獲得圖像的類別分?jǐn)?shù)的固定大小的向量冯乘,類得分圖在空間上平均(和池化)。我們還通過水平翻轉(zhuǎn)圖像來增強(qiáng)測(cè)試集晒夹;將原始圖像和翻轉(zhuǎn)圖像的soft-max類后驗(yàn)進(jìn)行平均裆馒,以獲得圖像的最終分?jǐn)?shù)姊氓。
Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5 × 5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).
由于全卷積網(wǎng)絡(luò)被應(yīng)用在整個(gè)圖像上,所以不需要在測(cè)試時(shí)對(duì)采樣多個(gè)裁剪圖像(Krizhevsky等喷好,2012)翔横,因?yàn)樗枰W(wǎng)絡(luò)重新計(jì)算每個(gè)裁剪圖像,這樣效率較低梗搅。同時(shí)禾唁,如Szegedy等人(2014)所做的那樣,使用大量的裁剪圖像可以提高準(zhǔn)確度无切,因?yàn)榕c全卷積網(wǎng)絡(luò)相比荡短,它使輸入圖像的采樣更精細(xì)。此外订雾,由于不同的卷積邊界條件肢预,多裁剪圖像評(píng)估是密集評(píng)估的補(bǔ)充:當(dāng)將ConvNet應(yīng)用于裁剪圖像時(shí),卷積特征圖用零填充洼哎,而在密集評(píng)估的情況下烫映,相同裁剪圖像的填充自然會(huì)來自于圖像的相鄰部分(由于卷積和空間池化),這大大增加了整個(gè)網(wǎng)絡(luò)的感受野噩峦,因此捕獲了更多的上下文锭沟。雖然我們認(rèn)為在實(shí)踐中,多裁剪圖像的計(jì)算時(shí)間增加并不足以證明準(zhǔn)確性的潛在收益识补,但作為參考族淮,我們還在每個(gè)尺度使用50個(gè)裁剪圖像(5×5規(guī)則網(wǎng)格,2次翻轉(zhuǎn))評(píng)估了我們的網(wǎng)絡(luò)凭涂,在3個(gè)尺度上總共150個(gè)裁剪圖像祝辣,與Szegedy等人(2014)在4個(gè)尺度上使用的144個(gè)裁剪圖像。
3.3 IMPLEMENTATION DETAILS
Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.
3.3 實(shí)現(xiàn)細(xì)節(jié)
我們的實(shí)現(xiàn)來源于公開的C++ Caffe工具箱(Jia切油,2013)(2013年12月推出)蝙斜,但包含了一些重大的修改,使我們能夠?qū)Π惭b在單個(gè)系統(tǒng)中的多個(gè)GPU進(jìn)行訓(xùn)練和評(píng)估澎胡,也能訓(xùn)練和評(píng)估在多個(gè)尺度上(如上所述)的全尺寸(未裁剪)圖像孕荠。多GPU訓(xùn)練利用數(shù)據(jù)并行性,通過將每批訓(xùn)練圖像分成幾個(gè)GPU批次攻谁,每個(gè)GPU并行處理稚伍。在計(jì)算GPU批次梯度之后,將其平均以獲得完整批次的梯度戚宦。梯度計(jì)算在GPU之間是同步的个曙,所以結(jié)果與在單個(gè)GPU上訓(xùn)練完全一樣。
While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.
最近提出了更加復(fù)雜的加速ConvNet訓(xùn)練的方法(Krizhevsky受楼,2014)垦搬,它們對(duì)網(wǎng)絡(luò)的不同層之間采用模型和數(shù)據(jù)并行祠挫,我們發(fā)現(xiàn)我們概念上更簡(jiǎn)單的方案與使用單個(gè)GPU相比,在現(xiàn)有的4-GPU系統(tǒng)上已經(jīng)提供了3.75倍的加速悼沿。在配備四個(gè)NVIDIA Titan Black GPU的系統(tǒng)上等舔,根據(jù)架構(gòu)訓(xùn)練單個(gè)網(wǎng)絡(luò)需要2-3周時(shí)間。
4 CLASSIFICATION EXPERIMENTS
Dataset. In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.
4 分類實(shí)驗(yàn)
數(shù)據(jù)集糟趾。在本節(jié)中慌植,我們介紹了描述的ConvNet架構(gòu)(用于ILSVRC 2012-2014挑戰(zhàn))在ILSVRC-2012數(shù)據(jù)集上實(shí)現(xiàn)的圖像分類結(jié)果。數(shù)據(jù)集包括1000個(gè)類別的圖像义郑,并分為三組:訓(xùn)練(130萬(wàn)張圖像)蝶柿,驗(yàn)證(5萬(wàn)張圖像)和測(cè)試(留有類標(biāo)簽的10萬(wàn)張圖像)。使用兩個(gè)措施評(píng)估分類性能:top-1和top-5錯(cuò)誤率非驮。前者是多類分類誤差交汤,即不正確分類圖像的比例;后者是ILSVRC中使用的主要評(píng)估標(biāo)準(zhǔn)劫笙,并且計(jì)算為圖像真實(shí)類別在前5個(gè)預(yù)測(cè)類別之外的圖像比例芙扎。
For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).
對(duì)于大多數(shù)實(shí)驗(yàn),我們使用驗(yàn)證集作為測(cè)試集填大。在測(cè)試集上也進(jìn)行了一些實(shí)驗(yàn)戒洼,并將其作為ILSVRC-2014競(jìng)賽(Russakovsky等,2014)“VGG”小組的輸入提交到了官方的ILSVRC服務(wù)器允华。
4.1 SINGLE SCALE EVALUATION
We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2. The test image size was set as follows: $Q = S$ for fixed S, and $Q = 0.5(S_{min} + S_{max})$ for jittered $S ∈ [S_{min}, S_{max}]$. The results of are shown in Table 3.
Table 3: ConvNet performance at a single test scale.
4.1 單尺度評(píng)估
我們首先評(píng)估單個(gè)ConvNet模型在單尺度上的性能圈浇,其層結(jié)構(gòu)配置如2.2節(jié)中描述。測(cè)試圖像大小設(shè)置如下:對(duì)于固定S的$Q = S$靴寂,對(duì)于抖動(dòng)$S ∈ [S_{min}, S_{max}]$磷蜀,$Q = 0.5(S_{min} + S_{max})$。結(jié)果如表3所示百炬。
表3:在單測(cè)試尺度的ConvNet性能
First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E).
首先褐隆,我們注意到,使用局部響應(yīng)歸一化(A-LRN網(wǎng)絡(luò))在沒有任何歸一化層的情況下收壕,對(duì)模型A沒有改善妓灌。因此轨蛤,我們?cè)谳^深的架構(gòu)(B-E)中不采用歸一化蜜宪。
Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E. Notably, in spite of the same depth, the configuration C (which contains three 1 × 1 conv. layers), performs worse than the configuration D, which uses 3 × 3 conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C). The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3 × 3 conv. layers with a single 5 × 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.
第二,我們觀察到分類誤差隨著ConvNet深度的增加而減邢樯健:從A中的11層到E中的19層圃验。值得注意的是,盡管深度相同缝呕,配置C(包含三個(gè)1×1卷積層)比在整個(gè)網(wǎng)絡(luò)層中使用3×3卷積的配置D更差澳窑。這表明斧散,雖然額外的非線性確實(shí)有幫助(C優(yōu)于B),但也可以通過使用具有非平凡感受野(D比C好)的卷積濾波器來捕獲空間上下文摊聋。當(dāng)深度達(dá)到19層時(shí)鸡捐,我們架構(gòu)的錯(cuò)誤率飽和,但更深的模型可能有益于較大的數(shù)據(jù)集麻裁。我們還將網(wǎng)絡(luò)B與具有5×5卷積層的淺層網(wǎng)絡(luò)進(jìn)行了比較箍镜,淺層網(wǎng)絡(luò)可以通過用單個(gè)5×5卷積層替換B中每對(duì)3×3卷積層得到(其具有相同的感受野如第2.3節(jié)所述)。測(cè)量的淺層網(wǎng)絡(luò)top-1錯(cuò)誤率比網(wǎng)絡(luò)B的top-1錯(cuò)誤率(在中心裁剪圖像上)高7%煎源,這證實(shí)了具有小濾波器的深層網(wǎng)絡(luò)優(yōu)于具有較大濾波器的淺層網(wǎng)絡(luò)色迂。
Finally, scale jittering at training time ($S ∈ [256; 512]$) leads to significantly better results than training on images with fixed smallest side ($S = 256$ or $S = 384$), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.
最后,訓(xùn)練時(shí)的尺度抖動(dòng)($S∈[256; 512]$)得到了與固定最小邊($S = 256$或$S = 384$)的圖像訓(xùn)練相比更好的結(jié)果手销,即使在測(cè)試時(shí)使用單尺度歇僧。這證實(shí)了通過尺度抖動(dòng)進(jìn)行的訓(xùn)練集增強(qiáng)確實(shí)有助于捕獲多尺度圖像統(tǒng)計(jì)。
4.2 MULTI-SCALE EVALUATION
Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixed S were evaluated over three test image sizes, close to the training one: $Q = {S ? 32, S, S + 32}$. At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable $S ∈ [S_{min}; S_{max}]$ was evaluated over a larger range of sizes $Q = {S_{min}, 0.5(S_{min} + S_{max}), S_{max}$.
4.2 多尺度評(píng)估
在單尺度上評(píng)估ConvNet模型后锋拖,我們現(xiàn)在評(píng)估測(cè)試時(shí)尺度抖動(dòng)的影響诈悍。它包括在一張測(cè)試圖像的幾個(gè)歸一化版本上運(yùn)行模型(對(duì)應(yīng)于不同的Q值),然后對(duì)所得到的類別后驗(yàn)進(jìn)行平均兽埃⌒戳ィ考慮到訓(xùn)練和測(cè)試尺度之間的巨大差異會(huì)導(dǎo)致性能下降,用固定S訓(xùn)練的模型在三個(gè)測(cè)試圖像尺度上進(jìn)行了評(píng)估讲仰,接近于訓(xùn)練一次:$Q = {S ? 32, S, S + 32}$慕趴。同時(shí),訓(xùn)練時(shí)的尺度抖動(dòng)允許網(wǎng)絡(luò)在測(cè)試時(shí)應(yīng)用于更廣的尺度范圍鄙陡,所以用變量$S ∈ [S_{min}; S_{max}]$訓(xùn)練的模型在更大的尺寸范圍$Q = {S_{min}, 0.5(S_{min} + S_{max}), S_{max}$上進(jìn)行評(píng)估冕房。
The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side S. Our best single-network performance on the validation set is 24.8%/7.5% top-1/top-5
error (highlighted in bold in Table 4). On the test set, the configuration E achieves 7.3% top-5 error.
Table 4: ConvNet performance at multiple test scales.
表4中給出的結(jié)果表明,測(cè)試時(shí)的尺度抖動(dòng)導(dǎo)致了更好的性能(與在單一尺度上相同模型的評(píng)估相比趁矾,如表3所示)耙册。如前所述,最深的配置(D和E)執(zhí)行最佳毫捣,并且尺度抖動(dòng)優(yōu)于使用固定最小邊S的訓(xùn)練详拙。我們?cè)隍?yàn)證集上的最佳單網(wǎng)絡(luò)性能為24.8%/7.5% top-1/top-5
的錯(cuò)誤率(在表4中用粗體突出顯示)。在測(cè)試集上蔓同,配置E實(shí)現(xiàn)了7.3% top-5
的錯(cuò)誤率饶辙。
表4:在多個(gè)測(cè)試尺度上的ConvNet性能
4.3 MULTI-CROP EVALUATION
In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details). We also assess the complementarity of the two evaluation techniques by averaging their soft-max outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them. As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.
Table 5: ConvNet evaluation techniques comparison. In all experiments the training scale S was sampled from [256; 512], and three test scales Q were considered: {256, 384, 512}.
4.3 多裁剪圖像評(píng)估
在表5中,我們將稠密ConvNet評(píng)估與多裁剪圖像評(píng)估進(jìn)行比較(細(xì)節(jié)參見第3.2節(jié))斑粱。我們還通過平均其soft-max輸出來評(píng)估兩種評(píng)估技術(shù)的互補(bǔ)性弃揽。可以看出,使用多裁剪圖像表現(xiàn)比密集評(píng)估略好矿微,而且這兩種方法確實(shí)是互補(bǔ)的痕慢,因?yàn)樗鼈兊慕M合優(yōu)于其中的每一種。如上所述涌矢,我們假設(shè)這是由于卷積邊界條件的不同處理掖举。
表5:ConvNet評(píng)估技術(shù)比較。在所有的實(shí)驗(yàn)中訓(xùn)練尺度S從[256娜庇;512]采樣拇泛,三個(gè)測(cè)試適度Q考慮:{256, 384, 512}。
4.4 CONVNET FUSION
Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).
4.4 卷積網(wǎng)絡(luò)融合
到目前為止思灌,我們?cè)u(píng)估了ConvNet模型的性能俺叭。在這部分實(shí)驗(yàn)中,我們通過對(duì)soft-max類別后驗(yàn)進(jìn)行平均泰偿,結(jié)合了幾種模型的輸出熄守。由于模型的互補(bǔ)性,這提高了性能耗跛,并且在了2012年(Krizhevsky等裕照,2012)和2013年(Zeiler&Fergus,2013调塌;Sermanet等晋南,2014)ILSVRC的頂級(jí)提交中使用。
The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5).
Table 6: Multiple ConvNet fusion results.
結(jié)果如表6所示羔砾。在ILSVRC提交的時(shí)候负间,我們只訓(xùn)練了單規(guī)模網(wǎng)絡(luò),以及一個(gè)多尺度模型D(僅在全連接層進(jìn)行微調(diào)而不是所有層)姜凄。由此產(chǎn)生的7個(gè)網(wǎng)絡(luò)組合具有7.3%的ILSVRC測(cè)試誤差政溃。在提交之后,我們考慮了只有兩個(gè)表現(xiàn)最好的多尺度模型(配置D和E)的組合态秧,它使用密集評(píng)估將測(cè)試誤差降低到7.0%董虱,使用密集評(píng)估和多裁剪圖像評(píng)估將測(cè)試誤差降低到6.8%。作為參考申鱼,我們表現(xiàn)最佳的單模型達(dá)到7.1%的誤差(模型E愤诱,表5)。
表6:多個(gè)卷積網(wǎng)絡(luò)融合結(jié)果
4.5 COMPARISON WITH THE STATE OF THE ART
Finally, we compare our results with the state of the art in Table 7. In the classification task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our “VGG” team secured the 2nd place with
7.3% test error using an ensemble of 7 models. After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models.
Table 7: Comparison with the state of the art in ILSVRC classification. Our method is denoted as “VGG”. Only the results obtained without outside training data are reported.
4.5 與最新技術(shù)比較
最后捐友,我們?cè)诒?中與最新技術(shù)比較我們的結(jié)果淫半。在ILSVRC-2014挑戰(zhàn)的分類任務(wù)(Russakovsky等,2014)中楚殿,我們的“VGG”團(tuán)隊(duì)獲得了第二名撮慨,
使用7個(gè)模型的組合取得了7.3%測(cè)試誤差。提交后脆粥,我們使用2個(gè)模型的組合將錯(cuò)誤率降低到6.8%砌溺。
表7:在ILSVRC分類中與最新技術(shù)比較。我們的方法表示為“VGG”变隔。報(bào)告的結(jié)果沒有使用外部數(shù)據(jù)规伐。
As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it. This is remarkable, considering that our best result is achieved by combining just two models —— significantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%. Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.
從表7可以看出,我們非常深的ConvNets顯著優(yōu)于前一代模型匣缘,在ILSVRC-2012和ILSVRC-2013競(jìng)賽中取得了最好的結(jié)果猖闪。我們的結(jié)果對(duì)于分類任務(wù)獲勝者(GoogLeNet具有6.7%的錯(cuò)誤率)也具有競(jìng)爭(zhēng)力,并且大大優(yōu)于ILSVRC-2013獲勝者Clarifai的提交肌厨,其使用外部訓(xùn)練數(shù)據(jù)取得了11.2%的錯(cuò)誤率培慌,沒有外部數(shù)據(jù)則為11.7%。這是非常顯著的柑爸,考慮到我們最好的結(jié)果是僅通過組合兩個(gè)模型實(shí)現(xiàn)的——明顯少于大多數(shù)ILSVRC提交吵护。在單網(wǎng)絡(luò)性能方面,我們的架構(gòu)取得了最好節(jié)果(7.0%測(cè)試誤差)表鳍,超過單個(gè)GoogLeNet 0.9%馅而。值得注意的是,我們并沒有偏離LeCun(1989)等人經(jīng)典的ConvNet架構(gòu)譬圣,但通過大幅增加深度改善了它瓮恭。
5 CONCLUSION
In this work we evaluated very deep convolutional networks (up to 19 weight layers) for large-scale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.
5 結(jié)論
在這項(xiàng)工作中,我們?cè)u(píng)估了非常深的卷積網(wǎng)絡(luò)(最多19個(gè)權(quán)重層)用于大規(guī)模圖像分類厘熟。已經(jīng)證明屯蹦,表示深度有利于分類精度,并且深度大大增加的傳統(tǒng)ConvNet架構(gòu)(LeCun等绳姨,1989颇玷;Krizhevsky等,2012)可以實(shí)現(xiàn)ImageNet挑戰(zhàn)數(shù)據(jù)集上的最佳性能就缆。在附錄中帖渠,我們還顯示了我們的模型很好地泛化到各種各樣的任務(wù)和數(shù)據(jù)集上,可以匹敵或超越更復(fù)雜的識(shí)別流程竭宰,其構(gòu)建圍繞不深的圖像表示空郊。我們的結(jié)果再次證實(shí)了深度在視覺表示中的重要性。
ACKNOWLEDGEMENTS
This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.
致謝
這項(xiàng)工作得到ERC授權(quán)的VisRec編號(hào)228180的支持.我們非常感謝NVIDIA公司捐贈(zèng)GPU為此研究使用切揭。
REFERENCES
Bell, S., Upchurch, P., Snavely, N., and Bala, K. Material recognition in the wild with the materials in context database. CoRR, abs/1412.0623, 2014.
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.
Cimpoi, M., Maji, S., and Vedaldi, A. Deep convolutional filter banks for texture recognition and segmentation. CoRR, abs/1411.6836, 2014.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. Flexible, high performance convolutional neural networks for image classification. In IJCAI, pp. 1237–1242, 2011.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale distributed deep networks. In NIPS, pp. 1232–1240, 2012.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. The Pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE CVPR Workshop of Generative Model Based Vision, 2004.
Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524v5, 2014. Published in Proc. CVPR, 2014.
Gkioxari, G., Girshick, R., and Malik, J. Actions and attributes from wholes and parts. CoRR, abs/1412.2604, 2014.
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, volume 9, pp. 249–256, 2010.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In Proc. ICLR, 2014.
Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007.
He, K., Zhang, X., Ren, S., and Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729v2, 2014.
Hoai, M. Regularized max pooling for image categorization. In Proc. BMVC., 2014.
Howard, A. G. Some improvements on deep convolutional neural network based image classification. In Proc. ICLR, 2014.
Jia, Y. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.
Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.
Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. In Proc. CVPR, 2014.
Perronnin, F., Sa ?nchez, J., and Mensink, T. Improving the Fisher kernel for large-scale image classification. In Proc. ECCV, 2010.
Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. CNN Features off-the-shelf: an Astounding Baseline for Recognition. CoRR, abs/1403.6382, 2014.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In Proc. ICLR, 2014.
Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199, 2014. Published in Proc. NIPS, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., and Yan, S. CNN: Single-label to multi-label. CoRR, abs/1406.5726, 2014.
Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV, 2014.