文章作者:Tyan
博客:noahsnail.com | CSDN | 簡書
聲明:作者翻譯論文僅為學(xué)習(xí)喊式,如有侵權(quán)請聯(lián)系作者刪除博文炮障,謝謝槐壳!
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
Going Deeper with Convolutions
Abstract
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
我們在ImageNet大規(guī)模視覺識別挑戰(zhàn)賽2014(ILSVRC14)上提出了一種代號為Inception的深度卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu),并在分類和檢測上取得了新的最好結(jié)果仔役。這個架構(gòu)的主要特點是提高了網(wǎng)絡(luò)內(nèi)部計算資源的利用率介衔。通過精心的手工設(shè)計,我們在增加了網(wǎng)絡(luò)深度和廣度的同時保持了計算預(yù)算不變骂因。為了優(yōu)化質(zhì)量炎咖,架構(gòu)的設(shè)計以赫布理論和多尺度處理直覺為基礎(chǔ)。我們在ILSVRC14提交中應(yīng)用的一個特例被稱為GoogLeNet寒波,一個22層的深度網(wǎng)絡(luò)乘盼,其質(zhì)量在分類和檢測的背景下進(jìn)行了評估。
1. Introduction
In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [10]. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12 times fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. On the object detection front, the biggest gains have not come from naive application of bigger and bigger deep networks, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].
1. 引言
過去三年中俄烁,由于深度學(xué)習(xí)和卷積網(wǎng)絡(luò)的發(fā)展[10]绸栅,我們的目標(biāo)分類和檢測能力得到了顯著提高。一個令人鼓舞的消息是页屠,大部分的進(jìn)步不僅僅是更強(qiáng)大硬件粹胯、更大數(shù)據(jù)集、更大模型的結(jié)果辰企,而主要是新的想法风纠、算法和網(wǎng)絡(luò)結(jié)構(gòu)改進(jìn)的結(jié)果。例如牢贸,ILSVRC 2014競賽中最靠前的輸入除了用于檢測目的的分類數(shù)據(jù)集之外竹观,沒有使用新的數(shù)據(jù)資源。我們在ILSVRC 2014中的GoogLeNet提交實際使用的參數(shù)只有兩年前Krizhevsky等人[9]獲勝結(jié)構(gòu)參數(shù)的1/12,而結(jié)果明顯更準(zhǔn)確臭增。在目標(biāo)檢測前沿懂酱,最大的收獲不是來自于越來越大的深度網(wǎng)絡(luò)的簡單應(yīng)用,而是來自于深度架構(gòu)和經(jīng)典計算機(jī)視覺的協(xié)同誊抛,像Girshick等人[6]的R-CNN算法那樣列牺。
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms —— especially their power and memory use —— gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
另一個顯著因素是隨著移動和嵌入式設(shè)備的推動,我們的算法的效率很重要——尤其是它們的電力和內(nèi)存使用拗窃。值得注意的是昔园,正是包含了這個因素的考慮才得出了本文中呈現(xiàn)的深度架構(gòu)設(shè)計,而不是單純的為了提高準(zhǔn)確率并炮。對于大多數(shù)實驗來說,模型被設(shè)計為在一次推斷中保持15億乘加的計算預(yù)算甥郑,所以最終它們不是單純的學(xué)術(shù)好奇心逃魄,而是能在現(xiàn)實世界中應(yīng)用,甚至是以合理的代價在大型數(shù)據(jù)集上使用澜搅。
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, where it significantly outperforms the current state of the art.
在本文中伍俘,我們將關(guān)注一個高效的計算機(jī)視覺深度神經(jīng)網(wǎng)絡(luò)架構(gòu),代號為Inception勉躺,它的名字來自于Lin等人[12]網(wǎng)絡(luò)論文中的Network與著名的“we need to go deeper”網(wǎng)絡(luò)迷因[1]的結(jié)合癌瘾。在我們的案例中,單詞“deep”用在兩個不同的含義中:首先饵溅,在某種意義上妨退,我們以“Inception module”的形式引入了一種新層次的組織方式,在更直接的意義上增加了網(wǎng)絡(luò)的深度蜕企。一般來說咬荷,可以把Inception模型看作論文[12]的邏輯頂點同時從Arora等人[2]的理論工作中受到了鼓舞和引導(dǎo)。這種架構(gòu)的好處在ILSVRC 2014分類和檢測挑戰(zhàn)賽中通過實驗得到了驗證轻掩,它明顯優(yōu)于目前的最好水平幸乒。
2. Related Work
Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure —— stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.
2. 近期工作
從LeNet-5 [10]開始,卷積神經(jīng)網(wǎng)絡(luò)(CNN)通常有一個標(biāo)準(zhǔn)結(jié)構(gòu)——堆疊的卷積層(后面可以選擇有對比歸一化和最大池化)后面是一個或更多的全連接層唇牧。這個基本設(shè)計的變種在圖像分類著作流行罕扎,并且目前為止在MNIST,CIFAR和更著名的ImageNet分類挑戰(zhàn)賽中[9, 21]的已經(jīng)取得了最佳結(jié)果丐重。對于更大的數(shù)據(jù)集例如ImageNet來說腔召,最近的趨勢是增加層的數(shù)目[12]和層的大小[21, 14],同時使用丟棄[7]來解決過擬合問題扮惦。
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].
盡管擔(dān)心最大池化層會引起準(zhǔn)確空間信息的損失宴咧,但與[9]相同的卷積網(wǎng)絡(luò)結(jié)構(gòu)也已經(jīng)成功的應(yīng)用于定位[9, 14],目標(biāo)檢測[6, 14, 18, 5]和行人姿態(tài)估計[19]径缅。
Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.
受靈長類視覺皮層神經(jīng)科學(xué)模型的啟發(fā)掺栅,Serre等人[15]使用了一系列固定的不同大小的Gabor濾波器來處理多尺度烙肺。我們使用一個了類似的策略。然而氧卧,與[15]的固定的2層深度模型相反桃笙,Inception結(jié)構(gòu)中所有的濾波器是學(xué)習(xí)到的。此外沙绝,Inception層重復(fù)了很多次搏明,在GoogLeNet模型中得到了一個22層的深度模型。
Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 × 1 convolutional layers are added to the network, increasing its depth. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.
Network-in-Network是Lin等人[12]為了增加神經(jīng)網(wǎng)絡(luò)表現(xiàn)能力而提出的一種方法闪檬。在他們的模型中星著,網(wǎng)絡(luò)中添加了額外的1 × 1卷積層,增加了網(wǎng)絡(luò)的深度粗悯。我們的架構(gòu)中大量的使用了這個方法虚循。但是,在我們的設(shè)置中样傍,1 × 1卷積有兩個目的:最關(guān)鍵的是横缔,它們主要是用來作為降維模塊來移除卷積瓶頸,否則將會限制我們網(wǎng)絡(luò)的大小衫哥。這不僅允許了深度的增加茎刚,而且允許我們網(wǎng)絡(luò)的寬度增加但沒有明顯的性能損失。
Finally, the current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: utilizing low-level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.
最后撤逢,目前最好的目標(biāo)檢測是Girshick等人[6]的基于區(qū)域的卷積神經(jīng)網(wǎng)絡(luò)(R-CNN)方法膛锭。R-CNN將整個檢測問題分解為兩個子問題:利用低層次的信號例如顏色,紋理以跨類別的方式來產(chǎn)生目標(biāo)位置候選區(qū)域蚊荣,然后用CNN分類器來識別那些位置上的對象類別泉沾。這樣一種兩個階段的方法利用了低層特征分割邊界框的準(zhǔn)確性,也利用了目前的CNN非常強(qiáng)大的分類能力妇押。我們在我們的檢測提交中采用了類似的方式跷究,但探索增強(qiáng)這兩個階段,例如對于更高的目標(biāo)邊界框召回使用多盒[5]預(yù)測敲霍,并融合了更好的邊界框候選區(qū)域分類方法俊马。
3. Motivation and High Level Considerations
The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth —— the number of network levels —— as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet (even in the 1000-class ILSVRC subset) as shown in Figure 1.
Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classification challenge. Domain knowledge is required to distinguish between these classes.
3. 動機(jī)和高層思考
提高深度神經(jīng)網(wǎng)絡(luò)性能最直接的方式是增加它們的尺寸。這不僅包括增加深度——網(wǎng)絡(luò)層次的數(shù)目——也包括它的寬度:每一層的單元數(shù)目肩杈。這是一種訓(xùn)練更高質(zhì)量模型容易且安全的方法柴我,尤其是在可獲得大量標(biāo)注的訓(xùn)練數(shù)據(jù)的情況下。但是這個簡單方案有兩個主要的缺點扩然。更大的尺寸通常意味著更多的參數(shù)艘儒,這會使增大的網(wǎng)絡(luò)更容易過擬合,尤其是在訓(xùn)練集的標(biāo)注樣本有限的情況下。這是一個主要的瓶頸界睁,因為要獲得強(qiáng)標(biāo)注數(shù)據(jù)集費時費力且代價昂貴觉增,經(jīng)常需要專家評委在各種細(xì)粒度的視覺類別進(jìn)行區(qū)分,例如圖1中顯示的ImageNet中的類別(甚至是1000類ILSVRC的子集)翻斟。
圖1: ILSVRC 2014分類挑戰(zhàn)賽的1000類中兩個不同的類別逾礁。區(qū)分這些類別需要領(lǐng)域知識。
The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.
均勻增加網(wǎng)絡(luò)尺寸的另一個缺點是計算資源使用的顯著增加访惜。例如嘹履,在一個深度視覺網(wǎng)絡(luò)中,如果兩個卷積層相連债热,它們的濾波器數(shù)目的任何均勻增加都會引起計算量平方式的增加砾嫉。如果增加的能力使用時效率低下(例如,如果大多數(shù)權(quán)重結(jié)束時接近于0)窒篱,那么會浪費大量的計算能力焕刮。由于計算預(yù)算總是有限的,計算資源的有效分布更偏向于尺寸無差別的增加舌剂,即使主要目標(biāo)是增加性能的質(zhì)量。
A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle —— neurons that fire together, wire together —— suggests that the underlying idea is applicable even under less strict conditions, in practice.
解決這兩個問題的一個基本的方式就是引入稀疏性并將全連接層替換為稀疏的全連接層暑椰,甚至是卷積層霍转。除了模仿生物系統(tǒng)之外,由于Arora等人[2]的開創(chuàng)性工作一汽,這也具有更堅固的理論基礎(chǔ)優(yōu)勢避消。他們的主要成果說明如果數(shù)據(jù)集的概率分布可以通過一個大型稀疏的深度神經(jīng)網(wǎng)絡(luò)表示,則最優(yōu)的網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)可以通過分析前一層激活的相關(guān)性統(tǒng)計和聚類高度相關(guān)的神經(jīng)元來一層層的構(gòu)建召夹。雖然嚴(yán)格的數(shù)學(xué)證明需要在很強(qiáng)的條件下岩喷,但事實上這個聲明與著名的赫布理論產(chǎn)生共鳴——神經(jīng)元一起激發(fā),一起連接——實踐表明监憎,基礎(chǔ)概念甚至適用于不嚴(yán)格的條件下纱意。
Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off. The gap is widened yet further by the use of steadily improving and highly tuned numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, yet the trend changed back to full connections with [9] in order to further optimize parallel computation. Current state-of-the-art architectures for computer vision have uniform structure. The large number of filters and greater batch size allows for the efficient use of dense computation.
遺憾的是,當(dāng)碰到在非均勻的稀疏數(shù)據(jù)結(jié)構(gòu)上進(jìn)行數(shù)值計算時鲸阔,現(xiàn)在的計算架構(gòu)效率非常低下偷霉。即使算法運(yùn)算的數(shù)量減少100倍,查詢和緩存丟失上的開銷仍占主導(dǎo)地位:切換到稀疏矩陣可能是不可行的褐筛。隨著穩(wěn)定提升和高度調(diào)整的數(shù)值庫的應(yīng)用类少,差距仍在進(jìn)一步擴(kuò)大,數(shù)值庫要求極度快速密集的矩陣乘法渔扎,利用底層的CPU或GPU硬件[16, 9]的微小細(xì)節(jié)硫狞。非均勻的稀疏模型也要求更多的復(fù)雜工程和計算基礎(chǔ)結(jié)構(gòu)。目前大多數(shù)面向視覺的機(jī)器學(xué)習(xí)系統(tǒng)通過采用卷積的優(yōu)點來利用空域的稀疏性。然而残吩,卷積被實現(xiàn)為對上一層塊的密集連接的集合财忽。為了打破對稱性,提高學(xué)習(xí)水平世剖,從論文[11]開始定罢,ConvNets習(xí)慣上在特征維度使用隨機(jī)的稀疏連接表,然而為了進(jìn)一步優(yōu)化并行計算旁瘫,論文[9]中趨向于變回全連接祖凫。目前最新的計算機(jī)視覺架構(gòu)有統(tǒng)一的結(jié)構(gòu)。更多的濾波器和更大的批大小要求密集計算的有效使用酬凳。
This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.
這提出了下一個中間步驟是否有希望的問題:一個架構(gòu)能利用濾波器水平的稀疏性惠况,正如理論所認(rèn)為的那樣,但能通過利用密集矩陣計算來利用我們目前的硬件宁仔。稀疏矩陣乘法的大量文獻(xiàn)(例如[3])認(rèn)為對于稀疏矩陣乘法稠屠,將稀疏矩陣聚類為相對密集的子矩陣會有更佳的性能。在不久的將來會利用類似的方法來進(jìn)行非均勻深度學(xué)習(xí)架構(gòu)的自動構(gòu)建翎苫,這樣的想法似乎并不牽強(qiáng)权埠。
The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12]. With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly in separation, they turned out to be close to optimal locally. One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.
Inception架構(gòu)開始是作為案例研究,用于評估一個復(fù)雜網(wǎng)絡(luò)拓?fù)錁?gòu)建算法的假設(shè)輸出煎谍,該算法試圖近似[2]中所示的視覺網(wǎng)絡(luò)的稀疏結(jié)構(gòu)攘蔽,并通過密集的、容易獲得的組件來覆蓋假設(shè)結(jié)果呐粘。盡管是一個非常投機(jī)的事情满俗,但與基于[12]的參考網(wǎng)絡(luò)相比,早期可以觀測到適度的收益作岖。隨著一點點調(diào)整加寬差距唆垃,作為[6]和[5]的基礎(chǔ)網(wǎng)絡(luò),Inception被證明在定位上下文和目標(biāo)檢測中尤其有用痘儡。有趣的是辕万,雖然大多數(shù)最初的架構(gòu)選擇已被質(zhì)疑并分離開進(jìn)行全面測試,但結(jié)果證明它們是局部最優(yōu)的沉删。然而必須謹(jǐn)慎:盡管Inception架構(gòu)在計算機(jī)上領(lǐng)域取得成功蓄坏,但這是否可以歸因于構(gòu)建其架構(gòu)的指導(dǎo)原則仍是有疑問的。確保這一點將需要更徹底的分析和驗證丑念。
4. Architectural Details
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by-layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).
4. 架構(gòu)細(xì)節(jié)
Inception架構(gòu)的主要想法是考慮怎樣近似卷積視覺網(wǎng)絡(luò)的最優(yōu)稀疏結(jié)構(gòu)并用容易獲得的密集組件進(jìn)行覆蓋涡戳。注意假設(shè)轉(zhuǎn)換不變性,這意味著我們的網(wǎng)絡(luò)將以卷積構(gòu)建塊為基礎(chǔ)脯倚。我們所需要做的是找到最優(yōu)的局部構(gòu)造并在空間上重復(fù)它渔彰。Arora等人[2]提出了一個層次結(jié)構(gòu)嵌屎,其中應(yīng)該分析最后一層的相關(guān)統(tǒng)計并將它們聚集成具有高相關(guān)性的單元組。這些聚類形成了下一層的單元并與前一層的單元連接恍涂。我們假設(shè)較早層的每個單元都對應(yīng)輸入層的某些區(qū)域宝惰,并且這些單元被分成濾波器組。在較低的層(接近輸入的層)相關(guān)單元集中在局部區(qū)域再沧。因此尼夺,如[12]所示,我們最終會有許多聚類集中在單個區(qū)域炒瘸,它們可以通過下一層的1×1卷積層覆蓋淤堵。然而也可以預(yù)期,將存在更小數(shù)目的在更大空間上擴(kuò)展的聚類顷扩,其可以被更大塊上的卷積覆蓋拐邪,在越來越大的區(qū)域上塊的數(shù)量將會下降。為了避免塊校正的問題隘截,目前Inception架構(gòu)形式的濾波器的尺寸僅限于1×1扎阶、3×3、5×5婶芭,這個決定更多的是基于便易性而不是必要性东臀。這也意味著提出的架構(gòu)是所有這些層的組合,其輸出濾波器組連接成單個輸出向量形成了下一階段的輸入犀农。另外惰赋,由于池化操作對于目前卷積網(wǎng)絡(luò)的成功至關(guān)重要,因此建議在每個這樣的階段添加一個替代的并行池化路徑應(yīng)該也應(yīng)該具有額外的有益效果(看圖2(a))井赌。
As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This suggests that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.
由于這些“Inception模塊”在彼此的頂部堆疊谤逼,其輸出相關(guān)統(tǒng)計必然有變化:由于較高層會捕獲較高的抽象特征贵扰,其空間集中度預(yù)計會減少仇穗。這表明隨著轉(zhuǎn)移到更高層,3×3和5×5卷積的比例應(yīng)該會增加戚绕。
One big problem with the above modules, at least in this naive form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage. The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
上述模塊的一個大問題是在具有大量濾波器的卷積層之上纹坐,即使適量的5×5卷積也可能是非常昂貴的,至少在這種樸素形式中有這個問題舞丛。一旦池化單元添加到混合中耘子,這個問題甚至?xí)兊酶黠@:輸出濾波器的數(shù)量等于前一階段濾波器的數(shù)量。池化層輸出和卷積層輸出的合并會導(dǎo)致這一階段到下一階段輸出數(shù)量不可避免的增加球切。雖然這種架構(gòu)可能會覆蓋最優(yōu)稀疏結(jié)構(gòu)谷誓,但它會非常低效,導(dǎo)致在幾個階段內(nèi)計算量爆炸吨凑。
This leads to the second idea of the Inception architecture: judiciously reducing dimension wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to process. The representation should be kept sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation making them dual-purpose. The final result is depicted in Figure 2(b).
這導(dǎo)致了Inception架構(gòu)的第二個想法:在計算要求會增加太多的地方捍歪,明智地減少維度户辱。這是基于嵌入的成功:甚至低維嵌入可能包含大量關(guān)于較大圖像塊的信息。然而嵌入以密集糙臼、壓縮形式表示信息并且壓縮信息更難處理庐镐。這種表示應(yīng)該在大多數(shù)地方保持稀疏(根據(jù)[2]中條件的要求】)并且僅在它們必須匯總時才壓縮信號。也就是說变逃,在昂貴的3×3和5×5卷積之前必逆,1×1卷積用來計算降維。除了用來降維之外揽乱,它們也包括使用線性修正單元使其兩用名眉。最終的結(jié)果如圖2(b)所示。
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
通常锤窑,Inception網(wǎng)絡(luò)是一個由上述類型的模塊互相堆疊組成的網(wǎng)絡(luò)璧针,偶爾會有步長為2的最大池化層將網(wǎng)絡(luò)分辨率減半。出于技術(shù)原因(訓(xùn)練過程中內(nèi)存效率)渊啰,只在更高層開始使用Inception模塊而在更低層仍保持傳統(tǒng)的卷積形式似乎是有益的探橱。這不是絕對必要的,只是反映了我們目前實現(xiàn)中的一些基礎(chǔ)結(jié)構(gòu)效率低下绘证。
A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes. Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.
該架構(gòu)的一個有用的方面是它允許顯著增加每個階段的單元數(shù)量隧膏,而不會在后面的階段出現(xiàn)計算復(fù)雜度不受控制的爆炸。這是在尺寸較大的塊進(jìn)行昂貴的卷積之前通過普遍使用降維實現(xiàn)的嚷那。此外胞枕,設(shè)計遵循了實踐直覺,即視覺信息應(yīng)該在不同的尺度上處理然后聚合魏宽,為的是下一階段可以從不同尺度同時抽象特征腐泻。
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it. We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 3—10× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.
計算資源的改善使用允許增加每個階段的寬度和階段的數(shù)量,而不會陷入計算困境队询∨勺可以利用Inception架構(gòu)創(chuàng)建略差一些但計算成本更低的版本。我們發(fā)現(xiàn)所有可用的控制允許計算資源的受控平衡蚌斩,導(dǎo)致網(wǎng)絡(luò)比沒有Inception結(jié)構(gòu)的類似執(zhí)行網(wǎng)絡(luò)快3—10倍铆惑,但是在這一點上需要仔細(xì)的手動設(shè)計。
5. GoogLeNet
By the “GoogLeNet” name we refer to the particular incarnation of the Inception architecture used in our submission for the ILSVRC 2014 competition. We also used one deeper and wider Inception network with slightly superior quality, but adding it to the ensemble seemed to improve the results only marginally. We omit the details of that network, as empirical evidence suggests that the influence of the exact architectural parameters is relatively minor. Table 1 illustrates the most common instance of Inception used in the competition. This network (trained with different image patch sampling methods) was used for 6 out of the 7 models in our ensemble.
5. GoogLeNet
通過“GoogLeNet”這個名字送膳,我們提到了在ILSVRC 2014競賽的提交中使用的Inception架構(gòu)的特例员魏。我們也使用了一個稍微優(yōu)質(zhì)的更深更寬的Inception網(wǎng)絡(luò),但將其加入到組合中似乎只稍微提高了結(jié)果叠聋。我們忽略了該網(wǎng)絡(luò)的細(xì)節(jié)撕阎,因為經(jīng)驗證據(jù)表明確切架構(gòu)的參數(shù)影響相對較小。表1說明了競賽中使用的最常見的Inception實例碌补。這個網(wǎng)絡(luò)(用不同的圖像塊采樣方法訓(xùn)練的)使用了我們組合中7個模型中的6個虏束。
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 in the RGB color space with zero mean. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.
所有的卷積都使用了修正線性激活名斟,包括Inception模塊內(nèi)部的卷積。在我們的網(wǎng)絡(luò)中感受野是在均值為0的RGB顏色空間中魄眉,大小是224×224砰盐。“#3×3 reduce”和“#5×5 reduce”表示在3×3和5×5卷積之前坑律,降維層使用的1×1濾波器的數(shù)量岩梳。在pool proj列可以看到內(nèi)置的最大池化之后,投影層中1×1濾波器的數(shù)量晃择。所有的這些降維/投影層也都使用了線性修正激活冀值。
The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint.The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. The exact number depends on how layers are counted by the machine learning infrastructure. The use of average pooling before the classifier is based on [12], although our implementation has an additional linear layer. The linear layer enables us to easily adapt our networks to other label sets, however it is used mostly for convenience and we do not expect it to have a major effect. We found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.
網(wǎng)絡(luò)的設(shè)計考慮了計算效率和實用性,因此推斷可以單獨的設(shè)備上運(yùn)行宫屠,甚至包括那些計算資源有限的設(shè)備列疗,尤其是低內(nèi)存占用的設(shè)備。當(dāng)只計算有參數(shù)的層時浪蹂,網(wǎng)絡(luò)有22層(如果我們也計算池化層是27層)抵栈。構(gòu)建網(wǎng)絡(luò)的全部層(獨立構(gòu)建塊)的數(shù)目大約是100。確切的數(shù)量取決于機(jī)器學(xué)習(xí)基礎(chǔ)設(shè)施對層的計算方式坤次。分類器之前的平均池化是基于[12]的古劲,盡管我們的實現(xiàn)有一個額外的線性層。線性層使我們的網(wǎng)絡(luò)能很容易地適應(yīng)其它的標(biāo)簽集缰猴,但它主要是為了方便使用产艾,我們不期望它有重大的影響。我們發(fā)現(xiàn)從全連接層變?yōu)槠骄鼗蓿岣吡舜蠹stop-1 %0.6
的準(zhǔn)確率闷堡,然而即使在移除了全連接層之后,丟失的使用還是必不可少的疑故。
Given relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. The strong performance of shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected. This was thought to combat the vanishing gradient problem while providing regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded. Later control experiments have shown that the effect of the auxiliary networks is relatively minor (around 0.5%) and that it required only one of them to achieve the same effect.
給定深度相對較大的網(wǎng)絡(luò)杠览,有效傳播梯度反向通過所有層的能力是一個問題。在這個任務(wù)上焰扳,更淺網(wǎng)絡(luò)的強(qiáng)大性能表明網(wǎng)絡(luò)中部層產(chǎn)生的特征應(yīng)該是非常有識別力的倦零。通過將輔助分類器添加到這些中間層误续,可以期望較低階段分類器的判別力吨悍。這被認(rèn)為是在提供正則化的同時克服梯度消失問題。這些分類器采用較小卷積網(wǎng)絡(luò)的形式蹋嵌,放置在Inception (4a)和Inception (4b)模塊的輸出之上育瓜。在訓(xùn)練期間,它們的損失以折扣權(quán)重(輔助分類器損失的權(quán)重是0.3)加到網(wǎng)絡(luò)的整個損失上栽烂。在推斷時躏仇,這些輔助網(wǎng)絡(luò)被丟棄恋脚。后面的控制實驗表明輔助網(wǎng)絡(luò)的影響相對較小(約0.5)焰手,只需要其中一個就能取得同樣的效果糟描。
The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:
- An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
- A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
- A fully connected layer with 1024 units and rectified linear activation.
- A dropout layer with 70% ratio of dropped outputs.
- A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).
A schematic view of the resulting network is depicted in Figure 3.
Figure 3: GoogLeNet network with all the bells and whistles.
包括輔助分類器在內(nèi)的附加網(wǎng)絡(luò)的具體結(jié)構(gòu)如下:
- 一個濾波器大小5×5,步長為3的平均池化層书妻,導(dǎo)致(4a)階段的輸出為4×4×512船响,(4d)的輸出為4×4×528。
- 具有128個濾波器的1×1卷積躲履,用于降維和修正線性激活见间。
- 一個全連接層,具有1024個單元和修正線性激活工猜。
- 丟棄70%輸出的丟棄層米诉。
- 使用帶有softmax損失的線性層作為分類器(作為主分類器預(yù)測同樣的1000類,但在推斷時移除)篷帅。
最終的網(wǎng)絡(luò)模型圖如圖3所示史侣。
圖3:含有的所有結(jié)構(gòu)的GoogLeNet網(wǎng)絡(luò)。
6. Training Methodology
GoogLeNet networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.
6. 訓(xùn)練方法
GoogLeNet網(wǎng)絡(luò)使用DistBelief[4]分布式機(jī)器學(xué)習(xí)系統(tǒng)進(jìn)行訓(xùn)練魏身,該系統(tǒng)使用適量的模型和數(shù)據(jù)并行抵窒。盡管我們僅使用一個基于CPU的實現(xiàn),但粗略的估計表明GoogLeNet網(wǎng)絡(luò)可以用更少的高端GPU在一周之內(nèi)訓(xùn)練到收斂叠骑,主要的限制是內(nèi)存使用李皇。我們的訓(xùn)練使用異步隨機(jī)梯度下降,動量參數(shù)為0.9[17]宙枷,固定的學(xué)習(xí)率計劃(每8次遍歷下降學(xué)習(xí)率4%)掉房。Polyak平均[13]在推斷時用來創(chuàng)建最終的模型。
Image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, such as dropout and the learning rate. Therefore, it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition, includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area with aspect ratio constrained to the interval $[\frac {3} {4}, \frac {4} {3}]$. Also, we found that the photometric distortions of Andrew Howard [8] were useful to combat overfitting to the imaging conditions of training data.
圖像采樣方法在過去幾個月的競賽中發(fā)生了重大變化慰丛,并且已收斂的模型在其他選項上進(jìn)行了訓(xùn)練笑跛,有時還結(jié)合著超參數(shù)的改變,例如丟棄和學(xué)習(xí)率揽祥。因此卓鹿,很難對訓(xùn)練這些網(wǎng)絡(luò)的最有效的單一方式給出明確指導(dǎo)。讓事情更復(fù)雜的是贤笆,受[8]的啟發(fā)蝇棉,一些模型主要是在相對較小的裁剪圖像進(jìn)行訓(xùn)練,其它模型主要是在相對較大的裁剪圖像上進(jìn)行訓(xùn)練芥永。然而篡殷,一個經(jīng)過驗證的方案在競賽后工作地很好,包括各種尺寸的圖像塊的采樣埋涧,它的尺寸均勻分布在圖像區(qū)域的8%——100%之間板辽,方向角限制為$[\frac {3} {4}, \frac {4} {3}]$之間奇瘦。另外,我們發(fā)現(xiàn)Andrew Howard[8]的光度扭曲對于克服訓(xùn)練數(shù)據(jù)成像條件的過擬合是有用的劲弦。
7. ILSVRC 2014 Classification Challenge Setup and Results
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.
7. ILSVRC 2014分類挑戰(zhàn)賽設(shè)置和結(jié)果
ILSVRC 2014分類挑戰(zhàn)賽包括將圖像分類到ImageNet層級中1000個葉子結(jié)點類別的任務(wù)耳标。訓(xùn)練圖像大約有120萬張,驗證圖像有5萬張邑跪,測試圖像有10萬張麻捻。每一張圖像與一個實際類別相關(guān)聯(lián),性能度量基于分類器預(yù)測的最高分呀袱。通常報告兩個數(shù)字:top-1準(zhǔn)確率贸毕,比較實際類別和第一個預(yù)測類別,top-5錯誤率夜赵,比較實際類別與前5個預(yù)測類別:如果圖像實際類別在top-5中明棍,則認(rèn)為圖像分類正確,不管它在top-5中的排名寇僧。挑戰(zhàn)賽使用top-5錯誤率來進(jìn)行排名摊腋。
We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.
- We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.
- During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resized the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This leads to 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).
- The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.
我們參加競賽時沒有使用外部數(shù)據(jù)來訓(xùn)練。除了本文中前面提到的訓(xùn)練技術(shù)之外嘁傀,我們在獲得更高性能的測試中采用了一系列技巧兴蒸,描述如下。
- 我們獨立訓(xùn)練了7個版本的相同的GoogLeNet模型(包括一個更廣泛的版本)细办,并用它們進(jìn)行了整體預(yù)測橙凳。這些模型的訓(xùn)練具有相同的初始化(甚至具有相同的初始權(quán)重,由于監(jiān)督)和學(xué)習(xí)率策略笑撞。它們僅在采樣方法和隨機(jī)輸入圖像順序方面不同岛啸。
- 在測試中,我們采用比Krizhevsky等人[9]更積極的裁剪方法茴肥。具體來說坚踩,我們將圖像歸一化為四個尺度,其中較短維度(高度或?qū)挾龋┓謩e為256瓤狐,288瞬铸,320和352,取這些歸一化的圖像的左础锐,中嗓节,右方塊(在肖像圖片中,我們采用頂部郁稍,中心和底部方塊)赦政。對于每個方塊胜宇,我們將采用4個角以及中心224×224裁剪圖像以及方塊尺寸歸一化為224×224耀怜,以及它們的鏡像版本恢着。這導(dǎo)致每張圖像會得到4×3×6×2 = 144的裁剪圖像。前一年的輸入中财破,Andrew Howard[8]采用了類似的方法掰派,經(jīng)過我們實證驗證,其方法略差于我們提出的方案左痢。我們注意到靡羡,在實際應(yīng)用中,這種積極裁剪可能是不必要的俊性,因為存在合理數(shù)量的裁剪圖像后略步,更多裁剪圖像的好處會變得很微小(正如我們后面展示的那樣)定页。
- softmax概率在多個裁剪圖像上和所有單個分類器上進(jìn)行平均趟薄,然后獲得最終預(yù)測。在我們的實驗中典徊,我們分析了驗證數(shù)據(jù)的替代方法杭煎,例如裁剪圖像上的最大池化和分類器的平均,但是它們比簡單平均的性能略遜卒落。
In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.
在本文的其余部分羡铲,我們分析了有助于最終提交整體性能的多個因素。
Our final submission to the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. Table 2 shows the statistics of some of the top-performing approaches over the past 3 years.
競賽中我們的最終提交在驗證集和測試集上得到了top-5 6.67%
的錯誤率儡毕,在其它的參與者中排名第一也切。與2012年的SuperVision方法相比相對減少了56.5%,與前一年的最佳方法(Clarifai)相比相對減少了約40%腰湾,這兩種方法都使用了外部數(shù)據(jù)訓(xùn)練分類器贾费。表2顯示了過去三年中一些表現(xiàn)最好的方法的統(tǒng)計。
We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in Table 3. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.
我們也分析報告了多種測試選擇的性能檐盟,當(dāng)預(yù)測圖像時通過改變表3中使用的模型數(shù)目和裁剪圖像數(shù)目褂萧。
8. ILSVRC 2014 Detection Challenge Setup and Results
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP). The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the selective search [20] approach with multi-box [5] predictions for higher object bounding box recall. In order to reduce the number of false positives, the superpixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 GoogLeNets when classifying each region. This leads to an increase in accuracy from 40% to 43.9%. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.
8. ILSVRC 2014檢測挑戰(zhàn)賽設(shè)置和結(jié)果
ILSVRC檢測任務(wù)是為了在200個可能的類別中生成圖像中目標(biāo)的邊界框。如果檢測到的對象匹配的它們實際類別并且它們的邊界框重疊至少50%(使用Jaccard索引)葵萎,則將檢測到的對象記為正確导犹。無關(guān)的檢測記為假陽性且被懲罰。與分類任務(wù)相反羡忘,每張圖像可能包含多個對象或沒有對象谎痢,并且它們的尺度可能是變化的。報告的結(jié)果使用平均精度均值(mAP)卷雕。GoogLeNet檢測采用的方法類似于R-CNN[6]节猿,但用Inception模塊作為區(qū)域分類器進(jìn)行了增強(qiáng)。此外,為了更高的目標(biāo)邊界框召回率滨嘱,通過選擇搜索[20]方法和多箱[5]預(yù)測相結(jié)合改進(jìn)了區(qū)域生成步驟峰鄙。為了減少假陽性的數(shù)量,超分辨率的尺寸增加了2倍太雨。這將選擇搜索算法的區(qū)域生成減少了一半吟榴。我們總共補(bǔ)充了200個來自多盒結(jié)果的區(qū)域生成,大約60%的區(qū)域生成用于[6]囊扳,同時將覆蓋率從92%提高到93%吩翻。減少區(qū)域生成的數(shù)量,增加覆蓋率的整體影響是對于單個模型的情況平均精度均值增加了1%锥咸。最后狭瞎,等分類單個區(qū)域時,我們使用了6個GoogLeNets的組合搏予。這導(dǎo)致準(zhǔn)確率從40%提高到43.9%脚作。注意,與R-CNN相反缔刹,由于缺少時間我們沒有使用邊界框回歸球涛。
We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use convolutional networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.
我們首先報告了最好檢測結(jié)果,并顯示了從第一版檢測任務(wù)以來的進(jìn)展校镐。與2013年的結(jié)果相比亿扁,準(zhǔn)確率幾乎翻了一倍。所有表現(xiàn)最好的團(tuán)隊都使用了卷積網(wǎng)絡(luò)鸟廓。我們在表4中報告了官方的分?jǐn)?shù)和每個隊伍的常見策略:使用外部數(shù)據(jù)从祝、集成模型或上下文模型。外部數(shù)據(jù)通常是ILSVRC12的分類數(shù)據(jù)引谜,用來預(yù)訓(xùn)練模型牍陌,后面在檢測數(shù)據(jù)集上進(jìn)行改善。一些團(tuán)隊也提到使用定位數(shù)據(jù)员咽。由于定位任務(wù)的邊界框很大一部分不在檢測數(shù)據(jù)集中毒涧,所以可以用該數(shù)據(jù)預(yù)訓(xùn)練一般的邊界框回歸器,這與分類預(yù)訓(xùn)練的方式相同贝室。GoogLeNet輸入沒有使用定位數(shù)據(jù)進(jìn)行預(yù)訓(xùn)練契讲。
In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.
在表5中,我們僅比較了單個模型的結(jié)果滑频。最好性能模型是Deep Insight的捡偏,令人驚訝的是3個模型的集合僅提高了0.3個點,而GoogLeNet在模型集成時明顯獲得了更好的結(jié)果峡迷。
9. Conclusions
Our results yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and narrower architectures.
9. 總結(jié)
我們的結(jié)果取得了可靠的證據(jù)银伟,即通過易獲得的密集構(gòu)造塊來近似期望的最優(yōu)稀疏結(jié)果是改善計算機(jī)視覺神經(jīng)網(wǎng)絡(luò)的一種可行方法。相比于較淺且較窄的架構(gòu),這個方法的主要優(yōu)勢是在計算需求適度增加的情況下有顯著的質(zhì)量收益彤避。
Our object detection work was competitive despite not utilizing context nor performing bounding box regression, suggesting yet further evidence of the strengths of the Inception architecture.
我們的目標(biāo)檢測工作雖然沒有利用上下文傅物,也沒有執(zhí)行邊界框回歸,但仍然具有競爭力忠藤,這進(jìn)一步顯示了Inception架構(gòu)優(yōu)勢的證據(jù)挟伙。
For both classification and detection, it is expected that similar quality of result can be achieved by much more expensive non-Inception-type networks of similar depth and width. Still, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest future work towards creating sparser and more refined structures in automated ways on the basis of [2], as well as on applying the insights of the Inception architecture to other domains.
對于分類和檢測楼雹,預(yù)期通過更昂貴的類似深度和寬度的非Inception類型網(wǎng)絡(luò)可以實現(xiàn)類似質(zhì)量的結(jié)果模孩。 然而,我們的方法取得了可靠的證據(jù)贮缅,即轉(zhuǎn)向更稀疏的結(jié)構(gòu)一般來說是可行有用的想法榨咐。這表明未來的工作將在[2]的基礎(chǔ)上以自動化方式創(chuàng)建更稀疏更精細(xì)的結(jié)構(gòu),以及將Inception架構(gòu)的思考應(yīng)用到其他領(lǐng)域谴供。
References
[1] Know your meme: We need to go deeper. http://knowyourmeme.com/memes/we-need-to-go-deeper. Accessed: 2014-09-15.
[2] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. CoRR, abs/1310.6343, 2013.
[3] U. V. C ?atalyu ?rek, C. Aykanat, and B. Uc ?ar. On two-dimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput., 32(2):656–683, Feb. 2010.
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232–1240. 2012.
[5] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.
[6] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.
[7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
[8] A. G. Howard. Some improvements on deep convolutional neural network based image classification. CoRR, abs/1312.5402, 2013.
[9] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, Dec. 1989.
[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[12] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.
[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, July 1992.
[14] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
[15] T. Serre, L. Wolf, S. M. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007.
[16] F. Song and J. Dongarra. Scaling up matrix computations on shared-memory manycore systems with 1000 cpu cores. In Proceedings of the 28th ACM Interna- tional Conference on Supercomputing, ICS ’14, pages 333–342, New York, NY, USA, 2014. ACM.
[17] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceed- ings, pages 1139–1147. JMLR.org, 2013.
[18] C.Szegedy,A.Toshev,andD.Erhan.Deep neural networks for object detection. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS, pages 2553–2561, 2013.
[19] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. CoRR, abs/1312.4659, 2013.
[20] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders. Segmentation as selective search for object recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 1879–1886, Washington, DC, USA, 2011. IEEE Computer Society.
[21] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV, volume 8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014.