文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡書
聲明:作者翻譯論文僅為學習董朝,如有侵權請聯(lián)系作者刪除博文牡拇,謝謝!
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Abstract
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features——using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
摘要
最先進的目標檢測網(wǎng)絡依靠區(qū)域提出算法來假設目標的位置稚失。SPPnet[1]和Fast R-CNN[2]等研究已經(jīng)減少了這些檢測網(wǎng)絡的運行時間丽柿,使得區(qū)域提出計算成為一個瓶頸振乏。在這項工作中蔗包,我們引入了一個區(qū)域提出網(wǎng)絡(RPN),該網(wǎng)絡與檢測網(wǎng)絡共享全圖像的卷積特征慧邮,從而使近乎零成本的區(qū)域提出成為可能调限。RPN是一個全卷積網(wǎng)絡,可以同時在每個位置預測目標邊界和目標分數(shù)误澳。RPN經(jīng)過端到端的訓練耻矮,可以生成高質(zhì)量的區(qū)域提出,由Fast R-CNN用于檢測脓匿。我們將RPN和Fast R-CNN通過共享卷積特征進一步合并為一個單一的網(wǎng)絡——使用最近流行的具有“注意力”機制的神經(jīng)網(wǎng)絡術語淘钟,RPN組件告訴統(tǒng)一網(wǎng)絡在哪里尋找。對于非常深的VGG-16模型[3]陪毡,我們的檢測系統(tǒng)在GPU上的幀率為5fps(包括所有步驟)米母,同時在PASCAL VOC 2007,2012和MS COCO數(shù)據(jù)集上實現(xiàn)了最新的目標檢測精度毡琉,每個圖像只有300個提出铁瞒。在ILSVRC和COCO 2015競賽中,F(xiàn)aster R-CNN和RPN是多個比賽中獲得第一名輸入的基礎桅滋。代碼可公開獲得蔽午。
1. Introduction
Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were computationally expensive as originally developed in [5], their cost has been drastically reduced thanks to sharing convolutions across proposals [1], [2]. The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.
1. 引言
目標檢測的最新進展是由區(qū)域提出方法(例如[4])和基于區(qū)域的卷積神經(jīng)網(wǎng)絡(R-CNN)[5]的成功驅(qū)動的。盡管在[5]中最初開發(fā)的基于區(qū)域的CNN計算成本很高善榛,但是由于在各種提議中共享卷積诊杆,所以其成本已經(jīng)大大降低了[1]虐呻,[2]。忽略花費在區(qū)域提議上的時間,最新版本Fast R-CNN[2]利用非常深的網(wǎng)絡[3]實現(xiàn)了接近實時的速率。現(xiàn)在定庵,提議是最新的檢測系統(tǒng)中測試時間的計算瓶頸。
Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.
區(qū)域提議方法通常依賴廉價的特征和簡練的推斷方案踪危。選擇性搜索[4]是最流行的方法之一蔬浙,它貪婪地合并基于設計的低級特征的超級像素。然而贞远,與有效的檢測網(wǎng)絡[2]相比畴博,選擇性搜索速度慢了一個數(shù)量級,在CPU實現(xiàn)中每張圖像的時間為2秒蓝仲。EdgeBoxes[6]目前提供了在提議質(zhì)量和速度之間的最佳權衡俱病,每張圖像0.2秒。盡管如此杂曲,區(qū)域提議步驟仍然像檢測網(wǎng)絡那樣消耗同樣多的運行時間庶艾。
One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.
有人可能會注意到袁余,基于區(qū)域的快速CNN利用GPU擎勘,而在研究中使用的區(qū)域提議方法在CPU上實現(xiàn),使得運行時間比較不公平颖榜。加速區(qū)域提議計算的一個顯而易見的方法是將其在GPU上重新實現(xiàn)棚饵。這可能是一個有效的工程解決方案,但重新實現(xiàn)忽略了下游檢測網(wǎng)絡掩完,因此錯過了共享計算的重要機會噪漾。
In this paper, we show that an algorithmic change——computing proposals with a deep convolutional neural network——leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [1], [2]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).
在本文中,我們展示了算法的變化——用深度卷積神經(jīng)網(wǎng)絡計算區(qū)域提議——導致了一個優(yōu)雅和有效的解決方案且蓬,其中在給定檢測網(wǎng)絡計算的情況下區(qū)域提議計算接近領成本欣硼。為此,我們引入了新的區(qū)域提議網(wǎng)絡(RPN)恶阴,它們共享最先進目標檢測網(wǎng)絡的卷積層[1]诈胜,[2]。通過在測試時共享卷積冯事,計算區(qū)域提議的邊際成本很薪剐佟(例如,每張圖像10ms)昵仅。
Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.
我們的觀察是缓熟,基于區(qū)域的檢測器所使用的卷積特征映射,如Fast R-CNN,也可以用于生成區(qū)域提議够滑。在這些卷積特征之上垦写,我們通過添加一些額外的卷積層來構(gòu)建RPN,這些卷積層同時在規(guī)則網(wǎng)格上的每個位置上回歸區(qū)域邊界和目標分數(shù)彰触。因此RPN是一種全卷積網(wǎng)絡(FCN)[7]梯澜,可以針對生成檢測區(qū)域建議的任務進行端到端的訓練。
RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel “anchor” boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.
RPN旨在有效預測具有廣泛尺度和長寬比的區(qū)域提議渴析。與使用圖像金字塔(圖1晚伙,a)或濾波器金字塔(圖1,b)的流行方法[8]俭茧,[9]咆疗,[1]相比,我們引入新的“錨”盒作為多種尺度和長寬比的參考母债。我們的方案可以被認為是回歸參考金字塔(圖1午磁,c),它避免了枚舉多種比例或長寬比的圖像或濾波器毡们。這個模型在使用單尺度圖像進行訓練和測試時運行良好迅皇,從而有利于運行速度。
圖1:解決多尺度和尺寸的不同方案衙熔。(a)構(gòu)建圖像和特征映射金字塔登颓,分類器以各種尺度運行。(b)在特征映射上運行具有多個比例/大小的濾波器的金字塔红氯。(c)我們在回歸函數(shù)中使用參考邊界框金字塔框咙。
To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.
為了將RPN與Fast R-CNN 2]目標檢測網(wǎng)絡相結(jié)合,我們提出了一種訓練方案痢甘,在微調(diào)區(qū)域提議任務和微調(diào)目標檢測之間進行交替喇嘱,同時保持區(qū)域提議的固定。該方案快速收斂塞栅,并產(chǎn)生兩個任務之間共享的具有卷積特征的統(tǒng)一網(wǎng)絡者铜。
We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time——the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset [12] and investigate the improvements on PASCAL VOC using the COCO data. Code has been made publicly available at https://github.com/shaoqingren/faster_rcnn (in MATLAB) and https://github.com/rbgirshick/py-faster-rcnn (in Python).
我們在PASCAL VOC檢測基準數(shù)據(jù)集上[11]綜合評估了我們的方法,其中具有Fast R-CNN的RPN產(chǎn)生的檢測精度優(yōu)于使用選擇性搜索的Fast R-CNN的強基準放椰。同時作烟,我們的方法在測試時幾乎免除了選擇性搜索的所有計算負擔——區(qū)域提議的有效運行時間僅為10毫秒。使用[3]的昂貴的非常深的模型庄敛,我們的檢測方法在GPU上仍然具有5fps的幀率(包括所有步驟)俗壹,因此在速度和準確性方面是實用的目標檢測系統(tǒng)。我們還報告了在MS COCO數(shù)據(jù)集上[12]的結(jié)果藻烤,并使用COCO數(shù)據(jù)研究了在PASCAL VOC上的改進绷雏。代碼可公開獲得https://github.com/shaoqingren/faster_rcnn(在MATLAB中)和https://github.com/rbgirshick/py-faster-rcnn(在Python中)头滔。
A preliminary version of this manuscript was published previously [10]. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection [13], part-based detection [14], instance segmentation [15], and image captioning [16]. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.
這個手稿的初步版本是以前發(fā)表的[10]。從那時起涎显,RPN和Faster R-CNN的框架已經(jīng)被采用并推廣到其他方法坤检,如3D目標檢測[13],基于部件的檢測[14]期吓,實例分割[15]和圖像標題[16]早歇。我們快速和有效的目標檢測系統(tǒng)也已經(jīng)在Pinterest[17]的商業(yè)系統(tǒng)中建立了,并報告了用戶參與度的提高讨勤。
In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries [18] in the tracks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in [18]). Faster R-CNN and RPN are also used by several other leading entries in these competitions. These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy.
在ILSVRC和COCO 2015競賽中箭跳,F(xiàn)aster R-CNN和RPN是ImageNet檢測,ImageNet定位潭千,COCO檢測和COCO分割中幾個第一名參賽者[18]的基礎谱姓。RPN完全從數(shù)據(jù)中學習提議區(qū)域,因此可以從更深入和更具表達性的特征(例如[18]中采用的101層殘差網(wǎng)絡)中輕松獲益刨晴。Faster R-CNN和RPN也被這些比賽中的其他幾個主要參賽者所使用屉来。這些結(jié)果表明,我們的方法不僅是一個實用合算的解決方案狈癞,而且是一個提高目標檢測精度的有效方法茄靠。
2. RELATED WORK
Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21]. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search [4], CPMC [22], MCG [23]) and those based on sliding windows (e.g., objectness in windows [24], EdgeBoxes [6]). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).
2. 相關工作
目標提議。目標提議方法方面有大量的文獻蝶桶。目標提議方法的綜合調(diào)查和比較可以在[19]慨绳,[20],[21]中找到莫瞬。廣泛使用的目標提議方法包括基于超像素分組(例如儡蔓,選擇性搜索[4]郭蕉,CPMC[22]疼邀,MCG[23])和那些基于滑動窗口的方法(例如窗口中的目標[24],EdgeBoxes[6])召锈。目標提議方法被采用為獨立于檢測器(例如旁振,選擇性搜索[4]目標檢測器,R-CNN[5]和Fast R-CNN[2])的外部模塊涨岁。
Deep Networks for Object Detection. The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27]. In the OverFeat method [9], a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned into a convolutional layer for detecting multiple classspecific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the “single-box” fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.
用于目標檢測的深度網(wǎng)絡拐袜。R-CNN方法[5]端到端地對CNN進行訓練,將提議區(qū)域分類為目標類別或背景梢薪。R-CNN主要作為分類器蹬铺,并不能預測目標邊界(除了通過邊界框回歸進行細化)。其準確度取決于區(qū)域提議模塊的性能(參見[20]中的比較)秉撇。一些論文提出了使用深度網(wǎng)絡來預測目標邊界框的方法[25]甜攀,[9]秋泄,[26],[27]规阀。在OverFeat方法[9]中恒序,訓練一個全連接層來預測假定單個目標定位任務的邊界框坐標。然后將全連接層變成卷積層谁撼,用于檢測多個類別的目標歧胁。MultiBox方法[26],[27]從網(wǎng)絡中生成區(qū)域提議厉碟,網(wǎng)絡最后的全連接層同時預測多個類別不相關的邊界框喊巍,并推廣到OverFeat的“單邊界框”方式。這些類別不可知的邊界框框被用作R-CNN的提議區(qū)域[5]箍鼓。與我們的全卷積方案相比玄糟,MultiBox提議網(wǎng)絡適用于單張裁剪圖像或多張大型裁剪圖像(例如224×224)。MultiBox在提議區(qū)域和檢測網(wǎng)絡之間不共享特征袄秩。稍后在我們的方法上下文中會討論OverFeat和MultiBox阵翎。與我們的工作同時進行的,DeepMask方法[28]是為學習分割提議區(qū)域而開發(fā)的之剧。
Shared computation of convolutions [9], [1], [29], [7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper [9] computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [1] on shared convolutional feature maps is developed for efficient region-based object detection [1], [30] and semantic segmentation [29]. Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.
卷積[9]郭卫,[1],[29]背稼,[7]贰军,[2]的共享計算已經(jīng)越來越受到人們的關注,因為它可以有效而準確地進行視覺識別蟹肘。OverFeat論文[9]計算圖像金字塔的卷積特征用于分類词疼,定位和檢測。共享卷積特征映射的自適應大小池化(SPP)[1]被開發(fā)用于有效的基于區(qū)域的目標檢測[1]帘腹,[30]和語義分割[29]贰盗。Fast R-CNN[2]能夠?qū)蚕砭矸e特征進行端到端的檢測器訓練,并顯示出令人信服的準確性和速度阳欲。
3. FASTER R-CNN
Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with attention
[31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.
Figure 2: Faster R-CNN is a single, unified network for object detection. The RPN module serves as the ‘a(chǎn)ttention’ of this unified network.
3. FASTER R-CNN
我們的目標檢測系統(tǒng)舵盈,稱為Faster R-CNN,由兩個模塊組成球化。第一個模塊是提議區(qū)域的深度全卷積網(wǎng)絡秽晚,第二個模塊是使用提議區(qū)域的Fast R-CNN檢測器[2]。整個系統(tǒng)是一個單個的筒愚,統(tǒng)一的目標檢測網(wǎng)絡(圖2)赴蝇。使用最近流行的“注意力”[31]機制的神經(jīng)網(wǎng)絡術語,RPN模塊告訴Fast R-CNN模塊在哪里尋找巢掺。在第3.1節(jié)中句伶,我們介紹了區(qū)域提議網(wǎng)絡的設計和屬性芍耘。在第3.2節(jié)中,我們開發(fā)了用于訓練具有共享特征模塊的算法熄阻。
圖2:Faster R-CNN是一個單一斋竞,統(tǒng)一的目標檢測網(wǎng)絡。RPN模塊作為這個統(tǒng)一網(wǎng)絡的“注意力”秃殉。
3.1 Region Proposal Networks
A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.3 We model this process with a fully convolutional network [7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model 32, which has 5 shareable convolutional layers and the Simonyan and Zisserman model 3, which has 13 shareable convolutional layers.
3.1 區(qū)域提議網(wǎng)絡
區(qū)域提議網(wǎng)絡(RPN)以任意大小的圖像作為輸入坝初,輸出一組矩形的目標提議,每個提議都有一個目標得分钾军。我們用全卷積網(wǎng)絡[7]對這個過程進行建模鳄袍,我們將在本節(jié)進行描述。因為我們的最終目標是與Fast R-CNN目標檢測網(wǎng)絡[2]共享計算吏恭,所以我們假設兩個網(wǎng)絡共享一組共同的卷積層拗小。在我們的實驗中,我們研究了具有5個共享卷積層的Zeiler和Fergus模型[32](ZF)和具有13個共享卷積層的Simonyan和Zisserman模型[3](VGG-16)樱哼。
To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer. This small network takes as input an $n × n$ spatial window of the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU [33] following). This feature is fed into two sibling fully-connected layers——a box-regression layer (reg) and a box-classification layer (cls). We use $n = 3$ in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mini-network is illustrated at a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).
Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.
為了生成區(qū)域提議哀九,我們在最后的共享卷積層輸出的卷積特征映射上滑動一個小網(wǎng)絡。這個小網(wǎng)絡將輸入卷積特征映射的$n×n$空間窗口作為輸入搅幅。每個滑動窗口映射到一個低維特征(ZF為256維阅束,VGG為512維,后面是ReLU[33])茄唐。這個特征被輸入到兩個子全連接層——一個邊界框回歸層(reg)和一個邊界框分類層(cls)息裸。在本文中,我們使用$n=3$沪编,注意輸入圖像上的有效感受野是大的(ZF和VGG分別為171和228個像素)呼盆。圖3(左)顯示了這個小型網(wǎng)絡的一個位置。請注意蚁廓,因為小網(wǎng)絡以滑動窗口方式運行访圃,所有空間位置共享全連接層。這種架構(gòu)通過一個n×n卷積層纳令,后面是兩個子1×1卷積層(分別用于reg和cls)自然地實現(xiàn)挽荠。
圖3:左:區(qū)域提議網(wǎng)絡(RPN)。右:在PASCAL VOC 2007測試集上使用RPN提議的示例檢測平绩。我們的方法可以檢測各種尺度和長寬比的目標。
3.1.1 Anchors
At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as $k$. So the reg layer has $4k$ outputs encoding the coordinates of $k$ boxes, and the cls layer outputs $2k$ scores that estimate probability of object or not object for each proposal. The $k$ proposals are parameterized relative to $k$ reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding $k=9$ anchors at each sliding position. For a convolutional feature map of a size W × H (typically ~2,400), there are $WHk$ anchors in total.
3.1.1 錨點
在每個滑動窗口位置漠另,我們同時預測多個區(qū)域提議捏雌,其中每個位置可能提議的最大數(shù)目表示為$k$。因此笆搓,reg層具有$4k$個輸出性湿,編碼$k$個邊界框的坐標纬傲,cls層輸出$2k$個分數(shù),估計每個提議是目標或不是目標的概率肤频。相對于我們稱之為錨點的$k$個參考邊界框叹括,$k$個提議是參數(shù)化的。錨點位于所討論的滑動窗口的中心宵荒,并與一個尺度和長寬比相關(圖3左)汁雷。默認情況下,我們使用3個尺度和3個長寬比报咳,在每個滑動位置產(chǎn)生$k=9$個錨點。對于大小為W×H(通常約為2400)的卷積特征映射,總共有$WHk$個錨點循衰。
Translation-Invariant Anchors
An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. This translation-invariant property is guaranteed by our method. As a comparison, the MultiBox method [27] uses k-means to generate 800 anchors, which are not translation invariant. So MultiBox does not guarantee that the same proposal is generated if an object is translated.
平移不變的錨點
我們的方法的一個重要特性是它是平移不變的蟋座,無論是在錨點還是計算相對于錨點的區(qū)域提議的函數(shù)。如果在圖像中平移目標岩臣,提議應該平移溜嗜,并且同樣的函數(shù)應該能夠在任一位置預測提議。平移不變特性是由我們的方法保證的架谎。作為比較粱胜,MultiBox方法[27]使用k-means生成800個錨點,這不是平移不變的狐树。所以如果平移目標焙压,MultiBox不保證會生成相同的提議。
The translation-invariant property also reduces the model size. MultiBox has a $(4+1)\times 800$-dimensional fully-connected output layer, whereas our method has a $(4+2)\times 9$-dimensional convolutional output layer in the case of $k=9$ anchors. As a result, our output layer has $2.8\times10^4$ parameters ($512\times(4+2)\times9$ for VGG-16), two orders of magnitude fewer than MultiBox's output layer that has $6.1\times10^6$ parameters ($1536\times(4+1)\times800$ for GoogleNet [34] in MultiBox [27]. If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox. We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.
平移不變特性也減小了模型的大小抑钟。MultiBox有$(4+1)\times 800$維的全連接輸出層涯曲,而我們的方法在$k=9$個錨點的情況下有$(4+2)\times 9$維的卷積輸出層。因此在塔,對于VGG-16幻件,我們的輸出層具有$2.8\times104$個參數(shù)(對于VGG-16為$512\times(4+2)\times9$),比MultiBox輸出層的$6.1\times106$個參數(shù)少了兩個數(shù)量級(對于MultiBox [27]中的GoogleNet[34]為$1536\times(4+1)\times800$)蛔溃。如果考慮到特征投影層绰沥,我們的提議層仍然比MultiBox少一個數(shù)量級。我們期望我們的方法在PASCAL VOC等小數(shù)據(jù)集上有更小的過擬合風險贺待。
Multi-Scale Anchors as Regression References
Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios). As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on image/feature pyramids, e.g., in DPM [8] and CNN-based methods [9], [1], [2]. The images are resized at multiple scales, and feature maps (HOG [8] or deep convolutional features [9], [1], [2]) are computed for each scale (Figure 1(a)). This way is often useful but is time-consuming. The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps. For example, in DPM [8], models of different aspect ratios are trained separately using different filter sizes (such as 5×7 and 7×5). If this way is used to address multiple scales, it can be thought of as a “pyramid of filters” (Figure 1(b)). The second way is usually adopted jointly with the first way [8].
多尺度錨點作為回歸參考
我們的錨點設計提出了一個新的方案來解決多尺度(和長寬比)徽曲。如圖1所示,多尺度預測有兩種流行的方法麸塞。第一種方法是基于圖像/特征金字塔秃臣,例如DPM[8]和基于CNN的方法[9],[1],[2]中奥此。圖像在多個尺度上進行縮放弧哎,并且針對每個尺度(圖1(a))計算特征映射(HOG[8]或深卷積特征[9],[1]稚虎,[2])撤嫩。這種方法通常是有用的,但是非常耗時蠢终。第二種方法是在特征映射上使用多尺度(和/或長寬比)的滑動窗口序攘。例如,在DPM[8]中蜕径,使用不同的濾波器大辛教ぁ(例如5×7和7×5)分別對不同長寬比的模型進行訓練。如果用這種方法來解決多尺度問題兜喻,可以把它看作是一個“濾波器金字塔”(圖1(b))梦染。第二種方法通常與第一種方法聯(lián)合采用[8]。
As a comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes (Table 8).
Table 8: Detection results of Faster R-CNN on PAS- CAL VOC 2007 test set using different settings of anchors. The network is VGG-16. The training data is VOC 2007 trainval. The default setting of using 3 scales and 3 aspect ratios ($69.9%$) is the same as that in Table 3.
作為比較朴皆,我們的基于錨點方法建立在錨點金字塔上帕识,這是更具成本效益的。我們的方法參照多尺度和長寬比的錨盒來分類和回歸邊界框遂铡。它只依賴單一尺度的圖像和特征映射肮疗,并使用單一尺寸的濾波器(特征映射上的滑動窗口)。我們通過實驗來展示這個方案解決多尺度和尺寸的效果(表8)扒接。
表8:Faster R-CNN在PAS-CAL VOC 2007測試數(shù)據(jù)集上使用不同錨點設置的檢測結(jié)果伪货。網(wǎng)絡是VGG-16。訓練數(shù)據(jù)是VOC 2007訓練集钾怔。使用3個尺度和3個長寬比($69.9%$)的默認設置碱呼,與表3中的相同。
Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector [2]. The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
由于這種基于錨點的多尺度設計宗侦,我們可以簡單地使用在單尺度圖像上計算的卷積特征愚臀,F(xiàn)ast R-CNN檢測器也是這樣做的[2]。多尺度錨點設計是共享特征的關鍵組件矾利,不需要額外的成本來處理尺度姑裂。
3.1.2 Loss Function
For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufficient to determine the positive samples; but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample. We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.
3.1.2 損失函數(shù)
為了訓練RPN,我們?yōu)槊總€錨點分配一個二值類別標簽(是目標或不是目標)男旗。我們給兩種錨點分配一個正標簽:(i)具有與實際邊界框的重疊最高交并比(IoU)的錨點舶斧,或者(ii)具有與實際邊界框的重疊超過0.7 IoU的錨點。注意剑肯,單個真實邊界框可以為多個錨點分配正標簽捧毛。通常第二個條件足以確定正樣本;但我們?nèi)匀徊捎玫谝粋€條件让网,因為在一些極少數(shù)情況下呀忧,第二個條件可能找不到正樣本。對于所有的真實邊界框溃睹,如果一個錨點的IoU比率低于0.3而账,我們給非正面的錨點分配一個負標簽。既不正面也不負面的錨點不會有助于訓練目標函數(shù)因篇。
With these definitions, we minimize an objective function following the multi-task loss in Fast R-CNN [2]. Our loss function for an image is defined as:$$
L(\lbrace p_i \rbrace, \lbrace t_i \rbrace) = \frac{1}{N_{cls}}\sum_i L_{cls}(p_i, p^{*}_i) \\ + \lambda\frac{1}{N_{reg}}\sum_i p^{*}_i L_{reg}(t_i, t^{*}_i).
$$Here, $i$ is the index of an anchor in a mini-batch and $p_i$ is the predicted probability of anchor $i$ being an object. The ground-truth label $p^{*}_i$ is 1 if the anchor is positive, and is 0 if the anchor is negative. $t_i$ is a vector representing the 4 parameterized coordinates of the predicted bounding box, and $t^{*}_i$ is that of the ground-truth box associated with a positive anchor. The classification loss $L_{cls}$ is log loss over two classes (object vs not object). For the regression loss, we use $L_{reg}(t_i, t^{*}_i)=R(t_i - t^{*}_i)$ where $R$ is the robust loss function (smooth $L_1$) defined in [2]. The term $p^{*}_i L_{reg}$ means the regression loss is activated only for positive anchors ($p^{*}_i=1$) and is disabled otherwise ($p^{*}_i=0$). The outputs of the cls and reg layers consist of ${p_i}$ and ${t_i}$ respectively.
根據(jù)這些定義泞辐,我們對目標函數(shù)Fast R-CNN[2]中的多任務損失進行最小化。我們對圖像的損失函數(shù)定義為:$$
L(\lbrace p_i \rbrace, \lbrace t_i \rbrace) = \frac{1}{N_{cls}}\sum_i L_{cls}(p_i, p^{*}_i) \\ + \lambda\frac{1}{N_{reg}}\sum_i p^{*}_i L_{reg}(t_i, t^{*}_i).
$$其中竞滓,$i$是一個小批量數(shù)據(jù)中錨點的索引咐吼,$p_i$是錨點$i$作為目標的預測概率。如果錨點為正商佑,真實標簽$p{*}_i$為1锯茄,如果錨點為負,則為0茶没。$t_i$是表示預測邊界框4個參數(shù)化坐標的向量肌幽,而$t{*}_i$是與正錨點相關的真實邊界框的向量。分類損失$L_{cls}$是兩個類別上(目標或不是目標)的對數(shù)損失抓半。對于回歸損失喂急,我們使用$L_{reg}(t_i, t^{*}_i)=R(t_i - t{*}_i)$,其中$R$是在[2]中定義的魯棒損失函數(shù)(平滑$L_1$)笛求。項$p{*}_i L_{reg}$表示回歸損失僅對于正錨點激活廊移,否則被禁用($p^{*}_i=0$)。cls和reg層的輸出分別由${p_i}$和${t_i}$組成探入。
The two terms are normalized by $N_{cls}$ and $N_{reg}$ and weighted by a balancing parameter $\lambda$. In our current implementation (as in the released code), the $cls$ term in Eqn.(1) is normalized by the mini-batch size (ie, $N_{cls}=256$) and the $reg$ term is normalized by the number of anchor locations (ie, $N_{reg} \sim 2,400$). By default we set $\lambda=10$, and thus both cls and reg terms are roughly equally weighted. We show by experiments that the results are insensitive to the values of $\lambda$ in a wide range(Table 9). We also note that the normalization as above is not required and could be simplified.
Table 9: Detection results of Faster R-CNN on PASCAL VOC 2007 test set using different values of $\lambda$ in Equation (1). The network is VGG-16. The training data is VOC 2007 trainval. The default setting of using $\lambda = 10$ ($69.9%$) is the same as that in Table 3.
這兩個項用$N_{cls}$和$N_{reg}$進行標準化狡孔,并由一個平衡參數(shù)$\lambda$加權。在我們目前的實現(xiàn)中(如在發(fā)布的代碼中)新症,方程(1)中的$cls$項通過小批量數(shù)據(jù)的大胁绞稀(即$N_{cls}=256$)進行歸一化,$reg$項根據(jù)錨點位置的數(shù)量(即徒爹,$N_{reg}\sim 24000$)進行歸一化荚醒。默認情況下,我們設置$\lambda=10$隆嗅,因此cls和reg項的權重大致相等界阁。我們通過實驗顯示,結(jié)果對寬范圍的$\lambda$值不敏感(表9)胖喳。我們還注意到泡躯,上面的歸一化不是必需的,可以簡化。
表9:Faster R-CNN使用方程(1)中不同的$\lambda$值在PASCAL VOC 2007測試集上的檢測結(jié)果较剃。網(wǎng)絡是VGG-16咕别。訓練數(shù)據(jù)是VOC 2007訓練集。使用$\lambda = 10$($69.9%$)的默認設置與表3中的相同写穴。
For bounding box regression, we adopt the parameterizations of the 4 coordinates following [5]:
$$
t_{\textrm{x}} = (x - x_{\textrm{a}})/w_{\textrm{a}},\quad
t_{\textrm{y}} = (y - y_{\textrm{a}})/h_{\textrm{a}},\\
t_{\textrm{w}} = \log(w / w_{\textrm{a}}), \quad
t_{\textrm{h}} = \log(h / h_{\textrm{a}}),\\
t^{*}_{\textrm{x}} = (x^{*} - x_{\textrm{a}})/w_{\textrm{a}},\quad
t^{*}_{\textrm{y}} = (y^{*} - y_{\textrm{a}})/h_{\textrm{a}},\\
t^{*}_{\textrm{w}} = \log(w^{*} / w_{\textrm{a}}),\quad
t^{*}_{\textrm{h}} = \log(h^{*} / h_{\textrm{a}}),
$$ where $x$, $y$, $w$, and $h$ denote the box's center coordinates and its width and height. Variables $x$, $x_{\textrm{a}}$, and $x^{*}$ are for the predicted box, anchor box, and ground-truth box respectively (likewise for $y, w, h$). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
對于邊界框回歸惰拱,我們采用[5]中的4個坐標參數(shù)化:$$
t_{\textrm{x}} = (x - x_{\textrm{a}})/w_{\textrm{a}},\quad
t_{\textrm{y}} = (y - y_{\textrm{a}})/h_{\textrm{a}},\\
t_{\textrm{w}} = \log(w / w_{\textrm{a}}), \quad
t_{\textrm{h}} = \log(h / h_{\textrm{a}}),\\
t^{*}_{\textrm{x}} = (x^{*} - x_{\textrm{a}})/w_{\textrm{a}},\quad
t^{*}_{\textrm{y}} = (y^{*} - y_{\textrm{a}})/h_{\textrm{a}},\\
t^{*}_{\textrm{w}} = \log(w^{*} / w_{\textrm{a}}),\quad
t^{*}_{\textrm{h}} = \log(h^{*} / h_{\textrm{a}}),
$$ 其中,$x$啊送,$y$偿短,$w$和$h$表示邊界框的中心坐標及其寬和高。變量$x$馋没,$x_{\textrm{a}}$和$x^{*}$分別表示預測邊界框昔逗,錨盒和實際邊界框(類似于$y, w, h$)。這可以被認為是從錨盒到鄰近的實際邊界框的回歸篷朵。
Nevertheless, our method achieves bounding-box regression by a different manner from previous RoI-based (Region of Interest) methods [1], [2]. In [1], [2], bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of $k$ bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the $k$ regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
然而勾怒,我們的方法通過與之前的基于RoI(感興趣區(qū)域)方法[1],[2]不同的方式來實現(xiàn)邊界框回歸款票。在[1]控硼,[2]中,對任意大小的RoI池化的特征執(zhí)行邊界框回歸艾少,并且回歸權重由所有區(qū)域大小共享卡乾。在我們的公式中,用于回歸的特征在特征映射上具有相同的空間大懈抗弧(3×3)幔妨。為了說明不同的大小,學習一組$k$個邊界框回歸器谍椅。每個回歸器負責一個尺度和一個長寬比误堡,而$k$個回歸器不共享權重。因此雏吭,由于錨點的設計锁施,即使特征具有固定的尺度/比例,仍然可以預測各種尺寸的邊界框杖们。
3.1.3 Training RPNs
The RPN can be trained end-to-end by back-propagation and stochastic gradient descent (SGD) [35]. We follow the “image-centric” sampling strategy from [2] to train this network. Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.
3.1.3 訓練RPN
RPN可以通過反向傳播和隨機梯度下降(SGD)進行端對端訓練[35]悉抵。我們遵循[2]的“以圖像為中心”的采樣策略來訓練這個網(wǎng)絡。每個小批量數(shù)據(jù)都從包含許多正面和負面示例錨點的單張圖像中產(chǎn)生摘完。對所有錨點的損失函數(shù)進行優(yōu)化是可能的姥饰,但是這樣會偏向于負樣本,因為它們是占主導地位的孝治。取而代之的是列粪,我們在圖像中隨機采樣256個錨點审磁,計算一個小批量數(shù)據(jù)的損失函數(shù),其中采樣的正錨點和負錨點的比率可達1:1岂座。如果圖像中的正樣本少于128個态蒂,我們使用負樣本填充小批量數(shù)據(jù)。
We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All other layers (i.e., the shared convolutional layers) are initialized by pre-training a model for ImageNet classification [36], as is standard practice [5]. We tune all layers of the ZF net, and conv3_1 and up for the VGG net to conserve memory [2]. We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset. We use a momentum of 0.9 and a weight decay of 0.0005 [37]. Our implementation uses Caffe [38].
我們通過從標準方差為0.01的零均值高斯分布中提取權重來隨機初始化所有新層掺逼。所有其他層(即共享卷積層)通過預訓練的ImageNet分類模型[36]來初始化吃媒,如同標準實踐[5]瓤介。我們調(diào)整ZF網(wǎng)絡的所有層吕喘,以及VGG網(wǎng)絡的conv3_1及其之上的層以節(jié)省內(nèi)存[2]。對于60k的小批量數(shù)據(jù)刑桑,我們使用0.001的學習率氯质,對于PASCAL VOC數(shù)據(jù)集中的下一個20k小批量數(shù)據(jù),使用0.0001祠斧。我們使用0.9的動量和0.0005的重量衰減[37]闻察。我們的實現(xiàn)使用Caffe[38]。
3.2 Sharing Features for RPN and Fast R-CNN
Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt Fast R-CNN [2]. Next we describe algorithms that learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers (Figure 2).
3.2 RPN和Fast R-CNN共享特征
到目前為止琢锋,我們已經(jīng)描述了如何訓練用于區(qū)域提議生成的網(wǎng)絡辕漂,沒有考慮將利用這些提議的基于區(qū)域的目標檢測CNN。對于檢測網(wǎng)絡吴超,我們采用Fast R-CNN[2]钉嘹。接下來我們介紹一些算法,學習由RPN和Fast R-CNN組成的具有共享卷積層的統(tǒng)一網(wǎng)絡(圖2)鲸阻。
Both RPN and Fast R-CNN, trained independently, will modify their convolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks. We discuss three ways for training networks with features shared:
獨立訓練的RPN和Fast R-CNN將以不同的方式修改卷積層跋涣。因此,我們需要開發(fā)一種允許在兩個網(wǎng)絡之間共享卷積層的技術鸟悴,而不是學習兩個獨立的網(wǎng)絡陈辱。我們討論三個方法來訓練具有共享特征的網(wǎng)絡:
(i) Alternating training. In this solution, we first train RPN, and use the proposals to train Fast R-CNN. The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.
(一)交替訓練。在這個解決方案中细诸,我們首先訓練RPN沛贪,并使用這些提議來訓練Fast R-CNN。由Fast R-CNN微調(diào)的網(wǎng)絡然后被用于初始化RPN震贵,并且重復這個過程利赋。這是本文所有實驗中使用的解決方案。
(ii) Approximate joint training. In this solution, the RPN and Fast R-CNN networks are merged into one network during training as in Figure 2. In each SGD iteration, the forward pass generates region proposals which are treated just like fixed, pre-computed proposals when training a Fast R-CNN detector. The backward propagation takes place as usual, where for the shared layers the backward propagated signals from both the RPN loss and the Fast R-CNN loss are combined. This solution is easy to implement. But this solution ignores the derivative w.r.t. the proposal boxes’ coordinates that are also network responses, so is approximate. In our experiments, we have empirically found this solver produces close results, yet reduces the training time by about $25-50%$ comparing with alternating training. This solver is included in our released Python code.
(二)近似聯(lián)合訓練屏歹。在這個解決方案中隐砸,RPN和Fast R-CNN網(wǎng)絡在訓練期間合并成一個網(wǎng)絡,如圖2所示蝙眶。在每次SGD迭代中季希,前向傳遞生成區(qū)域提議褪那,在訓練Fast R-CNN檢測器將這看作是固定的、預計算的提議式塌。反向傳播像往常一樣進行博敬,其中對于共享層,組合來自RPN損失和Fast R-CNN損失的反向傳播信號峰尝。這個解決方案很容易實現(xiàn)偏窝。但是這個解決方案忽略了關于提議邊界框的坐標(也是網(wǎng)絡響應)的導數(shù),因此是近似的武学。在我們的實驗中祭往,我們實驗發(fā)現(xiàn)這個求解器產(chǎn)生了相當?shù)慕Y(jié)果,與交替訓練相比火窒,訓練時間減少了大約$25-50%$硼补。這個求解器包含在我們發(fā)布的Python代碼中。
(iii) Non-approximate joint training. As discussed above, the bounding boxes predicted by RPN are also functions of the input. The RoI pooling layer [2] in Fast R-CNN accepts the convolutional features and also the predicted bounding boxes as input, so a theoretically valid backpropagation solver should also involve gradients w.r.t. the box coordinates. These gradients are ignored in the above approximate joint training. In a non-approximate joint training solution, we need an RoI pooling layer that is differentiable w.r.t. the box coordinates. This is a nontrivial problem and a solution can be given by an “RoI warping” layer as developed in [15], which is beyond the scope of this paper.
(三)非近似的聯(lián)合訓練熏矿。如上所述已骇,由RPN預測的邊界框也是輸入的函數(shù)。Fast R-CNN中的RoI池化層[2]接受卷積特征以及預測的邊界框作為輸入票编,所以理論上有效的反向傳播求解器也應該包括關于邊界框坐標的梯度褪储。在上述近似聯(lián)合訓練中,這些梯度被忽略慧域。在一個非近似的聯(lián)合訓練解決方案中鲤竹,我們需要一個關于邊界框坐標可微分的RoI池化層。這是一個重要的問題吊趾,可以通過[15]中提出的“RoI扭曲”層給出解決方案宛裕,這超出了本文的范圍。
4-Step Alternating Training. In this paper, we adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization. In the first step, we train the RPN as described in Section 3.1.3. This network is initialized with an ImageNet-pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. At this point the two networks do not share convolutional layers. In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional layers and only fine-tune the layers unique to RPN. Now the two networks share convolutional layers. Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN. As such, both networks share the same convolutional layers and form a unified network. A similar alternating training can be run for more iterations, but we have observed negligible improvements.
四步交替訓練论泛。在本文中揩尸,我們采用實用的四步訓練算法,通過交替優(yōu)化學習共享特征屁奏。在第一步中岩榆,我們按照3.1.3節(jié)的描述訓練RPN。該網(wǎng)絡使用ImageNet的預訓練模型進行初始化坟瓢,并針對區(qū)域提議任務進行了端到端的微調(diào)勇边。在第二步中,我們使用由第一步RPN生成的提議折联,由Fast R-CNN訓練單獨的檢測網(wǎng)絡粒褒。該檢測網(wǎng)絡也由ImageNet的預訓練模型進行初始化。此時兩個網(wǎng)絡不共享卷積層诚镰。在第三步中奕坟,我們使用檢測器網(wǎng)絡來初始化RPN訓練祥款,但是我們修正共享的卷積層,并且只對RPN特有的層進行微調(diào)≡律迹現(xiàn)在這兩個網(wǎng)絡共享卷積層刃跛。最后,保持共享卷積層的固定苛萎,我們對Fast R-CNN的獨有層進行微調(diào)桨昙。因此际度,兩個網(wǎng)絡共享相同的卷積層并形成統(tǒng)一的網(wǎng)絡啥容。類似的交替訓練可以運行更多的迭代,但是我們只觀察到可以忽略的改進媳荒。
3.3 Implementation Details
We train and test both region proposal and object detection networks on images of a single scale [1], [2]. We re-scale the images such that their shorter side is $s = 600$ pixels [2]. Multi-scale feature extraction (using an image pyramid) may improve accuracy but does not exhibit a good speed-accuracy trade-off [2]. On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ~10 pixels on a typical PASCAL image before resizing (~500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.
3.3 實現(xiàn)細節(jié)
我們在單尺度圖像上訓練和測試區(qū)域提議和目標檢測網(wǎng)絡[1]究履,[2]滤否。我們重新縮放圖像,使得它們的短邊是$s=600$像素[2]最仑。多尺度特征提取(使用圖像金字塔)可能會提高精度炊甲,但不會表現(xiàn)出速度與精度的良好折衷[2]泥彤。在重新縮放的圖像上,最后卷積層上的ZF和VGG網(wǎng)絡的總步長為16個像素卿啡,因此在調(diào)整大幸髁摺(?500×375)之前,典型的PASCAL圖像上的總步長為?10個像素颈娜。即使如此大的步長也能提供良好的效果剑逃,盡管步幅更小,精度可能會進一步提高官辽。
For anchors, we use 3 scales with box areas of $128^2$, $256^2$, and $512^2$ pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation experiments on their effects in the next section. As discussed, our solution does not need an image pyramid or filter pyramid to predict regions of multiple scales, saving considerable running time. Figure 3 (right) shows the capability of our method for a wide range of scales and aspect ratios. Table 1 shows the learned average proposal size for each anchor using the ZF net. We note that our algorithm allows predictions that are larger than the underlying receptive field. Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle of the object is visible.
Table 1: the learned average proposal size for each anchor using the ZF net (numbers for $s = 600$).
對于錨點蛹磺,我們使用了3個尺度,邊界框面積分別為$1282$同仆,$2562$和$512^2$個像素萤捆,以及1:1,1:2和2:1的長寬比俗批。這些超參數(shù)不是針對特定數(shù)據(jù)集仔細選擇的俗或,我們將在下一節(jié)中提供有關其作用的消融實驗。如上所述岁忘,我們的解決方案不需要圖像金字塔或濾波器金字塔來預測多個尺度的區(qū)域辛慰,節(jié)省了大量的運行時間。圖3(右)顯示了我們的方法在廣泛的尺度和長寬比方面的能力干像。表1顯示了使用ZF網(wǎng)絡的每個錨點學習到的平均提議大小帅腌。我們注意到辱志,我們的算法允許預測比基礎感受野更大。這樣的預測不是不可能的——如果只有目標的中間部分是可見的狞膘,那么仍然可以粗略地推斷出目標的范圍揩懒。
表1:使用ZF網(wǎng)絡的每個錨點學習到的平均提議大小($s=600$的數(shù)字)挽封。
The anchor boxes that cross image boundaries need to be handled with care. During training, we ignore all cross-boundary anchors so they do not contribute to the loss. For a typical $1000 \times 600$ image, there will be roughly 20000 ($\approx 60 \times 40 \times 9$) anchors in total. With the cross-boundary anchors ignored, there are about 6000 anchors per image for training. If the boundary-crossing outliers are not ignored in training, they introduce large, difficult to correct error terms in the objective, and training does not converge. During testing, however, we still apply the fully convolutional RPN to the entire image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.
跨越圖像邊界的錨盒需要小心處理已球。在訓練過程中,我們忽略了所有的跨界錨點辅愿,所以不會造成損失智亮。對于一個典型的$1000 \times 600$的圖片,總共將會有大約20000($\approx 60 \times 40 \times 9$)個錨點点待±龋跨界錨點被忽略,每張圖像約有6000個錨點用于訓練癞埠。如果跨界異常值在訓練中不被忽略状原,則會在目標函數(shù)中引入大的,難以糾正的誤差項苗踪,且訓練不會收斂颠区。但在測試過程中,我們?nèi)匀粚⑷矸eRPN應用于整張圖像通铲。這可能會產(chǎn)生跨邊界的提議邊界框毕莱,我們剪切到圖像邊界。
Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals, but evaluate different numbers of proposals at test-time.
一些RPN提議互相之間高度重疊颅夺。為了減少冗余朋截,我們在提議區(qū)域根據(jù)他們的cls分數(shù)采取非極大值抑制(NMS)。我們將NMS的IoU閾值固定為0.7吧黄,這就給每張圖像留下了大約2000個提議區(qū)域部服。正如我們將要展示的那樣,NMS不會損害最終的檢測準確性稚字,但會大大減少提議的數(shù)量饲宿。在NMS之后,我們使用前N個提議區(qū)域來進行檢測胆描。接下來瘫想,我們使用2000個RPN提議對Fast R-CNN進行訓練,但在測試時評估不同數(shù)量的提議昌讲。
4. EXPERIMENTS
4.1 Experiments on PASCAL VOC
We comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [11]. This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results on the PASCAL VOC 2012 benchmark for a few models. For the ImageNet pre-trained network, we use the “fast” version of ZF net [32] that has 5 convolutional layers and 3 fully-connected layers, and the public VGG-16 model [3] that has 13 convolutional layers and 3 fully-connected layers. We primarily evaluate detection mean Average Precision (mAP), because this is the actual metric for object detection (rather than focusing on object proposal proxy metrics).
4. 實驗
4.1 PASCAL VOC上的實驗
我們在PASCAL VOC 2007檢測基準數(shù)據(jù)集[11]上全面評估了我們的方法国夜。這個數(shù)據(jù)集包含大約5000張訓練評估圖像和在20個目標類別上的5000張測試圖像。我們還提供了一些模型在PASCAL VOC 2012基準數(shù)據(jù)集上的測試結(jié)果短绸。對于ImageNet預訓練網(wǎng)絡车吹,我們使用具有5個卷積層和3個全連接層的ZF網(wǎng)絡[32]的“快速”版本以及具有13個卷積層和3個全連接層的公開的VGG-16模型[3]筹裕。我們主要評估檢測的平均精度均值(mAP),因為這是檢測目標的實際指標(而不是關注目標提議代理度量)窄驹。
Table 2 (top) shows Fast R-CNN results when trained and tested using various region proposal methods. These results use the ZF net. For Selective Search (SS) [4], we generate about 2000 proposals by the “fast” mode. For EdgeBoxes (EB) [6], we generate the proposals by the default EB setting tuned for 0.7 IoU. SS has an mAP of $58.7%$ and EB has an mAP of $58.6%$ under the Fast R-CNN framework. RPN with Fast R-CNN achieves competitive results, with an mAP of $59.9%$ while using up to 300 proposals. Using RPN yields a much faster detection system than using either SS or EB because of shared convolutional computations; the fewer proposals also reduce the region-wise fully-connected layers’ cost (Table 5).
Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are Fast R-CNN with ZF, but using various proposal methods for training and testing.
Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS, pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.
表2(頂部)顯示了使用各種區(qū)域提議方法進行訓練和測試的Fast R-CNN結(jié)果朝卒。這些結(jié)果使用ZF網(wǎng)絡。對于選擇性搜索(SS)[4]乐埠,我們通過“快速”模式生成約2000個提議抗斤。對于EdgeBoxes(EB)[6],我們通過調(diào)整0.7 IoU的默認EB設置生成提議丈咐。SS在Fast R-CNN框架下的mAP為$58.7%$瑞眼,EB的mAP為$58.6%$。RPN與Fast R-CNN取得了有競爭力的結(jié)果棵逊,使用多達300個提議伤疙,mAP為$59.9%$。由于共享卷積計算辆影,使用RPN比使用SS或EB產(chǎn)生了更快的檢測系統(tǒng)徒像;較少的建議也減少了區(qū)域方面的全連接層成本(表5)。
表2:PASCAL VOC 2007測試集上的檢測結(jié)果(在VOC 2007訓練評估集上進行了訓練)秸歧。檢測器是帶有ZF的Fast R-CNN厨姚,但使用各種提議方法進行訓練和測試。
表5:K40 GPU上的時間(ms)键菱,除了SS提議是在CPU上評估〗癫迹“區(qū)域方面”包括NMS经备,池化,全連接和softmax層部默。查看我們發(fā)布的代碼來分析運行時間侵蒙。
Ablation Experiments on RPN. To investigate the behavior of RPNs as a proposal method, we conducted several ablation studies. First, we show the effect of sharing convolutional layers between the RPN and Fast R-CNN detection network. To do this, we stop after the second step in the 4-step training process. Using separate networks reduces the result slightly to $58.7%$ (RPN+ZF, unshared, Table 2). We observe that this is because in the third step when the detector-tuned features are used to fine-tune the RPN, the proposal quality is improved.
RPN上的消融實驗。為了研究RPN作為提議方法的性能傅蹂,我們進行了幾項消融研究纷闺。首先,我們顯示了RPN和Fast R-CNN檢測網(wǎng)絡共享卷積層的效果份蝴。為此犁功,我們在四步訓練過程的第二步之后停止訓練。使用單獨的網(wǎng)絡將結(jié)果略微減少到$58.7%$(RPN+ZF婚夫,非共享浸卦,表2)。我們觀察到案糙,這是因為在第三步中限嫌,當使用檢測器調(diào)整的特征來微調(diào)RPN時靴庆,提議質(zhì)量得到了改善。
Next, we disentangle the RPN’s influence on training the Fast R-CNN detection network. For this purpose, we train a Fast R-CNN model by using the 2000 SS proposals and ZF net. We fix this detector and evaluate the detection mAP by changing the proposal regions used at test-time. In these ablation experiments, the RPN does not share features with the detector.
接下來怒医,我們分析RPN對訓練Fast R-CNN檢測網(wǎng)絡的影響炉抒。為此,我們通過使用2000個SS提議和ZF網(wǎng)絡來訓練Fast R-CNN模型稚叹。我們固定這個檢測器焰薄,并通過改變測試時使用的提議區(qū)域來評估檢測的mAP。在這些消融實驗中入录,RPN不與檢測器共享特征蛤奥。
Replacing SS with 300 RPN proposals at test-time leads to an mAP of $56.8%$. The loss in mAP is because of the inconsistency between the training/testing proposals. This result serves as the baseline for the following comparisons.
在測試階段用300個RPN提議替換SS提議得到了$56.8%$的MAP。mAP的損失是因為訓練/測試提議不一致僚稿。這個結(jié)果作為以下比較的基準凡桥。
Somewhat surprisingly, the RPN still leads to a competitive result ($55.1%$) when using the top-ranked 100 proposals at test-time, indicating that the top-ranked RPN proposals are accurate. On the other extreme, using the top-ranked 6000 RPN proposals (without NMS) has a comparable mAP ($55.2%$), suggesting NMS does not harm the detection mAP and may reduce false alarms.
有些令人驚訝的是,RPN在測試時使用排名最高的100個提議仍然會導致有競爭力的結(jié)果($55.1%$)蚀同,表明排名靠前的RPN提議是準確的缅刽。相反的,使用排名靠前的6000個RPN提議(無NMS)具有相當?shù)膍AP($55.2%$)蠢络,這表明NMS不會損害檢測mAP并可能減少誤報衰猛。
Next, we separately investigate the roles of RPN’s cls and reg outputs by turning off either of them at test-time. When the cls layer is removed at test-time (thus no NMS/ranking is used), we randomly sample $N$ proposals from the unscored regions. The mAP is nearly unchanged with $N=1000$ ($55.8%$), but degrades considerably to $44.6%$ when $N=100$. This shows that the cls scores account for the accuracy of the highest ranked proposals.
接下來,我們通過在測試時分別關閉RPN的cls和reg輸出來調(diào)查RPN的作用刹孔。當cls層在測試時被移除(因此不使用NMS/排名)啡省,我們從未得分的區(qū)域中隨機采樣$N$個提議。當$N=1000$($55.8
%$)時髓霞,mAP幾乎沒有變化卦睹,但是當$N=100$時,會大大降低到$44.6%$方库。這表明cls分數(shù)考慮了排名最高的提議的準確性结序。
On the other hand, when the reg layer is removed at test-time (so the proposals become anchor boxes), the mAP drops to $52.1%$. This suggests that the high-quality proposals are mainly due to the regressed box bounds. The anchor boxes, though having multiple scales and aspect ratios, are not sufficient for accurate detection.
另一方面,當在測試階段移除reg層(所以提議變成錨盒)時纵潦,mAP將下降到$52.1%$徐鹤。這表明高質(zhì)量的提議主要是由于回歸的邊界框。錨盒雖然具有多個尺度和長寬比邀层,但不足以進行準確的檢測返敬。
We also evaluate the effects of more powerful networks on the proposal quality of RPN alone. We use VGG-16 to train the RPN, and still use the above detector of SS+ZF. The mAP improves from $56.8%$ (using RPN+ZF) to $59.2%$ (using RPN+VGG). This is a promising result, because it suggests that the proposal quality of RPN+VGG is better than that of RPN+ZF. Because proposals of RPN+ZF are competitive with SS (both are $58.7%$ when consistently used for training and testing), we may expect RPN+VGG to be better than SS. The following experiments justify this hypothesis.
我們還單獨評估了更強大的網(wǎng)絡對RPN提議質(zhì)量的影響。我們使用VGG-16來訓練RPN被济,仍然使用上述的SS+ZF檢測器救赐。mAP從$56.8%$(使用RPN+ZF)提高到$59.2%$(使用RPN+VGG)。這是一個很有希望的結(jié)果,因為這表明RPN+VGG的提議質(zhì)量要好于RPN+ZF经磅。由于RPN+ZF的提議與SS具有競爭性(當一致用于訓練和測試時泌绣,都是$58.7%$),所以我們可以預期RPN+VGG比SS更好预厌。以下實驗驗證了這個假設阿迈。
Performance of VGG-16. Table 3 shows the results of VGG-16 for both proposal and detection. Using RPN+VGG, the result is $68.5%$ for unshared features, slightly higher than the SS baseline. As shown above, this is because the proposals generated by RPN+VGG are more accurate than SS. Unlike SS that is pre-defined, the RPN is actively trained and benefits from better networks. For the feature-shared variant, the result is $69.9%$——better than the strong SS baseline, yet with nearly cost-free proposals. We further train the RPN and detection network on the union set of PASCAL VOC 2007 trainval and 2012 trainval. The mAP is $73.2%$. Figure 5 shows some results on the PASCAL VOC 2007 test set. On the PASCAL VOC 2012 test set (Table 4), our method has an mAP of $70.4%$ trained on the union set of VOC 2007 trainval+test and VOC 2012 trainval. Table 6 and Table 7 show the detailed numbers.
Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN, the train-time proposals for Fast R-CNN are 2000. ?: this number was reported in [2]; using the repository provided by this paper, this result is higher (68.1).
Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For RPN, the train-time proposals for Fast R-CNN are 2000. ?: http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ?: http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. §: http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html.
Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time proposals for Fast R-CNN are 2000. ${RPN}^*$ denotes the unsharing feature version.
Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time proposals for Fast R-CNN are 2000.
Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster R-CNN system. The model is VGG-16 and the training data is 07+12 trainval ($73.2%$ mAP on the 2007 test set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images. The running time for obtaining these results is 198ms per image, including all steps.
VGG-16的性能。表3顯示了VGG-16的提議和檢測結(jié)果轧叽。使用RPN+VGG苗沧,非共享特征的結(jié)果是$68.5%$,略高于SS的基準炭晒。如上所示待逞,這是因為RPN+VGG生成的提議比SS更準確。與預先定義的SS不同网严,RPN是主動訓練的并從更好的網(wǎng)絡中受益识樱。對于特性共享的變種,結(jié)果是$69.9%$——比強壯的SS基準更好震束,但幾乎是零成本的提議怜庸。我們在PASCAL VOC 2007和2012的訓練評估數(shù)據(jù)集上進一步訓練RPN和檢測網(wǎng)絡。該mAP是$73.2%$垢村。圖5顯示了PASCAL VOC 2007測試集的一些結(jié)果割疾。在PASCAL VOC 2012測試集(表4)中,我們的方法在VOC 2007的trainval+test
和VOC 2012的trainval
的聯(lián)合數(shù)據(jù)集上訓練的模型取得了$70.4%$的mAP嘉栓。表6和表7顯示了詳細的數(shù)字宏榕。
表3:PASCAL VOC 2007測試集的檢測結(jié)果。檢測器是Fast R-CNN和VGG-16侵佃。訓練數(shù)據(jù):“07”:VOC 2007 trainval担扑,“07 + 12”:VOC 2007 trainval和VOC 2012 trainval的聯(lián)合訓練集。對于RPN趣钱,訓練時Fast R-CNN的提議數(shù)量為2000。?:[2]中報道的數(shù)字胚宦;使用本文提供的倉庫首有,這個結(jié)果更高(68.1)。
表4:PASCAL VOC 2012測試集的檢測結(jié)果枢劝。檢測器是Fast R-CNN和VGG-16井联。訓練數(shù)據(jù):“07”:VOC 2007 trainval,“07 + 12”:VOC 2007 trainval和VOC 2012 trainval的聯(lián)合訓練集您旁。對于RPN烙常,訓練時Fast R-CNN的提議數(shù)量為2000。?:http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html。?:http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html蚕脏≌旄保§:http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html。
表6:使用Fast R-CNN檢測器和VGG-16在PASCAL VOC 2007測試集上的結(jié)果驼鞭。對于RPN秦驯,訓練時Fast R-CNN的提議數(shù)量為2000。${RPN}^*$表示沒有共享特征的版本挣棕。
表7:使用Fast R-CNN檢測器和VGG-16在PASCAL VOC 2012測試集上的結(jié)果译隘。對于RPN,訓練時Fast R-CNN的提議數(shù)量為2000洛心。
圖5:使用Faster R-CNN系統(tǒng)在PASCAL VOC 2007測試集上選擇的目標檢測結(jié)果示例固耘。該模型是VGG-16,訓練數(shù)據(jù)是07+12 trainval(2007年測試集中$73.2%$的mAP)词身。我們的方法檢測廣泛的尺度和長寬比目標厅目。每個輸出框都與類別標簽和[0,1]之間的softmax分數(shù)相關聯(lián)偿枕。使用0.6的分數(shù)閾值來顯示這些圖像璧瞬。獲得這些結(jié)果的運行時間為每張圖像198ms,包括所有步驟渐夸。
In Table 5 we summarize the running time of the entire object detection system. SS takes 1-2 seconds depending on content (on average about 1.5s), and Fast R-CNN with VGG-16 takes 320ms on 2000 SS proposals (or 223ms if using SVD on fully-connected layers [2]). Our system with VGG-16 takes in total 198ms for both proposal and detection. With the convolutional features shared, the RPN alone only takes 10ms computing the additional layers. Our region-wise computation is also lower, thanks to fewer proposals (300 per image). Our system has a frame-rate of 17 fps with the ZF net.
在表5中我們總結(jié)了整個目標檢測系統(tǒng)的運行時間嗤锉。根據(jù)內(nèi)容(平均大約1.5s),SS需要1-2秒墓塌,而使用VGG-16的Fast R-CNN在2000個SS提議上需要320ms(如果在全連接層上使用SVD[2]瘟忱,則需要223ms)。我們的VGG-16系統(tǒng)在提議和檢測上總共需要198ms苫幢。在共享卷積特征的情況下访诱,單獨RPN只需要10ms計算附加層。我們的區(qū)域計算也較低韩肝,這要歸功于較少的提議(每張圖片300個)触菜。我們的采用ZF網(wǎng)絡的系統(tǒng),幀速率為17fps哀峻。
Sensitivities to Hyper-parameters. In Table 8 we investigate the settings of anchors. By default we use 3 scales and 3 aspect ratios ($69.9%$ mAP in Table 8). If using just one anchor at each position, the mAP drops by a considerable margin of $3-4%$. The mAP is higher if using 3 scales (with 1 aspect ratio) or 3 aspect ratios (with 1 scale), demonstrating that using anchors of multiple sizes as the regression references is an effective solution. Using just 3 scales with 1 aspect ratio ($69.8%$) is as good as using 3 scales with 3 aspect ratios on this dataset, suggesting that scales and aspect ratios are not disentangled dimensions for the detection accuracy. But we still adopt these two dimensions in our designs to keep our system flexible.
對超參數(shù)的敏感度涡相。在表8中,我們調(diào)查錨點的設置剩蟀。默認情況下催蝗,我們使用3個尺度和3個長寬比(表8中$69.9%$的mAP)。如果在每個位置只使用一個錨點育特,那么mAP的下降幅度將是$3-4%$丙号。如果使用3個尺度(1個長寬比)或3個長寬比(1個尺度),則mAP更高,表明使用多種尺寸的錨點作為回歸參考是有效的解決方案犬缨。在這個數(shù)據(jù)集上喳魏,僅使用具有1個長寬比($69.8%$)的3個尺度與使用具有3個長寬比的3個尺度一樣好,這表明尺度和長寬比不是檢測準確度的解決維度遍尺。但我們?nèi)匀辉谠O計中采用這兩個維度來保持我們的系統(tǒng)靈活性截酷。
In Table 9 we compare different values of $\lambda$ in Equation (1). By default we use $\lambda=10$ which makes the two terms in Equation (1) roughly equally weighted after normalization. Table 9 shows that our result is impacted just marginally (by $\sim 1%$) when $\lambda$ is within a scale of about two orders of magnitude (1 to 100). This demonstrates that the result is insensitive to $\lambda$ in a wide range.
在表9中,我們比較了公式(1)中$\lambda$的不同值乾戏。默認情況下迂苛,我們使用$\lambda=10$,這使方程(1)中的兩個項在歸一化之后大致相等地加權鼓择。表9顯示三幻,當$\lambda$在大約兩個數(shù)量級(1到100)的范圍內(nèi)時,我們的結(jié)果只是稍微受到影響($\sim 1%$)呐能。這表明結(jié)果對寬范圍內(nèi)的$\lambda$不敏感念搬。
Analysis of Recall-to-IoU. Next we compute the recall of proposals at different IoU ratios with ground-truth boxes. It is noteworthy that the Recall-to-IoU metric is just loosely [19], [20], [21] related to the ultimate detection accuracy. It is more appropriate to use this metric to diagnose the proposal method than to evaluate it.
分析IoU召回率。接下來摆出,我們使用實際邊界框來計算不同IoU比率的提議召回率朗徊。值得注意的是,Recall-to-IoU度量與最終的檢測精度的相關性是松散的[19偎漫,20爷恳,21]。使用這個指標來診斷提議方法比評估提議方法更合適象踊。
In Figure 4, we show the results of using 300, 1000, and 2000 proposals. We compare with SS and EB, and the N proposals are the top-N ranked ones based on the confidence generated by these methods. The plots show that the RPN method behaves gracefully when the number of proposals drops from 2000 to 300. This explains why the RPN has a good ultimate detection mAP when using as few as 300 proposals. As we analyzed before, this property is mainly attributed to the cls term of the RPN. The recall of SS and EB drops more quickly than RPN when the proposals are fewer.
Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.
在圖4中温亲,我們顯示了使用300,1000和2000個提議的結(jié)果杯矩。我們與SS和EB進行比較栈虚,根據(jù)這些方法產(chǎn)生的置信度,N個提議是排名前N的提議史隆。從圖中可以看出魂务,當提議數(shù)量從2000個減少到300個時,RPN方法表現(xiàn)優(yōu)雅泌射。這就解釋了為什么RPN在使用300個提議時具有良好的最終檢測mAP头镊。正如我們之前分析過的,這個屬性主要歸因于RPN的cls項魄幕。當提議較少時苟耻,SS和EB的召回率下降的比RPN更快夏醉。
圖4:PASCAL VOC 2007測試集上的召回率和IoU重疊率蛋辈。
One-Stage Detection vs. Two-Stage Proposal + Detection. The OverFeat paper [9] proposes a detection method that uses regressors and classifiers on sliding windows over convolutional feature maps. OverFeat is a one-stage, class-specific detection pipeline, and ours is a two-stage cascade consisting of class-agnostic proposals and class-specific detections. In OverFeat, the region-wise features come from a sliding window of one aspect ratio over a scale pyramid. These features are used to simultaneously determine the location and category of objects. In RPN, the features are from square ($3\times 3$) sliding windows and predict proposals relative to anchors with different scales and aspect ratios. Though both methods use sliding windows, the region proposal task is only the first stage of Faster R-CNN —— the downstream Fast R-CNN detector attends to the proposals to refine them. In the second stage of our cascade, the region-wise features are adaptively pooled [1], [2] from proposal boxes that more faithfully cover the features of the regions. We believe these features lead to more accurate detections.
一階段檢測與兩階段提議+檢測。OverFeat論文[9]提出了一種在卷積特征映射的滑動窗口上使用回歸器和分類器的檢測方法案怯。OverFeat是一個一階段,類別特定的檢測流程少办,而我們的是兩階段級聯(lián)扼鞋,包括類不可知的提議和類別特定的檢測。在OverFeat中阴颖,區(qū)域特征來自一個尺度金字塔上一個長寬比的滑動窗口活喊。這些特征用于同時確定目標的位置和類別。在RPN中量愧,這些特征來自正方形($3\times 3$)滑動窗口钾菊,并且預測相對于錨點具有不同尺度和長寬比的提議。雖然這兩種方法都使用滑動窗口偎肃,但區(qū)域提議任務只是Faster R-CNN的第一階段——下游的Fast R-CNN檢測器會致力于對提議進行細化煞烫。在我們級聯(lián)的第二階段,在更忠實覆蓋區(qū)域特征的提議框中累颂,區(qū)域特征自適應地聚集[1]滞详,[2]。我們相信這些功能會帶來更準確的檢測結(jié)果紊馏。
To compare the one-stage and two-stage systems, we emulate the OverFeat system (and thus also circumvent other differences of implementation details) by one-stage Fast R-CNN. In this system, the “proposals” are dense sliding windows of 3 scales (128, 256, 512) and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is trained to predict class-specific scores and regress box locations from these sliding windows. Because the OverFeat system adopts an image pyramid, we also evaluate using convolutional features extracted from 5 scales. We use those 5 scales as in [1], [2].
為了比較一階段和兩階段系統(tǒng)料饥,我們通過一階段Fast R-CNN來模擬OverFeat系統(tǒng)(從而也規(guī)避了實現(xiàn)細節(jié)的其他差異)。在這個系統(tǒng)中朱监,“提議”是3個尺度(128岸啡,256,512)和3個長寬比(1:1赌朋,1:2凰狞,2:1)的密集滑動窗口。訓練Fast R-CNN來預測類別特定的分數(shù)沛慢,并從這些滑動窗口中回歸邊界框位置赡若。由于OverFeat系統(tǒng)采用圖像金字塔,我們也使用從5個尺度中提取的卷積特征進行評估团甲。我們使用[1]逾冬,[2]中5個尺度。
Table 10 compares the two-stage system and two variants of the one-stage system. Using the ZF model, the one-stage system has an mAP of $53.9%$. This is lower than the two-stage system ($58.7%$) by $4.8%$. This experiment justifies the effectiveness of cascaded region proposals and object detection. Similar observations are reported in [2], [39], where replacing SS region proposals with sliding windows leads to $\sim 6%$ degradation in both papers. We also note that the one-stage system is slower as it has considerably more proposals to process.
Table 10: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.
表10比較了兩階段系統(tǒng)和一階段系統(tǒng)的兩個變種躺苦。使用ZF模型身腻,一階段系統(tǒng)具有$53.9%$的mAP。這比兩階段系統(tǒng)($58.7%$)低$4.8%$匹厘。這個實驗驗證了級聯(lián)區(qū)域提議和目標檢測的有效性嘀趟。在文獻[2],[39]中報道了類似的觀察結(jié)果愈诚,在這兩篇論文中她按,用滑動窗取代SS區(qū)域提議會導致$\sim 6%$的退化牛隅。我們也注意到,一階段系統(tǒng)更慢酌泰,因為它產(chǎn)生了更多的提議媒佣。
表10:一階段檢測與兩階段提議+檢測。使用ZF模型和Fast R-CNN在PASCAL VOC 2007測試集上的檢測結(jié)果陵刹。RPN使用未共享的功能默伍。
4.2 Experiments on MS COCO
We present more results on the Microsoft COCO object detection dataset [12]. This dataset involves 80 object categories. We experiment with the 80k images on the training set, 40k images on the validation set, and 20k images on the test-dev set. We evaluate the mAP averaged for $IoU \in [0.5:0.05:0.95]$ (COCO’s standard metric, simply denoted as mAP@[.5, .95]) and mAP@0.5 (PASCAL VOC’s metric).
4.2 在MS COCO上的實驗
我們在Microsoft COCO目標檢測數(shù)據(jù)集[12]上提供了更多的結(jié)果。這個數(shù)據(jù)集包含80個目標類別衰琐。我們用訓練集上的8萬張圖像也糊,驗證集上的4萬張圖像以及測試開發(fā)集上的2萬張圖像進行實驗。我們評估了$IoU \in [0.5:0.05:0.95]$的平均mAP(COCO標準度量碘耳,簡稱為mAP@[.5,.95])和mAP@0.5(PASCAL VOC度量)显设。
There are a few minor changes of our system made for this dataset. We train our models on an 8-GPU implementation, and the effective mini-batch size becomes 8 for RPN (1 per GPU) and 16 for Fast R-CNN (2 per GPU). The RPN step and Fast R-CNN step are both trained for 240k iterations with a learning rate of 0.003 and then for 80k iterations with 0.0003. We modify the learning rates (starting with 0.003 instead of 0.001) because the mini-batch size is changed. For the anchors, we use 3 aspect ratios and 4 scales (adding $64^2$), mainly motivated by handling small objects on this dataset. In addition, in our Fast R-CNN step, the negative samples are defined as those with a maximum IoU with ground truth in the interval of [0,0.5), instead of [0.1,0.5) used in [1], [2]. We note that in the SPPnet system [1], the negative samples in [0.1, 0.5) are used for network fine-tuning, but the negative samples in [0, 0.5) are still visited in the SVM step with hard-negative mining. But the Fast R-CNN system [2] abandons the SVM step, so the negative samples in [0,0.1) are never visited. Including these [0,0.1) samples improves mAP@0.5 on the COCO dataset for both Fast R-CNN and Faster R-CNN systems (but the impact is negligible on PASCAL VOC).
我們的系統(tǒng)對這個數(shù)據(jù)集做了一些小的改動。我們在8 GPU實現(xiàn)上訓練我們的模型辛辨,RPN(每個GPU 1個)和Fast R-CNN(每個GPU 2個)的有效最小批大小為8個捕捂。RPN步驟和Fast R-CNN步驟都以24萬次迭代進行訓練,學習率為0.003斗搞,然后以0.0003的學習率進行8萬次迭代指攒。我們修改了學習率(從0.003而不是0.001開始),因為小批量數(shù)據(jù)的大小發(fā)生了變化僻焚。對于錨點允悦,我們使用3個長寬比和4個尺度(加上$64^2$),這主要是通過處理這個數(shù)據(jù)集上的小目標來激發(fā)的虑啤。此外隙弛,在我們的Fast R-CNN步驟中,負樣本定義為與實際邊界框的最大IOU在[0狞山,0.5)區(qū)間內(nèi)的樣本全闷,而不是[1],[2]中使用的[0.1,0.5)之間萍启。我們注意到总珠,在SPPnet系統(tǒng)[1]中,在[0.1勘纯,0.5)中的負樣本用于網(wǎng)絡微調(diào)局服,但[0,0.5)中的負樣本仍然在具有難例挖掘SVM步驟中被訪問。但是Fast R-CNN系統(tǒng)[2]放棄了SVM步驟驳遵,所以[0,0.1]中的負樣本都不會被訪問淫奔。包括這些[0,0.1)的樣本,在Fast R-CNN和Faster R-CNN系統(tǒng)在COCO數(shù)據(jù)集上改進了mAP@0.5(但對PASCAL VOC的影響可以忽略不計)堤结。
The rest of the implementation details are the same as on PASCAL VOC. In particular, we keep using 300 proposals and single-scale ($s=600$) testing. The testing time is still about 200ms per image on the COCO dataset.
其余的實現(xiàn)細節(jié)與PASCAL VOC相同搏讶。特別的是佳鳖,我們繼續(xù)使用300個提議和單一尺度($s=600$)測試。COCO數(shù)據(jù)集上的測試時間仍然是大約200ms處理一張圖像媒惕。
In Table 11 we first report the results of the Fast R-CNN system [2] using the implementation in this paper. Our Fast R-CNN baseline has $39.3%$ mAP@0.5 on the test-dev set, higher than that reported in [2]. We conjecture that the reason for this gap is mainly due to the definition of the negative samples and also the changes of the mini-batch sizes. We also note that the mAP@[.5, .95] is just comparable.
Table 11: Object detection results (%) on the MS COCO dataset. The model is VGG-16.
在表11中,我們首先報告了使用本文實現(xiàn)的Fast R-CNN系統(tǒng)[2]的結(jié)果来庭。我們的Fast R-CNN基準在test-dev
數(shù)據(jù)集上有$39.3%$的mAP@0.5妒蔚,比[2]中報告的更高。我們推測造成這種差距的原因主要是由于負樣本的定義以及小批量大小的變化月弛。我們也注意到mAP@[.5肴盏,.95]恰好相當。
表11:在MS COCO數(shù)據(jù)集上的目標檢測結(jié)果(%)帽衙。模型是VGG-16菜皂。
Next we evaluate our Faster R-CNN system. Using the COCO training set to train, Faster R-CNN has $42.1%$ mAP@0.5 and $21.5%$ mAP@[.5, .95] on the COCO test-dev set. This is $2.8%$ higher for mAP@0.5 and $2.2%$ higher for mAP@[.5, .95] than the Fast R-CNN counterpart under the same protocol (Table 11). This indicates that RPN performs excellent for improving the localization accuracy at higher IoU thresholds. Using the COCO trainval set to train, Faster R-CNN has $42.7%$ mAP@0.5 and $21.9%$ mAP@[.5, .95] on the COCO test-dev set. Figure 6 shows some results on the MS COCO test-dev set.
Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN system. The model is VGG-16 and the training data is COCO trainval ($42.7%$ mAP@0.5 on the test-dev set). Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images. For each image, one color represents one object category in that image.
接下來我們評估我們的Faster R-CNN系統(tǒng)。使用COCO訓練集訓練厉萝,在COCO測試開發(fā)集上Faster R-CNNN有$42.1%$的mAP@0.5和$21.5%$的mAP@[0.5恍飘,0.95]。與相同協(xié)議下的Fast R-CNN相比谴垫,mAP@0.5要高$2.8%$章母,mAP@[.5, .95]要高$2.2%$(表11)。這表明翩剪,在更高的IoU閾值上乳怎,RPN對提高定位精度表現(xiàn)出色。使用COCO訓練集訓練前弯,在COCO測試開發(fā)集上Faster R-CNN有$42.7%$的mAP@0.5和$21.9%$的mAP@[.5, .95]蚪缀。圖6顯示了MS COCO測試開發(fā)數(shù)據(jù)集中的一些結(jié)果。
圖6:使用Faster R-CNN系統(tǒng)在MS COCO test-dev數(shù)據(jù)集上選擇的目標檢測結(jié)果示例恕出。該模型是VGG-16询枚,訓練數(shù)據(jù)是COCO訓練數(shù)據(jù)(在測試開發(fā)數(shù)據(jù)集上為$42.7%$的mAP@0.5)。每個輸出框都與一個類別標簽和[0, 1]之間的softmax分數(shù)相關聯(lián)剃根。使用0.6的分數(shù)閾值來顯示這些圖像哩盲。對于每張圖像,一種顏色表示該圖像中的一個目標類別狈醉。
Faster R-CNN in ILSVRC & COCO 2015 competitions. We have demonstrated that Faster R-CNN benefits more from better features, thanks to the fact that the RPN completely learns to propose regions by neural networks. This observation is still valid even when one increases the depth substantially to over 100 layers [18]. Only by replacing VGG-16 with a 101-layer residual net (ResNet-101) [18], the Faster R-CNN system increases the mAP from $41.5
%/21.2%$ (VGG-16) to $48.4%/27.2%$ (ResNet-101) on the COCO val set. With other improvements orthogonal to Faster R-CNN, He et al. [18] obtained a single-model result of $55.7%/34.9%$ and an ensemble result of $59.0%/37.4%$ on the COCO test-dev set, which won the 1st place in the COCO 2015 object detection competition. The same system [18] also won the 1st place in the ILSVRC 2015 object detection competition, surpassing the second place by absolute $8.5%$. RPN is also a building block of the 1st-place winning entries in ILSVRC 2015 localization and COCO 2015 segmentation competitions, for which the details are available in [18] and [15] respectively.
在ILSVRC和COCO 2015比賽中的Faster R-CNN廉油。我們已經(jīng)證明,由于RPN通過神經(jīng)網(wǎng)絡完全學習了提議區(qū)域苗傅,F(xiàn)aster R-CNN從更好的特征中受益更多抒线。即使將深度增加到100層以上,這種觀察仍然是有效的[18]渣慕。僅用101層殘差網(wǎng)絡(ResNet-101)代替VGG-16嘶炭,F(xiàn)aster R-CNN系統(tǒng)就將mAP從$41.5
%/21.2%$(VGG-16)增加到$48.4%/27.2%$(ResNet-101)抱慌。與其他改進正交于Faster R-CNN,何等人[18]在COCO測試開發(fā)數(shù)據(jù)集上獲得了單模型$55.7%/34.9%$的結(jié)果和$59.0%/37.4%$的組合結(jié)果眨猎,在COCO 2015目標檢測競賽中獲得了第一名抑进。同樣的系統(tǒng)[18]也在ILSVRC 2015目標檢測競賽中獲得了第一名,超過第二名絕對的$8.5%$睡陪。RPN也是ILSVRC2015定位和COCO2015分割競賽第一名獲獎輸入的基石寺渗,詳情請分別參見[18]和[15]。
4.3 From MS COCO to PASCAL VOC
Large-scale data is of crucial importance for improving deep neural networks. Next, we investigate how the MS COCO dataset can help with the detection performance on PASCAL VOC.
4.3 從MS COCO到PASCAL VOC
大規(guī)模數(shù)據(jù)對改善深度神經(jīng)網(wǎng)絡至關重要兰迫。接下來信殊,我們調(diào)查MS COCO數(shù)據(jù)集如何幫助改進在PASCAL VOC上的檢測性能。
As a simple baseline, we directly evaluate the COCO detection model on the PASCAL VOC dataset, without fine-tuning on any PASCAL VOC data. This evaluation is possible because the categories on COCO are a superset of those on PASCAL VOC. The categories that are exclusive on COCO are ignored in this experiment, and the softmax layer is performed only on the 20 categories plus background. The mAP under this setting is $76.1%$ on the PASCAL VOC 2007 test set (Table 12). This result is better than that trained on VOC07+12 ($73.2%$) by a good margin, even though the PASCAL VOC data are not exploited.
作為一個簡單的基準數(shù)據(jù)汁果,我們直接在PASCAL VOC數(shù)據(jù)集上評估COCO檢測模型涡拘,而無需在任何PASCAL VOC數(shù)據(jù)上進行微調(diào)。這種評估是可能的据德,因為COCO類別是PASCAL VOC上類別的超集鳄乏。在這個實驗中忽略COCO專有的類別,softmax層僅在20個類別和背景上執(zhí)行晋控。這種設置下PASCAL VOC 2007測試集上的mAP為$76.1%$(表12)汞窗。即使沒有利用PASCAL VOC的數(shù)據(jù),這個結(jié)果也好于在VOC07+12($73.2%$)上訓練的模型的結(jié)果赡译。
Then we fine-tune the COCO detection model on the VOC dataset. In this experiment, the COCO model is in place of the ImageNet-pre-trained model (that is used to initialize the network weights), and the Faster R-CNN system is fine-tuned as described in Section 3.2. Doing so leads to $78.8%$ mAP on the PASCAL VOC 2007 test set. The extra data from the COCO set increases the mAP by $5.6%$. Table 6 shows that the model trained on COCO+VOC has the best AP for every individual category on PASCAL VOC 2007. Similar improvements are observed on the PASCAL VOC 2012 test set (Table 12 and Table 7). We note that the test-time speed of obtaining these strong results is still about 200ms per image.
Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time proposals for Fast R-CNN are 2000. $RPN^*$ denotes the unsharing feature version.
Table 12: Detection mAP (%) of Faster R-CNN on PASCAL VOC 2007 test set and 2012 test set using different training data. The model is VGG-16. “COCO” denotes that the COCO trainval
set is used for training. See also Table 6 and Table 7.
Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time proposals for Fast R-CNN are 2000.
然后我們在VOC數(shù)據(jù)集上對COCO檢測模型進行微調(diào)仲吏。在這個實驗中,COCO模型代替了ImageNet的預訓練模型(用于初始化網(wǎng)絡權重)蝌焚,F(xiàn)aster R-CNN系統(tǒng)按3.2節(jié)所述進行微調(diào)裹唆。這樣做在PASCAL VOC 2007測試集上可以達到$78.8%$的mAP。來自COCO集合的額外數(shù)據(jù)增加了$5.6%$的mAP只洒。表6顯示许帐,在PASCAL VOC 2007上,使用COCO+VOC訓練的模型在每個類別上具有最好的AP值毕谴。在PASCAL VOC 2012測試集(表12和表7)中也觀察到類似的改進成畦。我們注意到獲得這些強大結(jié)果的測試時間速度仍然是每張圖像200ms左右。
表6:Fast R-CNN檢測器和VGG-16在PASCAL VOC 2007測試集上的結(jié)果涝开。對于RPN循帐,F(xiàn)ast R-CNN的訓練時的提議數(shù)量是2000。$RPN^*$表示取消共享特征的版本舀武。
表12:使用不同的訓練數(shù)據(jù)在PASCAL VOC 2007測試集和2012測試集上檢測Faster R-CNN的檢測mAP(%)拄养。模型是VGG-16∫眨“COCO”表示COCOtrainval
數(shù)據(jù)集用于訓練瘪匿。另見表6和表7跛梗。
表7:Fast R-CNN檢測器和VGG-16在PASCAL VOC 2012測試集上的結(jié)果。對于RPN棋弥,F(xiàn)ast R-CNN訓練時的提議數(shù)量為2000核偿。
5. CONCLUSION
We have presented RPNs for efficient and accurate region proposal generation. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
5. 結(jié)論
我們已經(jīng)提出了RPN來生成高效,準確的區(qū)域提議顽染。通過與下游檢測網(wǎng)絡共享卷積特征宪祥,區(qū)域提議步驟幾乎是零成本的。我們的方法使統(tǒng)一的家乘,基于深度學習的目標檢測系統(tǒng)能夠以接近實時的幀率運行。學習到的RPN也提高了區(qū)域提議的質(zhì)量藏澳,從而提高了整體的目標檢測精度仁锯。
REFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European Conference on Computer Vision (ECCV), 2014.
[2] R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2015.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
[4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International
Journal of Computer Vision (IJCV), 2013.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[6] C. L. Zitnick and P. Dolla?r, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision(ECCV),2014.
[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2010.
[9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in International Conference on Learning Representations (ICLR), 2014.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “FasterR-CNN: Towards real-time object detection with region proposal networks,” in
Neural Information Processing Systems (NIPS), 2015.
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” 2007.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla?r, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in European Conference on Computer Vision (ECCV), 2014.
[13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” arXiv:1511.02300, 2015.
[14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based model for object detection and semantic part localization,” arXiv:1511.07131, 2015.
[15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015.
[16] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” arXiv:1511.07571, 2015.
[17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human curation and convnets: Powering item-to-item recommendations on pinterest,” arXiv:1511.04003, 2015.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
[19] J. Hosang, R. Benenson, and B. Schiele, “How good are detection proposals, really?” in British Machine Vision Conference (BMVC), 2014.
[20] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for effective detection proposals?” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
[21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, “Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv: 1505.05836, 2015.
[22] J. Carreira and C. Sminchisescu, “CPMC: Automatic object segmentation using constrained parametric min-cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.
[23] P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.
[25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Neural Information Processing Systems (NIPS), 2013.
[26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high-quality object detection,” arXiv:1412.1441 (v1), 2015.
[28] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment object candidates,” in Neural Information Processing Systems (NIPS), 2015.
[29] J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint object and stuff segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” arXiv:1504.06066, 2015.
[31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Neural Information Processing Systems (NIPS), 2015.
[32] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.
[33] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning (ICML), 2010.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, 1989.
[36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” in International Journal of Computer Vision (IJCV), 2015.
[37] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Neural Information Processing Systems (NIPS), 2012.
[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv:1408.5093, 2014.
[39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine Vision Conference (BMVC), 2015.