文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡書 | ? 云+社區(qū)
聲明:作者翻譯論文僅為學(xué)習(xí)尺迂,如有侵權(quán)請聯(lián)系作者刪除博文剥哑,謝謝硅则!
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
SSD: Single Shot MultiBox Detector
Abstract
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 × 300 input, SSD achieves 74.3% mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at: https://github.com/weiliu89/caffe/tree/ssd.
摘要
我們提出了一種使用單個(gè)深度神經(jīng)網(wǎng)絡(luò)來檢測圖像中的目標(biāo)的方法。我們的方法命名為SSD株婴,將邊界框的輸出空間離散化為不同長寬比的一組默認(rèn)框和并縮放每個(gè)特征映射的位置怎虫。在預(yù)測時(shí),網(wǎng)絡(luò)會在每個(gè)默認(rèn)框中為每個(gè)目標(biāo)類別的出現(xiàn)生成分?jǐn)?shù)困介,并對框進(jìn)行調(diào)整以更好地匹配目標(biāo)形狀大审。此外,網(wǎng)絡(luò)還結(jié)合了不同分辨率的多個(gè)特征映射的預(yù)測座哩,自然地處理各種尺寸的目標(biāo)徒扶。相對于需要目標(biāo)提出的方法,SSD非常簡單根穷,因?yàn)樗耆颂岢錾珊碗S后的像素或特征重新采樣階段酷愧,并將所有計(jì)算封裝到單個(gè)網(wǎng)絡(luò)中。這使得SSD易于訓(xùn)練和直接集成到需要檢測組件的系統(tǒng)中缠诅。PASCAL VOC,COCO和ILSVRC數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果證實(shí)乍迄,SSD對于利用額外的目標(biāo)提出步驟的方法具有競爭性的準(zhǔn)確性管引,并且速度更快,同時(shí)為訓(xùn)練和推斷提供了統(tǒng)一的框架闯两。對于300×300的輸入褥伴,SSD在VOC2007測試中以59FPS的速度在Nvidia Titan X上達(dá)到$74.3%$的mAP,對于512×512的輸入漾狼,SSD達(dá)到了$76.9%$的mAP重慢,優(yōu)于參照的最先進(jìn)的Faster R-CNN模型。與其他單階段方法相比逊躁,即使輸入圖像尺寸較小似踱,SSD也具有更高的精度。代碼獲然骸:https://github.com/weiliu89/caffe/tree/ssd核芽。
1. Introduction
Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications.Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sec. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.
1. 引言
目前最先進(jìn)的目標(biāo)檢測系統(tǒng)是以下方法的變種:假設(shè)邊界框,每個(gè)框重采樣像素或特征酵熙,并應(yīng)用一個(gè)高質(zhì)量的分類器轧简。自從選擇性搜索[1]通過在PASCAL VOC,COCO和ILSVRC上所有基于Faster R-CNN[2]的檢測都取得了當(dāng)前領(lǐng)先的結(jié)果(盡管具有更深的特征如[3])匾二,這種流程在檢測基準(zhǔn)數(shù)據(jù)上流行開來哮独。盡管這些方法準(zhǔn)確拳芙,但對于嵌入式系統(tǒng)而言,這些方法的計(jì)算量過大皮璧,即使是高端硬件舟扎,對于實(shí)時(shí)應(yīng)用而言也太慢。通常恶导,這些方法的檢測速度是以每幀秒(SPF)度量浆竭,甚至最快的高精度檢測器,F(xiàn)aster R-CNN惨寿,僅以每秒7幀(FPS)的速度運(yùn)行邦泄。已經(jīng)有很多嘗試通過處理檢測流程中的每個(gè)階段來構(gòu)建更快的檢測器(參見第4節(jié)中的相關(guān)工作),但是到目前為止裂垦,顯著提高的速度僅以顯著降低的檢測精度為代價(jià)顺囊。
This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2%$ or YOLO 45 FPS with mAP $63.4%$). The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. We are not the first to do this (cf [4,5]), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. With these modifications——especially using multiple layers for prediction at different scales——we can achieve high-accuracy using relatively low resolution input, further increasing detection speed. While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from $63.4%$ mAP for YOLO to $74.3%$ mAP for our SSD. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3]. Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful.
本文提出了第一個(gè)基于深度網(wǎng)絡(luò)的目標(biāo)檢測器,它不對邊界框假設(shè)的像素或特征進(jìn)行重采樣蕉拢,并且與其它方法有一樣精確度特碳。這對高精度檢測在速度上有顯著提高(在VOC2007測試中,59FPS和$74.3%$的mAP晕换,與Faster R-CNN 7FPS和$73.2%$的mAP或者YOLO 45 FPS和$63.4%$的mAP相比)午乓。速度的根本改進(jìn)來自消除邊界框提出和隨后的像素或特征重采樣階段。我們并不是第一個(gè)這樣做的人(查閱[4,5])闸准,但是通過增加一系列改進(jìn)益愈,我們設(shè)法比以前的嘗試顯著提高了準(zhǔn)確性。我們的改進(jìn)包括使用小型卷積濾波器來預(yù)測邊界框位置中的目標(biāo)類別和偏移量夷家,使用不同長寬比檢測的單獨(dú)預(yù)測器(濾波器)蒸其,并將這些濾波器應(yīng)用于網(wǎng)絡(luò)后期的多個(gè)特征映射中,以執(zhí)行多尺度檢測库快。通過這些修改——特別是使用多層進(jìn)行不同尺度的預(yù)測——我們可以使用相對較低的分辨率輸入實(shí)現(xiàn)高精度摸袁,進(jìn)一步提高檢測速度。雖然這些貢獻(xiàn)可能單獨(dú)看起來很小义屏,但是我們注意到由此產(chǎn)生的系統(tǒng)將PASCAL VOC實(shí)時(shí)檢測的準(zhǔn)確度從YOLO的$63.4%$的mAP提高到我們的SSD的$74.3%$的mAP靠汁。相比于最近備受矚目的殘差網(wǎng)絡(luò)方面的工作[3],在檢測精度上這是相對更大的提高闽铐。而且膀曾,顯著提高的高質(zhì)量檢測速度可以擴(kuò)大計(jì)算機(jī)視覺使用的設(shè)置范圍。
We summarize our contributions as follows:
We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.
These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.
我們總結(jié)我們的貢獻(xiàn)如下:
我們引入了SSD阳啥,這是一種針對多個(gè)類別的單次檢測器添谊,比先前的先進(jìn)的單次檢測器(YOLO)更快,并且準(zhǔn)確得多察迟,事實(shí)上斩狱,與執(zhí)行顯式區(qū)域提出和池化的更慢的技術(shù)具有相同的精度(包括Faster R-CNN)耳高。
SSD的核心是預(yù)測固定的一系列默認(rèn)邊界框的類別分?jǐn)?shù)和邊界框偏移,使用更小的卷積濾波器應(yīng)用到特征映射上所踊。
為了實(shí)現(xiàn)高檢測精度泌枪,我們根據(jù)不同尺度的特征映射生成不同尺度的預(yù)測,并通過縱橫比明確分開預(yù)測秕岛。
這些設(shè)計(jì)功能使得即使在低分辨率輸入圖像上也能實(shí)現(xiàn)簡單的端到端訓(xùn)練和高精度碌燕,從而進(jìn)一步提高速度與精度之間的權(quán)衡。
實(shí)驗(yàn)包括在PASCAL VOC继薛,COCO和ILSVRC上評估具有不同輸入大小的模型的時(shí)間和精度分析修壕,并與最近的一系列最新方法進(jìn)行比較。
2. The Single Shot Detector (SSD)
This section describes our proposed SSD framework for detection (Sec. 2.1) and the associated training methodology (Sec. 2.2). Afterwards, Sec. 2.3 presents dataset-specific model details and experimental results.
2. 單次檢測器(SSD)
本節(jié)描述我們提出的SSD檢測框架(2.1節(jié))和相關(guān)的訓(xùn)練方法(2.2節(jié))遏考。之后慈鸠,2.3節(jié)介紹了數(shù)據(jù)集特有的模型細(xì)節(jié)和實(shí)驗(yàn)結(jié)果。
2.1 Model
The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network. We then add auxiliary structure to the network to produce detections with the following key features:
2.1 模型
SSD方法基于前饋卷積網(wǎng)絡(luò)灌具,該網(wǎng)絡(luò)產(chǎn)生固定大小的邊界框集合青团,并對這些邊界框中存在的目標(biāo)類別實(shí)例進(jìn)行評分,然后進(jìn)行非極大值抑制步驟來產(chǎn)生最終的檢測結(jié)果咖楣。早期的網(wǎng)絡(luò)層基于用于高質(zhì)量圖像分類的標(biāo)準(zhǔn)架構(gòu)(在任何分類層之前被截?cái)啵┒桨剩覀儗⑵浞Q為基礎(chǔ)網(wǎng)絡(luò)。然后诱贿,我們將輔助結(jié)構(gòu)添加到網(wǎng)絡(luò)中以產(chǎn)生具有以下關(guān)鍵特征的檢測:
Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map).
用于檢測的多尺度特征映射娃肿。我們將卷積特征層添加到截取的基礎(chǔ)網(wǎng)絡(luò)的末端。這些層在尺寸上逐漸減小瘪松,并允許在多個(gè)尺度上對檢測結(jié)果進(jìn)行預(yù)測。用于預(yù)測檢測的卷積模型對于每個(gè)特征層都是不同的(查閱Overfeat[4]和YOLO[5]在單尺度特征映射上的操作)锨阿。
Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2. For a feature layer of size $m \times n$ with $p$ channels, the basic element for predicting parameters of a potential detection is a $3 \times 3 \times p$ small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the $m \times n$ locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).
Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5]. Our SSD model adds several feature layers to the end of a base network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. SSD with a 300 × 300 input size significantly outperforms its 448 × 448 YOLO counterpart in accuracy on VOC2007 test
while also improving the speed.
用于檢測的卷積預(yù)測器宵睦。每個(gè)添加的特征層(或者任選的來自基礎(chǔ)網(wǎng)絡(luò)的現(xiàn)有特征層)可以使用一組卷積濾波器產(chǎn)生固定的檢測預(yù)測集合。這些在圖2中的SSD網(wǎng)絡(luò)架構(gòu)的上部指出墅诡。對于具有$p$通道的大小為$m \times n$的特征層壳嚎,潛在檢測的預(yù)測參數(shù)的基本元素是$3 \times 3 \times p$的小核得到某個(gè)類別的分?jǐn)?shù),或者相對于默認(rèn)框坐標(biāo)的形狀偏移末早。在應(yīng)用卷積核的$m \times n$的每個(gè)位置烟馅,它會產(chǎn)生一個(gè)輸出值。邊界框偏移輸出值是相對每個(gè)特征映射位置的相對默認(rèn)框位置來度量的(查閱YOLO[5]的架構(gòu)然磷,該步驟使用中間全連接層而不是卷積濾波器)郑趁。
圖2:兩個(gè)單次檢測模型的比較:SSD和YOLO[5]。我們的SSD模型在基礎(chǔ)網(wǎng)絡(luò)的末端添加了幾個(gè)特征層姿搜,它預(yù)測了不同尺度和長寬比的默認(rèn)邊界框的偏移量及其相關(guān)的置信度寡润。300×300輸入尺寸的SSD在VOC2007 test
上的準(zhǔn)確度上明顯優(yōu)于448×448的YOLO的準(zhǔn)確度捆憎,同時(shí)也提高了速度。
Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of $k$ at a given location, we compute $c$ class scores and the $4$ offsets relative to the original default box shape. This results in a total of $(c+4)k$ filters that are applied around each location in the feature map, yielding $(c+4)kmn$ outputs for a $m\times n$ feature map. For an illustration of default boxes, please refer to Fig.1. Our default boxes are similar to the anchor boxes used in Faster R-CNN[2], however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes.
Fig. 1: SSD framework. (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8 × 8 and 4 × 4 in (b) and (c)). For each default box, we predict both the shape offsets and the confidences for all object categories ($(c_1, c_2, \dots, c_p)$). At training time, we first match these default boxes to the ground truth boxes. For example, we have matched two default boxes with the cat and one with the dog, which are treated as positives and the rest as negatives. The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).
默認(rèn)邊界框和長寬比梭纹。對于網(wǎng)絡(luò)頂部的多個(gè)特征映射躲惰,我們將一組默認(rèn)邊界框與每個(gè)特征映射單元相關(guān)聯(lián)。默認(rèn)邊界框以卷積的方式平鋪特征映射变抽,以便每個(gè)邊界框相對于其對應(yīng)單元的位置是固定的础拨。在每個(gè)特征映射單元中,我們預(yù)測單元中相對于默認(rèn)邊界框形狀的偏移量绍载,以及指出每個(gè)邊界框中存在的每個(gè)類別實(shí)例的類別分?jǐn)?shù)诡宗。具體而言,對于給定位置處的$k$個(gè)邊界框中的每一個(gè)逛钻,我們計(jì)算$c$個(gè)類別分?jǐn)?shù)和相對于原始默認(rèn)邊界框形狀的$4$個(gè)偏移量僚焦。這導(dǎo)致在特征映射中的每個(gè)位置周圍應(yīng)用總共$(c+4)k$個(gè)濾波器,對于$m\times n$的特征映射取得$(c+4)kmn$個(gè)輸出曙痘。有關(guān)默認(rèn)邊界框的說明芳悲,請參見圖1。我們的默認(rèn)邊界框與Faster R-CNN[2]中使用的錨邊界框相似边坤,但是我們將它們應(yīng)用到不同分辨率的幾個(gè)特征映射上名扛。在幾個(gè)特征映射中允許不同的默認(rèn)邊界框形狀讓我們有效地離散可能的輸出框形狀的空間。
圖1:SSD框架茧痒。(a)在訓(xùn)練期間肮韧,SSD僅需要每個(gè)目標(biāo)的輸入圖像和真實(shí)邊界框。以卷積方式旺订,我們評估具有不同尺度(例如(b)和(c)中的8×8和4×4)的幾個(gè)特征映射中每個(gè)位置處不同長寬比的默認(rèn)框的小集合(例如4個(gè))弄企。對于每個(gè)默認(rèn)邊界框,我們預(yù)測所有目標(biāo)類別($(c_1, c_2, \dots, c_p)$)的形狀偏移量和置信度区拳。在訓(xùn)練時(shí)拘领,我們首先將這些默認(rèn)邊界框與實(shí)際的邊界框進(jìn)行匹配。例如樱调,我們已經(jīng)與貓匹配兩個(gè)默認(rèn)邊界框约素,與狗匹配了一個(gè),這被視為積極的笆凌,其余的是消極的圣猎。模型損失是定位損失(例如,Smooth L1[6])和置信度損失(例如Softmax)之間的加權(quán)和乞而。
2.2 Training
The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.
2.2 訓(xùn)練
訓(xùn)練SSD和訓(xùn)練使用區(qū)域提出的典型檢測器之間的關(guān)鍵區(qū)別在于送悔,需要將真實(shí)信息分配給固定的檢測器輸出集合中的特定輸出。在YOLO[5]的訓(xùn)練中、Faster R-CNN[2]和MultiBox[7]的區(qū)域提出階段放祟,一些版本也需要這樣的操作鳍怨。一旦確定了這個(gè)分配,損失函數(shù)和反向傳播就可以應(yīng)用端到端了跪妥。訓(xùn)練也涉及選擇默認(rèn)邊界框集合和縮放進(jìn)行檢測鞋喇,以及難例挖掘和數(shù)據(jù)增強(qiáng)策略。
Matching strategy During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]). Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.
匹配策略眉撵。在訓(xùn)練過程中侦香,我們需要確定哪些默認(rèn)邊界框?qū)?yīng)實(shí)際邊界框的檢測,并相應(yīng)地訓(xùn)練網(wǎng)絡(luò)纽疟。對于每個(gè)實(shí)際邊界框罐韩,我們從默認(rèn)邊界框中選擇,這些框會在位置污朽,長寬比和尺度上變化散吵。我們首先將每個(gè)實(shí)際邊界框與具有最好的Jaccard重疊(如MultiBox[7])的邊界框相匹配。與MultiBox不同的是蟆肆,我們將默認(rèn)邊界框匹配到Jaccard重疊高于閾值(0.5)的任何實(shí)際邊界框矾睦。這簡化了學(xué)習(xí)問題,允許網(wǎng)絡(luò)為多個(gè)重疊的默認(rèn)邊界框預(yù)測高分炎功,而不是要求它只挑選具有最大重疊的一個(gè)邊界框枚冗。
注:Jaccard重疊即IoU。
Training objective The SSD training objective is derived from the MultiBox objective[7,8] but is extended to handle multiple object categories. Let $x_{ij}^p = \lbrace 1,0 \rbrace$ be an indicator for matching the $i$-th default box to the $j$-th ground truth box of category $p$. In the matching strategy above, we can have $\sum_i x_{ij}^p \geq 1$. The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss (conf): $$L(x, c, l, g) = \frac{1}{N}(L_{conf}(x, c) + \alpha L_{loc}(x, l, g)) \tag{1}$$ where N is the number of matched default boxes. If $N = 0$, wet set the loss to 0. The localization loss is a Smooth L1 loss[6] between the predicted box ($l$) and the ground truth box ($g$) parameters. Similar to Faster R-CNN[2], we regress to offsets for the center ($cx, cy$) of the default bounding box ($d$) and for its width ($w$) and height ($h$).
$$
L_{loc}(x,l,g) = \sum_{i \in Pos}^N \sum_{m \in \lbrace cx, cy, w, h \rbrace} x_{ij}^k \mathtt{smooth}_{L1}(l_{i}^m - \hat{g}_j^m) \\
\hat{g}_j^{cx} = (g_j^{cx} - d_i^{cx}) / d_i^w \quad \quad
\hat{g}_j^{cy} = (g_j^{cy} - d_i^{cy}) / d_i^h \\
\hat{g}_j^{w} = \log\Big(\frac{g_j{w}}{d_iw}\Big) \quad \quad
\hat{g}_j^{h} = \log\Big(\frac{g_j{h}}{d_ih}\Big)
\tag{2}
$$ The confidence loss is the softmax loss over multiple classes confidences ($c$).
$$
L_{conf}(x, c) = - \sum_{i\in Pos}^N x_{ij}^p log(\hat{c}_i^p) - \sum_{i\in Neg} log(\hat{c}_i^0)\quad \mathtt{where}\quad\hat{c}_i^p = \frac{\exp(c_i^p)}{\sum_p \exp(c_i^p)}
\tag{3}
$$ and the weight term $\alpha$ is set to 1 by cross validation.
訓(xùn)練目標(biāo)函數(shù)蛇损。SSD訓(xùn)練目標(biāo)函數(shù)來自于MultiBox目標(biāo)[7,8]赁温,但擴(kuò)展到處理多個(gè)目標(biāo)類別。設(shè)$x_{ij}^p = \lbrace 1,0 \rbrace$是第$i$個(gè)默認(rèn)邊界框匹配到類別$p$的第$j$個(gè)實(shí)際邊界框的指示器淤齐。在上面的匹配策略中股囊,我們有$\sum_i x_{ij}^p \geq 1$「模總體目標(biāo)損失函數(shù)是定位損失(loc)和置信度損失(conf)的加權(quán)和:$$L(x, c, l, g) = \frac{1}{N}(L_{conf}(x, c) + \alpha L_{loc}(x, l, g)) \tag{1}$$其中N是匹配的默認(rèn)邊界框的數(shù)量稚疹。如果$N=0$,則將損失設(shè)為0锈死。定位損失是預(yù)測框($l$)與真實(shí)框($g$)參數(shù)之間的Smooth L1損失[6]盟广。類似于Faster R-CNN[2]你弦,我們回歸默認(rèn)邊界框($d$)的中心偏移量($cx, cy$)和其寬度($w$)、高度($h$)的偏移量唧瘾。$$
L_{loc}(x,l,g) = \sum_{i \in Pos}^N \sum_{m \in \lbrace cx, cy, w, h \rbrace} x_{ij}^k \mathtt{smooth}_{L1}(l_{i}^m - \hat{g}_j^m) \\
\hat{g}_j^{cx} = (g_j^{cx} - d_i^{cx}) / d_i^w \quad \quad
\hat{g}_j^{cy} = (g_j^{cy} - d_i^{cy}) / d_i^h \\
\hat{g}_j^{w} = \log\Big(\frac{g_j{w}}{d_iw}\Big) \quad \quad
\hat{g}_j^{h} = \log\Big(\frac{g_j{h}}{d_ih}\Big)
\tag{2}
$$置信度損失是在多類別置信度($c$)上的softmax損失喇勋。
$$
L_{conf}(x, c) = - \sum_{i\in Pos}^N x_{ij}^p log(\hat{c}_i^p) - \sum_{i\in Neg} log(\hat{c}_i^0)\quad \mathtt{where}\quad\hat{c}_i^p = \frac{\exp(c_i^p)}{\sum_p \exp(c_i^p)}
\tag{3}
$$
通過交叉驗(yàn)證權(quán)重項(xiàng)$\alpha$設(shè)為1缨该。
Choosing scales and aspect ratios for default boxes To handle different object scales, some methods [4,9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results. Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8 × 8 and 4 × 4) which are used in the framework. In practice, we can use many more with small computational overhead.
為默認(rèn)邊界框選擇尺度和長寬比。為了處理不同的目標(biāo)尺度川背,一些方法[4,9]建議處理不同尺寸的圖像贰拿,然后將結(jié)果合并蛤袒。然而,通過利用單個(gè)網(wǎng)絡(luò)中幾個(gè)不同層的特征映射進(jìn)行預(yù)測膨更,我們可以模擬相同的效果妙真,同時(shí)還可以跨所有目標(biāo)尺度共享參數(shù)。以前的工作[10,11]已經(jīng)表明荚守,使用低層的特征映射可以提高語義分割的質(zhì)量珍德,因?yàn)榈蛯訒东@輸入目標(biāo)的更多細(xì)節(jié)。同樣矗漾,[12]表明锈候,從特征映射上添加全局上下文池化可以有助于平滑分割結(jié)果。受這些方法的啟發(fā)敞贡,我們使用較低和較高的特征映射進(jìn)行檢測泵琳。圖1顯示了框架中使用的兩個(gè)示例性特征映射(8×8和4×4)。在實(shí)踐中誊役,我們可以使用更多的具有很少計(jì)算開支的特征映射获列。
Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each layer. We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. Suppose we want to use $m$ feature maps for prediction. The scale of the default boxes for each feature map is computed as: $$s_k = s_\text{min} + \frac{s_\text{max} - s_\text{min}}{m - 1} (k - 1),\quad k\in [1, m]$$ where $s_\text{min}$ is 0.2 and $s_\text{max}$ is 0.9, meaning the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9, and all layers in between are regularly spaced. We impose different aspect ratios for the default boxes, and denote them as $a_r \in {1, 2, 3, \frac{1}{2}, \frac{1}{3}}$. We can compute the width ($w_k^a = s_k\sqrt{a_r}$) and height ($h_k^a = s_k / \sqrt{a_r}$) for each default box. For the aspect ratio of 1, we also add a default box whose scale is $s'_k = \sqrt{s_k s_{k+1}}$, resulting in 6 default boxes per feature map location. We set the center of each default box to $(\frac{i+0.5}{|f_k|}, \frac{j+0.5}{|f_k|})$, where $|f_k|$ is the size of the $k$-th square feature map, $i, j\in [0, |f_k|)$. In practice, one can also design a distribution of default boxes to best fit a specific dataset. How to design the optimal tiling is an open question as well.
已知網(wǎng)絡(luò)中不同層的特征映射具有不同的(經(jīng)驗(yàn)的)感受野大小[13]。幸運(yùn)的是势木,在SSD框架內(nèi)蛛倦,默認(rèn)邊界框不需要對應(yīng)于每層的實(shí)際感受野。我們設(shè)計(jì)平鋪默認(rèn)邊界框啦桌,以便特定的特征映射學(xué)習(xí)響應(yīng)目標(biāo)的特定尺度溯壶。假設(shè)我們要使用$m$個(gè)特征映射進(jìn)行預(yù)測。每個(gè)特征映射默認(rèn)邊界框的尺度計(jì)算如下:$$s_k = s_\text{min} + \frac{s_\text{max} - s_\text{min}}{m - 1} (k - 1),\quad k\in [1, m]$$其中$s_\text{min}$為0.2甫男,$s_\text{max}$為0.9且改,意味著最低層具有0.2的尺度,最高層具有0.9的尺度板驳,并且在它們之間的所有層是規(guī)則間隔的又跛。我們?yōu)槟J(rèn)邊界框添加不同的長寬比,并將它們表示為$a_r \in {1, 2, 3, \frac{1}{2}, \frac{1}{3}}$若治。我們可以計(jì)算每個(gè)邊界框的寬度($w_k^a = s_k\sqrt{a_r}$)和高度($h_k^a = s_k / \sqrt{a_r}$)慨蓝。對于長寬比為1,我們還添加了一個(gè)默認(rèn)邊界框端幼,其尺度為$s'_k = \sqrt{s_k s_{k+1}}$礼烈,在每個(gè)特征映射位置得到6個(gè)默認(rèn)邊界框。我們將每個(gè)默認(rèn)邊界框的中心設(shè)置為$(\frac{i+0.5}{|f_k|}, \frac{j+0.5}{|f_k|})$婆跑,其中$|f_k|$是第$k$個(gè)平方特征映射的大小此熬,$i, j\in [0, |f_k|)$。在實(shí)踐中,也可以設(shè)計(jì)默認(rèn)邊界框的分布以最適合特定的數(shù)據(jù)集犀忱。如何設(shè)計(jì)最佳平鋪也是一個(gè)懸而未決的問題募谎。
By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the 4 × 4 feature map, but not to any default boxes in the 8 × 8 feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during training.
通過將所有默認(rèn)邊界框的預(yù)測與許多特征映射所有位置的不同尺度和高寬比相結(jié)合,我們有不同的預(yù)測集合阴汇,涵蓋各種輸入目標(biāo)大小和形狀数冬。例如,在圖1中搀庶,狗被匹配到4×4特征映射中的默認(rèn)邊界框吉执,而不是8×8特征映射中的任何默認(rèn)框。這是因?yàn)槟切┻吔缈蛴胁煌某叨鹊乩矗黄ヅ涔返倪吔缈虼撩担虼嗽谟?xùn)練期間被認(rèn)為是負(fù)例。
Hard negative mining After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.
難例挖掘未斑。在匹配步驟之后咕宿,大多數(shù)默認(rèn)邊界框?yàn)樨?fù)例,尤其是當(dāng)可能的默認(rèn)邊界框數(shù)量較多時(shí)蜡秽。這在正的訓(xùn)練實(shí)例和負(fù)的訓(xùn)練實(shí)例之間引入了顯著的不平衡府阀。我們不使用所有負(fù)例,而是使用每個(gè)默認(rèn)邊界框的最高置信度損失來排序它們芽突,并挑選最高的置信度试浙,以便負(fù)例和正例之間的比例至多為3:1。我們發(fā)現(xiàn)這會導(dǎo)致更快的優(yōu)化和更穩(wěn)定的訓(xùn)練寞蚌。
Data augmentation To make the model more robust to various input object sizes and shapes, each training image is randomly sampled by one of the following options:
- Use the entire original input image.
- Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.
- Randomly sample a patch.
The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratio is between $\frac {1} {2}$ and 2. We keep the overlapped part of the ground truth box if the center of it is in the sampled patch. After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].
數(shù)據(jù)增強(qiáng)田巴。為了使模型對各種輸入目標(biāo)大小和形狀更魯棒,每張訓(xùn)練圖像都是通過以下選項(xiàng)之一進(jìn)行隨機(jī)采樣的:
- 使用整個(gè)原始輸入圖像挟秤。
- 采樣一個(gè)圖像塊壹哺,使得與目標(biāo)之間的最小Jaccard重疊為0.1,0.3艘刚,0.5管宵,0.7或0.9。
- 隨機(jī)采樣一個(gè)圖像塊攀甚。
每個(gè)采樣圖像塊的大小是原始圖像大小的[0.1箩朴,1],長寬比在$\frac {1} {2}$和2之間秋度。如果實(shí)際邊界框的中心在采用的圖像塊中炸庞,我們保留實(shí)際邊界框與采樣圖像塊的重疊部分。在上述采樣步驟之后静陈,除了應(yīng)用類似于文獻(xiàn)[14]中描述的一些光度變形之外燕雁,將每個(gè)采樣圖像塊調(diào)整到固定尺寸并以0.5的概率進(jìn)行水平翻轉(zhuǎn)。
3. Experimental Results
Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16]. Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2-s2$ to $3\times 3-s1$, and use the atrous algorithm[18] to fill the "holes". We remove all the dropout layers and the fc8 layer. We fine-tune the resulting model using SGD with initial learning rate $10^{-3}$, 0.9 momentum, 0.0005 weight decay, and batch size 32. The learning rate decay policy is slightly different for each dataset, and we will describe details later. The full training and testing code is built on Caffe[19] and is open source at: https://github.com/weiliu89/caffe/tree/ssd.
3. 實(shí)驗(yàn)結(jié)果
基礎(chǔ)網(wǎng)絡(luò)鲸拥。我們的實(shí)驗(yàn)全部基于VGG16[15]拐格,它是在ILSVRC CLS-LOC數(shù)據(jù)集[16]上預(yù)先訓(xùn)練的。類似于DeepLab-LargeFOV[17]刑赶,我們將fc6
和fc7
轉(zhuǎn)換為卷積層捏浊,從fc6和fc7中重采樣參數(shù),將pool5從$2\times 2-s2$更改為$3\times 3-s1$撞叨,并使用空洞算法[18]來填補(bǔ)這個(gè)“小洞”金踪。我們刪除所有的丟棄層和fc8
層。我們使用SGD對得到的模型進(jìn)行微調(diào)牵敷,初始學(xué)習(xí)率為$10^{-3}$胡岔,動量為0.9,權(quán)重衰減為0.0005枷餐,批數(shù)據(jù)大小為32靶瘸。每個(gè)數(shù)據(jù)集的學(xué)習(xí)速率衰減策略略有不同,我們將在后面詳細(xì)描述毛肋。完整的訓(xùn)練和測試代碼建立在Caffe[19]上并開源:[https://github.com/weiliu89/caffe/tree/ssd](https://github.com/weiliu89/caffe/tree/ SSD)怨咪。
3.1 PASCAL VOC2007
On this dataset, we compare against Fast R-CNN [6] and Faster R-CNN [2] on VOC2007 test
(4952 images). All methods fine-tune on the same pre-trained VGG16 network.
3.1 PASCAL VOC2007
在這個(gè)數(shù)據(jù)集上,我們在VOC2007 test
(4952張圖像)上比較了Fast R-CNN[6]和FAST R-CNN[2]润匙。所有的方法都在相同的預(yù)訓(xùn)練好的VGG16網(wǎng)絡(luò)上進(jìn)行微調(diào)诗眨。
Figure 2 shows the architecture details of the SSD300 model. We use conv4_3, conv7 (fc7), conv8_2, conv9_2, conv10_2, and conv11_2 to predict both location and confidences. We set default box with scale 0.1 on conv4_3. We initialize the parameters for all the newly added convolutional layers with the "xavier" method [20]. For conv4_3, conv10_2 and conv11_2, we only associate 4 default boxes at each feature map location —— omitting aspect ratios of $\frac{1}{3}$ and 3. For all other layers, we put 6 default boxes as described in Sec. 2.2. Since, as pointed out in [12], conv4_3 has a different feature scale compared to the other layers, we use the L2 normalization technique introduced in [12] to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation. We use the $10^{-3}$ learning rate for 40k iterations, then continue training for 10k iterations with $10^{-4}$ and $10^{-5}$. When training on VOC2007 $\texttt{trainval}$, Table 1 shows that our low resolution SSD300 model is already more accurate than Fast R-CNN. When we train SSD on a larger $512\times 512$ input image, it is even more accurate, surpassing Faster R-CNN by $1.7%$ mAP. If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by $1.1%$ and that SSD512 is $3.6%$ better. If we take models trained on COCO $\texttt{trainval35k}$ as described in Sec. 3.4 and fine-tuning them on the 07+12 dataset with SSD512, we achieve the best results: $81.6%$ mAP.
Table 1: PASCAL VOC2007 test
detection results. Both Fast and Faster R-CNN use input images whose minimum dimension is 600. The two SSD models have exactly the same settings except that they have different input sizes (300×300 vs. 512×512). It is obvious that larger input size leads to better results, and more data always helps. Data: ”07”: VOC2007 trainval
, ”07+12”: union of VOC2007 and VOC2012 trainval
. ”07+12+COCO”: first train on COCO trainval35k
then fine-tune on 07+12.
圖2顯示了SSD300模型的架構(gòu)細(xì)節(jié)。我們使用conv4_3孕讳,conv7(fc7)匠楚,conv8_2,conv9_2厂财,conv10_2和conv11_2來預(yù)測位置和置信度油啤。我們在conv4_3上設(shè)置了尺度為0.1的默認(rèn)邊界框。我們使用“xavier”方法[20]初始化所有新添加的卷積層的參數(shù)蟀苛。對于conv4_3益咬,conv10_2和conv11_2,我們只在每個(gè)特征映射位置上關(guān)聯(lián)了4個(gè)默認(rèn)邊界框——忽略$\frac {1} {3} $和3的長寬比帜平。對于所有其它層幽告,我們像2.2節(jié)描述的那樣放置了6個(gè)默認(rèn)邊界框。如[12]所指出的裆甩,與其它層相比冗锁,由于conv4_3具有不同的特征尺度,所以我們使用[12]中引入的L2正則化技術(shù)將特征映射中每個(gè)位置的特征標(biāo)準(zhǔn)縮放到20嗤栓,在反向傳播過程中學(xué)習(xí)尺度冻河。對于40k次迭代箍邮,我們使用$10{-3}$的學(xué)習(xí)率,然后繼續(xù)用$10{-4}$和$10^{-5}$的學(xué)習(xí)率訓(xùn)練10k迭代叨叙。當(dāng)對VOC2007 $\texttt{trainval}$進(jìn)行訓(xùn)練時(shí)锭弊,表1顯示了我們的低分辨率SSD300模型已經(jīng)比Fast R-CNN更準(zhǔn)確。當(dāng)我們用更大的$512\times 512$輸入圖像上訓(xùn)練SSD時(shí)擂错,它更加準(zhǔn)確味滞,超過了Faster R-CNN $1.7%$的mAP。如果我們用更多的(即07+12)數(shù)據(jù)來訓(xùn)練SSD钮呀,我們看到SSD300已經(jīng)比Faster R-CNN好$1.1%$剑鞍,SSD512比Faster R-CNN好$3.6%$。如果我們將SSD512用3.4節(jié)描述的COCO $\texttt{trainval35k}$來訓(xùn)練模型并在07+12數(shù)據(jù)集上進(jìn)行微調(diào)爽醋,我們獲得了最好的結(jié)果:$81.6%$的mAP蚁署。
表1:PASCAL VOC2007 test
檢測結(jié)果。Fast和Faster R-CNN都使用最小維度為600的輸入圖像蚂四。兩個(gè)SSD模型使用完全相同的設(shè)置除了它們有不同的輸入大小(300×300和512×512)形用。很明顯更大的輸入尺寸會導(dǎo)致更好的結(jié)果,并且更大的數(shù)據(jù)同樣有幫助证杭。數(shù)據(jù):“07”:VOC2007 trainval
田度,“07+12”:VOC2007和VOC2012 trainval
的聯(lián)合〗夥撸“07+12+COCO”:首先在COCO trainval35k
上訓(xùn)練然后在07+12上微調(diào)镇饺。
To understand the performance of our two SSD models in more details, we used the detection analysis tool from [21]. Figure 3 shows that SSD can detect various object categories with high quality (large white area). The majority of its confident detections are correct. The recall is around $85-90%$, and is much higher with “weak” (0.1 jaccard overlap) criteria. Compared to R-CNN [22], SSD has less localization error, indicating that SSD can localize objects better because it directly learns to regress the object shape and classify object categories instead of using two decoupled steps. However, SSD has more confusions with similar object categories (especially for animals), partly because we share locations for multiple categories. Figure 4 shows that SSD is very sensitive to the bounding box size. In other words, it has much worse performance on smaller objects than bigger objects. This is not surprising because those small objects may not even have any information at the very top layers. Increasing the input size (e.g. from 300 × 300 to 512 × 512) can help improve detecting small objects, but there is still a lot of room to improve. On the positive side, we can clearly see that SSD performs really well on large objects. And it is very robust to different object aspect ratios because we use default boxes of various aspect ratios per feature map location.
Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test
. The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). The solid red line reflects the change of recall with strong criteria (0.5 jaccard overlap) as the number of detections increases. The dashed red line is using the weak criteria (0.1 jaccard overlap). The bottom row shows the distribution of top-ranked false positive types.
Fig. 4: Sensitivity and impact of different object characteristics on VOC2007 test
set using [21]. The plot on the left shows the effects of BBox Area per category, and the right plot shows the effect of Aspect Ratio. Key: BBox Area: XS=extra-small; S=small; M=medium; L=large; XL=extra-large. Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide.
為了更詳細(xì)地了解我們兩個(gè)SSD模型的性能,我們使用了[21]中的檢測分析工具送讲。圖3顯示了SSD可以檢測到高質(zhì)量(大白色區(qū)域)的各種目標(biāo)類別奸笤。它大部分的確信檢測是正確的。召回約為$85-90%$哼鬓,而“弱”(0.1 Jaccard重疊)標(biāo)準(zhǔn)則要高得多监右。與R-CNN[22]相比,SSD具有更小的定位誤差异希,表明SSD可以更好地定位目標(biāo)健盒,因?yàn)樗苯訉W(xué)習(xí)回歸目標(biāo)形狀和分類目標(biāo)類別,而不是使用兩個(gè)解耦步驟称簿。然而扣癣,SSD對類似的目標(biāo)類別(特別是對于動物)有更多的混淆,部分原因是我們共享多個(gè)類別的位置憨降。圖4顯示SSD對邊界框大小非常敏感父虑。換句話說,它在較小目標(biāo)上比在較大目標(biāo)上的性能要差得多授药。這并不奇怪士嚎,因?yàn)檫@些小目標(biāo)甚至可能在頂層沒有任何信息呜魄。增加輸入尺寸(例如從300×300到512×512)可以幫助改進(jìn)檢測小目標(biāo),但仍然有很大的改進(jìn)空間莱衩。積極的一面爵嗅,我們可以清楚地看到SSD在大型目標(biāo)上的表現(xiàn)非常好。而且對于不同長寬比的目標(biāo)膳殷,它是非常魯棒的,因?yàn)槲覀兪褂妹總€(gè)特征映射位置的各種長寬比的默認(rèn)框九火。
圖3:SSD512在VOC2007 test
中的動物赚窃,車輛和家具上的性能可視化。第一行顯示由于定位不佳(Loc)岔激,與相似類別(Sim)混淆勒极,與其它(Oth)或背景(BG)相關(guān)的正確檢測(Cor)或假陽性的累積分?jǐn)?shù)。紅色的實(shí)線表示隨著檢測次數(shù)的增加虑鼎,強(qiáng)標(biāo)準(zhǔn)(0.5 Jaccard重疊)下的召回變化辱匿。紅色虛線是使用弱標(biāo)準(zhǔn)(0.1 Jaccard重疊)。最下面一行顯示了排名靠前的假陽性類型的分布炫彩。
圖4:使用[21]在VOC2007 test
設(shè)置上不同目標(biāo)特性的靈敏度和影響匾七。左邊的圖顯示了BBox面積對每個(gè)類別的影響,右邊的圖顯示了長寬比的影響江兢。關(guān)鍵:BBox區(qū)域:XS=超凶蛞洹;S=猩荚省邑贴;M=中等;L=大叔磷;XL=超大拢驾。長寬比:XT=超高/窄;T=高改基;M=中等繁疤;W=寬;XW =超寬秕狰。
3.2 Model analysis
To understand SSD better, we carried out controlled experiments to examine how each component affects performance. For all the experiments, we use the same settings and input size (300 × 300), except for specified changes to the settings or component(s).
3.2 模型分析
為了更好地了解SSD嵌洼,我們進(jìn)行了控制實(shí)驗(yàn),以檢查每個(gè)組件如何影響性能封恰。對于所有的實(shí)驗(yàn)麻养,我們使用相同的設(shè)置和輸入大小(300×300)诺舔,除了指定的設(shè)置或組件的更改鳖昌。
Data augmentation is crucial. Fast and Faster R-CNN use the original image and the horizontal flip to train. We use a more extensive sampling strategy, similar to YOLO [5]. Table 2 shows that we can improve $8.8%$ mAP with this sampling strategy. We do not know how much our sampling strategy will benefit Fast and Faster R-CNN, but they are likely to benefit less because they use a feature pooling step during classification that is relatively robust to object translation by design.
Table 2: Effects of various design choices and components on SSD performance.
數(shù)據(jù)增強(qiáng)至關(guān)重要备畦。Fast和Faster R-CNN使用原始圖像和水平翻轉(zhuǎn)來訓(xùn)練。我們使用更廣泛的抽樣策略许昨,類似于YOLO[5]懂盐。從表2可以看出,采樣策略可以提高$8.8%$的mAP糕档。我們不知道我們的采樣策略將會使Fast和Faster R-CNN受益多少莉恼,但是他們可能從中受益較少,因?yàn)樗麄冊诜诸愡^程中使用了一個(gè)特征池化步驟速那,這對通過設(shè)計(jì)的目標(biāo)變換來說相對魯棒俐银。
表2:各種設(shè)計(jì)選擇和組件對SSD性能的影響。
More default box shapes is better. As described in Sec. 2.2, by default we use 6 default boxes per location. If we remove the boxes with $\frac {1} {3}$ and 3 aspect ratios, the performance drops by $0.6%$. By further removing the boxes with $\frac {1} {2}$ and 2 aspect ratios, the performance drops another $2.1%$. Using a variety of default box shapes seems to make the task of predicting boxes easier for the network.
更多的默認(rèn)邊界框形狀會更好端仰。如2.2節(jié)所述捶惜,默認(rèn)情況下,我們每個(gè)位置使用6個(gè)默認(rèn)邊界框荔烧。如果我們刪除長寬比為$\frac {1} {3}$和3的邊界框吱七,性能下降了$0.6%$。通過進(jìn)一步去除$\frac {1} {2}$和2長寬比的盒子鹤竭,性能再下降$2.1%$踊餐。使用各種默認(rèn)邊界框形狀似乎使網(wǎng)絡(luò)預(yù)測邊界框的任務(wù)更容易。
Atrous is faster. As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17]. If we use the full VGG16, keeping pool5 with 2×2?s2 and not subsampling parameters from fc6 and fc7, and add conv5 3 for prediction, the result is about the same while the speed is about $20%$ slower.
Atrous更快臀稚。如第3節(jié)所述市袖,我們根據(jù)DeepLab-LargeFOV[17]使用子采樣的VGG16的空洞版本。如果我們使用完整的VGG16烁涌,保持pool5為2×2-s2苍碟,并且不從fc6和fc7中子采樣參數(shù),并添加conv5_3進(jìn)行預(yù)測撮执,結(jié)果大致相同微峰,而速度慢了大約$20%$。
Multiple output layers at different resolutions is better. A major contribution of SSD is using default boxes of different scales on different output layers. To measure the advantage gained, we progressively remove layers and compare results. For a fair comparison, every time we remove a layer, we adjust the default box tiling to keep the total number of boxes similar to the original (8732). This is done by stacking more scales of boxes on remaining layers and adjusting scales of boxes if needed. We do not exhaustively optimize the tiling for each setting. Table 3 shows a decrease in accuracy with fewer layers, dropping monotonically from 74.3 to 62.4. When we stack boxes of multiple scales on a layer, many are on the image boundary and need to be handled carefully. We tried the strategy used in Faster R-CNN [2], ignoring boxes which are on the boundary. We observe some interesting trends. For example, it hurts the performance by a large margin if we use very coarse feature maps (e.g. conv11_2 (1 × 1) or conv10_2 (3 × 3)). The reason might be that we do not have enough large boxes to cover large objects after the pruning. When we use primarily finer resolution maps, the performance starts increasing again because even after pruning a sufficient number of large boxes remains. If we only use conv7 for prediction, the performance is the worst, reinforcing the message that it is critical to spread boxes of different scales over different layers. Besides, since our predictions do not rely on ROI pooling as in [6], we do not have the collapsing bins problem in low-resolution feature maps [23]. The SSD architecture combines predictions from feature maps of various resolutions to achieve comparable accuracy to Faster R-CNN, while using lower resolution input images.
Table 3: Effects of using multiple output layers.
多個(gè)不同分辨率的輸出層更好抒钱。SSD的主要貢獻(xiàn)是在不同的輸出層上使用不同尺度的默認(rèn)邊界框蜓肆。為了衡量所獲得的優(yōu)勢,我們逐步刪除層并比較結(jié)果谋币。為了公平比較仗扬,每次我們刪除一層,我們調(diào)整默認(rèn)邊界框平鋪蕾额,以保持類似于最初的邊界框的總數(shù)(8732)早芭。這是通過在剩余層上堆疊更多尺度的盒子并根據(jù)需要調(diào)整邊界框的尺度來完成的。我們沒有詳盡地優(yōu)化每個(gè)設(shè)置的平鋪诅蝶。表3顯示層數(shù)較少退个,精度降低募壕,從74.3單調(diào)遞減至62.4。當(dāng)我們在一層上堆疊多尺度的邊界框時(shí)语盈,很多邊界框在圖像邊界上需要小心處理舱馅。我們嘗試了在Faster R-CNN[2]中使用這個(gè)策略,忽略在邊界上的邊界框刀荒。我們觀察到了一些有趣的趨勢代嗤。例如,如果我們使用非常粗糙的特征映射(例如conv11_2(1×1)或conv10_2(3×3))缠借,它會大大傷害性能干毅。原因可能是修剪后我們沒有足夠大的邊界框來覆蓋大的目標(biāo)。當(dāng)我們主要使用更高分辨率的特征映射時(shí)烈炭,性能開始再次上升溶锭,因?yàn)榧词乖谛藜糁笕匀挥凶銐驍?shù)量的大邊界框宝恶。如果我們只使用conv7進(jìn)行預(yù)測符隙,那么性能是最糟糕的,這就強(qiáng)化了在不同層上擴(kuò)展不同尺度的邊界框是非常關(guān)鍵的信息垫毙。此外霹疫,由于我們的預(yù)測不像[6]那樣依賴于ROI池化,所以我們在低分辨率特征映射中沒有折疊組塊的問題[23]综芥。SSD架構(gòu)將來自各種分辨率的特征映射的預(yù)測結(jié)合起來丽蝎,以達(dá)到與Faster R-CNN相當(dāng)?shù)木_度,同時(shí)使用較低分辨率的輸入圖像膀藐。
表3:使用多個(gè)輸出層的影響屠阻。
3.3 PASCAL VOC2012
We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval
and VOC2007 trainval
and test
(21503 images) for training, and test on VOC2012 test
(10991 images). We train the models with $10^{?3}$ learning rate for 60k iterations, then $10^{?4}$ for 20k iterations. Table 4 shows the results of our SSD300 and SSD512 model. We see the same performance trend as we observed on VOC2007 test
. Our SSD300 improves accuracy over Fast/Faster R-CNN. By increasing the training and testing image size to 512 × 512, we are $4.5%$ more accurate than Faster R-CNN. Compared to YOLO, SSD is significantly more accurate, likely due to the use of convolutional default boxes from multiple feature maps and our matching strategy during training. When fine-tuned from models trained on COCO, our SSD512 achieves $80.0%$ mAP, which is $4.1%$ higher than Faster R-CNN.
Table 4: PASCAL VOC2012 test
detection results. Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval
and test
and VOC2012 trainval
. ”07++12+COCO”: first train on COCO trainval35k
then fine-tune on 07++12.
3.3 PASCAL VOC2012
除了我們使用VOC2012 trainval
和VOC2007 trainval
,test
(21503張圖像)進(jìn)行訓(xùn)練额各,以及在VOC2012 test
(10991張圖像)上進(jìn)行測試之外国觉,我們使用與上述基本的VOC2007實(shí)驗(yàn)相同的設(shè)置。我們用$10{?3}$的學(xué)習(xí)率對模型進(jìn)行60k次的迭代訓(xùn)練虾啦,然后使用$10{?4}$的學(xué)習(xí)率進(jìn)行20k次迭代訓(xùn)練麻诀。表4顯示了我們的SSD300和SSD512模型的結(jié)果。我們看到了與我們在VOC2007 test
中觀察到的相同的性能趨勢傲醉。我們的SSD300比Fast/Faster R-CNN提高了準(zhǔn)確性蝇闭。通過將訓(xùn)練和測試圖像大小增加到512×512,我們比Faster R-CNN的準(zhǔn)確率提高了$4.5%$硬毕。與YOLO相比呻引,SSD更精確,可能是由于使用了來自多個(gè)特征映射的卷積默認(rèn)邊界框和我們在訓(xùn)練期間的匹配策略吐咳。當(dāng)對從COCO上訓(xùn)練的模型進(jìn)行微調(diào)后苞七,我們的SSD512達(dá)到了$80.0%$的mAP藐守,比Faster R-CNN高了$4.1%$。
表4: PASCAL VOC2012 test
上的檢測結(jié)果. Fast和Faster R-CNN使用最小維度為600的圖像蹂风,而YOLO的圖像大小為448× 48卢厂。數(shù)據(jù):“07++12”:VOC2007 trainval
,test
和VOC2012 trainval
惠啄∩骱悖“07++12+COCO”:先在COCO trainval135k
上訓(xùn)練然后在07++12上微調(diào)。
3.4 COCO
To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset. Since objects in COCO tend to be smaller than PASCAL VOC, we use smaller default boxes for all layers. We follow the strategy mentioned in Sec. 2.2, but now our smallest default box has a scale of 0.15 instead of 0.2, and the scale of the default box on conv4_3 is 0.07 (e.g. 21 pixels for a 300 × 300 image).
3.4 COCO
為了進(jìn)一步驗(yàn)證SSD框架撵渡,我們在COCO數(shù)據(jù)集上對SSD300和SSD512架構(gòu)進(jìn)行了訓(xùn)練融柬。由于COCO中的目標(biāo)往往比PASCAL VOC中的更小,因此我們對所有層使用較小的默認(rèn)邊界框趋距。我們遵循2.2節(jié)中提到的策略粒氧,但是現(xiàn)在我們最小的默認(rèn)邊界框尺度是0.15而不是0.2,并且conv4_3上的默認(rèn)邊界框尺度是0.07(例如节腐,300×300圖像中的21個(gè)像素)外盯。
We use the trainval35k
[24] for training. We first train the model with $10^{?3}$ learning rate for 160k iterations, and then continue training for 40k iterations with $10^{?4}$ and 40k iterations with $10^{?5}$. Table 5 shows the results on test-dev2015
. Similar to what we observed on the PASCAL VOC dataset, SSD300 is better than Fast R-CNN in both mAP@0.5 and mAP@[0.5:0.95]. SSD300 has a similar mAP@0.75 as ION [24] and Faster R-CNN [25], but is worse in mAP@0.5. By increasing the image size to 512 × 512, our SSD512 is better than Faster R-CNN [25] in both criteria. Interestingly, we observe that SSD512 is $5.3%$ better in mAP@0.75, but is only $1.2%$ better in mAP@0.5. We also observe that it has much better AP ($4.8%$) and AR ($4.6%$) for large objects, but has relatively less improvement in AP ($1.3%$) and AR ($2.0%$) for small objects. Compared to ION, the improvement in AR for large and small objects is more similar ($5.4%$ vs. $3.9%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part. In Fig. 5, we show some detection examples on COCO test-dev
with the SSD512 model.
Table 5: COCO test-dev2015
detection results.
Fig. 5: Detection examples on COCO test-dev
with SSD512 model. We show detections with scores higher than 0.6. Each color corresponds to an object category.
我們使用trainval35k
[24]進(jìn)行訓(xùn)練。我們首先用$10{?3}$的學(xué)習(xí)率對模型進(jìn)行訓(xùn)練翼雀,進(jìn)行160k次迭代饱苟,然后繼續(xù)以$10{?4}$和$10^{?5}$的學(xué)習(xí)率各進(jìn)行40k次迭代。表5顯示了test-dev2015
的結(jié)果狼渊。與我們在PASCAL VOC數(shù)據(jù)集中觀察到的結(jié)果類似箱熬,SSD300在mAP@0.5和mAP@[0.5:0.95]中都優(yōu)于Fast R-CNN。SSD300與ION 24]和Faster R-CNN[25]具有相似的mAP@0.75狈邑,但是mAP@0.5更差城须。通過將圖像尺寸增加到512×512,我們的SSD512在這兩個(gè)標(biāo)準(zhǔn)中都優(yōu)于Faster R-CNN[25]。有趣的是,我們觀察到SSD512在mAP@0.75中要好$5.3%$碉输,但是在mAP@0.5中只好$1.2%$锅必。我們也觀察到,對于大型目標(biāo),AP($4.8%$)和AR($4.6%$)的效果要好得多,但對于小目標(biāo),AP($1.3%$)和AR($2.0%$)有相對更少的改進(jìn)莺褒。與ION相比,大型和小型目標(biāo)的AR改進(jìn)更為相似($5.4%$和$3.9%$)雪情。我們推測Faster R-CNN在較小的目標(biāo)上比SSD更具競爭力遵岩,因?yàn)樗赗PN部分和Fast R-CNN部分都執(zhí)行了兩個(gè)邊界框細(xì)化步驟。在圖5中,我們展示了SSD512模型在COCO test-dev
上的一些檢測實(shí)例尘执。
表5:COCO test-dev2015
檢測結(jié)果舍哄。
圖5:SSD512模型在COCO test-dev
上的檢測實(shí)例。我們展示了分?jǐn)?shù)高于0.6的檢測誊锭。每種顏色對應(yīng)一種目標(biāo)類別表悬。
3.5 Preliminary ILSVRC results
We applied the same network architecture we used for COCO to the ILSVRC DET dataset [16]. We train a SSD300 model using the ILSVRC2014 DET train
and val1
as used in [22]. We first train the model with $10^{?3}$ learning rate for 320k iterations, and then continue training for 80k iterations with $10^{?4}$ and 40k iterations with $10^{?5}$. We can achieve 43.4 mAP on the val2
set [22]. Again, it validates that SSD is a general framework for high quality real-time detection.
3.5 初步的ILSVRC結(jié)果
我們將在COCO上應(yīng)用的相同網(wǎng)絡(luò)架構(gòu)應(yīng)用于ILSVRC DET數(shù)據(jù)集[16]。我們使用[22]中使用的ILSVRC2014 DETtrain
和val1
來訓(xùn)練SSD300模型丧靡。我們首先用$10{?3}$的學(xué)習(xí)率對模型進(jìn)行訓(xùn)練蟆沫,進(jìn)行了320k次的迭代,然后以$10{?4}$繼續(xù)迭代80k次温治,以$10^{?5}$迭代40k次饭庞。我們可以在val2
數(shù)據(jù)集上[22]實(shí)現(xiàn)43.4 mAP。再一次證明了SSD是用于高質(zhì)量實(shí)時(shí)檢測的通用框架熬荆。
3.6 Data Augmentation for Small Object Accuracy
Without a follow-up feature resampling step as in Faster R-CNN, the classification task for small objects is relatively hard for SSD, as demonstrated in our analysis (see Fig. 4). The data augmentation strategy described in Sec. 2.2 helps to improve the performance dramatically, especially on small datasets such as PASCAL VOC. The random crops generated by the strategy can be thought of as a “zoom in” operation and can generate many larger training examples. To implement a “zoom out” operation that creates more small training examples, we first randomly place an image on a canvas of 16× of the original image size filled with mean values before we do any random crop operation. Because we have more training images by introducing this new “expansion” data augmentation trick, we have to double the training iterations. We have seen a consistent increase of $2%-3%$ mAP across multiple datasets, as shown in Table 6. In specific, Figure 6 shows that the new augmentation trick significantly improves the performance on small objects. This result underscores the importance of the data augmentation strategy for the final model accuracy.
Table 6: Results on multiple datasets when we add the image expansion data augmentation trick. $SSD300^{*}$ and $SSD512^{*}$ are the models that are trained with the new data augmentation.
Fig.6: Sensitivity and impact of object size with new data augmentation on VOC2007 test
set using [21]. The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{*}$ and $SSD512^{*}$ model trained with the new data augmentation trick. It is obvious that the new data augmentation trick helps detecting small objects significantly.
3.6 為小目標(biāo)準(zhǔn)確率進(jìn)行數(shù)據(jù)增強(qiáng)
SSD沒有如Faster R-CNN中后續(xù)的特征重采樣步驟舟山,小目標(biāo)的分類任務(wù)對SSD來說相對困難,正如我們的分析(見圖4)所示卤恳。2.2描述的數(shù)據(jù)增強(qiáng)有助于顯著提高性能累盗,特別是在PASCAL VOC等小數(shù)據(jù)集上。策略產(chǎn)生的隨機(jī)裁剪可以被認(rèn)為是“放大”操作纬黎,并且可以產(chǎn)生許多更大的訓(xùn)練樣本幅骄。為了實(shí)現(xiàn)創(chuàng)建更多小型訓(xùn)練樣本的“縮小”操作劫窒,我們首先將圖像隨機(jī)放置在填充了平均值的原始圖像大小為16x的畫布上本今,然后再進(jìn)行任意的隨機(jī)裁剪操作。因?yàn)橥ㄟ^引入這個(gè)新的“擴(kuò)展”數(shù)據(jù)增強(qiáng)技巧主巍,我們有更多的訓(xùn)練圖像冠息,所以我們必須將訓(xùn)練迭代次數(shù)加倍。我們已經(jīng)在多個(gè)數(shù)據(jù)集上看到了一致的$2%-3%$的mAP增長孕索,如表6所示逛艰。具體來說,圖6顯示新的增強(qiáng)技巧顯著提高了模型在小目標(biāo)上的性能搞旭。這個(gè)結(jié)果強(qiáng)調(diào)了數(shù)據(jù)增強(qiáng)策略對最終模型精度的重要性散怖。
表6:我們使用圖像擴(kuò)展數(shù)據(jù)增強(qiáng)技巧在多個(gè)數(shù)據(jù)集上的結(jié)果。$SSD300{*}$和$SSD512{*}$是用新的數(shù)據(jù)增強(qiáng)訓(xùn)練的模型肄渗。
圖6:具有新的數(shù)據(jù)增強(qiáng)的目標(biāo)尺寸在[21]中使用的VOC2007test
數(shù)據(jù)集上靈敏度及影響镇眷。最上一行顯示了原始SSD300和SSD512模型上每個(gè)類別的BBox面積的影響,最下面一行對應(yīng)使用新的數(shù)據(jù)增強(qiáng)訓(xùn)練技巧的$SSD300{*}$和$SSD512{*}$模型翎嫡。新的數(shù)據(jù)增強(qiáng)技巧顯然有助于顯著檢測小目標(biāo)欠动。
An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map. We leave this for future work.
改進(jìn)SSD的另一種方法是設(shè)計(jì)一個(gè)更好的默認(rèn)邊界框平鋪,使其位置和尺度與特征映射上每個(gè)位置的感受野更好地對齊。我們將這個(gè)留給未來工作具伍。
3.7 Inference time
Considering the large number of boxes generated from our method, it is essential to perform non-maximum suppression (nms) efficiently during inference. By using a confidence threshold of 0.01, we can filter out most boxes. We then apply nms with jaccard overlap of 0.45 per class and keep the top 200 detections per image. This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers. We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz.
3.7 推斷時(shí)間
考慮到我們的方法產(chǎn)生大量邊界框翅雏,在推斷期間執(zhí)行非最大值抑制(nms)是必要的。通過使用0.01的置信度閾值人芽,我們可以過濾大部分邊界框望几。然后,我們應(yīng)用nms萤厅,每個(gè)類別0.45的Jaccard重疊橄妆,并保留每張圖像的前200個(gè)檢測。對于SSD300和20個(gè)VOC類別祈坠,這個(gè)步驟每張圖像花費(fèi)大約1.7毫秒害碾,接近在所有新增層上花費(fèi)的總時(shí)間(2.4毫秒)。我們使用Titan X赦拘、cuDNN v4慌随、Intel Xeon E5-2667v3@3.20GHz以及批大小為8來測量速度。
Table 7 shows the comparison between SSD, Faster R-CNN[2], and YOLO[5]. Both our SSD300 and SSD512 method outperforms Faster R-CNN in both speed and accuracy. Although Fast YOLO[5] can run at 155 FPS, it has lower accuracy by almost $22%$ mAP. To the best of our knowledge, SSD300 is the first real-time method to achieve above $70%$ mAP. Note that about $80%$ of the forward time is spent on the base network (VGG16 in our case). Therefore, using a faster base network could even further improve the speed, which can possibly make the SSD512 model real-time as well.
Table 7: Results on Pascal VOC2007 test
. SSD300 is the only real-time detection method that can achieve above $70%$ mAP. By using a larger input image, SSD512 outperforms all methods on accuracy while maintaining a close to real-time speed.
表7顯示了SSD躺同,F(xiàn)aster R-CNN[2]和YOLO[5]之間的比較阁猜。我們的SSD300和SSD512的速度和精度均優(yōu)于Faster R-CNN。雖然Fast YOLO[5]可以以155FPS的速度運(yùn)行蹋艺,但其準(zhǔn)確性卻降低了近$22%$的mAP剃袍。就我們所知,SSD300是第一個(gè)實(shí)現(xiàn)$70%$以上mAP的實(shí)時(shí)方法捎谨。請注意民效,大約$80%$前饋時(shí)間花費(fèi)在基礎(chǔ)網(wǎng)絡(luò)上(本例中為VGG16)。因此涛救,使用更快的基礎(chǔ)網(wǎng)絡(luò)可以進(jìn)一步提高速度畏邢,這也可能使SSD512模型達(dá)到實(shí)時(shí)。
表7:Pascal VOC2007 test
上的結(jié)果检吆。SSD300是唯一可以取得$70%$以上mAP的實(shí)現(xiàn)檢測方法舒萎。通過使用更大的輸入圖像,SSD512在精度上超過了所有方法同時(shí)保持近似實(shí)時(shí)的速度蹭沛。
4. Related Work
There are two established classes of methods for object detection in images, one based on sliding windows and the other based on region proposal classification. Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance. However, after the dramatic improvement brought on by R-CNN [22], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent.
4. 相關(guān)工作
在圖像中有兩種建立的用于目標(biāo)檢測的方法臂寝,一種基于滑動窗口,另一種基于區(qū)域提出分類摊灭。在卷積神經(jīng)網(wǎng)絡(luò)出現(xiàn)之前咆贬,這兩種方法的最新技術(shù)——可變形部件模型(DPM)[26]和選擇性搜索[1]——具有相當(dāng)?shù)男阅堋H欢寤颍赗-CNN[22]結(jié)合選擇性搜索區(qū)域提出和基于后分類的卷積網(wǎng)絡(luò)帶來的顯著改進(jìn)后素征,區(qū)域提出目標(biāo)檢測方法變得流行。
The original R-CNN approach has been improved in a variety of ways. The first set of approaches improve the quality and speed of post-classification, since it requires the classification of thousands of image crops, which is expensive and time-consuming. SPPnet [9] speeds up the original R-CNN approach significantly. It introduces a spatial pyramid pooling layer that is more robust to region size and scale and allows the classification layers to reuse features computed over feature maps generated at several image resolutions. Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness.
最初的R-CNN方法已經(jīng)以各種方式進(jìn)行了改進(jìn)。第一套方法提高了后分類的質(zhì)量和速度御毅,因?yàn)樗枰獙Τ汕先f的裁剪圖像進(jìn)行分類根欧,這是昂貴和耗時(shí)的。SPPnet[9]顯著加快了原有的R-CNN方法端蛆。它引入了一個(gè)空間金字塔池化層凤粗,該層對區(qū)域大小和尺度更魯棒,并允許分類層重用多個(gè)圖像分辨率下生成的特征映射上計(jì)算的特征今豆。Fast R-CNN[6]擴(kuò)展了SPPnet嫌拣,使得它可以通過最小化置信度和邊界框回歸的損失來對所有層進(jìn)行端到端的微調(diào),最初在MultiBox[7]中引入用于學(xué)習(xí)目標(biāo)呆躲。
The second set of approaches improve the quality of proposal generation using deep neural networks. In the most recent works like MultiBox [7,8], the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network. This further improves the detection accuracy but results in a somewhat complex setup, requiring the training of two neural networks with a dependency between them. Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks. This way region proposals are used to pool mid-level features and the final classification step is less expensive. Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN. But instead of using these to pool features and evaluate another classifier, we simultaneously produce a score for each object category in each box. Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.
第二套方法使用深度神經(jīng)網(wǎng)絡(luò)提高了提出生成的質(zhì)量异逐。在最近的工作MultiBox[7,8]中,基于低級圖像特征的選擇性搜索區(qū)域提出直接被單獨(dú)的深度神經(jīng)網(wǎng)絡(luò)生成的提出所取代插掂。這進(jìn)一步提高了檢測精度灰瞻,但是導(dǎo)致了一些復(fù)雜的設(shè)置,需要訓(xùn)練兩個(gè)具有依賴關(guān)系的神經(jīng)網(wǎng)絡(luò)辅甥。Faster R-CNN[2]將選擇性搜索提出替換為區(qū)域提出網(wǎng)絡(luò)(RPN)學(xué)習(xí)到的區(qū)域提出酝润,并引入了一種方法,通過交替兩個(gè)網(wǎng)絡(luò)之間的微調(diào)共享卷積層和預(yù)測層將RPN和Fast R-CNN結(jié)合在一起璃弄。通過這種方式要销,使用區(qū)域提出池化中級特征,并且最后的分類步驟比較便宜夏块。我們的SSD與Faster R-CNN中的區(qū)域提出網(wǎng)絡(luò)(RPN)非常相似疏咐,因?yàn)槲覀円彩褂靡唤M固定的(默認(rèn))邊界框進(jìn)行預(yù)測,類似于RPN中的錨邊界框拨扶。但是凳鬓,我們不是使用這些來池化特征并評估另一個(gè)分類器茁肠,而是為每個(gè)目標(biāo)類別在每個(gè)邊界框中同時(shí)生成一個(gè)分?jǐn)?shù)患民。因此,我們的方法避免了將RPN與Fast R-CNN合并的復(fù)雜性垦梆,并且更容易訓(xùn)練匹颤,更快且更直接地集成到其它任務(wù)中。
Another set of methods, which are directly related to our approach, skip the proposal step altogether and predict bounding boxes and confidences for multiple categories directly. OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories. YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories). Our SSD method falls in this category because we do not have the proposal step but use the default boxes. However, our approach is more flexible than the existing methods because we can use default boxes of different aspect ratios on each feature location from multiple feature maps at different scales. If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].
與我們的方法直接相關(guān)的另一組方法托猩,完全跳過提出步驟印蓖,直接預(yù)測多個(gè)類別的邊界框和置信度。OverFeat[4]是滑動窗口方法的深度版本京腥,在知道了底層目標(biāo)類別的置信度之后赦肃,直接從最頂層的特征映射的每個(gè)位置預(yù)測邊界框。YOLO[5]使用整個(gè)最頂層的特征映射來預(yù)測多個(gè)類別和邊界框(這些類別共享)的置信度。我們的SSD方法屬于這一類他宛,因?yàn)槲覀儧]有提出步驟船侧,但使用默認(rèn)邊界框。然而厅各,我們的方法比現(xiàn)有方法更靈活镜撩,因?yàn)槲覀兛梢栽诓煌叨鹊亩鄠€(gè)特征映射的每個(gè)特征位置上使用不同長寬比的默認(rèn)邊界框。如果我們只從最頂層的特征映射的每個(gè)位置使用一個(gè)默認(rèn)框队塘,我們的SSD將具有與OverFeat[4]相似的架構(gòu)袁梗;如果我們使用整個(gè)最頂層的特征映射,并添加一個(gè)全連接層進(jìn)行預(yù)測來代替我們的卷積預(yù)測器憔古,并且沒有明確地考慮多個(gè)長寬比遮怜,我們可以近似地再現(xiàn)YOLO[5]。
5. Conclusions
This paper introduces SSD, a fast single-shot object detector for multiple categories. A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allows us to efficiently model the space of possible box shapes. We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. We build SSD models with at least an order of magnitude more box predictions sampling location, scale, and aspect ratio, than existing methods [5,7]. We demonstrate that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed. Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster. Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.
5. 結(jié)論
本文介紹了SSD鸿市,一種快速的單次多類別目標(biāo)檢測器奈泪。我們模型的一個(gè)關(guān)鍵特性是使用網(wǎng)絡(luò)頂部多個(gè)特征映射的多尺度卷積邊界框輸出。這種表示使我們能夠高效地建木姆迹可能的邊界框形狀空間涝桅。我們通過實(shí)驗(yàn)驗(yàn)證,在給定合適訓(xùn)練策略的情況下烙样,大量仔細(xì)選擇的默認(rèn)邊界框會提高性能冯遂。我們構(gòu)建的SSD模型比現(xiàn)有的方法至少要多一個(gè)數(shù)量級的邊界框預(yù)測采樣位置,尺度和長寬比[5,7]谒获。我們證明了給定相同的VGG-16基礎(chǔ)架構(gòu)蛤肌,SSD在準(zhǔn)確性和速度方面與其對應(yīng)的最先進(jìn)的目標(biāo)檢測器相比毫不遜色。在PASCAL VOC和COCO上批狱,我們的SSD512模型的性能明顯優(yōu)于最先進(jìn)的Faster R-CNN[2]裸准,而速度提高了3倍。我們的實(shí)時(shí)SSD300模型運(yùn)行速度為59FPS赔硫,比目前的實(shí)時(shí)YOLO[5]更快炒俱,同時(shí)顯著提高了檢測精度。
Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component. A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video simultaneously.
除了單獨(dú)使用之外爪膊,我們相信我們的整體和相對簡單的SSD模型為采用目標(biāo)檢測組件的大型系統(tǒng)提供了有用的構(gòu)建模塊权悟。一個(gè)有前景的未來方向是探索它作為系統(tǒng)的一部分,使用循環(huán)神經(jīng)網(wǎng)絡(luò)來同時(shí)檢測和跟蹤視頻中的目標(biāo)推盛。
6. Acknowledgment
This work was started as an internship project at Google and continued at UNC. We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google. We also thank Philip Ammirato and Patrick Poirson for helpful comments. We thank NVIDIA for providing GPUs and acknowledge support from NSF 1452851, 1446631, 1526367, 1533771.
6. 致謝
這項(xiàng)工作是在谷歌的一個(gè)實(shí)習(xí)項(xiàng)目開始的峦阁,并在UNC繼續(xù)。我們要感謝Alex Toshev進(jìn)行有益的討論耘成,并感謝Google的Image Understanding和DistBelief團(tuán)隊(duì)榔昔。我們也感謝Philip Ammirato和Patrick Poirson提供有用的意見驹闰。我們感謝NVIDIA提供的GPU,并對NSF 1452851,1446631,1526367,1533771的支持表示感謝撒会。
References
Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV (2013)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS. (2015)
He, K., Zhang, X., Ren, S., Sun, J.:Deep residual learning for image recognition. In:CVPR. (2016)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat:Integrated recognition, localization and detection using convolutional networks. In: ICLR. (2014)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. (2016)
Girshick, R.: Fast R-CNN. In: ICCV. (2015)
Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: CVPR. (2014)
Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-quality object detection. arXiv preprint arXiv:1412.1441 v3 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: ECCV. (2014)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
Hariharan, B., Arbela?ez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR. (2015)
Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: Looking wider to see better.In:ILCR.(2016)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detector semerge in deep scene cnns. In: ICLR. (2015)
Howard, A.G.: Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402 (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: NIPS. (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. IJCV (2015)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR. (2015)
Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: Areal-time algorithm for signal analysis with the help of the wavelet transform. In: Wavelets. Springer (1990) 286–297
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: MM. (2014)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS. (2010)
Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: ECCV 2012. (2012)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR. (2014)
Zhang, L., Lin, L., Liang, X., He, K.: Is faster r-cnn doing well for pedestrian detection. In: ECCV. (2016)
Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net:Detecting objects in context with skip pooling and recurrent neural networks. In: CVPR. (2016)
COCO: Common Objects in Context. http://mscoco.org/dataset/#detections-leaderboard (2016) [Online; accessed 25-July-2016].
Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR. (2008)