文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡書
聲明:作者翻譯論文僅為學(xué)習(xí),如有侵權(quán)請聯(lián)系作者刪除博文虹钮,謝謝!
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
Deformable Convolutional Networks
Abstract
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of our approach. For the first time, we show that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation. The code is released at https://github.com/msracver/Deformable-ConvNets.
摘要
卷積神經(jīng)網(wǎng)絡(luò)(CNN)由于其構(gòu)建模塊固定的幾何結(jié)構(gòu)天然地局限于建模幾何變換。在這項(xiàng)工作中威兜,我們引入了兩個(gè)新的模塊來提高CNN的轉(zhuǎn)換建模能力,即可變形卷積和可變形RoI池化庐椒。兩者都基于這樣的想法:增加模塊中的空間采樣位置以及額外的偏移量牡属,并且從目標(biāo)任務(wù)中學(xué)習(xí)偏移量,而不需要額外的監(jiān)督扼睬。新模塊可以很容易地替換現(xiàn)有CNN中的普通模塊逮栅,并且可以通過標(biāo)準(zhǔn)的反向傳播便易地進(jìn)行端對端訓(xùn)練关贵,從而產(chǎn)生可變形卷積網(wǎng)絡(luò)怒详。大量的實(shí)驗(yàn)驗(yàn)證了我們方法的性能。我們首次證明了在深度CNN中學(xué)習(xí)密集空間變換對于復(fù)雜的視覺任務(wù)(如目標(biāo)檢測和語義分割)是有效的秩命。代碼發(fā)布在https://github.com/msracver/Deformable-ConvNets军俊。
1. Introduction
A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation. In general, there are two ways. The first is to build the training datasets with sufficient desired variations. This is usually realized by augmenting the existing data samples, e.g., by affine transformation. Robust representations can be learned from the data, but usually at the cost of expensive training and complex model parameters. The second is to use transformation-invariant features and algorithms. This category subsumes many well known techniques, such as SIFT (scale invariant feature transform) [42] and sliding window based object detection paradigm.
1. 引言
視覺識別中的一個(gè)關(guān)鍵挑戰(zhàn)是如何在目標(biāo)尺度侥加,姿態(tài),視點(diǎn)和部件變形中適應(yīng)幾何變化或建模幾何變換粪躬。一般來說担败,有兩種方法昔穴。首先是建立具有足夠期望變化的訓(xùn)練數(shù)據(jù)集。這通常通過增加現(xiàn)有的數(shù)據(jù)樣本來實(shí)現(xiàn)提前,例如通過仿射變換吗货。魯棒的表示可以從數(shù)據(jù)中學(xué)習(xí),但是通常以昂貴的訓(xùn)練和復(fù)雜的模型參數(shù)為代價(jià)狈网。其次是使用變換不變的特征和算法宙搬。這一類包含了許多眾所周知的技術(shù),如SIFT(尺度不變特征變換)[42]和基于滑動(dòng)窗口的目標(biāo)檢測范例拓哺。
There are two drawbacks in above ways. First, the geometric transformations are assumed fixed and known. Such prior knowledge is used to augment the data, and design the features and algorithms. This assumption prevents generalization to new tasks possessing unknown geometric transformations, which are not properly modeled. Second, hand-crafted design of invariant features and algorithms could be difficult or infeasible for overly complex transformations, even when they are known.
上述方法有兩個(gè)缺點(diǎn)勇垛。首先,幾何變換被假定是固定并且已知的士鸥。這樣的先驗(yàn)知識被用來擴(kuò)充數(shù)據(jù)闲孤,并設(shè)計(jì)特征和算法。這個(gè)假設(shè)阻止了對具有未知幾何變換的新任務(wù)的泛化能力烤礁,這些新任務(wù)沒有被正確地建模崭放。其次,手工設(shè)計(jì)的不變特征和算法對于過于復(fù)雜的變換可能是困難的或不可行的鸽凶,即使在已知復(fù)雜變化的情況下币砂。
Recently, convolutional neural networks (CNNs) [35] have achieved significant success for visual recognition tasks, such as image classification [31], semantic segmentation [41], and object detection [16]. Nevertheless, they still share the above two drawbacks. Their capability of modeling geometric transformations mostly comes from the extensive data augmentation, the large model capacity, and some simple hand-crafted modules (e.g., max-pooling [1] for small translation-invariance).
最近,卷積神經(jīng)網(wǎng)絡(luò)(CNNs)[35]在圖像分類[31]玻侥,語義分割[41]和目標(biāo)檢測[16]等視覺識別任務(wù)中取得了顯著的成功决摧。不過,他們?nèi)匀挥猩鲜鰞蓚€(gè)缺點(diǎn)凑兰。它們對幾何變換建模的能力主要來自大量的數(shù)據(jù)增強(qiáng)掌桩,大的模型容量以及一些簡單的手工設(shè)計(jì)模塊(例如,對小的平移具有不變性的最大池化[1])姑食。
In short, CNNs are inherently limited to model large, unknown transformations. The limitation originates from the fixed geometric structures of CNN modules: a convolution unit samples the input feature map at fixed locations; a pooling layer reduces the spatial resolution at a fixed ratio; a RoI (region-of-interest) pooling layer separates a RoI into fixed spatial bins, etc. There lacks internal mechanisms to handle the geometric transformations. This causes noticeable problems. For one example, the receptive field sizes of all activation units in the same CNN layer are the same. This is undesirable for high level CNN layers that encode the semantics over spatial locations. Because different locations may correspond to objects with different scales or deformation, adaptive determination of scales or receptive field sizes is desirable for visual recognition with fine localization, e.g., semantic segmentation using fully convolutional networks [41]. For another example, while object detection has seen significant and rapid progress [16, 52, 15, 47, 46, 40, 7] recently, all approaches still rely on the primitive bounding box based feature extraction. This is clearly sub-optimal, especially for non-rigid objects.
簡而言之波岛,CNN本質(zhì)上局限于建模大型,未知的轉(zhuǎn)換音半。該限制源于CNN模塊的固定幾何結(jié)構(gòu):卷積單元在固定位置對輸入特征圖進(jìn)行采樣则拷;池化層以一個(gè)固定的比例降低空間分辨率;一個(gè)RoI(感興趣區(qū)域)池化層把RoI分成固定的空間組塊等等曹鸠。缺乏處理幾何變換的內(nèi)部機(jī)制煌茬。這會(huì)導(dǎo)致明顯的問題。舉一個(gè)例子彻桃,同一CNN層中所有激活單元的感受野大小是相同的坛善。對于在空間位置上編碼語義的高級CNN層來說,這是不可取的。由于不同的位置可能對應(yīng)不同尺度或形變的目標(biāo)眠屎,所以對于具有精細(xì)定位的視覺識別來說剔交,例如使用全卷積網(wǎng)絡(luò)的語義分割[41],尺度或感受野大小的自適應(yīng)確定是理想的情況改衩。又如岖常,盡管最近目標(biāo)檢測已經(jīng)取得了顯著而迅速的進(jìn)展[16,52,15,47,46,40,7],但所有方法仍然依賴于基于特征提取的粗糙邊界框燎字。這顯然是次優(yōu)的,特別是對于非剛性目標(biāo)阿宅。
In this work, we introduce two new modules that greatly enhance CNNs’ capability of modeling geometric transformations. The first is deformable convolution. It adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. It is illustrated in Figure 1. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.
Figure 1: Illustration of the sampling locations in 3 × 3 standard and deformable convolutions. (a) regular sampling grid (green points) of standard convolution. (b) deformed sampling locations (dark blue points) with augmented offsets (light blue arrows) in deformable convolution. (c)(d) are special cases of (b), showing that the deformable convolution generalizes various transformations for scale, (anisotropic) aspect ratio and rotation.
在這項(xiàng)工作中候衍,我們引入了兩個(gè)新的模塊,大大提高了CNN建模幾何變換的能力洒放。首先是可變形卷積蛉鹿。它將2D偏移添加到標(biāo)準(zhǔn)卷積中的常規(guī)網(wǎng)格采樣位置上。它可以使采樣網(wǎng)格自由形變往湿。如圖1所示妖异。偏移量通過附加的卷積層從前面的特征圖中學(xué)習(xí)。因此领追,變形以局部的他膳,密集的和自適應(yīng)的方式受到輸入特征的限制。
圖1:3×3標(biāo)準(zhǔn)卷積和可變形卷積中采樣位置的示意圖绒窑。(a)標(biāo)準(zhǔn)卷積的定期采樣網(wǎng)格(綠點(diǎn))棕孙。(b)變形的采樣位置(深藍(lán)色點(diǎn))和可變形卷積中增大的偏移量(淺藍(lán)色箭頭)。(c)(d)是(b)的特例些膨,表明可變形卷積泛化到了各種尺度(各向異性)蟀俊、長寬比和旋轉(zhuǎn)的變換。
The second is deformable RoI pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling [15, 7]. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.
第二個(gè)是可變形的RoI池化订雾。它為前面的RoI池化的常規(guī)bin分區(qū)中的每個(gè)bin位置添加一個(gè)偏移量[15,7]肢预。類似地,從前面的特征映射和RoI中學(xué)習(xí)偏移量洼哎,使得具有不同形狀的目標(biāo)能夠自適應(yīng)的進(jìn)行部件定位烫映。
Both modules are light weight. They add small amount of parameters and computation for the offset learning. They can readily replace their plain counterparts in deep CNNs and can be easily trained end-to-end with standard back-propagation. The resulting CNNs are called deformable convolutional networks, or deformable ConvNets.
兩個(gè)模塊都輕量的。它們?yōu)槠茖W(xué)習(xí)增加了少量的參數(shù)和計(jì)算噩峦。他們可以很容易地取代深層CNN中簡單的對應(yīng)部分窑邦,并且可以很容易地通過標(biāo)準(zhǔn)的反向傳播進(jìn)行端對端的訓(xùn)練。所得到的CNN被稱為可變形卷積網(wǎng)絡(luò)壕探,或可變形ConvNets冈钦。
Our approach shares similar high level spirit with spatial transform networks [26] and deformable part models [11]. They all have internal transformation parameters and learn such parameters purely from data. A key difference in deformable ConvNets is that they deal with dense spatial transformations in a simple, efficient, deep and end-to-end manner. In Section 3.1, we discuss in details the relation of our work to previous works and analyze the superiority of deformable ConvNets.
我們的方法與空間變換網(wǎng)絡(luò)[26]和可變形部件模型[11]具有類似的高層精神。它們都有內(nèi)部的轉(zhuǎn)換參數(shù)李请,純粹從數(shù)據(jù)中學(xué)習(xí)這些參數(shù)瞧筛±魇欤可變形ConvNets的一個(gè)關(guān)鍵區(qū)別在于它們以簡單,高效较幌,深入和端到端的方式處理密集的空間變換揍瑟。在3.1節(jié)中,我們詳細(xì)討論了我們的工作與以前的工作的關(guān)系乍炉,并分析了可變形ConvNets的優(yōu)越性绢片。
2. Deformable Convolutional Networks
The feature maps and convolution in CNNs are 3D. Both deformable convolution and RoI pooling modules operate on the 2D spatial domain. The operation remains the same across the channel dimension. Without loss of generality, the modules are described in 2D here for notation clarity. Extension to 3D is straightforward.
2. 可變形卷積網(wǎng)絡(luò)
CNN中的特征映射和卷積是3D的〉呵恚可變形卷積和RoI池化模塊都在2D空間域上運(yùn)行底循。在整個(gè)通道維度上的操作保持不變。在不喪失普遍性的情況下槐瑞,為了符號清晰熙涤,這些模塊在2D中描述。擴(kuò)展到3D很簡單困檩。
2.1. Deformable Convolution
The 2D convolution consists of two steps: 1) sampling using a regular grid over the input feature map
; 2) summation of sampled values weighted by
. The grid
defines the receptive field size and dilation. For example,
defines a
kernel with dilation
.
2.1. 可變形卷積
2D卷積包含兩步:1)用規(guī)則的網(wǎng)格在輸入特征映射
上采樣祠挫;2)對
加權(quán)的采樣值求和。網(wǎng)格
定義了感受野的大小和擴(kuò)張悼沿。例如等舔,
定義了一個(gè)擴(kuò)張大小為
的
卷積核。
For each location on the output feature map
, we have
where
enumerates the locations in
.
對于輸出特征映射上的每個(gè)位置
糟趾,我們有
其中
枚舉了
中的位置软瞎。
In deformable convolution, the regular grid is augmented with offsets
, where
. Eq.(1) becomes
在可變形卷積中,規(guī)則的網(wǎng)格通過偏移
增大拉讯,其中
涤浇。方程(1)變?yōu)?img class="math-block" src="https://math.jianshu.com/math?formula=%5Cmathbf%7By%7D(%5Cmathbf%7Bp%7D%5C0)%3D%5Csum%5C%7B%5Cmathbf%7Bp%7D%5Cn%5Cin%5Cmathcal%7BR%7D%7D%5Cmathbf%7Bw%7D(%5Cmathbf%7Bp%7D%5Cn)%5Ccdot%20%5Cmathbf%7Bx%7D(%5Cmathbf%7Bp%7D%5C0%2B%5Cmathbf%7Bp%7D%5Cn%2B%5CDelta%20%5Cmathbf%7Bp%7D%5Cn).%5Ctag%7B2%7D" alt="\mathbf{y}(\mathbf{p}\0)=\sum\{\mathbf{p}\n\in\mathcal{R}}\mathbf{w}(\mathbf{p}\n)\cdot \mathbf{x}(\mathbf{p}\0+\mathbf{p}\n+\Delta \mathbf{p}\n).\tag{2}" mathimg="1">
Now, the sampling is on the irregular and offset locations . As the offset
is typically fractional, Eq.(2) is implemented via bilinear interpolation as
where
denotes an arbitrary (fractional) location (
for Eq.(2)),
enumerates all integral spatial locations in the feature map
, and
is the bilinear interpolation kernel. Note that
is two dimensional. It is separated into two one dimensional kernels as
where
. Eq.(3) is fast to compute as
is non-zero only for a few
s.
現(xiàn)在,采樣是在不規(guī)則且有偏移的位置上魔慷。由于偏移
通常是小數(shù)只锭,方程(2)可以通過雙線性插值實(shí)現(xiàn)
其中
表示任意(小數(shù))位置(公式(2)中
),
枚舉了特征映射
中所有整體空間位置院尔,
是雙線性插值的核蜻展。注意
是二維的。它被分為兩個(gè)一維核
其中
邀摆。方程(3)可以快速計(jì)算因?yàn)?img class="math-inline" src="https://math.jianshu.com/math?formula=G(%5Cmathbf%7Bq%7D%2C%5Cmathbf%7Bp%7D)" alt="G(\mathbf{q},\mathbf{p})" mathimg="1">僅對于一些
是非零的纵顾。
As illustrated in Figure 2, the offsets are obtained by applying a convolutional layer over the same input feature map. The convolution kernel is of the same spatial resolution and dilation as those of the current convolutional layer (e.g., also with dilation 1 in Figure 2. The output offset fields have the same spatial resolution with the input feature map. The channel dimension
corresponds to
2D offsets. During training, both the convolutional kernels for generating the output features and the offsets are learned simultaneously. To learn the offsets, the gradients are back-propagated through the bilinear operations in Eq.(3) and Eq.(4). It is detailed in appendix A.
Figure 2: Illustration of 3 × 3 deformable convolution.
如圖2所示,通過在相同的輸入特征映射上應(yīng)用卷積層來獲得偏移栋盹。卷積核具有與當(dāng)前卷積層相同的空間分辨率和擴(kuò)張(例如施逾,在圖2中也具有擴(kuò)張為1的)。輸出偏移域與輸入特征映射具有相同的空間分辨率。通道維度
(注釋:偏移的通道維度汉额,包括
方向的通道維度和
方向的通道維度)對應(yīng)于
個(gè)2D偏移量曹仗。在訓(xùn)練過程中,同時(shí)學(xué)習(xí)用于生成輸出特征的卷積核和偏移量蠕搜。為了學(xué)習(xí)偏移量怎茫,梯度通過方程(3)和(4)中的雙線性運(yùn)算進(jìn)行反向傳播。詳見附錄A妓灌。
圖2:3×3可變形卷積的說明轨蛤。
2.2. Deformable RoI Pooling
RoI pooling is used in all region proposal based object detection methods [16, 15, 47, 7]. It converts an input rectangular region of arbitrary size into fixed size features.
2.2. 可變形RoI池化
在所有基于區(qū)域提出的目標(biāo)檢測方法中都使用了RoI池化[16,15,47,7]。它將任意大小的輸入矩形區(qū)域轉(zhuǎn)換為固定大小的特征虫埂。
RoI Pooling [15] Given the input feature map and a RoI of size
and top-left corner
, RoI pooling divides the RoI into
(
is a free parameter) bins and outputs a
feature map
. For
-th bin (
), we have
where
is the number of pixels in the bin. The
-th bin spans
and
.
RoI池化[15]祥山。給定輸入特征映射、RoI的大小
和左上角
告丢,RoI池化將ROI分到
(
是一個(gè)自由參數(shù))個(gè)組塊(bin)中枪蘑,并輸出
的特征映射
损谦。對于第
個(gè)組塊(
)岖免,我們有
其中
是組塊中的像素?cái)?shù)量。第
個(gè)組塊的跨度為
和
照捡。
Similarly as in Eq.(2), in deformable RoI pooling, offsets are added to the spatial binning positions. Eq.(5) becomes
Typically,
is fractional. Eq.(6) is implemented by bilinear interpolation via Eq.(3) and (4).
類似于方程(2)颅湘,在可變形RoI池化中,將偏移加到空間組塊的位置上栗精。方程(5)變?yōu)?img class="math-block" src="https://math.jianshu.com/math?formula=%5Cmathbf%7By%7D(i%2Cj)%3D%5Csum%5C%7B%5Cmathbf%7Bp%7D%5Cin%20bin(i%2Cj)%7D%20%5Cmathbf%7Bx%7D(%5Cmathbf%7Bp%7D%5C0%2B%5Cmathbf%7Bp%7D%2B%5CDelta%20%5Cmathbf%7Bp%7D%5C%7Bij%7D)%2Fn%5C%7Bij%7D.%20%5Ctag%7B6%7D" alt="\mathbf{y}(i,j)=\sum\{\mathbf{p}\in bin(i,j)} \mathbf{x}(\mathbf{p}\0+\mathbf{p}+\Delta \mathbf{p}\{ij})/n\{ij}. \tag{6}" mathimg="1">通常闯参,
是小數(shù)。方程(6)通過雙線性插值方程(3)和(4)來實(shí)現(xiàn)悲立。
Figure 3 illustrates how to obtain the offsets. Firstly, RoI pooling (Eq.(5)) generates the pooled feature maps. From the maps, a fc layer generates the normalized offsets , which are then transformed to the offsets
in Eq.(6) by element-wise product with the RoI's width and height, as
. Here
is a pre-defined scalar to modulate the magnitude of the offsets. It is empirically set to
. The offset normalization is necessary to make the offset learning invariant to RoI size. The fc layer is learned by back-propagation, as detailed in appendix A.
Figure 3: Illustration of 3 × 3 deformable RoI pooling.
圖3說明了如何獲得偏移量鹿寨。首先,RoI池化(方程(5))生成池化后的特征映射薪夕。從特征映射中脚草,一個(gè)fc層產(chǎn)生歸一化偏移量,然后通過與RoI的寬和高進(jìn)行逐元素的相乘將其轉(zhuǎn)換為方程(6)中的偏移量
原献,如:
馏慨。這里
是一個(gè)預(yù)定義的標(biāo)量來調(diào)節(jié)偏移的大小。它經(jīng)驗(yàn)地設(shè)定為
姑隅。為了使偏移學(xué)習(xí)對RoI大小具有不變性写隶,偏移歸一化是必要的。fc層是通過反向傳播學(xué)習(xí)讲仰,詳見附錄A慕趴。
圖3:闡述3×3的可變形RoI池化。
Position-Sensitive (PS) RoI Pooling [7] It is fully convolutional and different from RoI pooling. Through a conv layer, all the input feature maps are firstly converted to score maps for each object class (totally
for
object classes), as illustrated in the bottom branch in Figure 4. Without need to distinguish between classes, such score maps are denoted as
where
enumerates all bins. Pooling is performed on these score maps. The output value for
-th bin is obtained by summation from one score map
corresponding to that bin. In short, the difference from RoI pooling in Eq.(5) is that a general feature map
is replaced by a specific positive-sensitive score map
.
Figure 4: Illustration of 3 × 3 deformable PS RoI pooling.
位置敏感(PS)的RoI池化[7]。它是全卷積的秩贰,不同于RoI池化霹俺。通過一個(gè)卷積層,所有的輸入特征映射首先被轉(zhuǎn)換為每個(gè)目標(biāo)類的個(gè)分?jǐn)?shù)映射(對于
個(gè)目標(biāo)類毒费,總共
個(gè))丙唧,如圖4的底部分支所示。不需要區(qū)分類觅玻,這樣的分?jǐn)?shù)映射被表示為
想际,其中
枚舉所有的組塊。池化是在這些分?jǐn)?shù)映射上進(jìn)行的溪厘。第
個(gè)組塊的輸出值是通過對分?jǐn)?shù)映射
對應(yīng)的組塊求和得到的胡本。簡而言之,與方程(5)中RoI池化的區(qū)別在于畸悬,通用特征映射
被特定的位置敏感的分?jǐn)?shù)映射
所取代侧甫。
圖4:闡述3×3的可變形PS RoI池化。
In deformable PS RoI pooling, the only change in Eq.(6) is that is also modified to
. However, the offset learning is different. It follows the
fully convolutional
spirit in [7], as illustrated in Figure 4. In the top branch, a conv layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets , which are then transformed to the real offsets
in the same way as in deformable RoI pooling described above.
在可變形PS RoI池化中蹋宦,方程(6)中唯一的變化是也被修改為
披粟。但是,偏移學(xué)習(xí)是不同的冷冗。它遵循[7]中的“全卷積”精神守屉,如圖4所示。在頂部分支中蒿辙,一個(gè)卷積層生成完整空間分辨率的偏移量字段拇泛。對于每個(gè)RoI(也對于每個(gè)類),在這些字段上應(yīng)用PS RoI池化以獲得歸一化偏移量
思灌,然后以上面可變形RoI池化中描述的相同方式將其轉(zhuǎn)換為實(shí)數(shù)偏移量
。
2.3. Deformable ConvNets
Both deformable convolution and RoI pooling modules have the same input and output as their plain versions. Hence, they can readily replace their plain counterparts in existing CNNs. In the training, these added conv and fc layers for offset learning are initialized with zero weights. Their learning rates are set to times (
by default, and
for the fc layer in Faster R-CNN) of the learning rate for the existing layers. They are trained via back propagation through the bilinear interpolation operations in Eq.(3) and Eq.(4). The resulting CNNs are called deformable ConvNets.
2.3. 可變形卷積網(wǎng)絡(luò)
可變形卷積和RoI池化模塊都具有與普通版本相同的輸入和輸出泰偿。因此熄守,它們可以很容易地取代現(xiàn)有CNN中的普通版本。在訓(xùn)練中甜奄,這些添加的用于偏移學(xué)習(xí)的conv和fc層的權(quán)重被初始化為零柠横。它們的學(xué)習(xí)率設(shè)置為現(xiàn)有層學(xué)習(xí)速率的倍(默認(rèn)
,F(xiàn)aster R-CNN中的fc層為
)课兄。它們通過方程(3)和方程(4)中雙線性插值運(yùn)算的反向傳播進(jìn)行訓(xùn)練牍氛。由此產(chǎn)生的CNN稱為可變形ConvNets。
To integrate deformable ConvNets with the state-of-the-art CNN architectures, we note that these architectures consist of two stages. First, a deep fully convolutional network generates feature maps over the whole input image. Second, a shallow task specific network generates results from the feature maps. We elaborate the two steps below.
為了將可變形的ConvNets與最先進(jìn)的CNN架構(gòu)集成烟阐,我們注意到這些架構(gòu)由兩個(gè)階段組成搬俊。首先紊扬,深度全卷積網(wǎng)絡(luò)在整個(gè)輸入圖像上生成特征映射。其次唉擂,淺層任務(wù)專用網(wǎng)絡(luò)從特征映射上生成結(jié)果餐屎。我們詳細(xì)說明下面兩個(gè)步驟。
Deformable Convolution for Feature Extraction We adopt two state-of-the-art architectures for feature extraction: ResNet-101 [22] and a modifed version of Inception-ResNet [51]. Both are pre-trained on ImageNet [8] classification dataset.
特征提取的可變形卷積玩祟。我們采用兩種最先進(jìn)的架構(gòu)進(jìn)行特征提雀顾酢:ResNet-101[22]和Inception-ResNet[51]的修改版本。兩者都在ImageNet[8]分類數(shù)據(jù)集上進(jìn)行預(yù)訓(xùn)練空扎。
The original Inception-ResNet is designed for image recognition. It has a feature misalignment issue and problematic for dense prediction tasks. It is modified to fix the alignment problem [20]. The modified version is dubbed as “Aligned-Inception-ResNet” and is detailed in appendix B.
最初的Inception-ResNet是為圖像識別而設(shè)計(jì)的藏鹊。它有一個(gè)特征不對齊的問題,對于密集的預(yù)測任務(wù)是有問題的转锈。它被修改來解決對齊問題[20]盘寡。修改后的版本被稱為“Aligned-Inception-ResNet”,詳見附錄B.
Both models consist of several convolutional blocks, an average pooling and a 1000-way fc layer for ImageNet classification. The average pooling and the fc layers are removed. A randomly initialized 1 × 1 convolution is added at last to reduce the channel dimension to 1024. As in common practice [4, 7], the effective stride in the last convolutional block is reduced from 32 pixels to 16 pixels to increase the feature map resolution. Specifically, at the beginning of the last block, stride is changed from 2 to 1 (“conv5” for both ResNet-101 and Aligned-Inception-ResNet). To compensate, the dilation of all the convolution filters in this block (with kernel size > 1) is changed from 1 to 2.
兩種模型都由幾個(gè)卷積塊組成撮慨,平均池化和用于ImageNet分類的1000類全連接層竿痰。平均池化和全連接層被移除。最后加入隨機(jī)初始化的1×1卷積砌溺,以將通道維數(shù)減少到1024影涉。與通常的做法[4,7]一樣,最后一個(gè)卷積塊的有效步長從32個(gè)像素減少到16個(gè)像素抚吠,以增加特征映射的分辨率常潮。具體來說弟胀,在最后一個(gè)塊的開始楷力,步長從2變?yōu)?(ResNet-101和Aligned-Inception-ResNet的“conv5”)。為了進(jìn)行補(bǔ)償孵户,將該塊(核大小>1)中的所有卷積濾波器的擴(kuò)張從1改變?yōu)?萧朝。
Optionally, deformable convolution is applied to the last few convolutional layers (with kernel size > 1). We experimented with different numbers of such layers and found 3 as a good trade-off for different tasks, as reported in Table 1.
Table 1: Results of using deformable convolution in the last 1, 2, 3, and 6 convolutional layers (of 3 × 3 filter) in ResNet-101 feature extraction network. For class-aware RPN, Faster R-CNN, and R-FCN, we report result on VOC 2007 test.
可選地,可變形卷積應(yīng)用于最后的幾個(gè)卷積層(核大小>1)夏哭。我們嘗試了不同數(shù)量的這樣的層检柬,發(fā)現(xiàn)3是不同任務(wù)的一個(gè)很好的權(quán)衡,如表1所示竖配。
表1:在ResNet-101特征提取網(wǎng)絡(luò)中的最后1個(gè)何址,2個(gè),3個(gè)和6個(gè)卷積層上(3×3濾波器)應(yīng)用可變形卷積的結(jié)果进胯。對于class-aware RPN用爪,F(xiàn)aster R-CNN和R-FCN,我們報(bào)告了在VOC 2007測試集上的結(jié)果胁镐。
Segmentation and Detection Networks A task specific network is built upon the output feature maps from the feature extraction network mentioned above.
分割和檢測網(wǎng)絡(luò)偎血。根據(jù)上述特征提取網(wǎng)絡(luò)的輸出特征映射構(gòu)建特定任務(wù)的網(wǎng)絡(luò)诸衔。
In the below, denotes the number of object classes.
在下面,表示目標(biāo)類別的數(shù)量颇玷。
DeepLab [5] is a state-of-the-art method for semantic segmentation. It adds a 1 × 1 convolutional layer over the feature maps to generates (C + 1) maps that represent the per-pixel classification scores. A following softmax layer then outputs the per-pixel probabilities.
DeepLab[5]是最先進(jìn)的語義分割方法。它在特征映射上添加1×1卷積層以生成表示每個(gè)像素分類分?jǐn)?shù)的(C+1)個(gè)映射。然后隨后的softmax層輸出每個(gè)像素的概率持隧。
Category-Aware RPN is almost the same as the region proposal network in [47], except that the 2-class (object or not) convolutional classifier is replaced by a (C + 1)-class convolutional classifier. It can be considered as a simplified version of SSD [40].
除了用(C+1)類卷積分類器代替2類(目標(biāo)或非目標(biāo))卷積分類器外倔叼,Category-Aware RPN與[47]中的區(qū)域提出網(wǎng)絡(luò)幾乎是相同的。它可以被認(rèn)為是SSD的簡化版本[40]空郊。
Faster R-CNN [47] is the state-of-the-art detector. In our implementation, the RPN branch is added on the top of the conv4 block, following [47]. In the previous practice [22, 24], the RoI pooling layer is inserted between the conv4 and the conv5 blocks in ResNet-101, leaving 10 layers for each RoI. This design achieves good accuracy but has high per-RoI computation. Instead, we adopt a simplified design as in [38]. The RoI pooling layer is added at last. On top of the pooled RoI features, two fc layers of dimension 1024 are added, followed by the bounding box regression and the classification branches. Although such simplification (from 10 layer conv5 block to 2 fc layers) would slightly decrease the accuracy, it still makes a strong enough baseline and is not a concern in this work.
Faster R-CNN[47]是最先進(jìn)的檢測器诊霹。在我們的實(shí)現(xiàn)中,RPN分支被添加在conv4塊的頂部渣淳,遵循[47]脾还。在以前的實(shí)踐中[22,24],在ResNet-101的conv4和conv5塊之間插入了RoI池化層入愧,每個(gè)RoI留下了10層鄙漏。這個(gè)設(shè)計(jì)實(shí)現(xiàn)了很好的精確度,但是具有很高的每個(gè)RoI計(jì)算棺蛛。相反怔蚌,我們采用[38]中的簡化設(shè)計(jì)。RoI池化層在最后添加旁赊。在池化的RoI特征之上桦踊,添加了兩個(gè)1024維的全連接層,接著是邊界框回歸和分類分支终畅。雖然這樣的簡化(從10層conv5塊到2個(gè)全連接層)會(huì)稍微降低精確度籍胯,但它仍然具有足夠強(qiáng)的基準(zhǔn),在這項(xiàng)工作中不再關(guān)心离福。
Optionally, the RoI pooling layer can be changed to deformable RoI pooling.
可選地杖狼,可以將RoI池化層更改為可變形的RoI池化。
R-FCN [7] is another state-of-the-art detector. It has negligible per-RoI computation cost. We follow the original implementation. Optionally, its RoI pooling layer can be changed to deformable position-sensitive RoI pooling.
R-FCN[7]是另一種最先進(jìn)的檢測器妖爷。它的每個(gè)RoI計(jì)算成本可以忽略不計(jì)蝶涩。我們遵循原來的實(shí)現(xiàn)⌒跏叮可選地绿聘,其RoI池化層可以改變?yōu)?em>可變形的位置敏感的RoI池化。
3. Understanding Deformable ConvNets
This work is built on the idea of augmenting the spatial sampling locations in convolution and RoI pooling with additional offsets and learning the offsets from target tasks.
3. 理解可變形卷積網(wǎng)絡(luò)
這項(xiàng)工作以用額外的偏移量在卷積和RoI池中增加空間采樣位置次舌,并從目標(biāo)任務(wù)中學(xué)習(xí)偏移量的想法為基礎(chǔ)熄攘。
When the deformable convolution are stacked, the effect of composited deformation is profound. This is exemplified in Figure 5. The receptive field and the sampling locations in the standard convolution are fixed all over the top feature map (left). They are adaptively adjusted according to the objects’ scale and shape in deformable convolution (right). More examples are shown in Figure 6. Table 2 provides quantitative evidence of such adaptive deformation.
Figure 5: Illustration of the fixed receptive field in standard convolution (a) and the adaptive receptive field in deformable convolution (b), using two layers. Top: two activation units on the top feature map, on two objects of different scales and shapes. The activation is from a 3 × 3 filter. Middle: the sampling locations of the 3 × 3 filter on the preceding feature map. Another two activation units are highlighted. Bottom: the sampling locations of two levels of 3 × 3 filters on the preceding feature map. Two sets of locations are highlighted, corresponding to the highlighted units above.
Figure 6: Each image triplet shows the sampling locations ( red points in each image) in three levels of 3 × 3 deformable filters (see Figure 5 as a reference) for three activation units (green points) on the background (left), a small object (middle), and a large object (right), respectively.
Table 2: Statistics of effective dilation values of deformable convolutional filters on three layers and four categories. Similar as in COCO [39], we divide the objects into three categories equally according to the bounding box area. Small: area < pixels; medium:
< area <
; large: area >
pixels.
當(dāng)可變形卷積疊加時(shí),復(fù)合變形的影響是深遠(yuǎn)的垃它。這在圖5中舉例說明鲜屏。標(biāo)準(zhǔn)卷積中的感受野和采樣位置在頂部特征映射上是固定的(左)烹看。它們在可變形卷積中(右)根據(jù)目標(biāo)的尺寸和形狀進(jìn)行自適應(yīng)調(diào)整。圖6中顯示了更多的例子洛史。表2提供了這種自適應(yīng)變形的量化證據(jù)惯殊。
圖5:標(biāo)準(zhǔn)卷積(a)中的固定感受野和可變形卷積(b)中的自適應(yīng)感受野的圖示,使用兩層也殖。頂部:頂部特征映射上的兩個(gè)激活單元土思,在兩個(gè)不同尺度和形狀的目標(biāo)上。激活來自3×3濾波器忆嗜。中間:前一個(gè)特征映射上3×3濾波器的采樣位置己儒。另外兩個(gè)激活單元突出顯示。底部:前一個(gè)特征映射上兩個(gè)3×3濾波器級別的采樣位置捆毫。突出顯示兩組位置闪湾,對應(yīng)于上面突出顯示的單元。
圖6:每個(gè)圖像三元組在三級3×3可變形濾波器(參見圖5作為參考)中顯示了三個(gè)激活單元(綠色點(diǎn))分別在背景(左)绩卤、小目標(biāo)(中)和大目標(biāo)(右)上的采樣位置(每張圖像中的個(gè)紅色點(diǎn))途样。
表2:可變形卷積濾波器在三個(gè)卷積層和四個(gè)類別上的有效擴(kuò)張值的統(tǒng)計(jì)。與在COCO[39]中類似濒憋,我們根據(jù)邊界框區(qū)域?qū)⒛繕?biāo)平均分為三類何暇。小:面積<個(gè)像素凛驮;中等:
<面積<
裆站; 大:面積>
。
The effect of deformable RoI pooling is similar, as illustrated in Figure 7. The regularity of the grid structure in standard RoI pooling no longer holds. Instead, parts deviate from the RoI bins and move onto the nearby object foreground regions. The localization capability is enhanced, especially for non-rigid objects.
Figure 7: Illustration of offset parts in deformable (positive sensitive) RoI pooling in R-FCN [7] and 3 × 3 bins (red) for an input RoI (yellow). Note how the parts are offset to cover the non-rigid objects.
可變形RoI池化的效果是類似的黔夭,如圖7所示宏胯。標(biāo)準(zhǔn)RoI池化中網(wǎng)格結(jié)構(gòu)的規(guī)律不再成立。相反纠修,部分偏離RoI組塊并移動(dòng)到附近的目標(biāo)前景區(qū)域胳嘲。定位能力得到增強(qiáng)厂僧,特別是對于非剛性物體扣草。
圖7:R-FCN[7]中可變形(正敏感)RoI池化的偏移部分的示意圖和輸入RoI(黃色)的3x3個(gè)組塊(紅色)。請注意部件如何偏移以覆蓋非剛性物體颜屠。
3.1. In Context of Related Works
Our work is related to previous works in different aspects. We discuss the relations and differences in details.
3.1. 相關(guān)工作的背景
我們的工作與以前的工作在不同的方面有聯(lián)系辰妙。我們詳細(xì)討論聯(lián)系和差異。
Spatial Transform Networks (STN) [26] It is the first work to learn spatial transformation from data in a deep learning framework. It warps the feature map via a global parametric transformation such as affine transformation. Such warping is expensive and learning the transformation parameters is known difficult. STN has shown successes in small scale image classification problems. The inverse STN method [37] replaces the expensive feature warping by efficient transformation parameter propagation.
空間變換網(wǎng)絡(luò)(STN)[26]甫窟。這是在深度學(xué)習(xí)框架下從數(shù)據(jù)中學(xué)習(xí)空間變換的第一個(gè)工作密浑。它通過全局參數(shù)變換扭曲特征映射,例如仿射變換粗井。這種扭曲是昂貴的尔破,學(xué)習(xí)變換參數(shù)是困難的街图。STN在小規(guī)模圖像分類問題上取得了成功。反STN方法[37]通過有效的變換參數(shù)傳播來代替昂貴的特征扭曲懒构。
The offset learning in deformable convolution can be considered as an extremely light-weight spatial transformer in STN [26]. However, deformable convolution does not adopt a global parametric transformation and feature warping. Instead, it samples the feature map in a local and dense manner. To generate new feature maps, it has a weighted summation step, which is absent in STN.
可變形卷積中的偏移學(xué)習(xí)可以被認(rèn)為是STN中極輕的空間變換器[26]餐济。然而,可變形卷積不采用全局參數(shù)變換和特征扭曲胆剧。相反絮姆,它以局部密集的方式對特征映射進(jìn)行采樣。為了生成新的特征映射秩霍,它有加權(quán)求和步驟篙悯,STN中不存在。
Deformable convolution is easy to integrate into any CNN architectures. Its training is easy. It is shown effective for complex vision tasks that require dense (e.g., semantic segmentation) or semi-dense (e.g., object detection) predictions. These tasks are difficult (if not infeasible) for STN [26, 37].
可變形卷積很容易集成到任何CNN架構(gòu)中铃绒。它的訓(xùn)練很簡單鸽照。對于要求密集(例如語義分割)或半密集(例如目標(biāo)檢測)預(yù)測的復(fù)雜視覺任務(wù)來說,它是有效的颠悬。這些任務(wù)對于STN來說是困難的(如果不是不可行的話)[26,37]移宅。
Active Convolution [27] This work is contemporary. It also augments the sampling locations in the convolution with offsets and learns the offsets via back-propagation end-to-end. It is shown effective on image classification tasks.
主動(dòng)卷積[27]。這項(xiàng)工作是當(dāng)代的椿疗。它還通過偏移來增加卷積中的采樣位置漏峰,并通過端到端的反向傳播學(xué)習(xí)偏移量。它對于圖像分類任務(wù)是有效的届榄。
Two crucial differences from deformable convolution make this work less general and adaptive. First, it shares the offsets all over the different spatial locations. Second, the offsets are static model parameters that are learnt per task or per training. In contrast, the offsets in deformable convolution are dynamic model outputs that vary per image location. They model the dense spatial transformations in the images and are effective for (semi-)dense prediction tasks such as object detection and semantic segmentation.
與可變形卷積的兩個(gè)關(guān)鍵區(qū)別使得這個(gè)工作不那么一般和適應(yīng)浅乔。首先,它在所有不同的空間位置上共享偏移量铝条。其次靖苇,偏移量是每個(gè)任務(wù)或每次訓(xùn)練都要學(xué)習(xí)的靜態(tài)模型參數(shù)。相反班缰,可變形卷積中的偏移是每個(gè)圖像位置變化的動(dòng)態(tài)模型輸出贤壁。他們對圖像中的密集空間變換進(jìn)行建模,對于(半)密集的預(yù)測任務(wù)(如目標(biāo)檢測和語義分割)是有效的埠忘。
Effective Receptive Field [43] It finds that not all pixels in a receptive field contribute equally to an output response. The pixels near the center have much larger impact. The effective receptive field only occupies a small fraction of the theoretical receptive field and has a Gaussian distribution. Although the theoretical receptive field size increases linearly with the number of convolutional layers, a surprising result is that, the effective receptive field size increases linearly with the square root of the number, therefore, at a much slower rate than what we would expect.
有效的感受野[43]脾拆。它發(fā)現(xiàn),并不是感受野中的所有像素都貢獻(xiàn)平等的輸出響應(yīng)莹妒。中心附近的像素影響更大名船。有效感受野只占據(jù)理論感受野的一小部分,并具有高斯分布旨怠。雖然理論上的感受野大小隨卷積層數(shù)量線性增加渠驼,但令人驚訝的結(jié)果是,有效感受野大小隨著數(shù)量的平方根線性增加鉴腻,因此迷扇,感受野大小以比我們期待的更低的速率增加百揭。
This finding indicates that even the top layer’s unit in deep CNNs may not have large enough receptive field. This partially explains why atrous convolution [23] is widely used in vision tasks (see below). It indicates the needs of adaptive receptive field learning.
這一發(fā)現(xiàn)表明,即使是深層CNN的頂層單元也可能沒有足夠大的感受野蜓席。這部分解釋了為什么空洞卷積[23]被廣泛用于視覺任務(wù)(見下文)信峻。它表明了自適應(yīng)感受野學(xué)習(xí)的必要。
Deformable convolution is capable of learning receptive fields adaptively, as shown in Figure 5, 6 and Table 2.
可變形卷積能夠自適應(yīng)地學(xué)習(xí)感受野瓮床,如圖5盹舞,6和表2所示。
Atrous convolution [23] It increases a normal filter’s stride to be larger than 1 and keeps the original weights at sparsified sampling locations. This increases the receptive field size and retains the same complexity in parameters and computation. It has been widely used for semantic segmentation [41, 5, 54] (also called dilated convolution in [54]), object detection [7], and image classification [55].
空洞卷積[23]隘庄。它將正常濾波器的步長增加到大于1踢步,并保持稀疏采樣位置的原始權(quán)重。這增加了感受野的大小丑掺,并保持了相同的參數(shù)和計(jì)算復(fù)雜性获印。它已被廣泛用于語義分割[41,5,54](在[54]中也稱擴(kuò)張卷積),目標(biāo)檢測[7]和圖像分類[55]街州。
Deformable convolution is a generalization of atrous convolution, as easily seen in Figure 1 (c). Extensive comparison to atrous convolution is presented in Table 3.
Table 3: Evaluation of our deformable modules and atrous convolution, using ResNet-101.
可變形卷積是空洞卷積的推廣兼丰,如圖1(c)所示。表3給出了大量的與空洞卷積的比較唆缴。
表3:我們的可變形模塊與空洞卷積的評估鳍征,使用ResNet-101。
Deformable Part Models (DPM) [11] Deformable RoI pooling is similar to DPM because both methods learn the spatial deformation of object parts to maximize the classification score. Deformable RoI pooling is simpler since no spatial relations between the parts are considered.
可變形部件模型(DPM)[11]面徽⊙薮裕可變形RoI池化與DPM類似,因?yàn)閮煞N方法都可以學(xué)習(xí)目標(biāo)部件的空間變形趟紊,以最大化分類得分氮双。由于不考慮部件之間的空間關(guān)系,所以可變形RoI池化更簡單霎匈。
DPM is a shallow model and has limited capability of modeling deformation. While its inference algorithm can be converted to CNNs [17] by treating the distance transform as a special pooling operation, its training is not end-to-end and involves heuristic choices such as selection of components and part sizes. In contrast, deformable ConvNets are deep and perform end-to-end training. When multiple deformable modules are stacked, the capability of modeling deformation becomes stronger.
DPM是一個(gè)淺層模型戴差,其建模變形能力有限。雖然其推理算法可以通過將距離變換視為一個(gè)特殊的池化操作轉(zhuǎn)換為CNN[17]铛嘱,但是它的訓(xùn)練不是端到端的暖释,而是涉及啟發(fā)式選擇,例如選擇組件和部件尺寸弄痹。相比之下饭入,可變形ConvNets是深層的并進(jìn)行端到端的訓(xùn)練。當(dāng)多個(gè)可變形模塊堆疊時(shí)肛真,建模變形的能力變得更強(qiáng)。
DeepID-Net [44] It introduces a deformation constrained pooling layer which also considers part deformation for object detection. It therefore shares a similar spirit with deformable RoI pooling, but is much more complex. This work is highly engineered and based on RCNN [16]. It is unclear how to adapt it to the recent state-of-the-art object detection methods [47, 7] in an end-to-end manner.
DeepID-Net[44]爽航。它引入了一個(gè)變形約束池化層蚓让,它也考慮了目標(biāo)檢測的部分變形乾忱。因此,它與可變形RoI池化共享類似的精神历极,但是要復(fù)雜得多窄瘟。這項(xiàng)工作是高度工程化并基于RCNN的[16]。目前尚不清楚如何以端對端的方式將其應(yīng)用于最近的最先進(jìn)目標(biāo)檢測方法[47,7]趟卸。
Spatial manipulation in RoI pooling Spatial pyramid pooling [34] uses hand crafted pooling regions over scales. It is the predominant approach in computer vision and also used in deep learning based object detection [21, 15].
RoI池化中的空間操作蹄葱。空間金字塔池化[34]在尺度上使用手工設(shè)計(jì)的池化區(qū)域锄列。它是計(jì)算機(jī)視覺中的主要方法图云,也用于基于深度學(xué)習(xí)的目標(biāo)檢測[21,15]。
Learning the spatial layout of pooling regions has received little study. The work in [28] learns a sparse subset of pooling regions from a large over-complete set. The large set is hand engineered and the learning is not end-to-end.
很少有學(xué)習(xí)池化區(qū)域空間布局的研究邻邮。[28]中的工作從一個(gè)大型的超完備集合中學(xué)習(xí)了池化區(qū)域一個(gè)稀疏子集竣况。大數(shù)據(jù)集是手工設(shè)計(jì)的并且學(xué)習(xí)不是端到端的。
Deformable RoI pooling is the first to learn pooling regions end-to-end in CNNs. While the regions are of the same size currently, extension to multiple sizes as in spatial pyramid pooling [34] is straightforward.
可變形RoI池化第一個(gè)在CNN中端到端地學(xué)習(xí)池化區(qū)域筒严。雖然目前這些區(qū)域的規(guī)模相同丹泉,但像空間金字塔池化[34]那樣擴(kuò)展到多種尺度很簡單。
Transformation invariant features and their learning There have been tremendous efforts on designing transformation invariant features. Notable examples include scale invariant feature transform (SIFT) [42] and ORB [49] (O for orientation). There is a large body of such works in the context of CNNs. The invariance and equivalence of CNN representations to image transformations are studied in [36]. Some works learn invariant CNN representations with respect to different types of transformations such as [50], scattering networks [3], convolutional jungles [32], and TI-pooling [33]. Some works are devoted for specific transformations such as symmetry [13, 9], scale [29], and rotation [53].
變換不變特征及其學(xué)習(xí)鸭蛙。在設(shè)計(jì)變換不變特征方面已經(jīng)進(jìn)行了巨大的努力摹恨。值得注意的例子包括尺度不變特征變換(SIFT)[42]和ORB[49](O為方向)。在CNN的背景下有大量這樣的工作娶视。CNN表示對圖像變換的不變性和等價(jià)性在[36]中被研究睬塌。一些工作學(xué)習(xí)關(guān)于不同類型的變換(如[50],散射網(wǎng)絡(luò)[3]歇万,卷積森林[32]和TI池化[33])的不變CNN表示揩晴。有些工作專門用于對稱性[13,9],尺度[29]和旋轉(zhuǎn)[53]等特定轉(zhuǎn)換贪磺。
As analyzed in Section 1, in these works the transformations are known a priori. The knowledge (such as parameterization) is used to hand craft the structure of feature extraction algorithm, either fixed in such as SIFT, or with learnable parameters such as those based on CNNs. They cannot handle unknown transformations in the new tasks.
如第一部分分析的那樣硫兰,在這些工作中,轉(zhuǎn)換是先驗(yàn)的寒锚。使用知識(比如參數(shù)化)來手工設(shè)計(jì)特征提取算法的結(jié)構(gòu)劫映,或者是像SIFT那樣固定的,或者用學(xué)習(xí)的參數(shù)刹前,如基于CNN的那些泳赋。它們無法處理新任務(wù)中的未知變換。
In contrast, our deformable modules generalize various transformations (see Figure 1). The transformation invariance is learned from the target task.
相反喇喉,我們的可變形模塊概括了各種轉(zhuǎn)換(見圖1)祖今。從目標(biāo)任務(wù)中學(xué)習(xí)變換的不變性。
Dynamic Filter [2] Similar to deformable convolution, the dynamic filters are also conditioned on the input features and change over samples. Differently, only the filter weights are learned, not the sampling locations like ours. This work is applied for video and stereo prediction.
動(dòng)態(tài)濾波器[2]。與可變形卷積類似千诬,動(dòng)態(tài)濾波器也是依據(jù)輸入特征并在采樣上變化耍目。不同的是,只學(xué)習(xí)濾波器權(quán)重徐绑,而不是像我們這樣采樣位置邪驮。這項(xiàng)工作適用于視頻和立體聲預(yù)測。
Combination of low level filters Gaussian filters and its smooth derivatives [30] are widely used to extract low level image structures such as corners, edges, T-junctions, etc. Under certain conditions, such filters form a set of basis and their linear combination forms new filters within the same group of geometric transformations, such as multiple orientations in Steerable Filters [12] and multiple scales in [45]. We note that although the term deformable kernels is used in [45], its meaning is different from ours in this work.
低級濾波器的組合傲茄。高斯濾波器及其平滑導(dǎo)數(shù)[30]被廣泛用于提取低級圖像結(jié)構(gòu)毅访,如角點(diǎn),邊緣盘榨,T形接點(diǎn)等喻粹。在某些條件下,這些濾波器形成一組基较曼,并且它們的線性組合在同一組幾何變換中形成新的濾波器磷斧,例如Steerable Filters[12]中的多個(gè)方向和[45]中多尺度。我們注意到盡管[45]中使用了可變形內(nèi)核這個(gè)術(shù)語捷犹,但它的含義與我們在本文中的含義不同弛饭。
Most CNNs learn all their convolution filters from scratch. The recent work [25] shows that it could be unnecessary. It replaces the free form filters by weighted combination of low level filters (Gaussian derivatives up to 4-th order) and learns the weight coefficients. The regularization over the filter function space is shown to improve the generalization ability when training data are small.
大多數(shù)CNN從零開始學(xué)習(xí)所有的卷積濾波器。最近的工作[25]表明萍歉,這可能是沒必要的侣颂。它通過低階濾波器(高斯導(dǎo)數(shù)達(dá)4階)的加權(quán)組合來代替自由形式的濾波器,并學(xué)習(xí)權(quán)重系數(shù)枪孩。通過對濾波函數(shù)空間的正則化憔晒,可以提高訓(xùn)練小數(shù)據(jù)量時(shí)的泛化能力。
Above works are related to ours in that, when multiple filters, especially with different scales, are combined, the resulting filter could have complex weights and resemble our deformable convolution filter. However, deformable convolution learns sampling locations instead of filter weights.
上面的工作與我們有關(guān)蔑舞,當(dāng)多個(gè)濾波器拒担,尤其是不同尺度的濾波器組合時(shí),所得到的濾波器可能具有復(fù)雜的權(quán)重攻询,并且與我們的可變形卷積濾波器相似从撼。但是,可變形卷積學(xué)習(xí)采樣位置而不是濾波器權(quán)重钧栖。
4. Experiments
4.1. Experiment Setup and Implementation
Semantic Segmentation We use PASCAL VOC [10] and CityScapes [6]. For PASCAL VOC, there are 20 semantic categories. Following the protocols in [19, 41, 4], we use VOC 2012 dataset and the additional mask annotations in [18]. The training set includes 10, 582 images. Evaluation is performed on 1, 449 images in the validation set. For CityScapes, following the protocols in [5], training and evaluation are performed on 2, 975 images in the train set and 500 images in the validation set, respectively. There are 19 semantic categories plus a background category.
4. 實(shí)驗(yàn)
4.1. 實(shí)驗(yàn)設(shè)置和實(shí)現(xiàn)
語義分割低零。我們使用PASCAL VOC[10]和CityScapes[6]。對于PASCAL VOC拯杠,有20個(gè)語義類別掏婶。遵循[19,41,4]中的協(xié)議,我們使用VOC 2012數(shù)據(jù)集和[18]中的附加掩模注釋潭陪。訓(xùn)練集包含10,582張圖像雄妥。評估在驗(yàn)證集中的1,449張圖像上進(jìn)行最蕾。對于CityScapes,按照[5]中的協(xié)議茎芭,對訓(xùn)練數(shù)據(jù)集中的2,975張圖像和驗(yàn)證集中的500張圖像分別進(jìn)行訓(xùn)練和評估揖膜。有19個(gè)語義類別加上一個(gè)背景類別誓沸。
For evaluation, we use the mean intersection-over-union (mIoU) metric defined over image pixels, following the standard protocols [10, 6]. We use mIoU@V and mIoU@C for PASCAl VOC and Cityscapes, respectively.
為了評估梅桩,我們使用在圖像像素上定義的平均交集(mIoU)度量,遵循標(biāo)準(zhǔn)協(xié)議[10拜隧,6]宿百。我們在PASCAl VOC和Cityscapes上分別使用mIoU@V和mIoU@C。
In training and inference, the images are resized to have a shorter side of pixels for PASCAL VOC and
pixels for Cityscapes. In SGD training, one image is randomly sampled in each mini-batch. A total of 30k and 45k iterations are performed for PASCAL VOC and Cityscapes, respectively, with 8 GPUs and one mini-batch on each. The learning rates are
and
in the first
and the last
iterations, respectively.
在訓(xùn)練和推斷中洪添,PASCAL VOC中圖像的大小調(diào)整為較短邊有個(gè)像素垦页,Cityscapes較短邊有
個(gè)像素。在SGD訓(xùn)練中干奢,每個(gè)小批次數(shù)據(jù)中隨機(jī)抽取一張圖像痊焊。分別對PASCAL VOC和Cityscapes進(jìn)行30k和45k迭代,有8個(gè)GPU每個(gè)GPU上處理一個(gè)小批次數(shù)據(jù)忿峻。前
次迭代和后
次迭代的學(xué)習(xí)率分別設(shè)為
薄啥,
。
Object Detection We use PASCAL VOC and COCO [39] datasets. For PASCAL VOC, following the protocol in [15], training is performed on the union of VOC 2007 trainval and VOC 2012 trainval. Evaluation is on VOC 2007 test. For COCO, following the standard protocol [39], training and evaluation are performed on the 120k images in the trainval and the 20k images in the test-dev, respectively.
目標(biāo)檢測逛尚。我們使用PASCAL VOC和COCO[39]數(shù)據(jù)集垄惧。對于PASCAL VOC,按照[15]中的協(xié)議绰寞,對VOC 2007 trainval和VOC 2012 trainval的并集進(jìn)行培訓(xùn)到逊。評估是在VOC 2007測試集上。對于COCO滤钱,遵循標(biāo)準(zhǔn)協(xié)議[39]觉壶,分別對trainval中的120k張圖像和test-dev中的20k張圖像進(jìn)行訓(xùn)練和評估。
For evaluation, we use the standard mean average precision (mAP) scores [10, 39]. For PASCAL VOC, we report mAP scores using IoU thresholds at 0.5 and 0.7. For COCO, we use the standard COCO metric of mAP@[0.5:0.95], as well as mAP@0.5.
為了評估件缸,我們使用標(biāo)準(zhǔn)的平均精度均值(MAP)得分[10,39]铜靶。對于PASCAL VOC,我們使用0.5和0.7的IoU閾值報(bào)告mAP分?jǐn)?shù)停团。對于COCO旷坦,我們使用mAP@[0.5:0.95]的標(biāo)準(zhǔn)COCO度量,以及mAP@0.5佑稠。
In training and inference, the images are resized to have a shorter side of 600 pixels. In SGD training, one image is randomly sampled in each mini-batch. For class-aware RPN, 256 RoIs are sampled from the image. For Faster R-CNN and R-FCN, 256 and 128 RoIs are sampled for the region proposal and the object detection networks, respectively. bins are adopted in RoI pooling. To facilitate the ablation experiments on VOC, we follow [38] and utilize pre-trained and fixed RPN proposals for the training of Faster R-CNN and R-FCN, without feature sharing between the region proposal and the object detection networks. The RPN network is trained separately as in the first stage of the procedure in [47]. For COCO, joint training as in [48] is performed and feature sharing is enabled for training. A total of 30k and 240k iterations are performed for PASCAL VOC and COCO, respectively, on 8 GPUs. The learning rates are set as
and
in the first
and the last
iterations, respectively.
在訓(xùn)練和推斷中秒梅,圖像被調(diào)整為較短邊具有600像素。在SGD訓(xùn)練中舌胶,每個(gè)小批次中隨機(jī)抽取一張圖片捆蜀。對于class-aware RPN,從圖像中采樣256個(gè)RoI。對于Faster R-CNN和R-FCN辆它,對區(qū)域提出和目標(biāo)檢測網(wǎng)絡(luò)分別采樣256個(gè)和128個(gè)RoI誊薄。在ROI池化中采用的組塊。為了促進(jìn)VOC的消融實(shí)驗(yàn)锰茉,我們遵循[38]呢蔫,并且利用預(yù)訓(xùn)練的和固定的RPN提出來訓(xùn)練Faster R-CNN和R-FCN,而區(qū)域提出和目標(biāo)檢測網(wǎng)絡(luò)之間沒有特征共享飒筑。RPN網(wǎng)絡(luò)是在[47]中過程的第一階段單獨(dú)訓(xùn)練的片吊。對于COCO,執(zhí)行[48]中的聯(lián)合訓(xùn)練协屡,并且訓(xùn)練可以進(jìn)行特征共享俏脊。在8個(gè)GPU上分別對PASCAL VOC和COCO執(zhí)行30k次和240k次迭代。前
次迭代和后
次迭代的學(xué)習(xí)率分別設(shè)為
肤晓,
爷贫。
4.2. Ablation Study
Extensive ablation studies are performed to validate the efficacy and efficiency of our approach.
4.2. 消融研究
我們進(jìn)行了廣泛的消融研究來驗(yàn)證我們方法的功效性和有效性。
Deformable Convolution Table 1 evaluates the effect of deformable convolution using ResNet-101 feature extraction network. Accuracy steadily improves when more deformable convolution layers are used, especially for DeepLab and class-aware RPN. The improvement saturates when using 3 deformable layers for DeepLab, and 6 for others. In the remaining experiments, we use 3 in the feature extraction networks.
可變形卷積补憾。表1使用ResNet-101特征提取網(wǎng)絡(luò)評估可變形卷積的影響漫萄。當(dāng)使用更多可變形卷積層時(shí),精度穩(wěn)步提高余蟹,特別是DeepLab和class-aware RPN卷胯。當(dāng)DeepLab使用3個(gè)可變形層時(shí),改進(jìn)飽和威酒,其它的使用6個(gè)窑睁。在其余的實(shí)驗(yàn)中,我們在特征提取網(wǎng)絡(luò)中使用3個(gè)葵孤。
We empirically observed that the learned offsets in the deformable convolution layers are highly adaptive to the image content, as illustrated in Figure 5 and Figure 6. To better understand the mechanism of deformable convolution, we define a metric called effective dilation for a deformable convolution filter. It is the mean of the distances between all adjacent pairs of sampling locations in the filter. It is a rough measure of the receptive field size of the filter.
我們經(jīng)驗(yàn)地觀察到担钮,可變形卷積層中學(xué)習(xí)到的偏移量對圖像內(nèi)容具有高度的自適應(yīng)性,如圖5和圖6所示尤仍。為了更好地理解可變形卷積的機(jī)制箫津,我們?yōu)榭勺冃尉矸e濾波器定義了一個(gè)稱為有效擴(kuò)張的度量標(biāo)準(zhǔn)。它是濾波器中所有采樣位置的相鄰對之間距離的平均值宰啦。這是對濾波器的感受野大小的粗略測量苏遥。
We apply the R-FCN network with 3 deformable layers (as in Table 1) on VOC 2007 test images. We categorize the deformable convolution filters into four classes: small, medium, large, and background, according to the ground truth bounding box annotation and where the filter center is. Table 2 reports the statistics (mean and std) of the effective dilation values. It clearly shows that: 1) the receptive field sizes of deformable filters are correlated with object sizes, indicating that the deformation is effectively learned from image content; 2) the filter sizes on the background region are between those on medium and large objects, indicating that a relatively large receptive field is necessary for recognizing the background regions. These observations are consistent in different layers.
我們在VOC 2007測試圖像上應(yīng)用R-FCN網(wǎng)絡(luò)压语,具有3個(gè)可變形層(如表1所示)学辱。根據(jù)真實(shí)邊界框標(biāo)注和濾波器中心的位置您访,我們將可變形卷積濾波器分為四類:小呻逆,中,大和背景驰吓。表2報(bào)告了有效擴(kuò)張值的統(tǒng)計(jì)(平均值和標(biāo)準(zhǔn)差)提完。它清楚地表明:1)可變形濾波器的感受野大小與目標(biāo)大小相關(guān)宋列,表明變形是從圖像內(nèi)容中有效學(xué)習(xí)到的; 2)背景區(qū)域上的濾波器大小介于中瞬矩,大目標(biāo)的濾波器之間茶鉴,表明一個(gè)相對較大的感受野是識別背景區(qū)域所必需的。這些觀察結(jié)果在不同層上是一致的景用。
The default ResNet-101 model uses atrous convolution with dilation 2 for the last three 3 × 3 convolutional layers (see Section 2.3). We further tried dilation values 4, 6, and 8 and reported the results in Table 3. It shows that: 1) accuracy increases for all tasks when using larger dilation values, indicating that the default networks have too small receptive fields; 2) the optimal dilation values vary for different tasks, e.g., 6 for DeepLab but 4 for Faster R-CNN; 3) deformable convolution has the best accuracy. These observations verify that adaptive learning of filter deformation is effective and necessary.
默認(rèn)的ResNet-101模型在最后的3個(gè)3×3卷積層使用擴(kuò)張為的2空洞卷積(見2.3節(jié))涵叮。我們進(jìn)一步嘗試了擴(kuò)張值4,6和8丛肢,并在表3中報(bào)告了結(jié)果围肥。它表明:1)當(dāng)使用較大的擴(kuò)張值時(shí)剿干,所有任務(wù)的準(zhǔn)確度都會(huì)增加蜂怎,表明默認(rèn)網(wǎng)絡(luò)的感受野太小置尔;* 2)對于不同的任務(wù)杠步,最佳擴(kuò)張值是不同的,例如榜轿,6用于DeepLab幽歼,4用于Faster R-CNN; 3)可變形卷積具有最好的精度谬盐。這些觀察結(jié)果證明了濾波器變形的自適應(yīng)學(xué)習(xí)是有效和必要的甸私。
Deformable RoI Pooling It is applicable to Faster R-CNN and R-FCN. As shown in Table 3, using it alone already produces noticeable performance gains, especially at the strict mAP@0.7 metric. When both deformable convolution and RoI Pooling are used, significant accuracy improvements are obtained.
可變形RoI池化。它適用于Faster R-CNN和R-FCN飞傀。如表3所示皇型,單獨(dú)使用它已經(jīng)產(chǎn)生了顯著的性能收益,特別是在嚴(yán)格的mAP@0.7度量標(biāo)準(zhǔn)下砸烦。當(dāng)同時(shí)使用可變形卷積和RoI池化時(shí)弃鸦,會(huì)獲得顯著準(zhǔn)確性改進(jìn)。
Model Complexity and Runtime Table 4 reports the model complexity and runtime of the proposed deformable ConvNets and their plain versions. Deformable ConvNets only add small overhead over model parameters and computation. This indicates that the significant performance improvement is from the capability of modeling geometric transformations, other than increasing model parameters.
Table 4: Model complexity and runtime comparison of deformable ConvNets and the plain counterparts, using ResNet-101. The overall runtime in the last column includes image resizing, network forward, and post-processing (e.g., NMS for object detection). Runtime is counted on a workstation with Intel E5-2650 v2 CPU and Nvidia K40 GPU.
模型復(fù)雜性和運(yùn)行時(shí)間幢痘。表4報(bào)告了所提出的可變形ConvNets及其普通版本的模型復(fù)雜度和運(yùn)行時(shí)間唬格。可變形ConvNets僅增加了很小的模型參數(shù)和計(jì)算量颜说。這表明顯著的性能改進(jìn)來自于建模幾何變換的能力购岗,而不是增加模型參數(shù)。
表4:使用ResNet-101的可變形ConvNets和對應(yīng)普通版本的模型復(fù)雜性和運(yùn)行時(shí)比較门粪。最后一列中的整體運(yùn)行時(shí)間包括圖像大小調(diào)整喊积,網(wǎng)絡(luò)前饋傳播和后處理(例如,用于目標(biāo)檢測的NMS)庄拇。運(yùn)行時(shí)間計(jì)算是在一臺(tái)配備了Intel E5-2650 v2 CPU和Nvidia K40 GPU的工作站上注服。
4.3. Object Detection on COCO
In Table 5, we perform extensive comparison between the deformable ConvNets and the plain ConvNets for object detection on COCO test-dev set. We first experiment using ResNet-101 model. The deformable versions of class-aware RPN, Faster R-CNN and R-FCN achieve mAP@[0.5:0.95] scores of ,
, and
respectively, which are
,
, and
relatively higher than their plain-ConvNets counterparts respectively. By replacing ResNet-101 by Aligned-Inception-ResNet in Faster R-CNN and R-FCN, their plain-ConvNet baselines both improve thanks to the more powerful feature representations. And the effective performance gains brought by deformable ConvNets also hold. By further testing on multiple image scales (the image shorter side is in [480, 576, 688, 864, 1200, 1400]) and performing iterative bounding box average [14], the mAP@[0.5:0.95] scores are increased to 37.5% for the deformable version of R-FCN. Note that the performance gain of deformable ConvNets is complementary to these bells and whistles.
Table 5: Object detection results of deformable ConvNets v.s. plain ConvNets on COCO test-dev set. M denotes multi-scale testing, and B denotes iterative bounding box average in the table.
4.3. COCO的目標(biāo)檢測
在表5中韭邓,我們在COCO test-dev數(shù)據(jù)集上對用于目標(biāo)檢測的可變形ConvNets和普通ConvNets進(jìn)行了廣泛的比較。我們首先使用ResNet-101模型進(jìn)行實(shí)驗(yàn)溶弟。class-aware RPN女淑,F(xiàn)aster CNN和R-FCN的可變形版本分別獲得了,
和
的mAP@[0.5:0.95]分?jǐn)?shù)辜御,分別比它們對應(yīng)的普通ConvNets相對高了
鸭你,
和
。通過在Faster R-CNN和R-FCN中用Aligned-Inception-ResNet取代ResNet-101擒权,由于更強(qiáng)大的特征表示袱巨,它們的普通ConvNet基線都得到了提高。而可變形ConvNets帶來的有效性能收益也是成立的碳抄。通過在多個(gè)圖像尺度上(圖像較短邊在[480,576,688,864,1200,1400]內(nèi))的進(jìn)一步測試愉老,并執(zhí)行迭代邊界框平均[14],對于R-FCN的可變形版本剖效,mAP@[0.5:0.95]分?jǐn)?shù)增加到了37.5%嫉入。請注意,可變形ConvNets的性能增益是對這些附加功能的補(bǔ)充璧尸。
表5:可變形ConvNets和普通ConvNets在COCO test-dev數(shù)據(jù)集上的目標(biāo)檢測結(jié)果咒林。在表中M表示多尺度測試,B表示迭代邊界框平均值爷光。
5. Conclusion
This paper presents deformable ConvNets, which is a simple, efficient, deep, and end-to-end solution to model dense spatial transformations. For the first time, we show that it is feasible and effective to learn dense spatial transformation in CNNs for sophisticated vision tasks, such as object detection and semantic segmentation.
5. 結(jié)論
本文提出了可變形ConvNets垫竞,它是一個(gè)簡單,高效蛀序,深度欢瞪,端到端的建模密集空間變換的解決方案。我們首次證明了在CNN中學(xué)習(xí)高級視覺任務(wù)(如目標(biāo)檢測和語義分割)中的密集空間變換是可行和有效的哼拔。
Acknowledgements
The Aligned-Inception-ResNet model was trained and investigated by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in unpublished work.
致謝
Aligned-Inception-ResNet模型由Kaiming He引有,Xiangyu Zhang,Shaoqing Ren和Jian Sun在未發(fā)表的工作中進(jìn)行了研究和訓(xùn)練倦逐。
References
[1] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In ICML, 2010. 1
[2] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016. 6
[3] J. Bruna and S. Mallat. Invariant scattering convolution networks. TPAMI, 2013. 6
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 4, 7
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016. 4, 6, 7
[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 7
[7] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 2, 3, 4, 5, 6
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 4, 10
[9] S. Dieleman, J. D. Fauw, and K. Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660, 2016. 6
[10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 7
[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 2010. 2, 6
[12] W. T. Freeman and E. H. Adelson. The design and use of steerable filters. TPAMI, 1991. 6
[13] R. Gens and P. M. Domingos. Deep symmetry networks. In NIPS, 2014. 6
[14] S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentation-aware cnn model. In ICCV, 2015. 9
[15] R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 2, 3, 6, 7
[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3, 6
[17] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.
[20] K. He, X. Zhang, S. Ren, and J. Sun. Aligned-inceptionresnet model, unpublished work. 4, 10
[21] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 6
[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 4, 10
[23] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. Wavelets: Time-Frequency Methods and Phase Space, page 289297, 1989. 6
[24] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 4
[25] J.-H. Jacobsen, J. van Gemert, Z. Lou, and A. W.M.Smeulders. Structured receptive fields in cnns. In CVPR, 2016. 6
[26] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 2, 5
[27] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017. 5
[28] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, 2012. 6
[29] A. Kanazawa, A. Sharma, and D. Jacobs. Locally scale-invariant convolutional neural networks. In NIPS, 2014. 6
[30] J. J. Koenderink and A. J. van Doom. Representation of local geometry in the visual system. Biological Cybernetics, 55(6):367–375, Mar. 1987. 6
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1
[32] D. Laptev and J. M. Buhmann. Transformation-invariantcon-volutional jungles. In CVPR, 2015. 6
[33] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. arXiv preprint arXiv:1604.06318, 2016. 6
[34] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 6
[35] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 1995. 1
[36] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015. 6
[37] C.-H. Lin and S. Lucey. Inverse compositional spatial transformer networks. arXiv preprint arXiv:1612.03897, 2016. arXiv preprint arXiv:1409.5403, 2014. 6
[18] B. Hariharan, P. Arbela?ez, L. Bourdev, S. Maji, and J. Malik. 5 Semantic contours from inverse detectors. In ICCV, 2011. 7 [19] B. Hariharan, P. Arbela?ez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. 7
[38] T.-Y. Lin, P. Dolla?r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4, 7
[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla?r, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014. 7
[40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, 2016. 1, 4
[41] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 6, 7
[42] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999. 1, 6
[43] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep convolutional neural networks. arXiv preprint arXiv:1701.04128, 2017. 6
[44] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, and X. Tang. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR, 2015. 6
[45] P. Perona. Deformable kernels for early vision. TPAMI, 1995. 6
[46] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 1
[47] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 3, 4, 6, 7
[48] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. TPAMI, 2016. 7
[49] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In ICCV, 2011. 6
[50] K. Sohn and H. Lee. Learning invariant representations with local transformations. In ICML, 2012. 6
[51] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 4, 10
[52] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov. Scalable, high-quality object detection. arXiv:1412.1441v2, 2014. 1
[53] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. arXiv preprint arXiv:1612.04642, 2016. 6
[54] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 6
[55] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In CVPR, 2017. 6