2018年年初
論文地址:https://arxiv.org/pdf/1611.07709.pdf
要點
第一個提出了在物體分割中可以端到端訓(xùn)練的框架常拓,是繼 FCN 之后分割領(lǐng)域的又一個重要進(jìn)展秆麸。
實例分割與語義分割不同。在一張圖像中载佳,待分割的物體個數(shù)是不定的,每個物體標(biāo)記一個類別的話,這張圖像的類別個數(shù)也是不定的郊供,導(dǎo)致輸出的通道個數(shù)也無法保持恒定甸陌,所以不能直接套用 FCN 的端到端訓(xùn)練框架须揣。因此,一個直接的想法是钱豁,先得到每個物體的檢測框耻卡,在每個檢測框內(nèi),再去提取物體的分割結(jié)果牲尺。這樣可以避免類別個數(shù)不定的問題卵酪。比如,在 faster rcnn 的框架中谤碳,提取 ROI 之后溃卡,對每個 ROI 區(qū)域多加一路物體分割的分支。這種方法雖然可行蜒简,但留有一個潛在的問題:label 的不穩(wěn)定瘸羡。想象一下有兩個人(A,B)離得很近搓茬,以至于每個人的檢測框都不得不包含一些另一個人的區(qū)域犹赖。當(dāng)我們關(guān)注 A 時,B 被引入的部分會標(biāo)記為背景垮兑;相反當(dāng)我們關(guān)注 B 時冷尉,這部分會被標(biāo)記為前景。為了解決上述問題系枪,本文引用了一種 Instance-sensitive score maps 的方法(首先在 Instance-sensitive Fully Convolutional Networks 一文中被提出)雀哨,簡單卻有效的實現(xiàn)了端到端的物體分割訓(xùn)練。
具體的作法是:將一個 object 的候選框分為 NxN 的格子私爷,每個格子的 feature 來自不同通道的 feature map雾棺。
以上圖為例,可以認(rèn)為衬浑,將物體分割的輸出分成了 9 個 channel捌浩,分別學(xué)習(xí) object 的左上,上工秩,右上尸饺,….. 右下等 9 個邊界进统。這種改變將物體從一個整體打散成為 9 個部分,從而在任何一張 feature map 上浪听,兩個相鄰的物體的 label 不再連在一起(feature map 1 代表物體的左上邊界螟碎,可以看到兩個人的左上邊界并沒有連在一起),因此迹栓,在每張 feature map 上掉分,兩人都是可區(qū)分的。
打個比喻克伊,假設(shè)本來我們只有一個 person 類別酥郭,兩個人如果肩并肩緊挨著站在一起,則無法區(qū)分彼此愿吹。如果我們劃分了左手不从,右手,中心軀干等三個類別犁跪,用三張獨(dú)立的 feature map 代表消返。那么在每張 feature map 上兩個人都是可區(qū)分的。當(dāng)我們需要判斷某個候選框內(nèi)有沒有人時耘拇,只需要對應(yīng)的去左手,右手宇攻,中心軀干的 feature map 上分別去對應(yīng)的區(qū)域拼在一起惫叛,看能不能拼成一個完整的人體即可。
借用這個方法逞刷,本文提出了一個物體分割端到端訓(xùn)練的框架嘉涌,如上圖所示,使用 region proposal 網(wǎng)絡(luò)提供物體分割的 ROI夸浅,對每個 ROI 區(qū)域仑最,應(yīng)用上述方法,得到物體分割的結(jié)果帆喇。
問題1:由于全連接層的輸入要求警医,所有ROI區(qū)域都要被轉(zhuǎn)化為相同的尺度,這對于不同尺度的目標(biāo)(尤其是面積比較大的目標(biāo))來說坯钦,細(xì)節(jié)信息損失巨大预皇。大家都知道,全連接層的本質(zhì)是矩陣乘法婉刀,因此需要相同尺度的輸入吟温,才能和固定尺度的參數(shù)進(jìn)行矩陣相乘得出結(jié)果。那么突颊,所有目標(biāo)區(qū)域(ROI)都要被放大或者縮小成一樣的尺度(這個被稱為ROI-Pooling技術(shù)被Fast R-CNN提出鲁豪,該論文發(fā)表于ICCV 2015,值得一提的是這也是微軟團(tuán)隊的工作)潘悼。那么,在前景與背景分離這個問題上面爬橡,對于面積大的目標(biāo)區(qū)域治唤,是將其縮小之后再進(jìn)行分離,然后將得到的結(jié)果(Mask)放大到原來的尺度作為前景堤尾,這樣對ROI區(qū)域的操作非常容易損失掉目標(biāo)的細(xì)節(jié)信息(比如說肝劲,車子的輪子在縮小的過程中就沒了,再放大之后也不會再有)郭宝。
? ?對于問題1辞槐,論文做出了如下表述:
First, the ROI?pooling step losses spatial details due to feature warping and?resizing, which however, is necessary to obtain a fixed-size?representation for fc layers. Such distortion?and fixed-size representation degrades the segmentation?accuracy, especially for large objects.
? ?問題2:全連接層參數(shù)規(guī)模龐大,這種尾大不掉的架構(gòu)很有可能發(fā)生過擬合粘室。從筆者的caffemodel解析這一篇博文中提到了榄檬,將一個簡單的LeNet模型的可訓(xùn)練參數(shù)提取出來,最后兩個全連接的參數(shù)竟然達(dá)到了網(wǎng)絡(luò)參數(shù)規(guī)模的90%以上衔统。由于這個問題鹿榜,訓(xùn)練與測試的代價也會增多。
? ?對于問題2锦爵,論文做出了如下表述:
Second, the fc?layers over-parametrize the task, without using regularization?of local weight sharing.
? ?問題3:在ROI區(qū)域提出后舱殿,圖像分割的子任務(wù)與圖像分類的子任務(wù)之間沒有共享參數(shù)。大家從上圖可以看到险掀,在圖標(biāo)區(qū)域提出后沪袭,圖像分割任務(wù)與圖像分類任務(wù)都是各自訓(xùn)練不同的全連接層,這樣使得架構(gòu)的效率異常低下樟氢。
? ?對于問題3冈绊,論文做出了如下表述:
Last, the per-ROI network computation in the last?step is not shared among ROIs. As observed empirically, a?considerably complex sub-network in the last step is necessary?to obtain good accuracy. It is therefore slow for?a large number of ROIs (typically hundreds or thousands of?region proposals).
? ?針對以上3個問題,我們來看看FCIS是怎么解決的埠啃。
? ?首先針對問題1死宣,ROI-Pooling被取消了,取而代之的是對ROI區(qū)域的聚合碴开,實質(zhì)就是復(fù)制粘貼毅该。
? ?然后針對問題2,全連接層(FC layer)被取消了叹螟,取而代之的是分類器(softmax)鹃骂。
? ?最后,針對問題3罢绽,圖像分割與圖像分類使用的是相同的特征圖畏线。
文章借鑒多篇網(wǎng)文,僅用于學(xué)習(xí)目的良价,侵權(quán)請聯(lián)系
論文翻譯
Abstract
We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all the merits of FCNs for semantic segmentation [29] and instance mask proposal [5]. It detects and segments the object instances jointly and simultanoulsy. By the introduction of position-senstive inside/outside score maps, the underlying convolutional representation is fully shared between the two sub-tasks, as well as between all regions of interest. The proposed network is highly integrated and achieves state-of-the-art performance in both accuracy and efficiency. It wins the COCO 2016 segmentation competition by a large margin. Code would be released at https: //github.com/daijifeng001/TA-FCN.
我們提出了第一個完全卷積的端到端解決方案,用于實例感知語義分段任務(wù)寝殴。它繼承了FCN在語義分割[29]和實例掩碼建議[5]的所有優(yōu)點蒿叠。它同時一起檢測和分割對象。通過引入位置敏感內(nèi)/外部評分圖,兩個子任務(wù)之間以及所有感興趣的區(qū)域之間完全共享潛在的卷積表示形式蚣常。建議的網(wǎng)絡(luò)高度集成,在準(zhǔn)確性和效率方面實現(xiàn)了最先進(jìn)的性能市咽。它以較大優(yōu)勢贏得了 COCO 2016 分割比賽。代碼將在 https 發(fā)布://github.com/daijifeng001/TA-FCN抵蚊。
1. Introduction
Fully convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation. An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories, as illustrated in Figure 1(a). Thanks to the simplicity, efficiency, and the local weight sharing property of convolution, FCNs provide an accurate, fast, and end-to-end solution for semantic segmentation.
完全卷積網(wǎng)絡(luò)(FCN)[29]最近主導(dǎo)了語義圖像分割領(lǐng)域施绎。FCN 采用任意大小的輸入圖像,應(yīng)用一系列卷積層,并生成所有語義類別的像素可能性分?jǐn)?shù)圖,如圖 1(a) 所示。由于卷積的簡單性贞绳、效率和局部權(quán)重共享特性,FCN 為語義分段提供了準(zhǔn)確谷醉、快速和端到端的解決方案。
However, conventional FCNs do not work for the instance-aware semantic segmentation task, which requires the detection and segmentation of individual object instances. The limitation is inherent. Because convolution is translation invariant, the same image pixel receives the same responses (thus classification scores) irrespective to its relative position in the context. However, instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions. This behavior cannot be modeled by a single FCN on the whole image. The problem is exemplified in Figure 2.
但是,傳統(tǒng)的 FCN 不適用于語義分割任務(wù)任務(wù),這需要檢測和分割單個對象實例冈闭。局限性是固有的俱尼。由于卷積是平移不變的,因此同一圖像像素接收相同的響應(yīng)(因此分類分?jǐn)?shù)),而不管它在上下文中的相對位置如何。但是,實例分割需要在區(qū)域級別上操作,并且同一像素在不同的區(qū)域中可能具有不同的語義萎攒。此行為不能由整個映像上的單個 FCN 建模遇八。圖 2 說明了這個問題。
Certain translation-variant property is required to solve the problem. In a prevalent family of instance-aware semantic segmentation approaches [7, 16, 8], it is achieved by aopting different types of sub-networks in three stages: 1) an FCN is applied on the whole image to generate intermediate and shared feature maps; 2) from the shared feature maps, a pooling layer warps each region of interest (ROI) into fixed-size per-ROI feature maps [17, 12]; 3) one or more fully-connected (fc) layer(s) in the last network convert the per-ROI feature maps to per-ROI masks. Note that the translation-variant property is introduced in the fc layer(s) in the last step.
要解決此問題,需要某些轉(zhuǎn)換變量屬性耍休。在流行的實例感知語義分段方法系列[7,16,8]中,它通過分三個階段選擇不同類型的子網(wǎng)絡(luò)來實現(xiàn):1)在整個圖像上應(yīng)用FCN來生成中間和共享要素映射;2) 從共享要素地圖中,池層將每個感興趣的區(qū)域 (ROI) 扭曲為固定大小的每個 ROI 要素映射 [17, 12];3) 最后一個網(wǎng)絡(luò)中的一個或多個完全連接 (fc) 層將每個 ROI 要素映射轉(zhuǎn)換為每個 ROI 掩碼刃永。請注意,在最后一步中的 fc 層中引入了轉(zhuǎn)換變量屬性。
Such methods have several drawbacks. First, the ROI pooling step losses spatial details due to feature warping and resizing, which however, is necessary to obtain a fixed-size representation (e.g., 14 × 14 in [8]) for fc layers. Such distortion and fixed-size representation degrades the segmentation accuracy, especially for large objects. Second, the fc layers over-parametrize the task, without using regularization of local weight sharing. For example, the last fc layer has high dimensional 784-way output to estimate a 28 × 28 mask. Last, the per-ROI network computation in the last step is not shared among ROIs. As observed empirically, a considerably complex sub network in the last step is necessary to obtain good accuracy [36, 9]. It is therefore slow for a large number of ROIs (typically hundreds or thousands of region proposals). For example, in the MNC method [8], which won the 1st place in COCO segmentation challenge 2015 [25], 10 layers in the ResNet-101 model [18] are kept in the per-ROI sub-network. The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step. These drawbacks motivate us to ask the question that, can we exploit the merits of FCNs for end-to-end instance-aware semantic segmentation?
這些方法有幾個缺點羊精。首先,ROI 池級步因要素扭曲和調(diào)整而丟失空間細(xì)節(jié),但是,對于 fc 圖層,獲取固定大小的表示形式(例如,14 * 14 [8])是必要的揽碘。這種失真和固定大小的表示會降低分割精度,尤其是對于大型對象。其次,,若不使用局部權(quán)重共享的正則化园匹,fc 層過度輔助任務(wù)。例如,最后一個 fc 圖層具有高維 784 輸出來估計 28 × 28 掩碼劫灶。最后,最后一步中的每 ROI 網(wǎng)絡(luò)計算不在 ROI 之間共享裸违。正如經(jīng)驗所觀察到的,在最后一步中,為了獲得良好的精度(36,9]),需要一個相當(dāng)復(fù)雜的子網(wǎng)絡(luò)。因此,對于大量區(qū)域執(zhí)行(通常為數(shù)百或數(shù)千個區(qū)域提案)而言,速度很慢本昏。例如,在 2015 年 COCO 分段挑戰(zhàn)中贏得第一名的 MNC 方法 [8]中,ResNet-101 模型 [18] 中的 10 層保存在每個 ROI 子網(wǎng)絡(luò)中供汛。該方法每張映像需要 1.4 秒,其中超過 80% 的時間用于最后一個 ROI 步驟。這些缺點促使我們提出這樣一個問題:我們能否利用 FCN 的優(yōu)點進(jìn)行端到端實例感知語義分段?
Recently, a fully convolutional approach has been proposed for instance mask proposal generation [5]. It extends the translation invariant score maps in conventional FCNs to position-sensitive score maps, which are somewhat translation-variant. This is illustrated in Figure 1(b). The approach is only used for mask proposal generation and presents several drawbacks. It is blind to semantic categories and requires a downstream network for detection. The object segmentation and detection sub-tasks are separated and the solution is not end-to-end. It operates on square, fixed-size sliding windows (224 × 224 pixels) and adopts a time-consuming image pyramid scanning to find instances at different scales.
最近,有人提出了一個完全卷積的方法用于mask proposal生成[5]涌穆。它將傳統(tǒng) FCN 中的轉(zhuǎn)換不變量分?jǐn)?shù)映射擴(kuò)展到位置敏感分?jǐn)?shù)圖,這些分?jǐn)?shù)圖在某種程度上是平移變體怔昨。如圖 1(b) 所示。該方法僅用于掩碼提案生成,并存在幾個缺點宿稀。它對語義類別視而不見,需要下游網(wǎng)絡(luò)進(jìn)行檢測趁舀。對象分割和檢測子任務(wù)是分開的,解決方案不是端到端的。它在方形固定大小的滑動窗口(224 × 224 像素)上運(yùn)行,并采用耗時的圖像金字塔掃描來查找不同比例的實例祝沸。
In this work, we propose the first end-to-end fully convolutional approach for instance-aware semantic segmentation. Dubbed FCIS, it extends the approach in [5]. The underlying convolutional representation and the score maps are fully shared for the object segmentation and detection sub-tasks, via a novel joint formulation with no extra parameters. The network structure is highly integrated and efficient. The per-ROI computation is simple, fast, and does not involve any warping or resizing operations. The approach is briefly illustrated in Figure 1(c). It operates on box proposals instead of sliding windows, enjoying the recent advances in object detection [34].
在這項工作中,我們提出了第一個端到端的完全卷積方法,用于實例感知語義分割。它被稱為FCIS,擴(kuò)展了[5]中的方法仁期。通過沒有額外參數(shù)的新型聯(lián)合公式,為對象分割和檢測子任務(wù)完全共享基礎(chǔ)卷積表示和分?jǐn)?shù)圖桑驱。網(wǎng)絡(luò)結(jié)構(gòu)高度集成,效率高。每個 ROI 計算簡單跛蛋、快速,不涉及任何變形或調(diào)整大小操作熬的。該方法在圖 1(c) 中進(jìn)行了簡要說明。它根據(jù)盒子建議而不是滑動窗口運(yùn)行,享受物體檢測的最新進(jìn)展[34]问芬。
[5]J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016
Extensive experiments verify that the proposed approach is state-of-the-art in both accuracy and efficiency. It achieves significantly higher accuracy than the previous challenge winning method MNC [8] on the large-scale COCO dataset [25]. It wins the 1st place in COCO 2016 segmentation competition, outperforming the 2nd place entry by 12% in accuracy relatively. It is fast. The inference in COCO competition takes 0.24 seconds per image using ResNet-101 model [18] (Nvidia K40), which is 6× faster than MNC [8]. Code would be released at https: //github.com/daijifeng001/TA-FCN.
廣泛的實驗驗證了該方法在準(zhǔn)確性和效率方面都是最先進(jìn)的悦析。與之前的挑戰(zhàn)獲勝方法 MNC [8] 相比,在大型 COCO 數(shù)據(jù)集 [25] 上,它實現(xiàn)了更高的精度。它在 COCO 2016 分段競賽中獲得第一名,在準(zhǔn)確性方面以 12% 的精度超過第二名此衅。速度很快强戴。使用 ResNet-101 模型 [18] (Nvidia K40)(Nvidia K40),COCO 競賽中的推理每幅圖像需要 0.24 秒,比 MNC [8] 快 6 倍。代碼將在 https 發(fā)布://github.com/daijifeng001/TA-FCN挡鞍。
2. Our Approach
2.1. Position-sensitive Score Map Parameterization
In FCNs [29], a classifier is trained to predict each pixel’s likelihood score of “the pixel belongs to some object category”. It is translation invariant and unaware of individual object instances. For example, the same pixel can be foreground on one object but background on another (adjacent) object. A single score map per-category is insufficient to distinguish these two cases.
在 FCN [29]中,對分類器進(jìn)行了訓(xùn)練,以預(yù)測每個像素的可能性分?jǐn)?shù)"像素屬于某個對象類別"骑歹。它是轉(zhuǎn)換不變的,不知道單個對象實例。例如,同一像素可以是一個對象上的前景,但在另一個(相鄰)對象上是背景墨微。每個類別的單個分?jǐn)?shù)圖不足以區(qū)分這兩種情況道媚。
To introduce translation-variant property, a fully convolutional solution is firstly proposed in [5] for instance mask proposal. It uses k^2 position-sensitive score maps that correspond to k × k evenly partitioned cells of objects. This is illustrated in Figure 1(b) (k = 3). Each score map has the same spatial extent of the original image (in a lower resolution, e.g., 16× smaller). Each score represents the likelihood of “the pixel belongs to some object instance at a relative position”. For example, the first map is for “at top left position” in Figure 1(b).
為了引入平移變量特性,首先在[5]中提出了一種完全卷積方法來提出instance mask proposal。它使用 k^2 位置敏感分?jǐn)?shù)圖,對應(yīng)于對象均勻分區(qū)的單元格 k * k翘县。如圖 1(b) (k = 3)所示最域。每個分?jǐn)?shù)圖都具有相同的原始圖像的空間范圍(分辨率較低,例如,16x 較小)。每個分?jǐn)?shù)表示"像素屬于相對位置的某個對象實例"的可能性锈麸。例如,第一個映射用于圖 1(b) 中的"左上位置"镀脂。
During training and inference, for a fixed-size square sliding window (224×224 pixels), its pixel-wise foreground likelihood map is produced by assembling (copy-paste) its k×k cells from the corresponding score maps. In this way, a pixel can have different scores in different instances as long as the pixel is at different relative positions in the instances.
在訓(xùn)練和推理過程中,對于固定尺寸的方形滑動窗口(224×224 像素),通過從相應(yīng)的分?jǐn)?shù)圖組裝(復(fù)制粘貼)其 k*k 單元格來生成其像素級前景可能性圖。這樣,只要像素在實例中處于不同的相對位置,像素在不同的實例中可以有不同的分?jǐn)?shù)忘伞。
FCIS是怎么實現(xiàn)圖像分類與圖像分割的并聯(lián)呢翘魄? 通過兩類score map解決,一類叫inside score map舀奶,一類叫outside score map暑竟。inside score map表征了像素點在ROI區(qū)域內(nèi)部中前景的分?jǐn)?shù),如果一個像素點是位于一個ROI區(qū)域內(nèi)部并且是目標(biāo)(前景)育勺,那么在inside score map中就應(yīng)該有較高的分?jǐn)?shù)光羞,而在outside score map中就應(yīng)該有較低的分?jǐn)?shù)绩鸣。反之亦然,如果一個像素點是位于一個ROI區(qū)域內(nèi)部并且是背景纱兑,那么在inside score map中就應(yīng)該有較低的分?jǐn)?shù)呀闻,而在outside score map中就應(yīng)該有較高的分?jǐn)?shù)。針對圖像分割潜慎,使用兩類score map捡多,通過一個分類器就可以分出前景與背景。針對圖像分類铐炫,將兩類score map結(jié)合起來垒手,可以實現(xiàn)分類問題。
? ?這樣做還有一個好處倒信,通過兩類score map的結(jié)合科贬,可以甄別出ROI檢測失誤的區(qū)域。
首先鳖悠,對于每個ROI區(qū)域榜掌,將inside score maps和outside score maps中的小塊特征圖復(fù)制出來,拼接成為了ROI inside map和ROI outside map乘综。針對圖像分割任務(wù)憎账,直接對上述兩類map通過softmax分類器分類,得到ROI中的目標(biāo)前景區(qū)域(Mask)卡辰。針對圖像分類任務(wù)胞皱,將兩類map中的score逐像素取最大值,得到一個map九妈,然后再通過一個softmax分類器反砌,得到該ROI區(qū)域?qū)?yīng)的圖像類別。在完成圖像分類的同時萌朱,還順便驗證了ROI區(qū)域檢測是否合理于颖,具體做法是求取最大值得到的map的所有值的平均數(shù),如果該平均數(shù)大于某個閾值嚷兔,則該ROI檢測是合理的。 ? ?針對輸入圖像上的每一個像素點做入,有三種情況:第一種情況是inside score高冒晰,outside score低;則該像素點位于ROI中的目標(biāo)部分竟块。第二種情況是inside score低壶运,outside score高,則該像素點位于ROI中的背景部分浪秘。第三種情況是inside score和outside score都很低蒋情,那么該像素點不在任何一個ROI里面埠况。因此,我們在上一段中描述的棵癣,針對ROI inside map和ROI outside map中逐像素點取最大值得到的圖像:如果求平均后分?jǐn)?shù)還是很低辕翰,那么,我們可以斷定這個檢測區(qū)域是不合理的狈谊。如果求平均后分?jǐn)?shù)超過了某個閾值喜命,我們就通過softmax分類器求ROI的圖像類別,再通過softmax分類器求前景與背景河劝。
As shown in [5], the approach is state-of-the-art for the object mask proposal task. However, it is also limited by the task. Only a fixed-size square sliding window is used. The network is applied on multi-scale images to find object instances of different sizes. The approach is blind to the object categories. Only a separate “objectness” classification sub-network is used to categorize the window as object or background. For the instance-aware semantic segmentation task, a separate downstream network is used to further classify the mask proposals into object categories [5].
如 {5} 所示,該方法是對象掩碼建議任務(wù)的最先進(jìn)的方法壁榕。但是,它還受到任務(wù)的限制。僅使用固定大小的方形滑動窗口赎瞎。網(wǎng)絡(luò)應(yīng)用于多比例圖像,以查找不同大小的對象實例牌里。該方法對對象類別視而不見。只有單獨(dú)的"對象"分類子網(wǎng)絡(luò)用于將窗口分類為對象或背景务甥。對于實例感知語義分段任務(wù),使用單獨(dú)的下游網(wǎng)絡(luò)將進(jìn)一步將掩碼建議分為對象類別 [5]牡辽。
2.2. Joint Mask Prediction and Classification
For the instance-aware semantic segmentation task, not only [5], but also many other state-of-the-art approaches, such as SDS [15], Hypercolumn [16], CFM [7], MNC [8], and MultiPathNet [42], share a similar structure: two subnetworks are used for object segmentation and detection sub-tasks, separately and sequentially.
對于實例感知語義分段任務(wù),不僅 [5],而且許多其他最先進(jìn)的方法,如 SDS [15]、超列 [16]缓呛、CFM [7]催享、MNC [8]和 MultiPathNet [42],共享一個類似的結(jié)構(gòu):對象使用兩個子網(wǎng)分別用于分割和檢測子任務(wù)。
Apparently, the design choices in such a setting, e.g., the two networks’ structure, parameters and execution order, are kind of arbitrary. They can be easily made for convenience other than for fundamental considerations. We conjecture that the separated sub-network design may not fully exploit the tight correlation between the two tasks.
顯然,在這樣的環(huán)境中,設(shè)計選擇,例如兩個網(wǎng)絡(luò)的結(jié)構(gòu)哟绊、參數(shù)和執(zhí)行順序,都是任意的因妙。除了基本考慮之外,它們可以很容易地為方便而制作。我們推測分離的子網(wǎng)絡(luò)設(shè)計不能充分利用兩個任務(wù)之間的緊密關(guān)聯(lián)票髓。
We enhance the “position-sensitive score map” idea to perform the object segmentation and detection sub-tasks jointly and simultaneously. The same set of score maps are shared for the two sub-tasks, as well as the underlying convolutional representation. Our approach brings no extra parameters and eliminates non essential design choices. We believe it can better exploit the strong correlation between the two sub-tasks.
我們強(qiáng)化"位置敏感分?jǐn)?shù)圖position-sensitive score map"理念,共同攀涵、同時地執(zhí)行對象分割和檢測子任務(wù)。對于兩個子任務(wù)以及基礎(chǔ)卷積(underlying convolutional representation)表示,共享相同的一組分?jǐn)?shù)圖洽沟。我們的方法沒有額外的參數(shù),消除了不必要的設(shè)計選擇以故。我們相信它可以更好地利用兩個子任務(wù)之間的強(qiáng)相關(guān)性。
Our approach is illustrated in Figure 1(c) and Figure 2. Given a region-of-interest (ROI), its pixel-wise score maps are produced by the assembling operation within the ROI. For each pixel in a ROI, there are two tasks: 1) detection: whether it belongs to an object bounding box at a relative position (detection+) or not (detection-); 2) segmentation: whether it is inside an object instance’s boundary (segmentation+) or not (segmentation-). A simple solution is to train two classifiers, separately. That’s exactly our baseline FCIS (separate score maps) in Table 1. In this case, the two classifiers are two 1 × 1 conv layers, each using just one task’s supervision.
圖 1(c) 和圖 2 說明了我們的方法裆操。給定感興趣區(qū)域 (ROI),其像素級分?jǐn)?shù)圖由 ROI 內(nèi)的裝配操作生成怒详。對于 ROI 中的每個像素,有兩個任務(wù):1) 檢測:它是否屬于位于相對位置(檢測+)的對象邊界框(檢測-);2) 分段:是否位于對象實例的邊界內(nèi)(分割+)或非(分割-)。一個簡單的解決方案是分別訓(xùn)練兩個分類器踪区。這正是表 1 中的基準(zhǔn) FCIS(單獨(dú)的分?jǐn)?shù)映射)昆烁。在這種情況下,兩個分類器是兩個 1 × 1 conv 層,每個分類器僅使用一個任務(wù)的監(jiān)督。
Our joint formulation fuses the two answers into two scores: inside and outside. There are three cases: 1) high inside score and low outside score: detection+, segmentation+; 2) low inside score and high outside score: detection+, segmentation-; 3) both scores are low: detection-, segmentation-. The two scores answer the two questions jointly via softmax and max operations. For detection, we use max to differentiate cases 1)-2) (detection+) from case 3) (detection-). The detection score of the whole ROI is then obtained via average pooling over all pixels’ likelihoods (followed by a softmax operator across all the categories). For segmentation, we use softmax to differentiate cases 1) (segmentation+) from 2) (segmentation-), at each pixel. The foreground mask (in probabilities) of the ROI is the union of the per-pixel segmentation scores (for each category). Similarly, the two sets of scores are from two 1 × 1 conv layer. The inside/outside classifiers are trained jointly as they receive the back-propagated gradients from both segmentation and detection losses.
我們的聯(lián)合將兩個答案融合成兩個分?jǐn)?shù):內(nèi)部和外部缎岗。有三種情況:1) 內(nèi)得分高,外線得分低:檢測+静尼、分割-;2) 內(nèi)得分低,外高分:檢測+,分割-;3) 兩個分?jǐn)?shù)都很低:檢測-,分段-。兩個分?jǐn)?shù)通過softmax和最大操作共同回答了這兩個問題。對于檢測,我們使用最大值來區(qū)分案例 1)-2)(檢測+)和案例 3)(檢測-)鼠渺。然后,通過對所有像素可能性的平均池(后跟所有類別的 softmax 運(yùn)算符)獲得整個 ROI 的檢測分?jǐn)?shù)鸭巴。對于分割,我們使用 softmax 來區(qū)分每個像素的情況 1)(分段+)和 2)(分段-)。ROI 的前景掩碼(概率)是每像素分割分?jǐn)?shù)(每個類別)的合并拦盹。同樣,兩組分?jǐn)?shù)來自兩個 1 × 1 conv 層鹃祖。內(nèi)部/外部分類器在從分段和檢測損耗中接收反傳播梯度時,會共同訓(xùn)練。
The approach has many desirable properties. All the perROI components (as in Figure 1(c)) do not have free parameters. The score maps are produced by a single FCN, without involving any feature warping, resizing or fc layers. All the features and score maps respect the aspect ratio of the original image. The local weight sharing property of FCNs is preserved and serves as a regularization mechanism. All per-ROI computation is simple (k^2cell division, score map copying, softmax, max, average pooling) and fast, giving rise to a negligible per-ROI computation cost.
該方法具有許多理想的屬性掌敬。所有 perROI 組件(如圖 1(c)所示)沒有自由參數(shù)惯豆。分數(shù)圖由單個 FCN 生成,不涉及任何要素扭曲、調(diào)整大小或 fc 圖層奔害。所有要素和記分貼圖都尊重原始圖像的縱橫比楷兽。保留 FCN 的本地權(quán)重共享屬性,并充當(dāng)正則化機(jī)制。所有每 ROI 計算都很簡單(k^2 單元分割华临、分?jǐn)?shù)圖復(fù)制芯杀、softmax、最大值雅潭、平均池),因此每 ROI 計算成本可忽略不計揭厚。
2.3. An End-to-End Solution
Figure 3 shows the architecture of our end-to-end solution. While any convolutional network architecture can be used [39, 40], in this work we adopt the ResNet model [18]. The last fully-connected layer for 1000?way classification is discarded. Only the previous convolutional layers are retained. The resulting feature maps have 2048 channels. On top of it, a 1 × 1 convolutional layer is added to reduce the dimension to 1024.
圖 3 顯示了端到端解決方案的體系結(jié)構(gòu)。雖然可以使用任何卷積網(wǎng)絡(luò)架構(gòu) [39, 40],但在這項工作中,我們采用了 ResNet 模型 [18]扶供。將丟棄 1000 路分類的最后一個完全連接的圖層筛圆。僅保留以前的卷積層。生成的要素地圖具有 2048 個通道椿浓。在上面,添加 1 × 1 卷積層,將尺寸減小到 1024太援。
圖像輸入進(jìn)來,經(jīng)過卷積層提取初步特征夯巷,然后利用這些特征赛惩,一邊經(jīng)過RPN(Region Proposal Network)網(wǎng)絡(luò)提取ROI區(qū)域,一邊再經(jīng)過一些卷積層生成2×(C+1)×k×k個特征圖趁餐。2代表inside和outside兩類喷兼;C+1代表圖像類別一共C類,再加上背景(未知的)1類澎怒;k×k代表每一類score map中各有k個(上圖的例子中k就為3)。在經(jīng)過assembling之后(其實就是復(fù)制粘貼),對于每一個ROI喷面,k×k個position-sensitive score map被綜合成了一個星瘾,然后放小了16倍(長寬各變成1/4),得到2×(C+1)個特征圖惧辈。然后開始并行操作琳状,第一條線:對于每一類的ROI inside map和ROI outside map逐像素取最大值,得到了C+1個特征圖盒齿,對這C+1個特征圖逐個求平均值念逞,將平均值同閾值比較,若大于閾值边翁,則判定該ROI合理翎承,則直接送入softmax分類器進(jìn)行分類,得到圖像類別符匾。若小于閾值叨咖,則不進(jìn)行任何操作。第二條線:做C+1次softmax分類啊胶,對每一個類別得到前景與背景甸各,然后根據(jù)第一條并行線的分類結(jié)果,選擇出對應(yīng)類別的前景與背景劃分結(jié)果焰坪。
In the original ResNet, the effective feature stride (the decrease in feature map resolution) at the top of the network is 32. This is too coarse for instance-aware semantic segmentation. To reduce the feature stride and maintain the field of view, the “hole algorithm” [3, 29] (Algorithme a` trous [30]) is applied. The stride in the first block of conv5 convolutional layers is decreased from 2 to 1. The effective feature stride is thus reduced to 16. To maintain the field of view, the “hole algorithm” is applied on all the convolutional layers of conv5 by setting the dilation as 2.
在原始 ResNet 中,網(wǎng)絡(luò)頂部的有效要素步幅(要素地圖分辨率的降低)為 32趣倾。對于實例感知語義分段來說,這太粗糙了。為了減小特征步幅并保持視野,應(yīng)用了"孔算法"[3,29](空洞卷積)某饰。conv5 卷積層第一塊的步幅從 2 減少到 1儒恋。因此,有效特征步幅減少到 16。為了保持視野,通過將擴(kuò)張設(shè)置為 2,將"孔算法"應(yīng)用于 conv5 的所有卷積層露乏。
We use region proposal network (RPN) [34] to generateROIs. For fair comparison with the MNC method [8], it isadded on top of the conv4 layers in the same way. Note that RPN is also fully convolutional.
我們使用區(qū)域建議網(wǎng)絡(luò) (RPN) [34] 來生成 ROI碧浊。為了與 MNC 方法 [8] 進(jìn)行公平比較,它以同樣的方式添加到 conv4 層的頂部。請注意,RPN 也是完全卷積的瘟仿。
From the conv5 feature maps, 2k^2 ×(C + 1) score maps are produced (C object categories, one background category, two sets of k^2 score maps per category, k = 7 by default in experiments) using a 1×1 convolutional layer. Over the score maps, each ROI is projected into a 16× smaller region. Its segmentation probability maps and classification scores over all the categories are computed as described in Section 2.2.
在 conv5 要素圖中,使用 1*1 卷積層生成 2k^2 *(C + 1) 分?jǐn)?shù)圖(C 對象類別箱锐、一個背景類別、每個類別兩組 k^2 得分圖,默認(rèn)情況下 k = 7)劳较。在分?jǐn)?shù)圖上,每個 ROI 被投影到 16x 多個較小的區(qū)域驹止。如第 2.2 節(jié)所述,計算了所有類別的細(xì)分概率圖和分類分?jǐn)?shù)。
有inside score map和outside score map
Following the modern object detection systems, bounding box (bbox) regression [13, 12] is used to refine the initial input ROIs. A sibling 1×1 convolutional layer with 4k^2 channels is added on the conv5 feature maps to estimate the bounding box shift in location and size. Below we discuss more details in inference and training.
在現(xiàn)代對象檢測系統(tǒng)之后,邊界框(bbox)回歸 [13, 12] 用于優(yōu)化初始輸入 ROIs观蜗。在 conv5 要素圖上添加了具有 4k^2 通道的同級 1?1 卷積層,以估計邊界框的位置和大小變化臊恋。下面我們將討論推理和培訓(xùn)中的更多詳細(xì)信息。
Inference For an input image, 300 ROIs with highest scores are generated from RPN. They pass through the bbox regression branch and give rise to another 300 ROIs. For each ROI, we get its classification scores and foreground mask (in probability) for all categories. Figure 2 shows an example. Non-maximum suppression (NMS) with an intersection-over-union (IoU) threshold 0.3 is used to filter out highly overlapping ROIs. The remaining ROIs are classified as the categories with highest classification scores. Their foreground masks are obtained by mask voting [8] as follows. For an ROI under consideration, we find all the ROIs (from the 600) with IoU scores higher than 0.5. Their foreground masks of the category are averaged on a per-pixel basis, weighted by their classification scores. The averaged mask is binarized as the output.
對于輸入圖像,從 RPN 生成 300 個得分最高的 ROIs墓捻。它們通過 bbox 回歸分支,并產(chǎn)生另外 300 個 ROIs抖仅。對于每個 ROI,我們獲取所有類別的分類分?jǐn)?shù)和前景掩碼(概率)。圖 2 顯示了一個示例。具有交集過結(jié)合 (IoU) 閾值 0.3 的非最大抑制 (NMS) 用于篩選出高度重疊的 ROI撤卢。其余的 ROI 被歸類為分類分?jǐn)?shù)最高的類別环凿。其前景掩碼通過掩碼投票 [8] 獲得,如下所示。對于正在考慮的投資回報率,我們發(fā)現(xiàn)所有 IOU 分?jǐn)?shù)高于 0.5 的 ROI(從 600 中) 放吩。類別的前景蒙版按像素求平均值,按分類分?jǐn)?shù)加權(quán)智听。平均蒙版被二元化為輸出。