論文題目:FCOS: Fully Convolutional One-Stage Object Detection
其亮點(diǎn):
- 基于FCN構(gòu)建全卷積檢測器,使得視覺任務(wù)(如語義分割)可以統(tǒng)一在FCN框架
- anchor-free,proposal free,避免了訓(xùn)練階段關(guān)于anchor或者proposal的iou計(jì)算.更重要的是,避免了一切與anchor有關(guān)的超參數(shù)
- 簡單的Backbone;neck;head檢測算法框架
原始anchor-base的缺點(diǎn):
- 檢測器對anchor的大小工三、縱橫比和二、數(shù)量比較敏感;在RetinaNet,更改這些超參數(shù)會影響性能高達(dá)4% ap(coco基準(zhǔn)).因此在使用基于anchor的檢測器時(shí)要仔細(xì)調(diào)關(guān)于anchor的超參數(shù)
- 由于anchor的比例和縱橫比在初始時(shí)保持固定,檢測器在處理形狀變化較大的候選對象時(shí)能力不夠
- 為了實(shí)現(xiàn)較高的召回率,anchor-base的檢測器將anchor密集地放置在圖像特征中,導(dǎo)致訓(xùn)練過程加大了正負(fù)樣本的不平衡,當(dāng)然也顯著增加訓(xùn)練過程的計(jì)算量(一般通過計(jì)算與GT之間的IOU來計(jì)算loss,anchor生成的proposal越多,計(jì)算量就越大)
本文方法:
逐像素回歸預(yù)測
信息表示:
對于訓(xùn)練目標(biāo)ground-truth bounding boxes,我們用其四元向量表示:
其中代表了邊框的左上角點(diǎn)坐標(biāo),
代表了邊框的右上角坐標(biāo),
代表了其目標(biāo)框的類別.
backbone CNN網(wǎng)絡(luò)提取的特征圖屬于第
層,其中縮放的步幅(stride)為
.
對于特征圖中的每一對坐標(biāo)點(diǎn),我們可以與原始圖像建立一一對應(yīng)關(guān)系
;
不同于anchor-base的檢測器,fcos對每一個(gè)特征圖上的坐標(biāo)都作為訓(xùn)練樣本進(jìn)行回歸(也就是像素級別回歸).
如同上面的對應(yīng)關(guān)系,如果(x,y)落在任何一個(gè)ground-truth bounding box中,那么它是一個(gè)正訓(xùn)練樣本,其標(biāo)簽是ground-truth的標(biāo)簽,如果不落在box中,則該樣本則為負(fù)樣本,
.
除此之外,fcos還對每一個(gè)像素進(jìn)行回歸預(yù)測一個(gè)四元組向量,分別代表了其四個(gè)邊框到中心點(diǎn)的距離;
某個(gè)點(diǎn)落入邊框內(nèi),則回歸的目標(biāo)為:
作者認(rèn)為fcos優(yōu)于其他訓(xùn)練器的方式在于利用了盡可能多的前景樣本來訓(xùn)練回歸器,而不是通過計(jì)算iou來獲取正樣本.
網(wǎng)絡(luò)輸出
根據(jù)逐像素級別的回歸目標(biāo),我們得到其訓(xùn)練目標(biāo)為-d向量代表了其分類結(jié)果,4-d向量
代表了其邊框坐標(biāo)回歸信息.
fcos采取訓(xùn)練個(gè)二分類器的方式進(jìn)行類別判斷,而不是直接訓(xùn)練一個(gè)多分類判別器.
Moreover, since the regression targets are always positive, we employ exp(x) to map any real number to (0,∞) on the top of the regression branch
在回歸分支中像yolo一樣采取exp函數(shù)對坐標(biāo)進(jìn)行比例map.
loss 函數(shù)
整體loss分為分類loss與邊框信息loss:
其中采取focal loss;
則是計(jì)算預(yù)測框與真實(shí)框的 iou loss.
為正樣本的數(shù)量.
為平衡系數(shù),用來平衡分類損失與坐標(biāo)損失.
多級預(yù)測提高召回率
利用FPN多級特征圖尺寸來提高召回率;
如結(jié)構(gòu)圖所示,作者使用了FPN的5個(gè)層次的特征圖為別為,其stride分別為是:8,16,32,64和128.
那么對于回歸目標(biāo),其處于不同的特征圖,最大最小值也有了一定的限制;
如果,或者
,則其是一個(gè)負(fù)樣本,不參與訓(xùn)練計(jì)算;
的取值分別為0,64,128,256,512;
center-ness 分支
作者發(fā)現(xiàn),通過原始的辦法,由于在遠(yuǎn)離中心點(diǎn)的位置產(chǎn)生了許多低質(zhì)量的預(yù)測邊框,導(dǎo)致檢測器性能不高,所以作者提出了一個(gè)新的分支,center-ness,用來抑制這些低質(zhì)量邊框;
對于同一個(gè)坐標(biāo)點(diǎn)的回歸坐標(biāo)信息,通過以下公式計(jì)算其中心度:
遠(yuǎn)離中心點(diǎn)的邊框,其會很大,導(dǎo)致center-ness分?jǐn)?shù)很低.這樣就可以有效地抑制這些低質(zhì)量邊框.
center-ness 乘以分類分?jǐn)?shù)來得到最終score(用來計(jì)算預(yù)測框的排名).最后,這些低質(zhì)量的邊框可能通過最終的NMS過程來濾除掉.
代碼閱讀
本代碼采取mmdetion 框架中的fcos代碼部分;
其backbone跟fpn均采取普通設(shè)置(resnet+fpn多尺度預(yù)測),來窺探下fcos_heads.py
#/mmdet/models/anchor_heads/fcos_heads.py
@HEADS.register_module
class FCOSHead(nn.Module):
"""
Fully Convolutional One-Stage Object Detection head from [1]_.
The FCOS head does not use anchor boxes. Instead bounding boxes are
predicted at each pixel and a centerness measure is used to supress
low-quality predictions.
References:
.. [1] https://arxiv.org/abs/1904.01355
Example:
>>> self = FCOSHead(11, 7)
>>> feats = [torch.rand(1, 7, s, s) for s in [4, 8, 16, 32, 64]]
>>> cls_score, bbox_pred, centerness = self.forward(feats)
>>> assert len(cls_score) == len(self.scales)
"""
def __init__(self,
num_classes,
in_channels,
feat_channels=256,
stacked_convs=4,
#定義了其采樣步長,回歸邊框
strides=(4, 8, 16, 32, 64),
regress_ranges=((-1, 64), (64, 128), (128, 256), (256, 512),
(512, INF)),
#三種loss分別為focalloss,iouloss 跟CrossEntropyLoss計(jì)算
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0),
loss_bbox=dict(type='IoULoss', loss_weight=1.0),
loss_centerness=dict(
type='CrossEntropyLoss',
use_sigmoid=True,
loss_weight=1.0),
conv_cfg=None,
norm_cfg=dict(type='GN', num_groups=32, requires_grad=True)):
super(FCOSHead, self).__init__()
self.num_classes = num_classes
self.cls_out_channels = num_classes - 1
self.in_channels = in_channels
self.feat_channels = feat_channels
self.stacked_convs = stacked_convs
self.strides = strides
self.regress_ranges = regress_ranges
self.loss_cls = build_loss(loss_cls)
self.loss_bbox = build_loss(loss_bbox)
self.loss_centerness = build_loss(loss_centerness)
self.conv_cfg = conv_cfg
self.norm_cfg = norm_cfg
self.fp16_enabled = False
self._init_layers()
def _init_layers(self):
#每一個(gè)head都有兩個(gè)分支:分類分支和回歸坐標(biāo)分支,經(jīng)過四次采樣.
self.cls_convs = nn.ModuleList()
self.reg_convs = nn.ModuleList()
for i in range(self.stacked_convs):
chn = self.in_channels if i == 0 else self.feat_channels
self.cls_convs.append(
ConvModule(
chn,
self.feat_channels,
3,
stride=1,
padding=1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
bias=self.norm_cfg is None))
self.reg_convs.append(
ConvModule(
chn,
self.feat_channels,
3,
stride=1,
padding=1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
bias=self.norm_cfg is None))
self.fcos_cls = nn.Conv2d(
self.feat_channels, self.cls_out_channels, 3, padding=1)
self.fcos_reg = nn.Conv2d(self.feat_channels, 4, 3, padding=1)
self.fcos_centerness = nn.Conv2d(self.feat_channels, 1, 3, padding=1)
self.scales = nn.ModuleList([Scale(1.0) for _ in self.strides])
#前饋計(jì)算方式,分類分支,回歸分支,center-ness 分支分別計(jì)算結(jié)果
def forward_single(self, x, scale):
cls_feat = x
reg_feat = x
for cls_layer in self.cls_convs:
cls_feat = cls_layer(cls_feat)
cls_score = self.fcos_cls(cls_feat)
centerness = self.fcos_centerness(cls_feat)
for reg_layer in self.reg_convs:
reg_feat = reg_layer(reg_feat)
# scale the bbox_pred of different level
# float to avoid overflow when enabling FP16
bbox_pred = scale(self.fcos_reg(reg_feat)).float().exp()
return cls_score, bbox_pred, centerness
#center-ness 的定義
def centerness_target(self, pos_bbox_targets):
# only calculate pos centerness targets, otherwise there may be nan
left_right = pos_bbox_targets[:, [0, 2]]
top_bottom = pos_bbox_targets[:, [1, 3]]
centerness_targets = (
left_right.min(dim=-1)[0] / left_right.max(dim=-1)[0]) * (
top_bottom.min(dim=-1)[0] / top_bottom.max(dim=-1)[0])
return torch.sqrt(centerness_targets)
總結(jié)
- FCN的方式進(jìn)行像素級別中心點(diǎn)回歸預(yù)測邊框方式
- 提出center-ness分支,有效解決了離中心點(diǎn)較遠(yuǎn)的低質(zhì)量預(yù)測框問題
- 建立統(tǒng)一的FCN視覺任務(wù)框架
reference
[1] Tian Z, Shen C, Chen H, et al. Fcos: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9627-9636.