前言
前面的推文已經(jīng)介紹過SSD算法,我覺得原理說的還算清楚了件舵,但是一個(gè)算法不深入到代碼去理解是完全不夠的往声。因此本篇文章是在上篇SSD算法原理解析的基礎(chǔ)上做的代碼解析埂奈,解析SSD算法原理的推文的地址如下:https://mp.weixin.qq.com/s/lXqobT45S1wz-evc7KO5DA。今天要解析的SSD源碼來自于github一個(gè)非衬斡Γ火的Pytorch實(shí)現(xiàn)澜掩,已經(jīng)有3K+星,地址為:https://github.com/amdegroot/ssd.pytorch/
網(wǎng)絡(luò)結(jié)構(gòu)
為了比較好的對(duì)應(yīng)SSD的結(jié)構(gòu)來看代碼杖挣,我們首先放出SSD的網(wǎng)絡(luò)結(jié)構(gòu)肩榕,如下圖所示:
可以看到原始的SSD網(wǎng)絡(luò)是以VGG-16作Backbone(骨干網(wǎng)絡(luò))的。為了更加清晰看到相比于VGG16惩妇,SSD的網(wǎng)絡(luò)使用了哪些變化株汉,知乎上的一個(gè)帖子做了一個(gè)非常清晰的圖,這里借用一下歌殃,原圖地址為:https://zhuanlan.zhihu.com/p/79854543 乔妈。帶有特征圖維度信息的更清晰的骨干網(wǎng)絡(luò)和VGG16的對(duì)比圖如下:
源碼解析
OK,現(xiàn)在我們就要開始從源碼剖析SSD了 氓皱。主要弄清楚三個(gè)方面路召,網(wǎng)絡(luò)結(jié)構(gòu)的搭建,Anchor還有損失函數(shù)波材,就算是理解這個(gè)源碼了股淡。
網(wǎng)絡(luò)搭建
從上面的圖中我們可以清晰的看到在以VGG16做骨干網(wǎng)絡(luò)時(shí),在conv5后丟棄了CGG16中的全連接層改為了和的卷積層廷区。其中conv4-1
卷積層前面的maxpooling
層的ceil_model=True
唯灵,使得輸出特征圖長(zhǎng)寬為。還有conv5-3
后面的一層maxpooling
層參數(shù)為隙轻,不進(jìn)行下采樣早敬。然后在fc7
后面接上多尺度提取的另外4個(gè)卷積層就構(gòu)成了完整的SSD網(wǎng)絡(luò)。這里VGG16修改后的代碼如下大脉,來自ssd.py:
def vgg(cfg, i, batch_norm=False):
layers = []
in_channels = i
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
elif v == 'C':
layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
layers += [pool5, conv6,
nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
return layers
可以看到和我們上面的那張圖是完全一致的搞监。代碼里面最后獲得的conv7
就是我們上面圖里面的fc7
,特征維度是:镰矿。
現(xiàn)在可以開始搭建SSD網(wǎng)絡(luò)后面的多尺度提取網(wǎng)絡(luò)了琐驴。也就是網(wǎng)絡(luò)結(jié)構(gòu)圖中的Extra Feature Layers。我們從開篇的結(jié)構(gòu)圖中截取一下這一部分,方便我們對(duì)照代碼绝淡。
實(shí)現(xiàn)的代碼如下(同樣來自ssd.py):
def add_extras(cfg, i, batch_norm=False):
# Extra layers added to VGG for feature scaling
layers = []
in_channels = i
flag = False #flag 用來控制 kernel_size= 1 or 3
for k, v in enumerate(cfg):
if in_channels != 'S':
if v == 'S':
layers += [nn.Conv2d(in_channels, cfg[k + 1],
kernel_size=(1, 3)[flag], stride=2, padding=1)]
else:
layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])]
flag = not flag
in_channels = v
return layers
可以看到網(wǎng)絡(luò)結(jié)構(gòu)中除了魔改后的VGG16和Extra Layers還有6個(gè)橫著的線宙刘,這代表的是對(duì)6個(gè)尺度的特征圖進(jìn)行卷積獲得預(yù)測(cè)框的回歸(loc)和類別(cls)信息,注意SSD將背景也看成類別了牢酵,所以對(duì)于VOC數(shù)據(jù)集類別數(shù)就是20+1=21悬包。這部分的代碼為:
def multibox(vgg, extra_layers, cfg, num_classes):
loc_layers = []#多尺度分支的回歸網(wǎng)絡(luò)
conf_layers = []#多尺度分支的分類網(wǎng)絡(luò)
# 第一部分,vgg 網(wǎng)絡(luò)的 Conv2d-4_3(21層)馍乙, Conv2d-7_1(-2層)
vgg_source = [21, -2]
for k, v in enumerate(vgg_source):
# 回歸 box*4(坐標(biāo))
loc_layers += [nn.Conv2d(vgg[v].out_channels,
cfg[k] * 4, kernel_size=3, padding=1)]
# 置信度 box*(num_classes)
conf_layers += [nn.Conv2d(vgg[v].out_channels,
cfg[k] * num_classes, kernel_size=3, padding=1)]
# 第二部分布近,cfg從第三個(gè)開始作為box的個(gè)數(shù),而且用于多尺度提取的網(wǎng)絡(luò)分別為1,3,5,7層
for k, v in enumerate(extra_layers[1::2], 2):
loc_layers += [nn.Conv2d(v.out_channels, cfg[k]
* 4, kernel_size=3, padding=1)]
conf_layers += [nn.Conv2d(v.out_channels, cfg[k]
* num_classes, kernel_size=3, padding=1)]
return vgg, extra_layers, (loc_layers, conf_layers)
# 用下面的測(cè)試代碼測(cè)試一下
if __name__ == "__main__":
vgg, extra_layers, (l, c) = multibox(vgg(base['300'], 3),
add_extras(extras['300'], 1024),
[4, 6, 6, 6, 4, 4], 21)
print(nn.Sequential(*l))
print('---------------------------')
print(nn.Sequential(*c))
在jupter notebook輸出信息為:
'''
loc layers:
'''
Sequential(
(0): Conv2d(512, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(1024, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(512, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(256, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
---------------------------
'''
conf layers:
'''
Sequential(
(0): Conv2d(512, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(1024, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(512, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(256, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2d(256, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(256, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
Anchor生成(Prior_Box層)
這個(gè)在前面SSD的原理篇中講過了丝格,這里不妨再回憶一下撑瞧,SSD從魔改后的VGG16的conv4_3
開始一共使用了6個(gè)不同大小的特征圖,大小分別為(38,28),(19,19),(10,10),(5,5),(3,3),(1,1)
显蝌,但每個(gè)特征圖上設(shè)置的先驗(yàn)框(Anchor)的數(shù)量不同预伺。先驗(yàn)框的設(shè)置包含尺度和長(zhǎng)寬比兩個(gè)方面。對(duì)于先驗(yàn)框的設(shè)置曼尊,公式如下:
酬诀,其中指的是特征圖個(gè)數(shù),這里為5骆撇,因?yàn)榈谝粚?code>conv4_3的Anchor是單獨(dú)設(shè)置的料滥,代表先驗(yàn)框大小相對(duì)于特征圖的比例,注意這里不是相對(duì)原圖哦艾船。最后葵腹,和表示比例的最小值和最大值,論文中分別取和屿岂。
對(duì)于第一個(gè)特征圖践宴,它的先驗(yàn)框尺度比例設(shè)置為,則他的尺度為爷怀,后面的特征圖帶入公式計(jì)算阻肩,并將其映射會(huì)原圖300的大小可以得到,剩下的5個(gè)特征圖的尺度為运授。所以綜合起來烤惊,6個(gè)特征圖的尺度為。有了Anchor的尺度吁朦,接下來設(shè)置Anchor的長(zhǎng)寬柒室,論文中長(zhǎng)寬設(shè)置一般為,根據(jù)面積和長(zhǎng)寬比可以得到先驗(yàn)框的寬度和高度:
逗宜。
這里有一些值得注意的點(diǎn)雄右,如下:
- 上面的是相對(duì)于原圖的大小空骚。
- 默認(rèn)情況下千元,每個(gè)特征圖除了上面5個(gè)比例的Anchor圈盔,還會(huì)設(shè)置一個(gè)尺度為且的先驗(yàn)框,這樣每個(gè)特征圖都設(shè)置了兩個(gè)長(zhǎng)寬比為1但大小不同的正方形先驗(yàn)框付魔。最后一個(gè)特征圖需要參考一下來計(jì)算逢渔。
- 在實(shí)現(xiàn)
conv4_3
,conv10_2
,conv11_2
層時(shí)僅使用4個(gè)先驗(yàn)框肋坚,不使用長(zhǎng)寬比為的Anchor。 - 每個(gè)單元的先驗(yàn)框中心點(diǎn)分布在每個(gè)單元的中心肃廓,即:
智厌,其中是特征圖的大小。
從Anchor的值來看亿昏,越前面的特征圖Anchor的尺寸越小,也就是說對(duì)小目標(biāo)的效果越好档礁。先驗(yàn)框的總數(shù)為num_priors = 38x38x4+19x19x6+10x10x6+5x5x6+3x3x4+1x1x4=8732
角钩。
生成先驗(yàn)框的代碼如下(來自layers/functions/prior_box.py)
class PriorBox(object):
"""Compute priorbox coordinates in center-offset form for each source
feature map.
"""
def __init__(self, cfg):
super(PriorBox, self).__init__()
self.image_size = cfg['min_dim']
# number of priors for feature map location (either 4 or 6)
self.num_priors = len(cfg['aspect_ratios'])
self.variance = cfg['variance'] or [0.1]
self.feature_maps = cfg['feature_maps']
self.min_sizes = cfg['min_sizes']
self.max_sizes = cfg['max_sizes']
self.steps = cfg['steps']
self.aspect_ratios = cfg['aspect_ratios']
self.clip = cfg['clip']
self.version = cfg['name']
for v in self.variance:
if v <= 0:
raise ValueError('Variances must be greater than 0')
def forward(self):
mean = []
# 遍歷多尺度的 特征圖: [38, 19, 10, 5, 3, 1]
for k, f in enumerate(self.feature_maps):
# 遍歷每個(gè)像素
for i, j in product(range(f), repeat=2):
# k-th 層的feature map 大小
f_k = self.image_size / self.steps[k]
# # 每個(gè)框的中心坐標(biāo)
cx = (j + 0.5) / f_k
cy = (i + 0.5) / f_k
# aspect_ratio: 1 當(dāng) ratio==1的時(shí)候,會(huì)產(chǎn)生兩個(gè) box
# r==1, size = s_k呻澜, 正方形
s_k = self.min_sizes[k]/self.image_size
mean += [cx, cy, s_k, s_k]
# r==1, size = sqrt(s_k * s_(k+1)), 正方形
# rel size: sqrt(s_k * s_(k+1))
s_k_prime = sqrt(s_k * (self.max_sizes[k]/self.image_size))
mean += [cx, cy, s_k_prime, s_k_prime]
# 當(dāng) ratio != 1 的時(shí)候递礼,產(chǎn)生的box為矩形
for ar in self.aspect_ratios[k]:
mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)]
mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)]
# 轉(zhuǎn)化為 torch的Tensor
output = torch.Tensor(mean).view(-1, 4)
#歸一化,把輸出設(shè)置在 [0,1]
if self.clip:
output.clamp_(max=1, min=0)
return output
網(wǎng)絡(luò)結(jié)構(gòu)
結(jié)合了前面介紹的魔改后的VGG16羹幸,還有Extra Layers脊髓,還有生成Anchor的Priobox策略,我們可以寫出SSD的整體結(jié)構(gòu)如下(代碼在ssd.py):
class SSD(nn.Module):
"""Single Shot Multibox Architecture
The network is composed of a base VGG network followed by the
added multibox conv layers. Each multibox layer branches into
1) conv2d for class conf scores
2) conv2d for localization predictions
3) associated priorbox layer to produce default bounding
boxes specific to the layer's feature map size.
See: https://arxiv.org/pdf/1512.02325.pdf for more details.
Args:
phase: (string) Can be "test" or "train"
size: input image size
base: VGG16 layers for input, size of either 300 or 500
extras: extra layers that feed to multibox loc and conf layers
head: "multibox head" consists of loc and conf conv layers
"""
def __init__(self, phase, size, base, extras, head, num_classes):
super(SSD, self).__init__()
self.phase = phase
self.num_classes = num_classes
# 配置config
self.cfg = (coco, voc)[num_classes == 21]
# 初始化先驗(yàn)框
self.priorbox = PriorBox(self.cfg)
self.priors = Variable(self.priorbox.forward(), volatile=True)
self.size = size
# SSD network
# backbone網(wǎng)絡(luò)
self.vgg = nn.ModuleList(base)
# Layer learns to scale the l2 normalized features from conv4_3
# conv4_3后面的網(wǎng)絡(luò)栅受,L2 正則化
self.L2Norm = L2Norm(512, 20)
self.extras = nn.ModuleList(extras)
# 回歸和分類網(wǎng)絡(luò)
self.loc = nn.ModuleList(head[0])
self.conf = nn.ModuleList(head[1])
if phase == 'test':
self.softmax = nn.Softmax(dim=-1)
self.detect = Detect(num_classes, 0, 200, 0.01, 0.45)
def forward(self, x):
"""Applies network layers and ops on input image(s) x.
Args:
x: input image or batch of images. Shape: [batch,3,300,300].
Return:
Depending on phase:
test:
Variable(tensor) of output class label predictions,
confidence score, and corresponding location predictions for
each object detected. Shape: [batch,topk,7]
train:
list of concat outputs from:
1: confidence layers, Shape: [batch*num_priors,num_classes]
2: localization layers, Shape: [batch,num_priors*4]
3: priorbox layers, Shape: [2,num_priors*4]
"""
sources = list()
loc = list()
conf = list()
# apply vgg up to conv4_3 relu
# vgg網(wǎng)絡(luò)到conv4_3
for k in range(23):
x = self.vgg[k](x)
# l2 正則化
s = self.L2Norm(x)
sources.append(s)
# apply vgg up to fc7
# conv4_3 到 fc
for k in range(23, len(self.vgg)):
x = self.vgg[k](x)
sources.append(x)
# apply extra layers and cache source layer outputs
# extras 網(wǎng)絡(luò)
for k, v in enumerate(self.extras):
x = F.relu(v(x), inplace=True)
if k % 2 == 1:
# 把需要進(jìn)行多尺度的網(wǎng)絡(luò)輸出存入 sources
sources.append(x)
# apply multibox head to source layers
# 多尺度回歸和分類網(wǎng)絡(luò)
for (x, l, c) in zip(sources, self.loc, self.conf):
loc.append(l(x).permute(0, 2, 3, 1).contiguous())
conf.append(c(x).permute(0, 2, 3, 1).contiguous())
loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
if self.phase == "test":
output = self.detect(
loc.view(loc.size(0), -1, 4), # loc preds
self.softmax(conf.view(conf.size(0), -1,
self.num_classes)), # conf preds
self.priors.type(type(x.data)) # default boxes
)
else:
output = (
# loc的輸出将硝,size:(batch, 8732, 4)
loc.view(loc.size(0), -1, 4),
# conf的輸出,size:(batch, 8732, 21)
conf.view(conf.size(0), -1, self.num_classes),
# 生成所有的候選框 size([8732, 4])
self.priors
)
return output
# 加載模型參數(shù)
def load_weights(self, base_file):
other, ext = os.path.splitext(base_file)
if ext == '.pkl' or '.pth':
print('Loading weights into state dict...')
self.load_state_dict(torch.load(base_file,
map_location=lambda storage, loc: storage))
print('Finished!')
else:
print('Sorry only .pth and .pkl files supported.')
然后為了增加可讀性屏镊,重新封裝了一下依疼,代碼如下:
base = {
'300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
512, 512, 512],
'512': [],
}
extras = {
'300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
'512': [],
}
mbox = {
'300': [4, 6, 6, 6, 4, 4], # number of boxes per feature map location
'512': [],
}
def build_ssd(phase, size=300, num_classes=21):
if phase != "test" and phase != "train":
print("ERROR: Phase: " + phase + " not recognized")
return
if size != 300:
print("ERROR: You specified size " + repr(size) + ". However, " +
"currently only SSD300 (size=300) is supported!")
return
# 調(diào)用multibox,生成vgg,extras,head
base_, extras_, head_ = multibox(vgg(base[str(size)], 3),
add_extras(extras[str(size)], 1024),
mbox[str(size)], num_classes)
return SSD(phase, size, base_, extras_, head_, num_classes)
Loss解析
SSD的損失函數(shù)包含兩個(gè)部分而芥,一個(gè)是定位損失律罢,一個(gè)是分類損失,整個(gè)損失函數(shù)表達(dá)如下:
其中棍丐,是先驗(yàn)框的正樣本數(shù)量误辑,是類別置信度預(yù)測(cè)值,是先驗(yàn)框?qū)?yīng)的邊界框預(yù)測(cè)值歌逢,是ground truth的位置參數(shù)巾钉,代表網(wǎng)絡(luò)的預(yù)測(cè)值。對(duì)于位置損失秘案,采用Smooth L1 Loss睛琳,位置信息都是encode
之后的數(shù)值盒蟆,后面會(huì)講這個(gè)encode的過程。而對(duì)于分類損失师骗,首先需要使用hard negtive mining
將正負(fù)樣本按照1:3
的比例把負(fù)樣本抽樣出來历等,抽樣的方法是:針對(duì)所有batch的confidence,按照置信度誤差進(jìn)行降序排列辟癌,取出前top_k
個(gè)負(fù)樣本寒屯。損失函數(shù)可以用下圖表示:
實(shí)現(xiàn)步驟
- Reshape所有batch中的conf,即代碼中的
batch_conf = conf_data.view(-1, self.num_classes)
黍少,方便后續(xù)排序寡夹。 - 置信度誤差越大,實(shí)際上就是預(yù)測(cè)背景的置信度越小厂置。
- 把所有conf進(jìn)行
logsoftmax
處理(均為負(fù)值)菩掏,預(yù)測(cè)的置信度越小,則logsoftmax
越小昵济,取絕對(duì)值智绸,則|logsoftmax|
越大,降序排列-logsoftmax
访忿,取前top_k
的負(fù)樣本瞧栗。
其中,log_sum_exp函數(shù)的代碼如下:
def log_sum_exp(x):
x_max = x.detach().max()
return torch.log(torch.sum(torch.exp(x-x_max), 1, keepdim=True))+x_max
分類損失conf_logP
函數(shù)如下:
conf_logP = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))
這樣計(jì)算的原因主要是為了增強(qiáng)logsoftmax
損失的數(shù)值穩(wěn)定性海铆。放一張我的手推圖:
損失函數(shù)完整代碼實(shí)現(xiàn)迹恐,來自
layers/modules/multibox_loss.py
:
class MultiBoxLoss(nn.Module):
"""SSD Weighted Loss Function
Compute Targets:
1) Produce Confidence Target Indices by matching ground truth boxes
with (default) 'priorboxes' that have jaccard index > threshold parameter
(default threshold: 0.5).
2) Produce localization target by 'encoding' variance into offsets of ground
truth boxes and their matched 'priorboxes'.
3) Hard negative mining to filter the excessive number of negative examples
that comes with using a large number of default bounding boxes.
(default negative:positive ratio 3:1)
Objective Loss:
L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
weighted by α which is set to 1 by cross val.
Args:
c: class confidences,
l: predicted boxes,
g: ground truth boxes
N: number of matched default boxes
See: https://arxiv.org/pdf/1512.02325.pdf for more details.
"""
def __init__(self, num_classes, overlap_thresh, prior_for_matching,
bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
use_gpu=True):
super(MultiBoxLoss, self).__init__()
self.use_gpu = use_gpu
self.num_classes = num_classes
self.threshold = overlap_thresh
self.background_label = bkg_label
self.encode_target = encode_target
self.use_prior_for_matching = prior_for_matching
self.do_neg_mining = neg_mining
self.negpos_ratio = neg_pos
self.neg_overlap = neg_overlap
self.variance = cfg['variance']
def forward(self, predictions, targets):
"""Multibox Loss
Args:
predictions (tuple): A tuple containing loc preds, conf preds,
and prior boxes from SSD net.
conf shape: torch.size(batch_size,num_priors,num_classes)
loc shape: torch.size(batch_size,num_priors,4)
priors shape: torch.size(num_priors,4)
targets (tensor): Ground truth boxes and labels for a batch,
shape: [batch_size,num_objs,5] (last idx is the label).
"""
loc_data, conf_data, priors = predictions
num = loc_data.size(0)# batch_size
priors = priors[:loc_data.size(1), :]
num_priors = (priors.size(0)) # 先驗(yàn)框個(gè)數(shù)
num_classes = self.num_classes #類別數(shù)
# match priors (default boxes) and ground truth boxes
# 獲取匹配每個(gè)prior box的 ground truth
# 創(chuàng)建 loc_t 和 conf_t 保存真實(shí)box的位置和類別
loc_t = torch.Tensor(num, num_priors, 4)
conf_t = torch.LongTensor(num, num_priors)
for idx in range(num):
truths = targets[idx][:, :-1].data #ground truth box信息
labels = targets[idx][:, -1].data # ground truth conf信息
defaults = priors.data # priors的 box 信息
# 匹配 ground truth
match(self.threshold, truths, defaults, self.variance, labels,
loc_t, conf_t, idx)
if self.use_gpu:
loc_t = loc_t.cuda()
conf_t = conf_t.cuda()
# wrap targets
loc_t = Variable(loc_t, requires_grad=False)
conf_t = Variable(conf_t, requires_grad=False)
# 匹配中所有的正樣本mask,shape[b,M]
pos = conf_t > 0
num_pos = pos.sum(dim=1, keepdim=True)
# Localization Loss,使用 Smooth L1
# shape[b,M]-->shape[b,M,4]
pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
loc_p = loc_data[pos_idx].view(-1, 4) #預(yù)測(cè)的正樣本box信息
loc_t = loc_t[pos_idx].view(-1, 4) #真實(shí)的正樣本box信息
loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False) #Smooth L1 損失
'''
Target;
下面進(jìn)行hard negative mining
過程:
1卧斟、 針對(duì)所有batch的conf殴边,按照置信度誤差(預(yù)測(cè)背景的置信度越小,誤差越大)進(jìn)行降序排列;
2珍语、 負(fù)樣本的label全是背景找都,那么利用log softmax 計(jì)算出logP,
logP越大,則背景概率越低,誤差越大;
3廊酣、 選取誤差交大的top_k作為負(fù)樣本能耻,保證正負(fù)樣本比例接近1:3;
'''
# Compute max conf across batch for hard negative mining
# shape[b*M,num_classes]
batch_conf = conf_data.view(-1, self.num_classes)
# 使用logsoftmax,計(jì)算置信度,shape[b*M, 1]
loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))
# Hard Negative Mining
loss_c[pos] = 0 # 把正樣本排除亡驰,剩下的就全是負(fù)樣本晓猛,可以進(jìn)行抽樣
loss_c = loss_c.view(num, -1)# shape[b, M]
# 兩次sort排序,能夠得到每個(gè)元素在降序排列中的位置idx_rank
_, loss_idx = loss_c.sort(1, descending=True)
_, idx_rank = loss_idx.sort(1)
# 抽取負(fù)樣本
# 每個(gè)batch中正樣本的數(shù)目凡辱,shape[b,1]
num_pos = pos.long().sum(1, keepdim=True)
num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
# 抽取前top_k個(gè)負(fù)樣本戒职,shape[b, M]
neg = idx_rank < num_neg.expand_as(idx_rank)
# Confidence Loss Including Positive and Negative Examples
# shape[b,M] --> shape[b,M,num_classes]
pos_idx = pos.unsqueeze(2).expand_as(conf_data)
neg_idx = neg.unsqueeze(2).expand_as(conf_data)
# 提取出所有篩選好的正負(fù)樣本(預(yù)測(cè)的和真實(shí)的)
conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)
targets_weighted = conf_t[(pos+neg).gt(0)]
# 計(jì)算conf交叉熵
loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)
# Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
# 正樣本個(gè)數(shù)
N = num_pos.data.sum()
loss_l /= N
loss_c /= N
return loss_l, loss_c
先驗(yàn)框匹配策略
上面的代碼中還有一個(gè)地方?jīng)]講到,就是match函數(shù)透乾。這是SSD算法的先驗(yàn)框匹配函數(shù)洪燥。在訓(xùn)練時(shí)首先需要確定訓(xùn)練圖片中的ground truth是由哪一個(gè)先驗(yàn)框來匹配磕秤,與之匹配的先驗(yàn)框所對(duì)應(yīng)的邊界框?qū)⒇?fù)責(zé)預(yù)測(cè)它。SSD的先驗(yàn)框和ground truth匹配原則主要有2點(diǎn)捧韵。第一點(diǎn)是對(duì)于圖片中的每個(gè)ground truth市咆,找到和它IOU最大的先驗(yàn)框,該先驗(yàn)框與其匹配再来,這樣可以保證每個(gè)ground truth一定與某個(gè)prior匹配蒙兰。第二點(diǎn)是對(duì)于剩余的未匹配的先驗(yàn)框,若某個(gè)ground truth和它的IOU大于某個(gè)閾值(一般設(shè)為0.5)芒篷,那么改prior和這個(gè)ground truth搜变,剩下沒有匹配上的先驗(yàn)框都是負(fù)樣本(如果多個(gè)ground truth和某一個(gè)先驗(yàn)框的IOU均大于閾值,那么prior只與IOU最大的那個(gè)進(jìn)行匹配)针炉。代碼實(shí)現(xiàn)如下挠他,來自layers/box_utils.py
:
def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
"""把和每個(gè)prior box 有最大的IOU的ground truth box進(jìn)行匹配,
同時(shí)篡帕,編碼包圍框殖侵,返回匹配的索引,對(duì)應(yīng)的置信度和位置
Args:
threshold: IOU閾值赂苗,小于閾值設(shè)為背景
truths: ground truth boxes, shape[N,4]
priors: 先驗(yàn)框愉耙, shape[M,4]
variances: prior的方差, list(float)
labels: 圖片的所有類別贮尉,shape[num_obj]
loc_t: 用于填充encoded loc 目標(biāo)張量
conf_t: 用于填充encoded conf 目標(biāo)張量
idx: 現(xiàn)在的batch index
The matched indices corresponding to 1)location and 2)confidence preds.
"""
# jaccard index
# 計(jì)算IOU
overlaps = jaccard(
truths,
point_form(priors)
)
# (Bipartite Matching)
# [1,num_objects] 和每個(gè)ground truth box 交集最大的 prior box
best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)
# [1,num_priors] 和每個(gè)prior box 交集最大的 ground truth box
best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True)
best_truth_idx.squeeze_(0) #M
best_truth_overlap.squeeze_(0) #M
best_prior_idx.squeeze_(1) #N
best_prior_overlap.squeeze_(1) #N
# 保證每個(gè)ground truth box 與某一個(gè)prior box 匹配拌滋,固定值為 2 > threshold
best_truth_overlap.index_fill_(0, best_prior_idx, 2) # ensure best prior
# TODO refactor: index best_prior_idx with long tensor
# ensure every gt matches with its prior of max overlap
# 保證每一個(gè)ground truth 匹配它的都是具有最大IOU的prior
# 根據(jù) best_prior_dix 鎖定 best_truth_idx里面的最大IOU prior
for j in range(best_prior_idx.size(0)):
best_truth_idx[best_prior_idx[j]] = j
matches = truths[best_truth_idx] # 提取出所有匹配的ground truth box, Shape: [M,4]
conf = labels[best_truth_idx] + 1 # 提取出所有GT框的類別, Shape:[M]
# 把 iou < threshold 的框類別設(shè)置為 bg,即為0
conf[best_truth_overlap < threshold] = 0 # label as background
# 編碼包圍框
loc = encode(matches, priors, variances)
# 保存匹配好的loc和conf到loc_t和conf_t中
loc_t[idx] = loc # [num_priors,4] encoded offsets to learn
conf_t[idx] = conf # [num_priors] top class label for each prior
位置坐標(biāo)轉(zhuǎn)換
我們看到上面出現(xiàn)了一個(gè)point_form函數(shù)猜谚,這是什么意思呢败砂?這是因?yàn)槟繕?biāo)框有2種表示方式:
-
這部分的代碼在layers/box_utils.py
下:
def point_form(boxes):
""" Convert prior_boxes to (xmin, ymin, xmax, ymax)
把 prior_box (cx, cy, w, h)轉(zhuǎn)化為(xmin, ymin, xmax, ymax)
"""
return torch.cat((boxes[:, :2] - boxes[:, 2:]/2, # xmin, ymin
boxes[:, :2] + boxes[:, 2:]/2), 1) # xmax, ymax
def center_size(boxes):
""" Convert prior_boxes to (cx, cy, w, h)
把 prior_box (xmin, ymin, xmax, ymax) 轉(zhuǎn)化為 (cx, cy, w, h)
"""
return torch.cat((boxes[:, 2:] + boxes[:, :2])/2, # cx, cy
boxes[:, 2:] - boxes[:, :2], 1) # w, h
IOU計(jì)算
這部分比較簡(jiǎn)單,對(duì)于兩個(gè)Box來講魏铅,首先計(jì)算兩個(gè)box左上角點(diǎn)坐標(biāo)的最大值和右下角坐標(biāo)的最小值昌犹,然后計(jì)算交集面積,最后把交集面積除以對(duì)應(yīng)的并集面積览芳。代碼仍在layers/box_utils.py
:
def intersect(box_a, box_b):
""" We resize both tensors to [A,B,2] without new malloc:
[A,2] -> [A,1,2] -> [A,B,2]
[B,2] -> [1,B,2] -> [A,B,2]
Then we compute the area of intersect between box_a and box_b.
Args:
box_a: (tensor) bounding boxes, Shape: [A,4].
box_b: (tensor) bounding boxes, Shape: [B,4].
Return:
(tensor) intersection area, Shape: [A,B].
"""
A = box_a.size(0)
B = box_b.size(0)
# 右下角斜姥,選出最小值
max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),
box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
# 左上角,選出最大值
min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),
box_b[:, :2].unsqueeze(0).expand(A, B, 2))
# 負(fù)數(shù)用0截?cái)嗖拙梗瑸?代表交集為0
inter = torch.clamp((max_xy - min_xy), min=0)
return inter[:, :, 0] * inter[:, :, 1]
def jaccard(box_a, box_b):
"""Compute the jaccard overlap of two sets of boxes. The jaccard overlap
is simply the intersection over union of two boxes. Here we operate on
ground truth boxes and default boxes.
E.g.:
A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)
Args:
box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4]
box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4]
Return:
jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)]
"""
inter = intersect(box_a, box_b)# A∩B
# box_a和box_b的面積
area_a = ((box_a[:, 2]-box_a[:, 0]) *
(box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter) # [A,B]#(N,)
area_b = ((box_b[:, 2]-box_b[:, 0]) *
(box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter) # [A,B]#(M,)
union = area_a + area_b - inter
return inter / union # [A,B]
L2標(biāo)準(zhǔn)化
VGG16的conv4_3
特征圖的大小為铸敏,網(wǎng)絡(luò)層靠前,方差比較大悟泵,需要加一個(gè)L2標(biāo)準(zhǔn)化杈笔,以保證和后面的檢測(cè)層差異不是很大。L2標(biāo)準(zhǔn)化的公式如下:
糕非,其中蒙具。同時(shí)球榆,這里還要注意的是如果簡(jiǎn)單的對(duì)一個(gè)layer的輸入進(jìn)行L2標(biāo)準(zhǔn)化就會(huì)改變?cè)搶拥囊?guī)模,并且會(huì)減慢學(xué)習(xí)速度禁筏,因此這里引入了一個(gè)縮放系數(shù)
持钉,對(duì)于每一個(gè)通道l2標(biāo)準(zhǔn)化后的結(jié)果為:
,通常的值設(shè)10或者20融师,效果比較好右钾。代碼來自layers/modules/l2norm.py。
class L2Norm(nn.Module):
'''
conv4_3特征圖大小38x38旱爆,網(wǎng)絡(luò)層靠前舀射,norm較大,需要加一個(gè)L2 Normalization,以保證和后面的檢測(cè)層差異不是很大怀伦,具體可以參考: ParseNet脆烟。這個(gè)前面的推文里面有講。
'''
def __init__(self, n_channels, scale):
super(L2Norm, self).__init__()
self.n_channels = n_channels
self.gamma = scale or None
self.eps = 1e-10
# 將一個(gè)不可訓(xùn)練的類型Tensor轉(zhuǎn)換成可以訓(xùn)練的類型 parameter
self.weight = nn.Parameter(torch.Tensor(self.n_channels))
self.reset_parameters()
# 初始化參數(shù)
def reset_parameters(self):
nn.init.constant_(self.weight, self.gamma)
def forward(self, x):
# 計(jì)算x的2范數(shù)
norm = x.pow(2).sum(dim=1, keepdim=True).sqrt() # shape[b,1,38,38]
x = x / norm # shape[b,512,38,38]
# 擴(kuò)展self.weight的維度為shape[1,512,1,1]房待,然后參考公式計(jì)算
out = self.weight[None,...,None,None] * x
return out
位置信息編解碼
上面提到了計(jì)算坐標(biāo)損失的時(shí)候邢羔,坐標(biāo)是encoding
之后的,這是怎么回事呢桑孩?根據(jù)論文的描述拜鹤,預(yù)測(cè)框和ground truth邊界框存在一個(gè)轉(zhuǎn)換關(guān)系,先定義一些變量:
- 先驗(yàn)框位置:
- ground truth框位置:
- variance是先驗(yàn)框的坐標(biāo)方差流椒。
然后編碼的過程可以表示為:
解碼的過程可以表示為:
這部分對(duì)應(yīng)的代碼在layers/box_utils.py
里面:
def encode(matched, priors, variances):
"""Encode the variances from the priorbox layers into the ground truth boxes
we have matched (based on jaccard overlap) with the prior boxes.
Args:
matched: (tensor) Coords of ground truth for each prior in point-form
Shape: [num_priors, 4].
priors: (tensor) Prior boxes in center-offset form
Shape: [num_priors,4].
variances: (list[float]) Variances of priorboxes
Return:
encoded boxes (tensor), Shape: [num_priors, 4]
"""
# dist b/t match center and prior's center
g_cxcy = (matched[:, :2] + matched[:, 2:])/2 - priors[:, :2]
# encode variance
g_cxcy /= (variances[0] * priors[:, 2:])
# match wh / prior wh
g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]
g_wh = torch.log(g_wh) / variances[1]
# return target for smooth_l1_loss
return torch.cat([g_cxcy, g_wh], 1) # [num_priors,4]
# Adapted from https://github.com/Hakuyume/chainer-ssd
def decode(loc, priors, variances):
"""Decode locations from predictions using priors to undo
the encoding we did for offset regression at train time.
Args:
loc (tensor): location predictions for loc layers,
Shape: [num_priors,4]
priors (tensor): Prior boxes in center-offset form.
Shape: [num_priors,4].
variances: (list[float]) Variances of priorboxes
Return:
decoded bounding box predictions
"""
boxes = torch.cat((
priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],
priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)
boxes[:, :2] -= boxes[:, 2:] / 2
boxes[:, 2:] += boxes[:, :2]
return boxes
后處理NMS
這部分我在上周的推文講過原理了敏簿,這里不再贅述了。這里IOU閾值取了0.5宣虾。不了解原理可以去看一下我的那篇推文惯裕,也給了源碼講解,地址是:https://mp.weixin.qq.com/s/orYMdwZ1VwwIScPmIiq5iA 绣硝。這部分的代碼也在layers/box_utils.py
里面蜻势。就不再拿代碼來贅述了。
檢測(cè)函數(shù)
模型在測(cè)試的時(shí)候鹉胖,需要把loc和conf輸入到detect函數(shù)進(jìn)行nms握玛,然后給出結(jié)果。這部分的代碼在layers/functions/detection.py
里面甫菠,如下:
class Detect(Function):
"""At test time, Detect is the final layer of SSD. Decode location preds,
apply non-maximum suppression to location predictions based on conf
scores and threshold to a top_k number of output predictions for both
confidence score and locations.
"""
def __init__(self, num_classes, bkg_label, top_k, conf_thresh, nms_thresh):
self.num_classes = num_classes
self.background_label = bkg_label
self.top_k = top_k
# Parameters used in nms.
self.nms_thresh = nms_thresh
if nms_thresh <= 0:
raise ValueError('nms_threshold must be non negative.')
self.conf_thresh = conf_thresh
self.variance = cfg['variance']
def forward(self, loc_data, conf_data, prior_data):
"""
Args:
loc_data: 預(yù)測(cè)出的loc張量挠铲,shape[b,M,4], eg:[b, 8732, 4]
conf_data:預(yù)測(cè)出的置信度,shape[b,M,num_classes], eg:[b, 8732, 21]
prior_data:先驗(yàn)框淑蔚,shape[M,4], eg:[8732, 4]
"""
num = loc_data.size(0) # batch size
num_priors = prior_data.size(0)
output = torch.zeros(num, self.num_classes, self.top_k, 5)# 初始化輸出
conf_preds = conf_data.view(num, num_priors,
self.num_classes).transpose(2, 1)
# 解碼loc的信息市殷,變?yōu)檎5腷boxes
for i in range(num):
# 解碼loc
decoded_boxes = decode(loc_data[i], prior_data, self.variance)
# 拷貝每個(gè)batch內(nèi)的conf,用于nms
conf_scores = conf_preds[i].clone()
# 遍歷每一個(gè)類別
for cl in range(1, self.num_classes):
# 篩選掉 conf < conf_thresh 的conf
c_mask = conf_scores[cl].gt(self.conf_thresh)
scores = conf_scores[cl][c_mask]
# 如果都被篩掉了刹衫,則跳入下一類
if scores.size(0) == 0:
continue
# 篩選掉 conf < conf_thresh 的框
l_mask = c_mask.unsqueeze(1).expand_as(decoded_boxes)
boxes = decoded_boxes[l_mask].view(-1, 4)
# idx of highest scoring and non-overlapping boxes per class
# nms
ids, count = nms(boxes, scores, self.nms_thresh, self.top_k)
# nms 后得到的輸出拼接
output[i, cl, :count] = \
torch.cat((scores[ids[:count]].unsqueeze(1),
boxes[ids[:count]]), 1)
flt = output.contiguous().view(num, -1, 5)
_, idx = flt[:, :, 0].sort(1, descending=True)
_, rank = idx.sort(1)
flt[(rank < self.top_k).unsqueeze(-1).expand_as(flt)].fill_(0)
return output
后記
SSD的核心代碼解析大概就到這里了醋寝,我覺得這個(gè)過程算法還算比較清晰了搞挣,不過SSD能夠表現(xiàn)較好的原因還和它的多種有效的數(shù)據(jù)增強(qiáng)方式有關(guān),之后我們有機(jī)會(huì)再來解析一下他的數(shù)據(jù)增強(qiáng)策略音羞。本文寫作的目錄參考了知乎https://zhuanlan.zhihu.com/p/79854543囱桨,看代碼和寫作以及理解一些細(xì)節(jié)大概花了一周時(shí)間,看到這里的同學(xué)不妨給我點(diǎn)個(gè)贊吧嗅绰。
歡迎關(guān)注我的微信公眾號(hào)GiantPadaCV舍肠,期待和你一起交流機(jī)器學(xué)習(xí),深度學(xué)習(xí)窘面,圖像算法翠语,優(yōu)化技術(shù),比賽及日常生活等财边。