作 者: 心有寶寶人自圓
聲 明: 歡迎轉載本文中的圖片或文字坯墨,請說明出處
寫在前面
受到前輩們的啟發(fā)目代,決定應該寫些文章記錄一下學習的內容了
之前也讀過一些文章塔橡、寫過一些代碼佳镜,以后再慢慢填坑吧 ??
現在把最近讀的學習與大家分享一下
在此分享一下自己的理解和心得,如有錯誤或理解不當敬請指出 ??
這篇文章是SSD:Single Shot Multibox Detector:第一部分-論文閱讀的后續(xù)內容姊氓,努力填坑......
論文地址:SSD: Single Shot MultiBox Detector
我們的目標是:用Pytorch實現SSD ??
我使用的是python-3.6+ pytorch-1.3.0+torchvision-0.4.1
訓練集:VOC2007 trainval 丐怯,VOC2012 trainval
測試集:VOC2007 test
其中目標類別如下,共20個類別+1(背景類)
('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat',
'chair', 'cow', 'diningtable','dog', 'horse', 'motorbike', 'person',
'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor')
-
以下圖片為detect的結果翔横,訓練了45個epochs读跷,比著作者的200+epochs差的挺多,但效果還行把(關鍵有點耗時間??)禾唁,隨機展示了測試集中的一些圖片檢測效果??效览,看看怎么樣
0.論文重要概念的回顧
- single-shot vs two-stage:典型的two-stage模型(R-CNN系列)一般有SSD論文提及的那個pipeline无切,大量的多尺度的提議區(qū)域,卷積神經網絡提取特征丐枉,高質量分類器進行分類哆键,用回歸方法預測邊界框的位置,blablabla......總之它存在準確率-速度權衡矛洞,大量的計算資源消耗使它不適合真實世界的即時目標檢測任務洼哎;SSD將最耗時的提議區(qū)域的選擇與重采樣去除碾褂,轉而使用封裝在了模型內部的固定錨框减途,是我們能又快又準的進行目標檢測
- 固定的錨框(fixed邊界框漠畜,priors):在我之前寫的論文閱讀部分中,大量的準備工作都是對錨框進行的抽兆,錨框的設計對模型的訓練至關重要,因為它將被設計成ground truth標記(offset+label)族淮。錨框是預先在SSD模型中固定下來的(priors)辫红,以(aspect ratio, scale)來標識。由于錨框與不同層次的feature map對應祝辣,所以高層的 scale大贴妻,低層的 scale小(預測是基于每一個priors)
- 多尺度特征圖與預測器:SSD在不同層次的特征圖上進行預測蝙斜,并將預測結果加到截斷的base net之后名惩。低層主要用來檢測較小的目標,高層主要用來檢測較大的目標孕荠,不同尺度的預測器學習去預測該尺度下的目標娩鹉。由于不同的尺度特征上,一個像素的感受野在高層更大稚伍,這一特性使得卷積核被設定成固定的大小的小卷積核弯予。
- Hard Negative Mining:SSD在訓練時往往會存在大量的負類,這將導致訓練數據的正負類嚴重不平衡个曙,所以我們需要顯式選擇一定比例負類信度高的預測結果去計算損失锈嫩,而不使用全部的負類
- 非極大值抑制:只留下信度最高的預測框,刪除交疊垦搬、冗余的數據框
整體的工作量還是很大的呼寸,我盡量把注釋寫的清楚 ??
記得定義全局變量
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
1. 從錨框(論文中固定邊界框、default boxes悼沿,之后的Prior)開始
import matplotlib.pyplot as plt
def show_box(box, color):
"""
使用matplotlib展示邊界框
:param box: 邊界框等舔,(xmin, ymin, xmax, ymax)
:return: matplotlib.patches.Rectangle
"""
return plt.Rectangle(xy=(box[0], box[1]), width=box[2] - box[0], height=box[3] - box[1], fill=False,edgecolor=color, linewidth=2)
通常來說,目標(不論是哪個種類)在圖像中的位置分布十分散亂糟趾,大小尺寸各不一致慌植。從概率上來說甚牲,目標可能出現在任何地方,所以我們只能將這種概率空間離散化蝶柿,這樣我們至少能得出一個概率值了......??我們就讓錨框盡可能的普遍整個特征圖(離散化的概率空間丈钙?)。
錨框是先驗的交汤、固定的方框雏赦,它們共同代表了這個類別可能性和近似的方框的概率空間,之后為了突出先驗性芙扎,給它起個英文名:Prior星岗。
1.1 好吧Prior
- 這些錨框需要人工選定且大小、尺度符合訓練數據的特點戒洼,想要Prior代表概率空間就需要它們以每個像素塊生成
- 和之前論文閱讀中講的一樣俏橘,低層采樣較小的scale(檢測較小的目標),高層采用較大的scale(檢測較大的目標)圈浇。因為scale采用比例表示寥掐,從特征圖還原到原始空間上尺度具有一致性
10x10特征圖上某一location對應的6個priors(其他的沒畫太多了)
(具體的操作過程看論文或我之前寫的文章把,這里只標識了重點步驟)
def create_prior_boxes(widths: list, heights: list, scales: list, aspect_ratios: list) -> torch.Tensor:
"""
Create prior boxes on each pixel following authors methods in paper
:param widths: widths list of all feature maps using for create priors
:param heights: heights list of all feature maps using for create priors
:param scales: scales list of all feature maps use for create priors.
Note that each feature map has a specific scale
:param aspect_ratios: widths list of all feature maps use for create priors.
Note that each feature maps has different nums of ratios
:return: priors' location in center coordinates , a tensor in shape of(8732, 4)
"""
prior_boxes = []
for i, (width, height, scale, ratios) in enumerate(zip(widths, heights, scales, aspect_ratios)):
for y in range(height):
for x in range(width):
# change cxcy to the center of pixel
# change cxcy in range 0 to 1
cx = (x + 0.5) / width
cy = (y + 0.5) / height
for ratio in ratios:
# all those params are proportional form(percent coordinates)
prior_width = scale * math.sqrt(ratio)
prior_height = scale / math.sqrt(ratio)
prior_boxes.append([cx, cy, prior_width, prior_height])
# For the aspect ratio of 1, we also add a default box whose scale is sqrt(s(k)*(sk+1))
if ratio == 1:
try:
additional_scale = math.sqrt(scales[i] * scales[i + 1])
# except this is the last feature map, only one pixel is left
except IndexError:
additional_scale = 1
# ratio of 1 means scale is width and height
prior_boxes.append([cx, cy, additional_scale, additional_scale])
return torch.FloatTensor(prior_boxes).clamp_(0, 1).to(device) # (8732, 4) Note that they are percent coordinates
1.2 Prior的表示形式
Prior在論文中表示為(cx, cy, w, h):中心表示形式磷蜀,而有時候為了編程的方便還會采用(xmin, ymin , xmax, ymax)的邊緣表示形式召耘,這就需要兩種表示形式的相互轉化
def xy_to_cxcy(xy: torch.Tensor) -> torch.Tensor:
"""
把(xmin, ymin, xmax, ymax)的中心表示形式轉換為(cx, cy, w, h)的邊緣表示形式
:param xy: 邊界框的(xmin, ymin, xmax, ymax)表示,a tensor of size (num_boxes, 4)
:return:邊界框的(cx, cy, w, h)表示褐隆, a tensor of size (num_boxes, 4)
"""
return torch.cat([(xy[:, 2:] + xy[:, :2] )/ 2, xy[:, 2:] - xy[:, :2]], dim=1)
def cxcy_to_xy(cxcy: torch.Tensor) -> torch.Tensor:
"""
把(cx, cy, w, h)表示形式轉換為(xmin, ymin, xmax, ymax)
:param cxcy: 邊界框的(cx, cy, w, h)表示污它,a tensor of size (n_boxes, 4)
:return: 邊界框的(xmin, ymin, xmax, ymax)表示
"""
return torch.cat([cxcy[:, :2] - (cxcy[:, 2:] / 2), cxcy[:, :2] + (cxcy[:, 2:] / 2)], 1)
注:在之前的論文閱讀部分也指明了通過多方面考慮應該使用相對長度(或相對坐標,即已進行歸一化)來表示Prior
1.3 Prior to ground truth
很顯然priors并不是真正的groud truth信息(與真實邊界存在偏差妓灌、未指定類別轨蛤、且每個prior的ground truth據有不確定性,我們需要量化這些信息)虫埂,我們需要將priors的信息調整為ground truth信息來計算損失(同時我么也必須理解我們預測的是什么祥山,預測結果怎么轉換為真實預測邊界框的信息)
1.3.1 offset
偏移量表示為,論文閱讀部分指出進行了如下編碼:
? =
,
=
,
,
(1)掉伏,
? 其中(cx,cy,w,h)是ground truth的真實位置信息缝呕,是prior的真實位置信息
而在實際使用的時候常常使用基于經驗參數的標準化對編碼結果再次處理,即:
? =
,
=
,
,
(2)斧散,
? 其中經驗參數
def cxcy_to_gcxgcy(cxcy: torch.Tensor, priors_cxcy: torch.Tensor) -> torch.Tensor:
"""
使用中心格式的輸入計算與目標區(qū)域與priors的偏移量供常,該偏移量按式(2)編碼
中心格式的目標區(qū)域與priors是一一對應的
:param cxcy: 邊緣格式的邊界框, a tensor of size (n_priors, 4)
:param priors_cxcy: prior的邊界框, a tensor of size (n_priors, 4)
:return: encoded bounding boxes, a tensor of size (n_priors, 4)
"""
return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:]) * 10,
torch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1)
我們要獲得實際預測邊界框,則需要對上述過程進行解碼(注:預測器實際預測的結果是上面最終編碼的的offsets)
def gcxgcy_to_cxcy(gcxgcy: torch.Tensor, priors_cxcy: torch.Tensor) -> torch.Tensor:
"""
輸入模型預測的offsets和priors(一一對應)鸡捐,解碼出的預測邊界框中心格式邊界框
:param gcxgcy:編碼后的邊界框(即offset),如模型的輸出, a tensor of size (n_priors, 4)
:param priors_cxcy:prior的邊界框, a tensor of size (n_priors, 4)
:return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)
"""
return torch.cat([gcxgcy[:, :2] / 10 * priors_cxcy[:, 2:] + priors_cxcy[:, 2],
torch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], dim=1)
這一部分中ground truth offset只需cxcy為ground truth labels即可栈暇,但cxcy需與priors一一對應,這種對應關系箍镜,就是我們接下來討論的內容
1.3.2 object class
0代表背景類源祈,1-n_classes代表目標類別煎源。每個圖像中目標個數、目標類別均不一定相同香缺,因此我要先給priors分配一個目標手销,由該目標的類別確定prior的類別
1.3.3 criterion
為了為priors分配類別,必須采用一種指標图张,來判斷priors與真實邊界框的匹配程度
原文中采用了jaccard overlap(交并比锋拖,IoU)
下面定義了計算交并比的函數,注意輸入是邊界框的邊緣形式
def find_intersection(set_1, set_2):
"""
Find the intersection of every box combination between two sets of boxes that are in boundary coordinates.
:param set_1: set 1, a tensor of dimensions (n1, 4)
:param set_2: set 2, a tensor of dimensions (n2, 4)
:return: intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
"""
# PyTorch auto-broadcasts singleton dimensions
lower_bound = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0)) # (n1,n2,2)
upper_bound = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0)) # (n1,n2,2)
intersection_dims = torch.clamp(upper_bound - lower_bound, 0) # (n1, n2, 2)
return intersection_dims[:, :, 0] * intersection_dims[:, :, 1] # (n1, n2)
def find_jaccard_overlap(set_1, set_2):
"""
Find the Jaccard Overlap (IoU) of every box combination between two sets of boxes that are in boundary coordinates.
:param set_1: set 1, a tensor of dimensions (n1, 4)
:param set_2: set 2, a tensor of dimensions (n2, 4)
:return: Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
"""
# Find intersections
intersection = find_intersection(set_1, set_2)
# Find areas of each box in both sets
areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1]) # (n1)
areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1]) # (n2)
# Find the union
# PyTorch auto-broadcasts singleton dimensions
union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection # (n1, n2)
return intersection / union # (n1, n2)
假設set_1是priors(8732, 4)祸轮,set_2是真實邊界框(n_object_per_image, 4)兽埃,我們最終的到(8732, n_object_per_image)的tensor,即在該圖像內每個prior與每個object box的交并比
1.3.4 priors to ground truth
def label_prior(priors_cxcy, boxes, classes):
"""
Assign ground truth label for prior. Note that we do this for each image in a batch
priors are fixed pretrain, boxes and classes are from dataloader.
:param priors_cxcy: priors which we create in shape of (8732, 4),note that they are center center coordinates and percent coordinates
:param boxes: boxes is a tensor of true objects' bounding boxes in the image. Note that they are percent coordinates
:param classes: classes is a tensor of true objects' class labels in the image
:return:
"""
n_objects = boxes.size(0)
# cxcy to xy
priors_cxcy = priors_cxcy
priors_cxcy = cxcy_to_xy(priors_cxcy)
overlaps = find_jaccard_overlap(boxes, priors_cxcy)
# 為每個prior找出最大的overlap并以此為標準分配目標(注意不是類別)
overlap_per_prior, object_per_prior = overlaps.max(dim=0) # (8732)
# 直接為按交并比大小分配類別會產生如下的問題
# 1. 如果一個檢測目標對與所有priors的交并比都不是最大的倔撞,該目標的類別則不能分配給任意一個prior
# 2. 給定閾值(0.5)將交并比較小的prior分配給背景類(class 0)
# 解決第一個問題:
_, prior_per_object = overlaps.max(dim=1) # (nums of object)每個值為該目標對應的index in (0, 8731)
object_per_prior[prior_per_object] = torch.LongTensor(range(n_objects)).to(device) # 為與每個目標overlap最大prior的分配為該目標
overlap_per_prior[prior_per_object] = 1
# 解決第二個問題:
class_per_prior = classes[object_per_prior] # 根據object的索引獲得對應其真實的類別標簽
class_per_prior[overlap_per_prior < 0.5] = 0 # (8732)
# 為每個prior計算與之前所分配objcet邊界框的offset
offset_per_prior = cxcy_to_gcxgcy(boxes[object_per_prior], priors_cxcy) # (8732, 4)
return class_per_prior, offset_per_prior
不難注意到讲仰,每個prior對應了一個ground truth慕趴,它們用來檢測不同尺度痪蝇、不同位置的目標
label_prior()是針對batch里的一個圖像與之對應的目標邊界框和目標類別(xml文件標注的,from dataloard),只需在batches里寫個for循環(huán)即可冕房,就得到了針對該圖片的priors to ground truth躏啰,用于Loss計算(見5.1).
2. 網絡結構
SSD模型的網絡結構將VGG-16從FC之前截斷作為base net,將base net細節(jié)結構進行更改并加上Conv6和Conv7耙册,在base net之后加上了額外的卷積層結構
(注:為代碼的可讀性網絡给僵,SSD的網絡被拆分BaseNet和AuxiliaryConvolutions)
完整的VGG-16模型由于全連接層的存在,需要輸入的大小為( 3, 224, 224)详拙,作者將網絡魔改一下用來接收300x300的輸入(SSD300 model)
2.0 Conv4_3:
按vgg-16向前傳播的時候帝际,Conv_4中300 x 300的原始圖像會被下采樣到37 x 37,而這里指出的大小為38 x 38饶辙。vgg-16網絡中蹲诀,能夠下采樣的只有池化層,所以這里變化是由maxpool3的修改而導致的弃揽,將其中計算輸出尺寸的函數由向下取整(floor)改為向上取整(ceiling)
self.pool3=nn.MaxPool2d(kernel_size=2, 2, ceil_mode=True)
2.1 Maxpool5
不在使用原來vgg-16中同一結構脯爪,而改用size=(3,3),stride=1矿微,padding=1的maxpool
self.pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
2.2 Conv6與Conv7:希望我能表述的足夠清楚??
fc6-fc7:圖像為(512, 7, 7).flatten()(fc6)
4096
(fc7)
1000痕慢,作者希望直接利用fc6和fc7的weights生成Conv6和Conv7的卷積核
2.2.1我們先來理清一下卷積層與全連接層的相互轉化問題
-
卷積層->全連接層:
Conv to FC
由上圖很容易的出轉換fc層的權重是取自卷積核權重的稀疏矩陣。又特征圖每個輸出通道上的像素由輸入空間所有in_channel在相同位置的卷積值相加得到(i.e.紅框陰影由多層藍陰影框(假設有多層-_-)分別與多個卷積核卷積得到的多層結果相加得到)涌矢,所以out_channel控制特征圖的個數掖举,in_channel和out_channels控制fc權重的長和寬
-
全連接層->卷積層:考慮input像素(512,7娜庇,7).flatten() -> 4096個塔次,此時fc權重為(512*7*7滨巴,4096)
假設卷積核大小與圖像大小一致,為(4096俺叭,512恭取,7,7)熄守,按照卷積的運算過程蜈垮,得到的結果是(某一輸出通道內)每個通道的每個像素與對應的卷積核權重相乘之后相加,與全連接的計算結果完全一致裕照,此時通道維是原來的特征維
所以conv6的卷積核應為(4096攒发,512,7晋南,7)惠猿,conv7的卷積核應為(4096,4096负间,1偶妖,1)
However,這樣還不行??政溃,這些過濾器數量眾多趾访、體積龐大,而且計算成本很高董虱,所以作者對卷積核進行了下采樣
2.2.2 卷積核下采樣
其實這個過程非常的簡單扼鞋,就是把卷積核的參數(out_channels, height, width這三個dim)給下采樣了.......
from collections import Iterable
def decimate(tensor: torch.Tensor, m: Iterable) -> torch.Tensor:
"""
對tensor的一些維度進行下采樣,每一維度的下采樣間隔列表為m
:param tensor: 要被下采樣的tensor
:param m: 每一維度的下采樣間隔參數列表愤诱,如果某一維度不進行下采樣云头,參數為None
:return: 下采樣后的tensor
"""
assert tensor.dim() == len(m)
for d in range(tensor.dim()):
if m[d] is not None:
tensor = tensor.index_select(dim=d, index=torch.arange(start=0, end=tensor.size(d), step=m[d]))
return tensor
作者將 height和width dim的采樣率都設為3(每三取一),out_channels采樣率為4采樣出了的原始卷積核
終于我們得到了Conv6核Conv7的卷積核分別為(1024淫半,512溃槐,3,3)撮慨,(1024, 1024竿痰,1, 1)
2.2.2Atrous卷積
Atrous卷積(空洞卷積, also known as Dilated Convolution or Convolution with holes......)實際針對的是相鄰的像素(因為相鄰像素一般在信息上有較大冗余)。為了在不進行pooling下采樣的情況下能夠獲得更大的感受野砌溺,我們便可以在卷積的輸入空間內加入空洞(因為pooling意味著圖片信息的損失影涉。Atrous卷積實際并沒有圖片信息的損失,只不過特征圖同一像素不提取輸入空間相鄰像素的信息规伐,而在其他特征圖像素中蟹倾,之前被“跳過”的相鄰像也確實和卷積核進行了運算......不多說了,看圖更清楚??)
該圖片來自:vdumoulin/conv_arithmetic (可能大家對這一系列的圖都很熟悉,陰影部分是卷積運算的區(qū)域??)
不難發(fā)現鲜棠,確實每個輸入空間的像素都被用到(沒有像pooling那樣丟棄)并且還擴大了感受野
2.2.3Atrous算法與卷積核的下采樣
原文中肌厨,conv6的輸出大小仍是19x19,且使用了atrous卷積豁陆。
按之前講述的內容卷積核被下采樣后柑爸,特征圖原本應該與7x7卷積核運算,但下采樣使部分核有所缺失(holes are in the kernel)盒音,所以合適的方法應該讓卷積時跳過3個像素表鳍。然而作者的倉庫中實際上使用了dilation=6,這樣的操作可能是考慮了修改之后maxpool5沒有使輸出大小縮小一半祥诽,所以dilation需要增加一倍
self.conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6) # atrous convolution
self.conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
接下來使用原全連接層的weight和bias更新base_net:
# this part can be defined in class BaseNet as a function for init.
# get state_dict which only contains params
state_dict = base_net.state_dict() # base net is instance of BaseNet
pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()
# fc6
conv_fc_weight = pretrained_state_dict['classifier.0.weight'].view(4096, 512, 7, 7) # (4096, 512, 7, 7)
conv_fc_bias = pretrained_state_dict['classifier.0.bias'] # (4096)
state_dict['conv6.weight'] = decimate(conv_fc_weight, m=[4, None, 3, 3])# (1024, 512, 3, 3)
# fc7:在預訓練模型中譬圣,fc7的名字就是classifier.3
conv_fc7_weight = pretrained_state_dict['classifier.3.weight'].view(4096, 4096, 1, 1) # (4096, 4096, 1, 1)
conv_fc7_bias = pretrained_state_dict['classifier.3.bias'] # (4096)
state_dict['conv7.weight'] = decimate(conv_fc7_weight, m=[4, 4, None, None]) # (1024, 1024, 1, 1)
state_dict['conv7.bias'] = decimate(conv_fc7_bias, m=[4]) # (1024)
base_net.load_state_dict(state_dict)
......這個令人頭疼的部分終于結束了??
2.3 其余的附加卷積層:
都是作者附加的用來提取大尺度特征的,挺好理解雄坪,1x1卷積層有妙用(類似于提取特征圖進一步提取特征厘熟?)??
class AuxiliaryConvolutions(nn.Module):
"""
Additional convolutions to produce higher-level feature maps.
"""
def __init__(self):
super(AuxiliaryConvolutions, self).__init__()
# Auxiliary convolutions on top of the VGG base
self.conv8_1 = nn.Conv2d(1024, 256, kernel_size=1, padding=0)
self.conv8_2 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1)
self.conv9_1 = nn.Conv2d(512, 128, kernel_size=1, padding=0)
self.conv9_2 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
self.conv10_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
self.conv10_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0)
self.conv11_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
self.conv11_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0)
# Initialize convolutions' parameters
for c in self.children():
if isinstance(c, nn.Conv2d):
nn.init.xavier_normal_(c.weight)
nn.init.constant_(c.bias, 0.)
2.4 multi-level feature maps:
從圖中可以看出,用來提取多尺度特征的特征圖選擇為conv4_3
, conv7
, conv8_2
, conv9_2
, conv10_2
, and conv11_2
(有低層特征圖维哈,也有高層特征圖)绳姨,在forward內把這些特征圖返回出來即可
BaseNet:forward return conv4_3_features, conv7_features
AuxiliaryConvolutions: foward return conv8_2_features, conv9_2_features, conv10_2_features, conv11_2_features
2.5 predictor
多層特征圖傳入各自的預測其,分別預測offset和class笨农,各層的預測器具有較類似的結構:kernel_size=3, padding=1
注意offset的預測結果是基于該層特征圖上priors的編碼結果(見1.3)就缆,class需要為各個類別評分
def loc_predictor(in_channels, num_priors):
"""
邊界框預測層,為每個輸入空間每個像素上的priors預測4個偏移量
:param in_channels: 輸入空間通道數
:param num_priors:每個單元為中心生成 num_priors 個prior
:return:預測offset的卷積層
"""
return nn.Conv2d(in_channels, num_priors * 4, kernel_size=3, padding=1)
def cls_predictor(in_channels, num_priors, num_classes):
"""
類別預測層,為每個輸入空間像素上的priors預測各個類別的評分
類別預測層使用一個保持輸入高和寬的卷積層。此時谒亦,輸出和輸入在特征圖寬和高上的空間坐標一一對應
:param in_channels: 輸入空間通道數
:param num_priors: 每個單元為中心生成 num_priors 個prior
:param num_classes: 目標的類別個數為 num_classes
:return:類別預測的卷積層
"""
return nn.Conv2d(in_channels, num_priors * num_classes, kernel_size=3, padding=1)
priors是在特征圖每個像素上生成的,預測器的預測結果的w,h與輸入空間一致空郊,所以每個預測空間像素與輸入空間像素對應份招,很自然offset是針對對應prior的編碼后offset,此時out_channels轉換為了特征維狞甚,為了應對不同輸入空間大小不同導致w,h和num_priors的不同锁摔,我們需要在把所有輸出結果concatenate前,需要把其空間維flatten一下哼审。class預測與offset預測的思路基本一致只是最后的特征維(輸出通道)不同
- 為了訓練還需要把選取提取特征的特征圖元素個數湊得和priors的個數一致(一一對應關系)
最后把所有特征圖的預測結果連接起來
class PredictionConvolution(nn.Module):
"""
Convolutions to predict class scores and bounding boxes
"""
def __init__(self, n_classes):
"""
:param n_class: number of different types of objects
"""
self.n_classes = n_classes
super(PredictionConvolution, self).__init__()
# Number of priors, as we showing before ,at per position in each feature map
n_boxes = {'conv4_3': 4,
'conv7': 6,
'conv8_2': 6,
'conv9_2': 6,
'conv10_2': 4,
'conv11_2': 4}
self.convs = ['conv4_3', 'conv7', 'conv8_2', 'conv9_2', 'conv10_2', 'conv11_2']
for name, ic in zip(self.convs, [512, 1024, 512, 256, 256, 256]):
setattr(self, 'cls_%s' % name, cls_predictor(ic, n_boxes[name], n_classes))
setattr(self, 'loc_%s' % name, loc_predictor(ic, n_boxes[name]))
# Initialize convolutions' parameters
for c in self.children():
if isinstance(c, nn.Conv2d):
nn.init.xavier_normal_(c.weight)
nn.init.constant_(c.bias, 0.)
def _apply(self, x: torch.Tensor, conv: nn.Conv2d, num_features: int):
"""
Apply forward calculation for each conv2d with respect to specific feature map
:param x: input tensor
:param conv: conv
:param num_features: output feature, for loc_pred is 4, for label_pred is num_classes+1
:return: locations and class scores
"""
x = conv(x).permute(0, 2, 3, 1).contiguous()
return x.view(x.size(0), -1, num_features)
def forward(self, *args):
# args are feature maps needed for prediction
assert len(args) == len(self.convs)
locs = []
classes_scores = []
for name, x in zip(self.convs, args):
classes_scores.append(self._apply(x, getattr(self, 'cls_%s' %name), self.n_classes))
locs.append(self._apply(x, getattr(self, 'loc_%s' % name), 4))
locs = torch.cat(locs, dim=1) # (N, 8732, 4)
classes_scores = torch.cat(classes_scores, dim=1) # (N, 8732, n_classes)
return locs, classes_scores
2.6 SSD300
把BaseNet谐腰,AuxiliaryConvolutions和PredictionConvolution整合在一起得到SSD300模型
3. 訓練數據處理
數據增廣時除了圖像本身的處理外還涉及對真實邊界框的處理,所以我們不能直接使用torchvision.transform里封裝好的類涩盾,我們只能手動寫了??
針對文中所說的0.5的概率進行圖像增廣十气,只需通過判斷random.random()是否小于0.5來進行圖像增廣即可
3.1 隨機裁剪
原文中的數據增廣主要就是這個隨機裁剪了
def random_crop(image: torch.Tensor, boxes: torch.Tensor, labels: torch.Tensor):
"""
隨機裁剪,能夠幫助網絡學習更大尺度的目標春霍,但某些目標可能被完全剪切掉
:param image: 圖像, a tensor of dimensions (3, original_h, original_w)
:param boxes: 邊緣形式的真實邊界框, a tensor of dimensions (n_objects, 4)
:param labels: 真實目標類別, a tensor of dimensions (n_objects)
:return: 隨機裁剪后圖像砸西,邊界框,目標類別
"""
original_width = image.size(2)
original_height = image.size(1)
while True:
# 'None' 意味著不剪裁,0意味著隨即裁剪,[.1, .3, .5, .7, .9]是作者文中描述的最小交并比
min_overlap = random.choice([0., .1, .3, .5, .7, .9, None])
if min_overlap is None:
return image, boxes, labels
# 對選取的最小交并比嘗試50次(原文中未提及芹枷,但作者倉庫中使用)衅疙,若均不滿足條件,則進行下一循環(huán)選擇新的最小交并比
for _ in range(50):
min_scale = 0.3
# 論文中提及采樣比例是[.1, 1]鸳慈,但作者倉庫使用[.3, 1]
# random.uniform(a,b)->[a,b]閉區(qū)間
new_width = int(original_width * random.uniform(min_scale, 1))
new_height = int(original_height * random.uniform(min_scale, 1))
# 論文重提及采樣后aspect ratio應該在[0.5,2]
if not .5 <= new_height / new_width <= 2:
continue
# 獲取裁剪的位置
# random.randint(a,b)->[a,b]閉區(qū)間
left = random.randint(0, original_width - new_width)
top = random.randint(0, original_height - new_height)
right = left + new_width
bottom = top + new_height
crop_bounding = torch.FloatTensor([left, top, right, bottom])
# 計算剪裁后的圖片與真實邊界框交并比
over_lap = find_jaccard_overlap(crop_bounding.unsqueeze(0), boxes).squeeze(0) # (n_objects)
# 論文中提及饱溢,與所有目標的交并比應該> min_overlap
if over_lap.max().item() < min_overlap:
continue
cropped_image = image[:, top:bottom, left:right]
# 判斷object是否在圖像中的判據:true bounding box的中心是否在裁剪后的圖像中
box_centers = (boxes[:, :2] + boxes[:, 2:]) / 2. # (n_objects, 2)
center_in_cropped_iamge = (box_centers[:, 0] > left) * (box_centers[:, 0] < right) * ( box_centers[:, 1] > top) * (box_centers[:, 0] < bottom) # (n_objects)
# 如果沒有一個目標的中心在裁剪后的圖像中
if center_in_cropped_iamge.any():
continue
# 丟棄沒有通過判據的目標
new_boxes = boxes[center_in_cropped_iamge]
new_labels = labels[center_in_cropped_iamge]
# 計算剪切后圖像中邊界框的位置
# 篩選出真實左邊界、上邊界和裁剪左邊界走芋、上邊界之中小的那個
new_boxes[:, :2] = torch.max(new_boxes[:, :2], crop_bounding[:2])
new_boxes[:, :2] -= crop_bounding[:2]
# 篩選出真實右邊界理朋、下邊界和裁剪右邊界、下邊界之中大的那個
new_boxes[:, 2:] = torch.min(new_boxes[:, 2:], crop_bounding[2:])
new_boxes[:, 2:] -= crop_bounding[:2]
return cropped_image, new_boxes, new_labels
3.2 水平翻轉
這個很簡單绿聘,就是真實邊界框不是圖像還需要額外處理
def flip(image, boxes):
"""
Flip image horizontally.
:param image: 一個PIL圖像嗽上,因為調用了torchvision的函數,必須使用PIL Image
:param boxes: 邊緣形式的真實邊界框, a tensor of dimensions (n_objects, 4)
:return: 水平翻轉圖像, 更新后的邊界框
"""
# Flip image
new_image = torchvision.transforms.functional.hflip(image)
# Flip boxes
new_boxes = boxes
new_boxes[:, 0] = image.width - (boxes[:, 0] + 1)
new_boxes[:, 2] = image.width - (boxes[:, 2] + 1)
new_boxes = new_boxes[:, [2, 1, 0, 3]]
return new_image, new_boxes
3.3 Resize
SSD300模型需要將訓練集resize到300 x 300熄攘,此外在這里把真實邊界框處理成比例 的形式
def resize(image, boxes, size=(300, 300), return_percent_coords=True):
"""
Resize image. For the SSD300, resize to (300, 300).
Since percent/fractional coordinates are calculated for the bounding boxes (w.r.t image dimensions) in this process,
you may choose to retain them.
:param image: image, a PIL Image
:param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
:param size: resize to specific size
:param return_percent_coords: whether to return new bounding box coordinates in form of percent coordinates
:return: resized image, updated bounding box coordinates (or fractional coordinates, in which case they remain the same)
"""
# Resize image
new_image = transforms.functional.resize(image, size)
# Resize bounding boxes
old_size = torch.FloatTensor([image.width, image.height, image.width, image.height]).unsqueeze(0)
# resize means percent coordinates will not change for only augment or shrink
new_boxes = boxes / old_size # percent coordinates means same even if different size
if not return_percent_coords:
new_size = torch.FloatTensor([size[0], size[1], size[0], size[1]]).unsqueeze(0)
new_boxes = new_boxes * new_size
return new_image, new_boxes
3.5 Expand
由于模型對于較小尺度目標的檢測性能不好兽愤,在此我們將訓練數據放大,以增強對小尺度目標的檢測能力
整體的步驟與resize十分類似挪圾,只不過需要將新圖片放大浅萧,將原圖片放在新圖片內部,再將其他空白部分填充一下
這個填充的值推薦使用三個channels各自的平均值(可以在3.6中看到)
由于新圖片范圍比原圖片大哲思,真實邊界框只需加上[ 向左的移動洼畅,向下的移動,向左的移動棚赔,向下的移動 ]
3.6 標準化
輸入數據先被歸一化到[0, 1]帝簇,預訓練的模型會還需對歸一化輸入進行標準化,這個頁面展示了torchvision.model預訓練模型的具體處理
mean = [0.485, 0.456, 0.406] # RGB channels
std = [0.229, 0.224, 0.225] # RGB channels
4. Dataset and DataLoader
Dataset需要手動創(chuàng)建torch.utils.data.Dataset的子類靠益,在里面對圖片丧肴、真實邊界框、目標標記進行第3節(jié)的處理即可
Dataset返回圖片胧后、真實邊界框芋浮、目標標記
然而在使用DataLoader讀取batches的時候會出現問題:
注意每個圖片內objects的個數不同,這會導致每個圖片內boxes和labels的長度不同壳快,這樣沒辦法組成batches
所以我們要為DataLoader的collate_fn=參數指定一個函數(注意只需傳入函數名)纸巷,按此函數整理輸出
def collate_fn(batch):
"""
This describes how to combine these tensors of different sizes. We use lists.
:param batch: an iterable of N sets from __getitem__()
:return: a tensor of images, lists of varying-size tensors of bounding boxes, labels, and difficulties
"""
images = list()
boxes = list()
labels = list()
for b in batch:
images.append(b[0])
boxes.append(b[1])
labels.append(b[2])
images = torch.stack(images, dim=0)
return images, boxes, labels, difficulties # tensor (N, 3, 300, 300), 3 lists of N tensors each
5.訓練
5.1 Loss Function
location_loss=torch.nn.L1Loss()
confidence_loss=nn.CrossEntropyLoss(reduction='none')
5.2 Hard negative mining
由于訓練數據中的負類(背景類)遠遠多于正類,導致訓練數據正負類嚴重的不平衡眶痰,所以這里要使用Hard negative mining瘤旨,選擇Loss最大的負類,使正負類之比為1:3
def calculate_loss(priors_cxcy, pred_locs, pred_scores, boxes, labels, loc_loss, conf_loss, alpha=1):
"""
使用Hard Negative mining 計算損失
:param priors_cxcy: 中心形式的priors
:param pred_locs: 預測的offsets, 一個batch的預測結果
:param pred_scores: 類別預測分數, 一個batch的預測結果
:param boxes: 真實邊界框凛驮,from a batch of dataloader
:param labels: 真實類別標記裆站,from a batch of dataloader
:param loc_loss: nn.L1Loss()
:param conf_loss: nn.CrossEntropyLoss(reduction='none')
:param alpha: 論文中位置損失的權重,默認為1
:return:
"""
n_priors = priors_cxcy.size(0)
batch_size = pred_locs.size(0)
n_classes = pred_scores.size(2)
assert n_priors == pred_scores.size(1) == pred_scores.size(1)
true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device) # (N, 8732, 4)
true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device) # (N, 8732)
# 在不同圖片里,為每個prior分配真實標簽
for i in range(batch_size):
cls, loc = label_prior(priors_cxcy, boxes[i], labels[i])
true_locs[i] = loc
true_classes[i] = cls
positive_priors = (true_classes != 0) # (N, 8732)
# 計算位置損失:位置損失只計算正類(非背景類)
loss_of_loc = loc_loss(pred_locs[positive_priors], true_locs[positive_priors])
# 計算信度損失
# 按論文中負類:正類 = 3:1選取負類
n_hard_negative = 3 * positive_priors.sum(dim=1) # (N)
# 首先計算所由正類和負類的信度損失宏胯,這樣可以免得計算不同圖片導致的位置關系
# CrossEntropyLoss(reduction='none')使得損失在第0維度上羅列開來而不是相加或取平均
loss_of_conf_all = conf_loss(pred_scores.view(-1, n_classes), labels.view(-1)) # (N * 8732)
loss_of_conf_all = loss_of_conf_all.view(batch_size, n_priors) # (N, 8732)
# 我們已經知道了所有正類的損失
loss_of_conf_pos = loss_of_conf_all[positive_priors] # (sum(n_positives))
loss_of_conf_neg = loss_of_conf_all.clone() # (N, 8732)
loss_of_conf_neg[positive_priors] = 0 # (N, 8732), 使正類的loss永遠不能在前n_hard_negatives
loss_of_conf_neg, _ = loss_of_conf_neg.sort(dim=1, descending=True) # 負類將損失按降序排序
neg_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(loss_of_conf_neg) # (N, 8732), 為每行元素標序號
hard_negatives = (neg_ranks < n_hard_negative.unsqueeze(1)) # (N, 8732)
loss_of_conf_hard_neg = loss_of_conf_neg[hard_negatives] # (sum(n_hard_negatives)
# As in the paper, averaged over positive priors only, although computed over both positive and hard-negative priors
loss_of_conf = (loss_of_conf_pos.sum() + loss_of_conf_hard_neg.sum()) / positive_priors.sum().float() # (), scalar
# TOTAL LOSS
return loss_of_conf + alpha * loss_of_loc
6. 目標檢測
6.1 非極大值抑制
在最后進行目標檢測的時候羽嫡,我們不希望輸出過多的預測邊界框(此時的邊界框存在大量的重疊),這時候我們需要進行非極大值抑制肩袍,把認為是重疊的邊界框(不同預測邊界框之間的交并比大于給定閾值認為是重疊)去除杭棵,只保留信度最大的邊界框
def none_max_suppress(priors_cxcy, pred_locs, pred_scores, min_score, max_overlap, top_k):
"""
執(zhí)行非極大值預測
:param priors_cxcy: 中心格式的priors
:param pred_locs: 預測的offsets,預測器的輸出
:param pred_scores: 預測的得分氛赐,預測器的輸出
:param min_score: 設置接收的最小得分
:param max_overlap: 設置抑制的最大交并比
:param top_k: 保留至多top_k個預測目標
:return: 壓縮后邊緣形式的邊界框魂爪、類別、得分
"""
batch_size = priors.size(0)
n_priors = priors.size(0)
n_classes = pred_scores.size(2)
pred_scores = torch.softmax(pred_scores, dim=2) # (batch_size, n_priors, n_classes)
assert n_priors == pred_scores.size(1) == pred_locs.size(1)
boxes_all_image = []
scores_all_image = []
labels_all_image = []
for i in range(batch_size):
# 將預測的offset解碼為邊緣形式的邊界框
boxes = cxcy_to_xy(gcxgcy_to_cxcy(pred_locs[i], priors_cxcy)) # (n_priors, 4)
boxes_per_image = []
scores_per_image = []
labels_per_image = []
for c in range(1, n_classes):
class_scores = pred_scores[i, :, c] # (8732)
score_above_min = class_scores > min_score
n_score_above_min = score_above_min.sum().item()
if n_score_above_min == 0:
continue
# 僅保留score>min_score的預測
class_scores = class_scores[score_above_min]
class_boxes = boxes[score_above_min]
# 按檢測信度排序
class_scores, sorted_ind = class_scores.sort(dim=0, descending=True) # (n_score_above_min)
class_boxes = class_boxes[sorted_ind] # (n_score_above_min, 4)
# 按交并比進行非極大值壓縮
overlap = find_jaccard_overlap(class_boxes, class_boxes) # (n_score_above_min, n_score_above_min)
# 創(chuàng)建記錄是否被壓縮的掩碼艰管,1代表壓縮
suppress = torch.zeros((n_score_above_min), dtype=torch.uint8).to(device)
for b_id in range(n_score_above_min):
# 若已被掩碼記錄為壓縮滓侍,則跳過
if suppress[b_id] == 1:
continue
# 按預測邊框間的交并比是否>max_overlap更新mask,并保持原來被壓縮的邊界框不變
suppress = torch.max(suppress, (overlap[box] > max_overlap).byte())
# 不壓縮當前邊界框
suppress[b_id] = 0
# 僅為每個類存儲未被壓縮的預測
boxes_per_image.append(class_boxes[(1 - suppress).bool()])
scores_per_image.append(class_scores[(1 - suppress).bool()])
labels_per_image.append(torch.LongTensor([c] * (1 - suppress).sum().item()))
# 如果該圖片中沒有包含任何類別, 則把整個圖片標注為背景類
if len(labels_per_image) == 0:
boxes_per_image.append(torch.FloatTensor([0, 0, 1, 1]).to(device))
labels_per_image.append(torch.LongTensor([0]).to(device))
scores_per_image.append(torch.FloatTensor([0]).to(device))
boxes_per_image = torch.cat(boxes_per_image, dim=0) # (n_objects, 4)
scores_per_image = torch.cat(scores_per_image, dim=0) # (n_objects)
labels_per_image = torch.cat(labels_per_image, dim=0) # (n_objects)
n_object = boxes_per_image.size(0)
# 只保留按信度排序前K個目標
if n_object > top_k:
scores_per_image, sorted_ind = scores_per_image.sort(dim=0, descending=True)
scores_per_image = scores_per_image[:top_k]
boxes_per_image = boxes_per_image[sorted_ind][:top_k]
labels_per_image = labels_per_image[sorted_ind][:top_k]
boxes_all_image.append(boxes_per_image)
scores_all_image.append(scores_per_image)
labels_all_image.append(labels_per_image)
return boxes_all_image, labels_all_image, scores_all_image # 長度為batch_size的列表
額外部分:一些注意點
我們將各層特征圖的輸出連接成一個tensor,此時conv4_3 feature maps處于較低層牲芋,其features數值比之高層的大很多(下采樣會使特征響應的數值減辛冒省),因此我們可以選擇對feature maps進行歸一化(如L2 normalization)后缸浦,再放大其特征響應(該factor由網絡自己學習)夕冲。我認為Batch Normalization同樣也適用。
-
使用dtype=torch.bool或torch.uint8(至少1.3.0之后就廢除了uint8的索引操作了)為多維tensor進行索引操作裂逐,得到的索引結果是flatten的(注:此 bool tensor的位置與原 tensor一一時姑蓝,若不是則會保留dim(即使還維剩余1個數組)煤惩,切片則會把僅剩一個數組的維度給壓縮了),如
x = torch.rand((2, 3, 4)) # 假設有一半的數據>0.5 y = x > 0.5 # y in shape of (2, 3, 4)缅疟,一半是True桃熄,一半是False print(x[y].shape) # tenor in shape of(12)
-
提高訓練速度的一些操作
torch.backends.cudnn.benchmark = True
dataloader的pin_memory=True款违,使用GPU中的鎖頁內存(不與虛擬內存交換數據以加快速度)油猫,需要GPU內存足夠夕凝,更具體內容參考:https://blog.csdn.net/tfcy694/article/details/83270701
這里沒用使用eval函數去評價模型實際的效果,可以選擇使用mAP鸽照。在保存最好的網絡模型時,可以考慮eval指標的增加來保留下好的參數颠悬,同時可以用此eval指標控制epochs提前終止
新人上路矮燎,請多多關注??,純手動不易赔癌,歡迎討論
轉載請說明出處诞外。