yolov3解析
yolo系列論文看過衫画,源碼包調(diào)過侧但,抽點時間把論文理解和源碼做個一一對應皮壁,加深理解笼沥,論文
https://pjreddie.com/darknet/yolo/
源碼看的mxnet,gluon-cv,代碼地址:https://github.com/dmlc/gluon-cv
yolov3 network
darknet53一共53層卷積,除去最后一個FC總共52個卷積用于當做主體網(wǎng)絡窜骄,主體網(wǎng)絡被分成三個stage锦募,結(jié)構(gòu)類似FPN,1-26層卷積為stage1,27-43層卷積為stage2,44-52層卷積為stage3邻遏,低層卷積(26)感受野更小糠亩,負責檢測小目標,深層卷積(52)感受野大准验,容易檢測出大目標赎线,整體網(wǎng)絡的graph見文章最后
yolov3 summary
layer filters size input output
0 0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BFLOPs
1 1 conv 64 3 x 3 / 2 416 x 416 x 32 -> 208 x 208 x 64 1.595 BFLOPs
2 2 conv 32 1 x 1 / 1 208 x 208 x 64 -> 208 x 208 x 32 0.177 BFLOPs
3 3 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BFLOPs
4 res 1 208 x 208 x 64 -> 208 x 208 x 64
4 5 conv 128 3 x 3 / 2 208 x 208 x 64 -> 104 x 104 x 128 1.595 BFLOPs
5 6 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BFLOPs
6 7 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BFLOPs
8 res 5 104 x 104 x 128 -> 104 x 104 x 128
7 9 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BFLOPs
8 10 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BFLOPs
11 res 8 104 x 104 x 128 -> 104 x 104 x 128
9 12 conv 256 3 x 3 / 2 104 x 104 x 128 -> 52 x 52 x 256 1.595 BFLOPs
10 13 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
11 14 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
15 res 12 52 x 52 x 256 -> 52 x 52 x 256
12 16 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
13 17 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
18 res 15 52 x 52 x 256 -> 52 x 52 x 256
14 19 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
15 20 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
21 res 18 52 x 52 x 256 -> 52 x 52 x 256
16 22 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
17 23 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
24 res 21 52 x 52 x 256 -> 52 x 52 x 256
18 25 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
19 26 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
27 res 24 52 x 52 x 256 -> 52 x 52 x 256
20 28 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
21 29 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
30 res 27 52 x 52 x 256 -> 52 x 52 x 256
22 31 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
23 32 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
33 res 30 52 x 52 x 256 -> 52 x 52 x 256
24 34 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
25 35 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
36 res 33 52 x 52 x 256 -> 52 x 52 x 256
fpn1---------------------------------------------------------
26 37 conv 512 3 x 3 / 2 52 x 52 x 256 -> 26 x 26 x 512 1.595 BFLOPs
27 38 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
28 39 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
40 res 37 26 x 26 x 512 -> 26 x 26 x 512
29 41 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
30 42 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
43 res 40 26 x 26 x 512 -> 26 x 26 x 512
31 44 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
32 45 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
46 res 43 26 x 26 x 512 -> 26 x 26 x 512
33 47 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
34 48 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
49 res 46 26 x 26 x 512 -> 26 x 26 x 512
35 50 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
36 51 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
52 res 49 26 x 26 x 512 -> 26 x 26 x 512
37 53 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
38 54 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
55 res 52 26 x 26 x 512 -> 26 x 26 x 512
39 56 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
40 57 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
58 res 55 26 x 26 x 512 -> 26 x 26 x 512
41 59 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
42 60 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
61 res 58 26 x 26 x 512 -> 26 x 26 x 512
fpn2------------------------------------------------------------
43 62 conv 1024 3 x 3 / 2 26 x 26 x 512 -> 13 x 13 x1024 1.595 BFLOPs
44 63 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
45 64 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
65 res 62 13 x 13 x1024 -> 13 x 13 x1024
46 66 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
47 67 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
68 res 65 13 x 13 x1024 -> 13 x 13 x1024
48 69 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
49 70 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
71 res 68 13 x 13 x1024 -> 13 x 13 x1024
50 72 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
51 73 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
74 res 71 13 x 13 x1024 -> 13 x 13 x1024
fpn3---------------------------------------------------------------
0 75 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
1 76 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
2 77 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
3 78 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
4 79 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
5 80 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
58 81 conv 75 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 75 0.026 BFLOPs
82 yolo
83 route 79
59 84 conv 256 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 256 0.044 BFLOPs
85 upsample 2x 13 x 13 x 256 -> 26 x 26 x 256
86 route 85 61
60 87 conv 256 1 x 1 / 1 26 x 26 x 768 -> 26 x 26 x 256 0.266 BFLOPs
88 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
89 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
90 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
91 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
92 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
93 conv 75 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 75 0.052 BFLOPs
94 yolo
95 route 91
96 conv 128 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BFLOPs
97 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128
98 route 97 36
99 conv 128 1 x 1 / 1 52 x 52 x 384 -> 52 x 52 x 128 0.266 BFLOPs
100 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
101 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
102 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
103 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
104 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
105 conv 75 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 75 0.104 BFLOPs
106 yolo
---------------------
YOLOOutputV3
從tip卷積套件relu輸出開始,到推理的reshape成detection結(jié)束糊饱。
針對最后一個尺度:
卷積輸出24=(1+4+3)3,3是類別垂寥,4代表一個box,1代表是否有物體另锋,最后的3=6/2滞项,anchor兩個一組,一組分別代表高和寬夭坪。
yolo3.py->YOLOOutputV3->123行->
pred->24x169 (1+4+3)x3x13x13<-->(置信度+坐標+類別)xanchor數(shù)量x高x寬
pred->169x3x8 列代表特征位置文判,橫代表anchor的index,通道分別是置信度室梅,位置戏仓,類別
raw_box_centers->169x3x2 每個格子潭流,每個anchor相對的中心點
raw_box_scales->169x3x2 每個格子,每個anchor的伸縮比例
objness->169x3x1 每個格子柜去,每個anchor的置信度
class_pred->169x3x3 每個格子,每個anchor的類別概率
box_centers->169x3x2 每個格子拆宛,每個anchor對應box相對原圖的中心點,加了offset
box_scales->169x3x2 每個格子嗓奢,每個anchor對應box相對原圖高寬,它由raw_box_scales先按元素計算以 e(2.71)為底的冪,再和anchor相乘
class_score->169x3x3 每個格子浑厚,每個anchor每個類別的得分乘以置信度股耽,分類與置信度聯(lián)合做loss
bbox->169x3x4 每個格子,每個anchor對應box的坐標钳幅,左上角物蝙,右下角
offsets->169x1x2 ,每個網(wǎng)格相對偏移,x(0->12),y(0->12),每個網(wǎng)格中心點加上其左上角的相對位置偏移敢艰,再乘以stride(32)诬乞,坐標中心從相對變?yōu)榻^對
anchor->[[116,90],[156,198],[373,326]],每個anchor的比例,最后一個尺度(1313)的三個anchor钠导,相對于定義震嫉,anchor被顛倒,高緯用于檢測大物體牡属,yolo3定義的三組anchors:anchors = [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]]
如果是訓練票堵,返回bbox(1,507,4),raw_box_centers(1x169x3x2),raw_box_scales(1x169x3x2),bojness(1x169x3x1),clas_pred(1x169x3),anchor(1x1x3x2),offset(1x169x1x2)
針對其他兩個尺度
針對其他兩個尺度分別返回
bbox(1,2028,4),raw_box_centers(1x676x3x2),raw_box_scales(1x676x3x2),bojness(1x676x3x1),clas_pred(1x676x3),anchor(1x1x3x2),offset(1x676x1x2)
bbox(1,8112,4),raw_box_centers(1x2704x3x2),raw_box_scales(1x2704x3x2),bojness(1x2704x3x1),clas_pred(1x2704x3),anchor(1x1x3x2),offset(1x2704x1x2)
訓練的時候前向計算的
all_objectness (1x507x1) con (1x2028x1) con (1x8112x1)->(1x10647x1)
all_box_centers,(1x507x2) con (1x2028x2) con (1x8112x1)->(1x10647x2)
all_box_scales,(1x507x2) con (1x2028x2) con (1x8112x2)->(1x10647x2)
all_class_pred (1x507x3) con (1x2028x3) con (1x8112x3)->(1x10647x3)
與構(gòu)造好的label做loss更新參數(shù),所有的cell長寬以及anchor數(shù)量糅合到一維
YOLODetectionBlockV3
接在特征提取后面逮栅,介于特征提取和輸出pred之間悴势,用作特征轉(zhuǎn)換,降維等,源碼在yolo3.py措伐,類名YOLODetectionBlockV3,每一個stage之后都接一個YOLODetectionBlock,channel設置為[512,256,128],所以每個YOLODetectionBlock最后輸出的通道數(shù)依次減少特纤,[512,1024,512,1024,512,1024],[256,512,256,512,256,512], [128,256,128,256,128,256],每一組一個6個卷積废士,最后一個卷積的輸出(tip)進入output用于檢測叫潦,第5個卷積的輸出進入transitions層后和對應的stage concate后進入下一個YOLODetectionBlockV3。
YOLODetectionBlockV3之間transition,就一個卷積官硝,卷積后分別在特征圖高和寬的維度各做一次repeat使得上采樣矗蕊,然后做一次slice_like使得YOLODetectionBlockV3的輸出和route的一模一樣以便concate
loss
+
+
整體思路為:每個cell的每個anchor和label做loss,根據(jù)label會有一個mask氢架,中心點傻咖,scale有物體的cell,anchor才有l(wèi)oss岖研,其他位置被0mask值忽略卿操,每個cell,anchor有沒有物體的置信度都都被用來做loss警检,有物體的cell才會做分類loss,依次對應上面的數(shù)學公式害淤;針對某個cell扇雕,某個類被預測,則為1窥摄,該cell如果有物體镶奉,那這個位置肯定為1?那一個cell有很多anchor崭放,不用算每個anchor的分類嗎哨苛?這一部分在源碼部分看的不是特別明白
anchors
anchors = [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]],
這個應該是相對于416的寬和高,相對于416币砂,320建峭,608訓練的時候是等比例調(diào)整得,同樣等比例調(diào)整了的應該還有l(wèi)abel值
整體網(wǎng)絡結(jié)構(gòu)圖