
前面(YOLO v3深入理解)討論過論文和方案之后咱台,現(xiàn)在看一下代碼實現(xiàn)夏志。YOLO原作者是C程序垒拢,這里選擇的是Kears+Tensorflow版本,代碼來自experiencor的git項目keras-yolo3荣恐,我補充了一些注釋液斜,項目在keras-yolo3 + 注釋,如有錯漏請指正叠穆。

圖1 檢測Raccoon


圖2 輸入->輸出


參考上面圖2,對于一個輸入圖像痹束,比如416*416*3检疫,相應(yīng)的會輸出 13*13*3 + 26*26*3 + 52*52*3 = 10647 個預測框讶请。我們希望這些預測框的信息能夠盡量準確的反應(yīng)出哪些位置存在對象祷嘶,是哪種對象屎媳,其邊框位置在哪里。

在設(shè)置標簽y(10647個預測框 * (4+1+類別數(shù)) 張量)的時候论巍,YOLO的設(shè)計思路是烛谊,對于輸入圖像中的每個對象,該對象實際邊框(groud truth)的中心落在哪個網(wǎng)格嘉汰,就由該網(wǎng)格負責預測該對象丹禀。不過,由于設(shè)計了3種不同大小的尺度鞋怀,每個網(wǎng)格又有3個先驗框双泪,所以對于一個對象中心點,可以對應(yīng)9個先驗框密似。但最終只選擇與實際邊框IOU最大的那個先驗框負責預測該對象(該先驗框的置信度=1)焙矛,所有其它先驗框都不負責預測該對象(置信度=0)。同時残腌,該先驗框所在的輸出向量中村斟,邊框位置設(shè)置為對象實際邊框,以及該對象類型設(shè)置為1抛猫。






圖2 邊框預測





訓練樣本設(shè)置參考 generator.py 中 class BatchGenerator甜孤。
loss計算參考 yolo.py 的 call(self, x)协饲。
網(wǎng)絡(luò)結(jié)構(gòu)是 yolo.py 的 create_yolov3_model()。



        x = input_image, y_pred, y_true, true_boxes
        分別是:輸入圖像膀篮,YOLO輸出的tensor,標簽y(期望其輸出的tensor)岂膳,輸入圖像中所有g(shù)round truth box誓竿。
        loss = 邊框位置xy loss + 邊框位置wh loss + 邊框置信度loss + 對象分類loss
    def call(self, x):
        # true_boxes 對應(yīng) BatchGenerator 里面的 t_batch,shape=(batch,1,1,1,一個圖像中最多幾個對象,4個坐標)
        # y_true 對應(yīng) BatchGenerator 里面的 yolo_1/yolo_2/yolo_3谈截,即一個特征圖tensor
        input_image, y_pred, y_true, true_boxes = x

        # adjust the shape of the y_predict [batch, grid_h, grid_w, 3, 4+1+nb_class]
        # shape=(batch, 特征圖高筷屡,特征圖寬,3個anchor簸喂,4個邊框坐標+1個置信度+檢測對象類別數(shù))
        y_pred = tf.reshape(y_pred, tf.concat([tf.shape(y_pred)[:3], tf.constant([3, -1])], axis=0))
        # initialize the masks
        # object_mask 是一個特征圖上所有預測框的置信度(objectness)毙死,這里來自標簽y_true,除了負責檢測對象的那些anchor喻鳄,其它置信度都是0扼倘。
        # shape = (batch, 特征圖高,特征圖寬除呵,3個anchor再菊,1個置信度)
        # y_true[..., 4]提取邊框置信度(最后一維tensor中隅肥,前4個是邊框坐標,第5個就是置信度)袄简,expand_dims將其恢復到原來的tensor形狀。
        object_mask     = tf.expand_dims(y_true[..., 4], 4)

        # the variable to keep track of number of batches processed
        batch_seen = tf.Variable(0.)        

        # compute grid factor and net factor
        # 特征圖的寬高
        grid_h      = tf.shape(y_true)[1]
        grid_w      = tf.shape(y_true)[2]
        grid_factor = tf.reshape(tf.cast([grid_w, grid_h], tf.float32), [1,1,1,1,2])

        # 輸入圖像的寬高
        net_h       = tf.shape(input_image)[1]
        net_w       = tf.shape(input_image)[2]            
        net_factor  = tf.reshape(tf.cast([net_w, net_h], tf.float32), [1,1,1,1,2])
        Adjust prediction
        # pred_box_xy 是預測框在特征圖上的中心點坐標泛啸,特征圖網(wǎng)格大小歸一化為1*1绿语,=(sigma(t_xy) + c_xy)
        pred_box_xy    = (self.cell_grid[:,:grid_h,:grid_w,:,:] + tf.sigmoid(y_pred[..., :2]))  # shape=(batch,特征圖高,特征圖寬,3預測框,2坐標)
        # pred_box_wh 是預測對象的t_w, t_h。注:truth_wh = anchor_wh * exp(t_wh)
        pred_box_wh    = y_pred[..., 2:4]                                                       # shape=(batch,特征圖高,特征圖寬,3預測框,2坐標)
        pred_box_conf  = tf.expand_dims(tf.sigmoid(y_pred[..., 4]), 4)                          # shape=(batch,特征圖高,特征圖寬,3預測框,1confidence)
        pred_box_class = y_pred[..., 5:]                                                        # shape=(batch,特征圖高,特征圖寬,3預測框,c個對象)

        Adjust ground truth
        # true_box_xy 是實際邊框在特征圖上的中心點坐標候址,=(sigma(t_xy) + c_xy)吕粹,參見y_true
        true_box_xy    = y_true[..., 0:2]                  # shape=(batch,特征圖高,特征圖寬,3預測框,2坐標)
        # true_box_wh 是對象的t_w, t_h。注:truth_wh = anchor_wh * exp(t_wh)
        true_box_wh    = y_true[..., 2:4]                  # shape=(batch,特征圖高,特征圖寬,3預測框,2坐標)
        true_box_conf  = tf.expand_dims(y_true[..., 4], 4) # shape=(batch,特征圖高,特征圖寬,3預測框,1confidence)
        true_box_class = tf.argmax(y_true[..., 5:], -1)    # shape=(batch,特征圖高,特征圖寬,3預測框)

        Compare each predicted box to all true boxes
        一個特征圖上有 寬*高*3anchor 個預測框,YOLO的策略是荠雕,一個對象其中心點所在gird的3個anchor稳其,IOU最大的那個anchor負責預測(其confidence=1)該對象。
        # initially, drag all objectness of all boxes to 0
        conf_delta  = pred_box_conf - 0 

        # then, ignore the boxes which have good overlap with some true box
        # true_xy,true_wh 的值是相當于將原始圖像的寬高歸一化為1*1
        true_xy = true_boxes[..., 0:2] / grid_factor  # shape=(batch,1,1,1,一個圖像中最多幾(3)個對象,2個xy坐標),xy是特征圖上的坐標疙驾,與y_true中的xy一樣
        true_wh = true_boxes[..., 2:4] / net_factor   # shape=(batch,1,1,1,一個圖像中最多幾(3)個對象,2個wh坐標),wh是原始圖像上對象的寬和高
        true_wh_half = true_wh / 2.
        true_mins    = true_xy - true_wh_half
        true_maxes   = true_xy + true_wh_half
        pred_xy = tf.expand_dims(pred_box_xy / grid_factor, 4)                        # shape=(batch,特征圖高,特征圖寬,3預測框,1,2坐標)
        pred_wh = tf.expand_dims(tf.exp(pred_box_wh) * self.anchors / net_factor, 4)  # shape=(batch,特征圖高,特征圖寬,3預測框,1,2坐標)
        pred_wh_half = pred_wh / 2.
        pred_mins    = pred_xy - pred_wh_half
        pred_maxes   = pred_xy + pred_wh_half    

        intersect_mins  = tf.maximum(pred_mins,  true_mins)  # shape=(batch, 特征圖高,特征圖寬, 3預測框, 一個圖像中最多幾(3)個對象, 2個坐標)
        intersect_maxes = tf.minimum(pred_maxes, true_maxes) # shape=(batch, 特征圖高,特征圖寬, 3預測框, 一個圖像中最多幾(3)個對象, 2個坐標)

        intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)  # shape=(batch, 特征圖高,特征圖寬, 3預測框, 一個圖像中最多幾(3)個對象, 2個坐標)
        intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]       # shape=(batch, 特征圖高,特征圖寬, 3預測框, 一個圖像中最多幾(3)個對象)
        true_areas = true_wh[..., 0] * true_wh[..., 1]  # shape=(batch,1,       1,       1,      一個圖像中最多幾(3)個對象)
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]  # shape=(batch,特征圖高,特征圖寬,3預測框,1)

        union_areas = pred_areas + true_areas - intersect_areas  # shape=(batch, 特征圖高,特征圖寬, 3預測框, 一個圖像中最多幾(3)個對象)
        iou_scores  = tf.truediv(intersect_areas, union_areas)   # shape=(batch, 特征圖高,特征圖寬, 3預測框, 一個圖像中最多幾(3)個對象)

        # 每個預測框與最接近的實際對象的IOU
        best_ious   = tf.reduce_max(iou_scores, axis=4)  # shape=(batch, 特征圖高,特征圖寬, 3預測框)

        # IOU低于閾值的那些預測邊框凶伙,才計算其(檢測到背景的)置信度的loss
        conf_delta *= tf.expand_dims(tf.to_float(best_ious < self.ignore_thresh), 4) # shape=(batch,特征圖高,特征圖寬,3預測框,1confidence)

        Compute some online statistics
        true_xy = true_box_xy / grid_factor
        true_wh = tf.exp(true_box_wh) * self.anchors / net_factor

        true_wh_half = true_wh / 2.
        true_mins    = true_xy - true_wh_half
        true_maxes   = true_xy + true_wh_half

        pred_xy = pred_box_xy / grid_factor
        pred_wh = tf.exp(pred_box_wh) * self.anchors / net_factor 
        pred_wh_half = pred_wh / 2.
        pred_mins    = pred_xy - pred_wh_half
        pred_maxes   = pred_xy + pred_wh_half      

        intersect_mins  = tf.maximum(pred_mins,  true_mins)
        intersect_maxes = tf.minimum(pred_maxes, true_maxes)
        intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
        true_areas = true_wh[..., 0] * true_wh[..., 1]
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]

        union_areas = pred_areas + true_areas - intersect_areas
        iou_scores  = tf.truediv(intersect_areas, union_areas)
        iou_scores  = object_mask * tf.expand_dims(iou_scores, 4)
        count       = tf.reduce_sum(object_mask)
        count_noobj = tf.reduce_sum(1 - object_mask)
        detect_mask = tf.to_float((pred_box_conf*object_mask) >= 0.5)
        class_mask  = tf.expand_dims(tf.to_float(tf.equal(tf.argmax(pred_box_class, -1), true_box_class)), 4)
        recall50    = tf.reduce_sum(tf.to_float(iou_scores >= 0.5 ) * detect_mask  * class_mask) / (count + 1e-3)
        recall75    = tf.reduce_sum(tf.to_float(iou_scores >= 0.75) * detect_mask  * class_mask) / (count + 1e-3)    
        avg_iou     = tf.reduce_sum(iou_scores) / (count + 1e-3)
        avg_obj     = tf.reduce_sum(pred_box_conf  * object_mask)  / (count + 1e-3)
        avg_noobj   = tf.reduce_sum(pred_box_conf  * (1-object_mask))  / (count_noobj + 1e-3)
        avg_cat     = tf.reduce_sum(object_mask * class_mask) / (count + 1e-3) 

        Warm-up training
        batch_seen = tf.assign_add(batch_seen, 1.)
        true_box_xy, true_box_wh, xywh_mask = tf.cond(tf.less(batch_seen, self.warmup_batches+1),
                              # 根據(jù)YOLOv2開始的設(shè)計,前self.warmup_batches 個batch 計算的是預測框與先驗框的誤差它碎,不是與真實對象邊框的誤差镊靴。
                              # 但這里代碼好像有點問題。
                              lambda: [true_box_xy + (0.5 + self.cell_grid[:,:grid_h,:grid_w,:,:]) * (1-object_mask), 
                                       true_box_wh + tf.zeros_like(true_box_wh) * (1-object_mask),   # zeros_like 導致后面的項為0链韭,實際還是true_box_wh偏竟,需要修改
                                       tf.ones_like(object_mask)],                                   # 每個預測框的位置都計入loss
                              # 之后的batch不做特殊處理
                              lambda: [true_box_xy, 

        Compare each true box to all anchor boxes
        # 注:exp(true_box_wh) = exp(t_wh) = truth_wh / anchor_wh
        # exp(true_box_wh) * self.anchors / net_factor = truth_wh / anchor_wh * self.anchors / net_factor = truth_wh / net_factor
        # wh_scale 是實際對象相對輸入圖像的大小。
        wh_scale = tf.exp(true_box_wh) * self.anchors / net_factor   # shape=(batch,特征圖高,特征圖寬,3anchor,2坐標)
        # wh_scale 與實際對象邊框的面積負相關(guān)敞峭,小尺寸對象對邊框誤差提升敏感度踊谋,the smaller the box, the bigger the scale
        wh_scale = tf.expand_dims(2 - wh_scale[..., 0] * wh_scale[..., 1], axis=4)

        # 正常情況下(warmup_batches之后),xywh_mask = object_mask旋讹,即存在對象的那些預測框(其位置殖蚕、置信度轿衔、對象類型有意義)才計算loss。
        # 不存在對象的那些預測框睦疫,其置信度有意義(不過conf_delta已過濾掉了那些IOU超過閾值的邊框)害驹,計入loss。而位置和對象類型無意義蛤育,不計入loss宛官。
        xy_delta    = xywh_mask   * (pred_box_xy-true_box_xy) * wh_scale * self.xywh_scale  # shape=(batch,特征圖高,特征圖寬,3個預測框,2個位置)
        wh_delta    = xywh_mask   * (pred_box_wh-true_box_wh) * wh_scale * self.xywh_scale  # shape=(batch,特征圖高,特征圖寬,3個預測框,2個位置)
        # shape=(batch,特征圖高,特征圖寬,3個預測框,1個置信度),前一半是檢測到對象的置信度瓦糕,后一半是檢測到背景的置信度
        conf_delta  = object_mask * (pred_box_conf-true_box_conf) * self.obj_scale + (1-object_mask) * conf_delta * self.noobj_scale
        # shape=(batch,特征圖高,特征圖寬,3個預測框,1個交叉熵)
        class_delta = object_mask * \
                      tf.expand_dims(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class), 4) * \

        # shape=(batch_size,)
        loss_xy    = tf.reduce_sum(tf.square(xy_delta),       list(range(1,5)))
        loss_wh    = tf.reduce_sum(tf.square(wh_delta),       list(range(1,5)))
        loss_conf  = tf.reduce_sum(tf.square(conf_delta),     list(range(1,5)))
        loss_class = tf.reduce_sum(class_delta,               list(range(1,5)))

        loss = loss_xy + loss_wh + loss_conf + loss_class

        loss = tf.Print(loss, [grid_h, avg_obj], message='avg_obj \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, avg_noobj], message='avg_noobj \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, avg_iou], message='avg_iou \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, avg_cat], message='avg_cat \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, recall50], message='recall50 \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, recall75], message='recall75 \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, count], message='count \t\t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, tf.reduce_sum(loss_xy), 
                                       tf.reduce_sum(loss_class)],  message='loss xy, wh, conf, class: \t',   summarize=1000)   

        # loss 的shape=(batch_size,)
        return loss*self.grid_scale


[1]YOLOv3: An Incremental Improvement
[2]YOLO v3深入理解
[3]keras-yolo3 + 注釋

