R-CNN, Fast R-CNN, Faster R-CNN
今年四月份的時候撑刺,在一個研究院實習(xí)時學(xué)習(xí)了RCNN, Fast RCNN, Faster RCNN系列Object Detection框架硼被,現(xiàn)在總結(jié)一下无切。
一. R-CNN(Regions with CNN features)
1.1 框架結(jié)構(gòu)
論文中提到:
Our object detection system consists of three modules.
The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
The third module is a set of class specific linear SVMs.
Bounding-box Regression
Based on the error analysis, we implemented a simple method to reduce localization errors. Inspired by the bounding-box regression employed in DPM, we train a linear regression model to predict a new detection window given the pool5features for a selective search region proposal.
我們便知道R-CNN由三個部分組成:
- 提取Region Proposals的模塊;
- 提取特征向量的卷積神經(jīng)網(wǎng)絡(luò);
- 線性SVM分類器, bouding-box回歸(用于物體的定位).
1.2 神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)
神經(jīng)網(wǎng)絡(luò)的輸入為依靠selective search方法提取region proposal后經(jīng)過warped region調(diào)整大小, 然后經(jīng)過5層卷積和2層全連接層晾咪,輸出結(jié)果一方面送入SVM分類的圆,另一方面送去Bounding-box回歸.
二. SPP-net(Spatial Pyramid Pooling, 空間金字塔池化層)
2.1 提出背景
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224*224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement.
在說到Fast R-CNN之前, 先提一下SPP-net. 如論文所說, 由于其他網(wǎng)絡(luò)比如R-CNN的region proposal需要先經(jīng)過warped region調(diào)整成固定大小, 適用性不是很好, 因此SPP-net提出了一種不限制輸入大小的網(wǎng)絡(luò).
2.2 實現(xiàn)方式
輸入的圖像(無論大小), 先經(jīng)過卷積神經(jīng)網(wǎng)絡(luò), 網(wǎng)絡(luò)結(jié)果經(jīng)過選擇的3個filter(此處選取的是16, 4, 1)做pooling, 三個輸出首尾連接形成固定長度的輸出. 至此解決了輸入圖像大小限制的問題.
三. Fast R-CNN
3.1 框架結(jié)構(gòu)
A Fast R-CNN network takes as input an entire image and a set of object proposals.
The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
首先selective search提取出的region proposal輸入卷積神經(jīng)網(wǎng)絡(luò), 得到的feature map輸入RoI pooling層, 提取出一段固定長度的特征向量, 一方面輸入softmax層估計物體概率, 另一方面輸入Bounding-box回歸層.
The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level.
如論文中所提及, RoI層實際上就是SPP-net的一種情況.
3.2 網(wǎng)絡(luò)結(jié)構(gòu)(VGG16為例)
四. Faster R-CNN
4.1 框架結(jié)構(gòu)
Our object detection system, called Faster R-CNN, is composed of two modules.
The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.
The entire system is a single, unified network for object detection (Figure 2).
圖像輸入神經(jīng)網(wǎng)絡(luò)后得到feature map, 在進入RoI pooling之前, 先經(jīng)過一個RPN層(下一點提到), 然后將得到的region proposal和feature map一起輸入RoI pooling層, 后續(xù)與Fast R-CNN一致.
4.2 RPN層(Region Proposal Network)
A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.
RPN層替代了原來的selective search方法來提取region proposal, 提出anchor box的概念, 通過sliding window的移動和選取k個不同比例的anchor box, 最后得到2k個是否為target的分數(shù)和4k個物體坐標.(這里的2指的是target/not a target, 4指的是坐標x/y/w/h)
4.3 網(wǎng)絡(luò)結(jié)構(gòu)
五. 總結(jié)
- Fast R-CNN 通過RoI pooling層將R-CNN后面SVM分類與Bounding-box回歸做入到神經(jīng)網(wǎng)絡(luò)中;
- Faster R-CNN 通過RPN層將Fast R-CNN前面的region proposal提取層整合入神經(jīng)網(wǎng)絡(luò)中, 實現(xiàn)End-to-End.
References
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2014). Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
[3] Ross Girshick. (2015). Fast R-CNN.
[4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
- 我的個人主頁:http://www.techping.cn/
- 我的CSDN博客:http://blog.csdn.net/techping
- 我的簡書:http://www.reibang.com/users/b2a36e431d5e/
- 我的GitHub:https://github.com/techping