Weakly Supervised Object Localization

Author: Zongwei Zhou | 周縱葦
Weibo: @MrGiovanni
Email: zongweiz@asu.edu

Weakly supervised object localization refers to learning object locations from an image relied only on image-level annotation.

Overview

There have been a number of ongoing investigations regarding weakly supervised object localization relied only on image-level annotation, covering Self-Transfer Learning based (Hwang et al.), Attention based (Kazi et al.), and Multiple Instance Learning based (Li et al.) models. One of the most recognized approaches is class activation map (CAM), introduced by Zhou et al., which is produced from a trained Convolutional Neural Network (CNN) for classification. This localization ability is generic and encouraging number of medical applications in weakly supervised disease localization. For example, Wang et al. calculated the CAMs from the multi-class CNN and generated the bounding boxes for each pathology candidate in X-rays; Gondal et al. can handle multiple diabetic retinopathy lesions in one retinal fundus image by considering multiple binarized region proposals from CAM; Qi et al. achieved better localization performance by combining CAMs of four typically observed in a placental image together.

Despite CAM is promising in image-level detection, it may not be a proper approach for accurate object localization, which demands both rough location and approximate size of the object, as it only focuses on the most discriminative region of the object, ignoring the rest of the object. For example, when applying CAM to localize cats, it focuses on only one of the most discriminative areas such as face, body, or tail of the cat, whereas fails to outline the whole cat. That is, measured by detection metrics, CAM behaves appropriately while measured by such critical metrics as IoU and Dice, CAM performs merely fair. For this reason, many studies intend to improve the CAM towards solving a more ambitious task -- semantic segmentation relied only on image-level annotation.

In Teh et al., attention mechanism has been introduced to guide classifier learn a more discriminative object region using the traditional region proposal method, which is computational cost and time-consuming in practice. Further, CAM has been utilized as the new region proposal method and regularized as an attention mask to reveal more discriminative regions. Kim et al. proposed two-phase learning, using the CAM from the pre-trained CNN as suppression mask, and then training the second CNN by adding this suppression mask to the intermediate feature maps. Merge the two CAMs from both CNNs together to get a more accurate localization performance. This approach, however, is not suitable for the medical image because the class of the object has to appear in the pre-trained task. González-Gonzalo et al. slightly improved diabetic retinopathy lesions localization accuracy in an iterative manner by inpainting input image base on the previous predicted CAM. Wei et al. trained several CNNs independently for adversarial erasing (AE) and adopted a recursive manner to generate localization map until the classification CNN training is failed. Zhang et al. proposed an improved version of AE, named Adversarial Complementary Learning (ACoL) by integrating those independent CNNs into a single network and training it end-to-end. Nevertheless, ACoL can only approximate the same-quality maps as CAM, but in a more convenient way. Another recent attempt by Singh et al. was to encourage CNN to focus on multiple relevant parts of the object beyond just the most discriminative one, by randomly remove patches from the input image.

The success of aforementioned variants of CAM- and Attention- based approaches, in essence, shares two consequential presumptions. First, the accuracy of CNN should be promising to conduct a meaningful prediction for the classification task; in other words, if CNN performs poorly, due to some reasons such as the limited labeled data, this technology will be powerless in object localization. Second, built on a well-trained CNN and follow-up the outstanding results for object classification, the class activation map is supposed to activate the discriminative regions of the object. Unfortunately, it is only an intuitive assumption and experiments have demonstrated the effectiveness under most scenarios, but no solid theory proves the validity. The discriminative region is not always equivalent to the target object itself. Thereby, training a promising CNN in classification task is a necessary but not sufficient condition in utilizing CAM for weakly supervised object localization.

Novel Technical Approaches in Computer Vision

Is object localization for free?
Oquab, Maxime, et al. "Is object localization for free?-weakly-supervised learning with convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
https://leon.bottou.org/publications/pdf/cvpr-2015.pdf

It is probably the first to describe a weakly supervised object localization using CNNs. However, their localization is limited to a point lying in the boundary of the object rather than determining the full extent of the object, limited by the global max pooling. The max pooling rather than average pooling was used is because the task was formulated as a multiple instance learning (MIL) problem. Personally, I prefer this work more than Zhou et al., and especially admire their proposed question: is object localization with convolutional neural networks for free? The method is close enough to CAM, described as following:
First, we treat the last fully connected network layers as convolutions to cope with the uncertainty in object localization.
Second, we introduce a max-pooling layer that hypothesizes the possible location of the object in the image.
Third, we modify the cost function to learn from image-level supervision.

Class Activation Map (CAM)
Zhou, Bolei, et al. "Learning deep features for discriminative localization." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf

The major message of the paper very focuses: invite global average layer and shed light on how it explicitly enables CNN to have remarkable localization ability despite being trained on image-level labels. Though CAM works well in object localization, it has low precision, covering both relevant and non-relevant (noise activations and backgrounds) regions; it produces low-resolution maps, impeding precise object localization when the input image size is small.

Self-transfer Learning
Hwang, Sangheum, and Hyo-Eun Kim. "Self-transfer learning for weakly supervised lesion localization." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2016.
https://link.springer.com/chapter/10.1007/978-3-319-46723-8_28

It works like Multi-task Learning (MTL).

Attention Networks
Teh, Eu Wern, Mrigank Rochan, and Yang Wang. "Attention Networks for Weakly Supervised Object Localization." BMVC. 2016.
http://www.cs.umanitoba.ca/~ywang/papers/bmvc16_attention.pdf

Edge boxes method extracts proposals (bounding boxes) that are likely to contain any object. Each proposal is passed to a linear layer to obtain its attention score. Then apply the softmax operation to the attention scores before multiplying it with its corresponding proposal features. This gives a whole image feature vector that is the weighted average of proposals. Finally, the whole image feature is used to classify the image.
This paper introduces proposal attention to implicitly locate the object by learning the contribution of each proposed region towards the final classification results. Similar with R-CNN, the region proposal approach (Edge boxes method), however, are adopted as external modules independent of the network, so the region locations are not adaptive with the expressive features learned by model training. Also, it cannot handle multiple objects in the image because only one proposal with the highest attention score can be detected per image. This method is not as flexible and powerful as CAM because
First, the feature learner is not the deep neural network but simple linear layers, so the features may not be representative and robust.
Second, the region proposals are highly restricted by the conventional approach, so the final class activation map only contains a rectangle box to coarsely approximate the object.

Hide-and-seek
Kumar Singh, Krishna, and Yong Jae Lee. "Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization." Proceedings of the IEEE International Conference on Computer Vision. 2017.
http://openaccess.thecvf.com/content_ICCV_2017/papers/Singh_Hide-And-Seek_Forcing_a_ICCV_2017_paper.pdf

They also used CAM. Rather than modifying the CNN architecture, they instead modified the input image by hiding random patches from it. The underline explanation is to force the network to learn to focus on multiple relevant parts, instead of the most relevant part, of an object. I rephrase the contribution into data augmentation by injecting noise (block) to the input images and learn a more robust deep neural network. In terms of feature maps, the activated region is enlarged from previously only the most discriminative regions to several discriminative regions. I'm not sure if the proposed method may increase the number of false positive detections. Note that this work should be easy to reproduce. But I don't expect the results will be dramatically improved regarding a more generalizable deep neural network enhanced by noise data augmentation.

Two phase learning
Kim, Dahun, et al. "Two-phase learning for weakly supervised object localization." Proceedings of the IEEE International Conference on Computer Vision. 2017.
https://dgyoo.github.io/papers/iccv17.pdf

Limited to only the natural images and the object label is seen by the pre-trained model. Train only the second network, and merge the CAMs from two networks together in the inference time. Element-wise multiplication is used for constraining the intermediate feature maps in the 2nd network. The quality of the suppression mask from the 1st network is important --- if it's messed up, the 2nd network will be meaningless. Therefore, this approach is not suitable for medical applications with limited labeled data.

Adversarial erasing
Wei, Yunchao, et al. "Object region mining with adversarial erasing: A simple classification to semantic segmentation approach." IEEE CVPR. Vol. 1. No. 2. 2017.
http://openaccess.thecvf.com/content_cvpr_2017/papers/Wei_Object_Region_Mining_CVPR_2017_paper.pdf

With adversarial erasing (AE), a classification network first mines the most discriminative region for image category label “dog”. Then, AE erases the mined region (head) from the image and the classification network is re-trained to discover a new object region (body) for performing classification without a performance drop. We repeat such adversarial erasing process for multiple times and merge the erased regions into an integral foreground segmentation mask. Repeating such adversarial erasing can localize increasingly discriminative regions diagnostic for image category until no more informative region left.
How to recognize the ending point when "no more informative region left"? The algorithm denotes that while (training of classification is success) do.

Adversarial Complementary Learning (ACoL)
Zhang, Xiaolin, et al. "Adversarial complementary learning for weakly supervised object localization." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhang_Adversarial_Complementary_Learning_CVPR_2018_paper.pdf

Adversarial erasing (AE) trains three networks independently for adversarial erasing. ACoL trains two adversarial branches jointly by integrating them into a single network. Second, AE adopts a recursive method to generate localization maps, and it has to forward the networks multiple times.

Multiple-instance learning (Maxpooling)
Li, Zhe, et al. "Thoracic disease identification and localization with limited supervision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
http://openaccess.thecvf.com/content_cvpr_2018/papers/Li_Thoracic_Disease_Identification_CVPR_2018_paper.pdf

Fair Applications in Medical Image Analysis

ChestX-ray8
Wang, Xiaosong, et al. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017.
http://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf

Diabetic retinopathy lesions in retinal fundus images
Gondal, Waleed M., et al. "Weakly-supervised localization of diabetic retinopathy lesions in retinal fundus images." 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017.
https://arxiv.org/pdf/1706.09634.pdf

It can handle multiple object detection appearing in one image.

Placental Ultrasound Images with Residual Networks
Qi, Huan, Sally Collins, and Alison Noble. "Weakly supervised learning of placental ultrasound images with residual networks." Annual Conference on Medical Image Understanding and Analysis. Springer, Cham, 2017.
Weakly supervised learning of placental ultrasound images with residual networks

Combination of typically observed in a placental image, namely (1) placenta only (PL); (2) placenta and myometrium (PL+MY); (3) placenta and subcutaneous tissue (PL+ST); (4) placenta, myometrium and subcutaneous tissue (PL+MY+ST). This is achieved by incorporating a global average pooling (GAP) layer before the fully connected layer.

Iterative saliency map refinement
González-Gonzalo, Cristina, et al. "Improving weakly-supervised lesion localization with iterative saliency map refinement." (MIDL 2018).
https://openreview.net/pdf?id=r15c8gnoG

An interesting approach to reveal discriminative image regions by inpainting based on previous CAM, a slight improvement between the final accuracy and initial accuracy. Note that the improvement is not significant and application-wise.

Proximal Femur Fractures
Jiménez-Sánchez, Amelia, et al. "Weakly-Supervised Localization and Classification of Proximal Femur Fractures." arXiv preprint arXiv:1809.10692 (2018).
https://arxiv.org/pdf/1809.10692.pdf

This paper investigated and adapted Spatial Transformers (ST), Self-Transfer Learning (STL),
and localization from global pooling layers (CAM), involving with / without localization, and with supervised / weakly-supervised localization. (a) and (f) are the lower- and upper- bound references, respectively. (b) requires supervised training for localization network. (c), (d), and (e) are weakly supervised object localization. Their experimental results show that self-transfer learning (STL) guides feature activations and boost performance when a larger number of labels in the dataset (6 classes), lower performance when binary classification. Different pooling layers are investigated, and as expected, global average pooling is confirmed as the best one. Also, CAM converges faster than the compared methods (Attention and STL), as composed of a single network.

最后編輯于：2019.03.06 08:24:13

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市座云，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖锣夹，帶你破解...
沈念sama閱讀 218,682評論 6贊 507
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件砖顷，死亡現(xiàn)場離奇詭異授嘀，居然都是意外死亡，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,277評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事蟹腾∠麸保” “怎么了？”我有些...
開封第一講書人閱讀 165,083評論 0贊 355
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經(jīng)常有香客問我，道長狈究，這世上最難降的妖魔是什么碎罚？我笑而不...
開封第一講書人閱讀 58,763評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任耙考，我火速辦了婚禮，結(jié)果婚禮上账蓉，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好锡足，可當我...
茶點故事閱讀 67,785評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著珠插，像睡著了一般缤底。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上犁河，一...
開封第一講書人閱讀 51,624評論 1贊 305
城市分裂傳說
那天，我揣著相機與錄音肝箱，去河邊找鬼。笑死链嘀，一個胖子當著我的面吹牛粥脚，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 40,358評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼像寒，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了憔鬼？” 一聲冷哼從身側(cè)響起侮叮，我...
開封第一講書人閱讀 39,261評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤卸勺，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后静浴，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體得问，經(jīng)...
沈念sama閱讀 45,722評論 1贊 315
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡漓骚，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,900評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年叉信，在試婚紗的時候發(fā)現(xiàn)自己被綠了佳遂。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 40,030評論 1贊 350
活死人
序言：一個原本活蹦亂跳的男人離奇死亡煤搜，死狀恐怖嘲驾，靈堂內(nèi)的尸體忽然破棺而出誊垢，到底是詐尸還是另有隱情筒饰，我是刑警寧澤，帶...
沈念sama閱讀 35,737評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布攒钳，位于F島的核電站晤斩，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏击喂。R本人自食惡果不足惜乎赴，卻給世界環(huán)境...
茶點故事閱讀 41,360評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一原探、第九天我趴在偏房一處隱蔽的房頂上張望胁出。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,941評論 0贊 22
一樁弒父案顿天，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽鸟缕。三九已至蹲蒲，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背恕齐。一陣腳步聲響...
開封第一講書人閱讀 33,057評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工溶其，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 48,237評論 3贊 371
代替公主和親
正文我出身青樓会通，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 44,976評論 2贊 355

Weakly Supervised Object Localization

Weakly supervised object localization refers to learning object locations from an image relied only on image-level annotation.

Overview

Novel Technical Approaches in Computer Vision

Fair Applications in Medical Image Analysis

推薦閱讀更多精彩內(nèi)容