1. 寫在前面
https://arxiv.org/abs/2111.09452
從經(jīng)典的OD到OVD的最主要挑戰(zhàn)在于現(xiàn)有的OD數(shù)據(jù)集類別都是有限的野建,例如最常用的COCO只有80個(gè)類,所以對于novel類別的識(shí)別會(huì)比較困難。
基于這一點(diǎn)蜂怎,這篇文章的思路就是想從大規(guī)模的image-caption pairs中自動(dòng)生成一些偽物體檢測標(biāo)注蟀瞧,用來訓(xùn)練模型窄陡。
1. Introduction
OD is limited to a fixed set of objects (e.g., 80 objects for COCO)
to reduce the need for human labor for annotating: Zero-shot OD & OVD
ZSOD: transfer from base to novel by exploring the corelations between base and novel categories;
OVD: transfer from base to novel with the help of image captions (個(gè)人不是那么準(zhǔn)確垂睬,也不是一定要用caption來做OVD)
both ZS-OD and OVD are limited by the small size of base category set
This paper: automatically generates bounding-box annotations for objects at scale using existing resources.
Existing vision-language models imply strong localization ability.
this paper: improve OVD using pseudo-bounding box annotations generated from large-scale image caption pairs.
left:human labor
right:image-caption + VL model --> pseudo annotation
2. Method
two components:
a pseudo bounding-box label generator鹏秋;
an open vocabulary object detector.
2.1 pseudo bounding-box label generator
predefine objects of interest
input: {image, caption} pairs
image --> image encoder --> visual embedding per region
caption --> text encoder --> text embedding per token
input {visual embeddings, text embeddings} --> multi-modal encoder --> multi modal features by image-text interaction via cross-attention.
for each object in the predefined objects, e.g., racket, use the grad-cam to visulize its activation map.
apply proposal generator to get multiple boxes
the box with the largest overlap with the activation map is regarded as the pseudo box.
2.2 open vocabulary object detector with pseudo-bounding box:
because of the two-step pipeline, any OVD model is acceptable.
in this paper, a typical framework for OVD is thus selected.
- input: image跨嘉,large-scale object vocabulary set
- image --> feature extractor --> object proposals --> RoI --> region-based visual embedding
- category texts in large-scale object vocabulary set + "background" --> text encoder --> text embeddings
- training: encourgae the paired {region-based visual embedding, text embedding}