Paper | Open-Vocabulary Object Detection Using Captions

1 basic

  • github.com/alirezazareian/ovr-cnn
  • the first paper which proposes the task of "open-vocabulary object detection"

2 introduction

OD: each category needs thousands of bounding boxes;

stage 1: use {image, caption} pairs to learn a visual semantic space;
stage 2: use annotated boxes for several classes to train object detection;
stage 3: inference which can detect objects beyond the base classes;

to summarize, we train a model that takes an image and detects any object within a given target vocabulary VT.

Task Definition:

  1. test on target vocabulary V_{T};
  2. train on an image-caption dataset with the vocabulary as V_{C}
  3. train on an annotated object detection dataset with the vocabulary as V_{B}
  4. V_{T} is not known during training and can be any subset of the entire vocabulary V_{\omega}.

**compare with ZSD and WSD: **

  • ZSD: no V_{C};
  • WSD: no V_{B}% and need to knowV_{T}$ before training;
  • OVD is a generalization of ZSD and WSD.
image.png

outcome:

  • significant outcome the ZSD and WSD methods;

3 Method

OVD framework:

  • meaning of open: the words in the captions are not limited, but in practice, it is not literally "open" as it is limited to pretrained word embeddings. (However, word embeddings are typically trained on very large text corpora such as Wiki pedia that cover nearly every word
3.1 Learning visual semantic space
  • resembles the PixelBERT
  • use the RN50 as the visual encoder; and the BERT as the text encoder;
  • design a V2L (vision to language) module (mapping the vectors of vision patches to text vectors)
  • use the grounding (main) task to train the RN50 & V2L module.

specifically,

  1. input image --> RN50 --> features of patches
  2. each patch feature (vision) --> V2L --> patch feature (language) e^{I}_{i}
  3. caption --> Embedding e^{C}_{j}--> BERT --> features of words f^{C}_{j}
  4. patch features (language), words features --> multimodal transformer --> new features for patches and words m^{I}_{i}, m^{C}_{j}.
  5. task: perform weakly supervised grounding using {e^{I}_{i} , e^{C}_{j}}, making the paired {img, caption} be the positive, while the unpaired {img, caption} the negative, and dis between {img, caption} is calculated by average of all e^{I}_{i} and e^{C}_{j}.

the grounding objectives results in a learned visual backbone and V2L layer that can map regions in the image into words that best describe them.

besides, to teach the model learn to 1) extract all objects that might be described in captions and 2) determine what word completes the caption best, further introduce the image text matching (ITM) subtask and the Masked Language Matching (not sure about the full name) (MLM) subtask.

3.2 Learning open-vocabulary detection
  • use faster-rcnn
  1. block1-3 to extract features
  2. RPN --> predict objectness & bounding box coordinates;
  3. non-max suppression (NMS)
  4. region-of-interest pooling (ROI pooling) to get a feature map for each potential object which is typically used for classification in the supervised way;

However, in the zero-shot setting,

3.3 testing

basically the same with the training but for the last step compare the box features after V2L to the target classes.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末肩狂,一起剝皮案震驚了整個(gè)濱河市局待,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌龄砰,老刑警劉巖畸颅,帶你破解...
    沈念sama閱讀 216,470評(píng)論 6 501
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件担巩,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡没炒,警方通過(guò)查閱死者的電腦和手機(jī)涛癌,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,393評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)送火,“玉大人拳话,你說(shuō)我怎么就攤上這事≈治” “怎么了弃衍?”我有些...
    開(kāi)封第一講書(shū)人閱讀 162,577評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)坚俗。 經(jīng)常有香客問(wèn)我镜盯,道長(zhǎng)岸裙,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,176評(píng)論 1 292
  • 正文 為了忘掉前任速缆,我火速辦了婚禮降允,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘艺糜。我一直安慰自己剧董,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,189評(píng)論 6 388
  • 文/花漫 我一把揭開(kāi)白布倦踢。 她就那樣靜靜地躺著送滞,像睡著了一般。 火紅的嫁衣襯著肌膚如雪辱挥。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,155評(píng)論 1 299
  • 那天边涕,我揣著相機(jī)與錄音晤碘,去河邊找鬼。 笑死功蜓,一個(gè)胖子當(dāng)著我的面吹牛园爷,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播式撼,決...
    沈念sama閱讀 40,041評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼童社,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了著隆?” 一聲冷哼從身側(cè)響起扰楼,我...
    開(kāi)封第一講書(shū)人閱讀 38,903評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎美浦,沒(méi)想到半個(gè)月后弦赖,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,319評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡浦辨,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,539評(píng)論 2 332
  • 正文 我和宋清朗相戀三年蹬竖,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片流酬。...
    茶點(diǎn)故事閱讀 39,703評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡币厕,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出芽腾,到底是詐尸還是另有隱情旦装,我是刑警寧澤,帶...
    沈念sama閱讀 35,417評(píng)論 5 343
  • 正文 年R本政府宣布晦嵌,位于F島的核電站同辣,受9級(jí)特大地震影響拷姿,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜旱函,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,013評(píng)論 3 325
  • 文/蒙蒙 一响巢、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧棒妨,春花似錦踪古、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,664評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至纷纫,卻和暖如春枕扫,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背辱魁。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,818評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工烟瞧, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人染簇。 一個(gè)月前我還...
    沈念sama閱讀 47,711評(píng)論 2 368
  • 正文 我出身青樓参滴,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親锻弓。 傳聞我的和親對(duì)象是個(gè)殘疾皇子砾赔,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,601評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容