ECCV2016 ObjectNet3D: A Large Scale Database for 3D Object Recognition

0昼牛、關(guān)鍵詞

Database Construction,?3D Object Recognition, 3D Pose Annotation,?3D Shape Annotation

1术瓮、鏈接

該論文來(lái)自Stanford,一作是一位在Stanford大學(xué)Computational Vision & Geometry Lab實(shí)驗(yàn)室訪學(xué)的Michigan華人博士學(xué)生Yu Xiang(向宇)贰健,這之前他在復(fù)旦取得學(xué)士和碩士學(xué)位“模現(xiàn)在他已是德州大學(xué)達(dá)拉斯分校的助理教授(An assistant professor in the CS department at UT Dallas)。值得說(shuō)明的是伶椿,向宇博士同時(shí)也是3D數(shù)據(jù)集PASCAL3D+的一作辜伟。這些扎實(shí)的benckmarks工作以及一系列與3D目標(biāo)檢測(cè)相關(guān)的算法研究,足以證明他在這個(gè)領(lǐng)域的重要地位脊另。

論文鏈接:https://cvgl.stanford.edu/papers/xiang_eccv16.pdf

論文代碼:https://github.com/yuxng/ObjectNet3D_toolbox

論文官方網(wǎng)站介紹:https://cvgl.stanford.edu/projects/objectnet3d/

該論文提出的數(shù)據(jù)集ObjectNet3D主要服務(wù)于3D目標(biāo)識(shí)別任務(wù)导狡,即基于2D RGB圖像的3D物體姿態(tài)估計(jì)和形狀重建任務(wù),其中2D目標(biāo)檢測(cè)(classification and localization)作為中間任務(wù)偎痛,也是必要的存在烘豌。下圖是一張標(biāo)注實(shí)例的展示。

ObjectNet3D

2看彼、主要內(nèi)容概述

※ Abstract

We contribute a large scale database for 3D object recognition, named ObjectNet3D, that consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes. Objects in the 2D images in our database are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. Consequently, our database is useful for recognizing the 3D pose and 3D shape of objects from 2D images. We also provide baseline experiments on four tasks: region proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, and image-based 3D shape retrieval, which can serve as baselines for future research using our database. Our database is available online at http://cvgl.stanford.edu/projects/objectnet3d.

申明大規(guī)模數(shù)據(jù)集ObjectNet3D主要用于3D目標(biāo)識(shí)別任務(wù)廊佩,或者稱之為3D目標(biāo)檢測(cè)任務(wù),并精確地給出了數(shù)據(jù)集包含的類別靖榕、圖像數(shù)量标锄、物體個(gè)數(shù),以及3D形狀數(shù)量茁计。數(shù)據(jù)集中料皇,2D圖像均與3D形狀完成了對(duì)齊谓松,也就是說(shuō)每一個(gè)2D物體都有人工標(biāo)注的3D姿態(tài)和3D形狀(CAD模型)。(PS:盡管文章聲稱這些標(biāo)注標(biāo)簽是精確的践剂,但礙于當(dāng)時(shí)技術(shù)發(fā)展鬼譬,后續(xù)的更高質(zhì)量的同類數(shù)據(jù)集如Pix3DObjectron等均定量化地證明了,ObjectNet3D的標(biāo)簽并不十分精確)文章還提供了多個(gè)任務(wù)上的baseline(包括2D目標(biāo)檢測(cè)逊脯、3D姿態(tài)估計(jì)和3D形狀檢索)优质,這是數(shù)據(jù)集工作所必須的內(nèi)容。

※ Introduction

基于2D RGB圖像的3D目標(biāo)識(shí)別是一項(xiàng)重要且被廣泛地研究了的任務(wù)(Recognizing 3D properties of objects from 2D images, such as 3D location, 3D pose and 3D shape, is a central problem in computer vision that has wide applications in different scenarios including robotics, autonomous driving and augmented reality.)這一領(lǐng)域的發(fā)展军洼,離不開(kāi)相應(yīng)數(shù)據(jù)集的構(gòu)建(providing 3D annotations to 2D objects)巩螃。代表性數(shù)據(jù)集如NYU Depth(RGB-D),KITTI(RGB + point cloud)匕争,PASCAL3D+(RGB? +?3D CAD models)避乏。有了這些benchmarks,各類監(jiān)督學(xué)習(xí)算法才得以被提出和進(jìn)行公平地比較甘桑。

然而拍皮,現(xiàn)存的帶有3D標(biāo)注信息的數(shù)據(jù)集大都缺乏較大規(guī)模(limited in scale),包括2D物體的數(shù)量或2D圖片的總數(shù)量(either in the number of object categories or in the number of images)跑杭,至少春缕,這一領(lǐng)域的數(shù)據(jù)集的規(guī)模沒(méi)有達(dá)到大規(guī)模2D圖像數(shù)據(jù)集那樣,如ImageNet和MS-COCO艘蹋。因此锄贼,見(jiàn)證了圖像分類、檢測(cè)和分割在大規(guī)模數(shù)據(jù)集的加持下取得的進(jìn)展后女阀,本文作者認(rèn)為一個(gè)對(duì)等的大規(guī)模3D目標(biāo)識(shí)別數(shù)據(jù)集很有存在的必要宅荤。(After witnessing the progress on image classification, 2D object detection and segmentation with the advance of such large scale 2D image databases, we believe that a large scale database with 3D annotations would significantly benefit 3D object recognition.)(這一段引出了本文提出的數(shù)據(jù)集ObjectNet3D的必要性,說(shuō)法和寫(xiě)作值得借鑒

接下來(lái)的第三段浸策,作者主要介紹提出的數(shù)據(jù)集ObjectNet3D的具體規(guī)模冯键,以及搜集和制作的大致流程。ObjectNet3D數(shù)據(jù)集中圖片篩選自ImageNet repository庸汗,3D shapes篩選自ShapeNet?repository惫确。關(guān)于標(biāo)注標(biāo)簽,每張圖像中的物體都含有一個(gè)bounding box蚯舱,以及對(duì)應(yīng)的對(duì)齊后的3D shape CAD模型改化,其中對(duì)齊是指3D形狀投影后與2D物體的區(qū)域基本上重合,如下圖1所示枉昏。這些3D形狀標(biāo)簽使得ObjectNet3D數(shù)據(jù)集可以用于3D物體識(shí)別(姿態(tài)估計(jì)和最相近3D形狀檢索)陈肛,同時(shí),3D形狀的投影可以產(chǎn)生近似的segmentation boundaries兄裂。

Fig. 1. An example image in our database with 2D objects aligned with 3D shapes.The alignment enables us to project each 3D shape to the image where its projectionoverlaps with the 2D object as shown in the image on the right.

第四段中句旱,作者著重強(qiáng)調(diào)自己的工作重心阳藻,即對(duì)齊2D物體與3D形狀模型,是非常重要的(non-trivial)谈撒。這主要體現(xiàn)在兩點(diǎn):1)需要從成百上千的3D形狀模型中挑選出與待標(biāo)注的2D物體最相近的一個(gè)腥泥,如果讓人類標(biāo)注者逐個(gè)對(duì)比挑選,是不可行的(not feasible)啃匿;2)要對(duì)齊3D形狀模型與待標(biāo)注2D物體的姿態(tài)蛔外,往往是極容易發(fā)生錯(cuò)誤的(error-prone),因此對(duì)齊標(biāo)注的質(zhì)量很難控制立宜。針對(duì)以上兩個(gè)問(wèn)題冒萄,作者在文章中提出了解決方案:Ⅰ)作者使用現(xiàn)有的算法deep metric learning method(參考原文文獻(xiàn))臊岸,完成3D形狀模型的預(yù)挑選橙数。算法使用渲染后的3D模型圖片作為輸入,提取對(duì)應(yīng)的feature embedding帅戒,然后對(duì)于給定的2D物體圖片灯帮,返回topK的相似3D模型作為標(biāo)注者的備選對(duì)象;Ⅱ)為了保證標(biāo)注質(zhì)量逻住,作者定制化地開(kāi)發(fā)了對(duì)其3D模型與2D物體的標(biāo)注工具钟哥,標(biāo)注界面允許標(biāo)注者選取多種相機(jī)參數(shù),來(lái)最大化地對(duì)其模型與物體(To guarantee the quality of the alignment, we have designed an annotation tool to align the 3D shape with the 2D object. Our annotation interface allows annotators to interactively find a set of camera parameters for each object that produce good alignment.)瞎访。

最后一段腻贰,作者重申他們基于ObjectNet3D數(shù)據(jù)集提出了一些baseline方法肪虎,以供后續(xù)研究泛烙。

※ Related Work

這一章主要回歸一些代表性的3D目標(biāo)識(shí)別數(shù)據(jù)集愿汰。(We review representative datasets related to 3D object recognition)

Datasets with viewpoints. 這類數(shù)據(jù)集提供2D圖像的bounding box與viewpoints顺呕,但這些數(shù)據(jù)集的視角標(biāo)注大都存在規(guī)模小掠兄、視角離散化粗略碉渡、場(chǎng)景內(nèi)容簡(jiǎn)單等問(wèn)題(small in scale, coarse in viewpoint discretization and simple in scene context)焰扳,作者拿3DObjectEPFL Car數(shù)據(jù)集舉了具體的例子禽捆。與這類數(shù)據(jù)集相比拾徙,ObjectNet3D提供了連續(xù)的視角標(biāo)注標(biāo)簽洲炊,且圖像均來(lái)自真實(shí)場(chǎng)景。(It provides?continuous viewpoint annotation to realistic images from the web

Datasets with depths or 3D points. 這類數(shù)據(jù)集使用深度圖像或點(diǎn)云完成與2D圖像中物體的配準(zhǔn)(registration)尼啡。使用深度信息的數(shù)據(jù)集舉例包括RGB-D Object暂衡、NYU depth、SUN RGB-D崖瞭;使用點(diǎn)云的數(shù)據(jù)集舉例為KITTI古徒,數(shù)據(jù)集的具體描述見(jiàn)原文。與這些數(shù)據(jù)集相比读恃,作者為每個(gè)2D物體提供的3D模型標(biāo)注隧膘,擁有比深度和點(diǎn)云更豐富的信息代态。(we align a 3D shape to each 2D object and provide 3D shape annotation to objects, which is richer information than depth or 3D points and allows us to transfer meta-data from the 3D shape back to the image.)

Datasets with 2D-3D alignments. 提供2D-3D之前對(duì)齊標(biāo)注的開(kāi)創(chuàng)性工作為LabelMe3D(An influential work in building datasets with 2D-3D alignment is LabelMe3D)。這之后疹吃,類似的提供2D圖像與3D形狀對(duì)的數(shù)據(jù)集有IKEA蹦疑、PASCAL3D+,與這兩個(gè)數(shù)據(jù)集不同的是(它們存在一些缺陷 it is insufficient to cover the variations of common object categories and their geometry variability)萨驶,ObjectNet3D提供了更大規(guī)模的image-shape對(duì)歉摧。下表1是與各個(gè)代表性的舉例數(shù)據(jù)集的對(duì)比。

※?Database Construction

Our goal is to build a large scale database for 3D object recognition. We resort to images in existing image repositories and propose an approach to align 3D shapes (which are available from existing 3D shape repositories) to the objects in these images. In this way, we have successfully built the ObjectNet3D database.

具體分以下6個(gè)步驟介紹數(shù)據(jù)集的構(gòu)建過(guò)程:

● 3.1 Object Categories

首先要明確腔呜,數(shù)據(jù)集ObjectNet3D是面向物體識(shí)別任務(wù)的(object category recognition)叁温。由于要提供2D物體的3D形狀作為標(biāo)注,因此均挑選的是剛性物體(rigid object categories)核畴,至于非剛性的組合式或鉸鏈?zhǔn)轿矬w類別(deformable and articulated objects)膝但,尤其是動(dòng)物和人體等類別,在對(duì)齊階段谤草,需要挪動(dòng)部件以擬合2D物體的輪廓跟束,這是極其困難的。作者聲稱將留作下一階段的研究目標(biāo)(We consider the extension to non-rigid object categories as a future work.)(PS:德國(guó)馬普所的大佬Michael Black一直致力于研究digital person丑孩,3D human shape的相關(guān)進(jìn)展冀宴,可以關(guān)注他們的產(chǎn)出)。表2中展示了ObjectNet3D數(shù)據(jù)集包含的100個(gè)剛性物體類別温学,其中包含了12類PASCAL VOC數(shù)據(jù)集中的剛性類別略贮,以及9類3DObject數(shù)據(jù)集中的剛性類別。

PS:高亮標(biāo)注的類別似乎不能算作剛性物體仗岖?

● 3.2 2D Image Acquisition

在明確100個(gè)物體類別后逃延,作者選擇從ImageNet數(shù)據(jù)集中挑選2D圖像。ImageNet按照詞網(wǎng)絡(luò)分層(WordNet hierarchy)的方式組織圖片箩帚,每一個(gè)節(jié)點(diǎn)表示一個(gè)同義詞集合(synset)真友,作者便根據(jù)上一步驟設(shè)定的物體類別,從這些同義詞集中下載對(duì)應(yīng)的圖片紧帕。但是盔然,對(duì)于類別can, desk lamp and trophy作者未能在ImageNet數(shù)據(jù)集中找到對(duì)應(yīng)同義詞集,且類別fork and iron的對(duì)應(yīng)圖像數(shù)量極少是嗜,因此作者額外地補(bǔ)充了這些類別的圖片(Google Image Search.)愈案。下圖2展示了一些 物體類別對(duì)應(yīng)的圖片。數(shù)據(jù)集中鹅搪,大部分圖像中的物體類別十分顯著(salient)站绪,這是由ImageNet主要被用于圖像分類決定的。(PS:這也從側(cè)面說(shuō)明丽柿,3D目標(biāo)識(shí)別數(shù)據(jù)集恢准,目前尚不能很好地支持多目標(biāo)檢測(cè)任務(wù)魂挂,即像2D目標(biāo)檢測(cè)數(shù)據(jù)集MS-COCO那樣,一張圖像中包含多種未知的物體類別和數(shù)量

Fig. 2. Example images in our database.

● 3.3 3D Shape Acquisition

首先馁筐,作者人工地Trimble 3D Warehouse中為100個(gè)物體類別挑選代表性3D形狀模型涂召,這些3D模型覆蓋了絕大多數(shù)類別中對(duì)應(yīng)的子類別(subcategories)。比如敏沉,對(duì)于car這一類別果正,會(huì)挑選 sedans, SUVs, vans, trucks等多種子類別。圖3a中展示了bench這一類別對(duì)應(yīng)的7種子類別盟迟,這些3D形狀已被統(tǒng)一地完成視角對(duì)齊e.g., front view of bench)秋泳,以及尺寸歸一化(their sizes normalized to fit into a unit sphere.)。此外攒菠,每種3D形狀模型還需要人工地挑選部分代表性關(guān)鍵點(diǎn)迫皱,如圖3a中的紅點(diǎn)所示,這些關(guān)鍵點(diǎn)可以在進(jìn)行image-shape的對(duì)齊過(guò)程中輔助識(shí)別形狀的姿態(tài)要尔。最后舍杜,從Trimble 3D Warehouse中總計(jì)完成了783個(gè)3D形狀模型的搜集新娜。

Fig. 3. Examples of the 3D shapes for bench in our database. (a) 3D shapes manuallyselected from Trimble 3D Warehouse. (b) 3D Shapes collected from ShapeNet.

然后赵辕,為了增加3D形狀的類別數(shù)量,作者又從ShapeNet repository中繼續(xù)挑選概龄。ShapeNet的構(gòu)造與ImageNet類似还惠,作者選擇使用ShapeNetCore subset來(lái)輔助挑選。ShapeNetCore包含了55個(gè)物體類別私杜,其中42個(gè)與ObjectNet3D的100個(gè)類別重合蚕键,作者又額外地下載了43,364個(gè)3D形狀模型以擴(kuò)充3D shapes的數(shù)量,一些樣例展示見(jiàn)圖3b衰粹。這些3D模型明顯比來(lái)自Trimble 3D Warehouse中的模型更具多樣性锣光,且包含更豐富的紋理信息(These 3D models are valuable since they capture more shape variations and have rich texture/material information.)。

● 3.4 Camera Model

這一步驟需要將2D圖像中的物體與3D形狀進(jìn)行對(duì)齊(align an object in an image with a 3D shape)铝耻,因此需要指定相機(jī)模型誊爹,作者使用的相機(jī)模型見(jiàn)下圖4a。

Fig. 4. Illustration of the camera model in our database (a) and the annotation interface for 2D-3D alignment (b).

作者構(gòu)建的相機(jī)模型主要分三個(gè)部分:1)以3D模型的中心點(diǎn)為原點(diǎn)的世界坐標(biāo)系(world coordinate systemO瓢捉,該坐標(biāo)系內(nèi)的3D物體用坐標(biāo)軸(i,j,k)表征频丘;2)相機(jī)坐標(biāo)系(camera coordinate systemC,該坐標(biāo)系內(nèi)的3個(gè)坐標(biāo)軸用(\dot{i},\dot{j},\dot{k})表示泡态,且相機(jī)默認(rèn)是朝向\dot{k}軸的反方向搂漠。在這種假設(shè)想,兩個(gè)系統(tǒng)間的旋轉(zhuǎn)可以用三個(gè)變量表示R=(a,e,\theta)某弦,分別是方位角azimuth a桐汤,高度變化elevation e而克,平面內(nèi)旋轉(zhuǎn)in-plane rotation\theta,系統(tǒng)間的平移也可以用三個(gè)變量表示T=(a,e,d)怔毛,其中d表示相機(jī)到物體模型原點(diǎn)的距離拍摇。RT一起表征了相機(jī)模型的外參( extrinsic parameters);3)相機(jī)模型的內(nèi)參(intrinsic parameters)包括焦距(focus length)f視口大泄萁亍(viewport size)\alpha充活,作者統(tǒng)一將焦距設(shè)置為1,視口大小固定為2000蜡娶,也就是單位焦距相當(dāng)于真實(shí)世界中的2000個(gè)像素混卵。另外,還假定主點(diǎn)坐標(biāo)(principal point)為(u,v)窖张,一般為2D圖像的中心位置坐標(biāo)(w/2, h/2)幕随。

至此,相機(jī)模型的投影矩陣M構(gòu)成如下:

M=\underbrace{\left[\begin{array}{ccc} \alpha f & 0 & u \\ 0 & \alpha f & v \\ 0 & 0 & 1 \end{array} \right]}_{intrinsic ~ parameters} \underbrace{[R(a,e,\theta), T(a,e,d)]}_{extrinsic ~ parameters}.~~~~~~(1)

在標(biāo)注過(guò)程中宿接,為了完成image-shape的對(duì)齊赘淮,會(huì)默認(rèn)固定住參數(shù)f\alpha,而調(diào)整獲取另外6個(gè)參數(shù):a,e,\theta,d,u,v睦霎。(PS:這些設(shè)置是作者在設(shè)計(jì)和開(kāi)發(fā)數(shù)據(jù)集標(biāo)注工具時(shí)必須考慮的梢卸。)

● 3.5 Annotation Process

標(biāo)注主要工作是為2D圖像中的物體提供3D標(biāo)注。作者分三步驟描述:Ⅰ)為2D物體標(biāo)注bounding box副女,被遮擋或截?cái)嗟奈矬w也需要標(biāo)注(Occluded objects and truncated objects are also labeled.)蛤高;Ⅱ)對(duì)于每一個(gè)由bounding box確定的物體,將其與某個(gè)最相似的3D形狀匹配起來(lái)碑幅,這些3D形狀根據(jù)每類物體類別已完成搜集和分類(見(jiàn)圖3a中來(lái)自Trimble 3D Warehouse的3D模型)戴陡,平均每類有7~8個(gè)3D形狀供標(biāo)注者挑選;Ⅲ)標(biāo)注者將挑選的3D形狀與2D物體對(duì)齊沟涨,對(duì)齊過(guò)程與3.4章節(jié)介紹的相機(jī)模型有關(guān)恤批,作者開(kāi)發(fā)了專用的標(biāo)注界面,見(jiàn)圖4b裹赴。標(biāo)注者通過(guò)調(diào)整界面中的按鈕喜庞,一張張地切換2D圖像,并為每個(gè)物體修改得到最合適的相機(jī)參數(shù)篮昧。這一修改調(diào)整的過(guò)程是極其復(fù)雜的赋荆,原文段落表述如下:

Annotators have full control of all the 6 camera parameters using the interface: azimuth, elevation, distance, in-plane rotation and principal point. Whenever these parameters are changed, we re-project the 3D shape to the image and display the overlap, which is helpful for the annotator to find a set of camera parameters that align the 3D shape with the 2D object well.?Our criterion for the alignment is maximizing the intersection over union between the projection of the 3D shape and the 2D object.?Fig. 4(b) shows the finished alignment for a computer keyboard.

圖5展示了數(shù)據(jù)集中某幾個(gè)類別對(duì)應(yīng)的視角分布情況(viewpoint distribution

PS:旋轉(zhuǎn)對(duì)應(yīng)的三個(gè)變量中,azimuth和elevation可以用單位球面上的點(diǎn)坐標(biāo)展示懊昨,但in-plane rotation不能用點(diǎn)展示窄潭,因此,這里使用的是顏色來(lái)區(qū)分不同的平面內(nèi)旋轉(zhuǎn)

● 3.6 3D Shape Retrieval

完成3D形狀標(biāo)注后,作者提及圖3b中來(lái)自ShapeNetCore的3D模型嫉你。這些模型對(duì)應(yīng)100個(gè)物體類別中的42個(gè)月帝,且每個(gè)類別對(duì)應(yīng)的3D形狀成百上千,不利于人工挑選最合適的一個(gè)幽污。因此嚷辅,作者想到使用形狀檢索的方式,來(lái)輔助地推薦最相似的3D形狀(So we develop a 3D shape retrieval method by learning feature embeddings with rendered images, and we use this method to retrieve the closest 3D shapes for objects in the 42 object categories.)距误。

具體地簸搞,我們假設(shè)一個(gè)2D物體為o,其對(duì)應(yīng)的N個(gè)3D形狀集合\mathcal{S}=\{S_1, S_2, ..., S_N\}准潭,目標(biāo)任務(wù)是按照這N個(gè)形狀與物體o的相似度進(jìn)行排序趁俊。作者將其歸類為度量學(xué)習(xí)問(wèn)題(a metric learning problem),也就是測(cè)量2D物體與3D形狀之前的距離D(o,S)刑然。下一步寺擂,為了便于溝通起這兩個(gè)不同的域,作者使用3D形狀渲染出來(lái)的一系列2D圖像來(lái)表示它S=\{s_1, s_2, ..., s_n\}泼掠,其中s_i表示3D形狀渲染出來(lái)的第i張圖像怔软,n表示渲染圖像的總數(shù)量,作者選擇的是n=100择镇。接著挡逼,作者將2D物體與3D形狀之前的距離度量,轉(zhuǎn)化為2D物體與一系列由3D形狀渲染所得的圖像集合之間的平均距離D(o,S)=1/n\sum\nolimits^n_{i=1}D(o,s_i)沐鼠。也就是說(shuō)挚瘟,現(xiàn)在叹谁,問(wèn)題轉(zhuǎn)化為求解兩張2D圖像之間的距離D(o,s_i)饲梭,這是一個(gè)被廣泛研究的問(wèn)題(which is an active research field in the literature.)

作者選擇了彼時(shí)最好的方法lifted structured feature embedding method(具體描述見(jiàn)原參考文獻(xiàn)),它表現(xiàn)好于contrastive embedding或triplet embedding焰檩。訓(xùn)練階段憔涉,僅使用被渲染得到的圖像作為訓(xùn)練集,每個(gè)3D形狀渲染后的圖像集作為同一類別析苫。測(cè)試階段兜叨,計(jì)算每個(gè)2D物體與渲染圖像的歐氏距離,然后計(jì)算平均距離作為2D物體與3D形狀的相似度衩侥。另外国旷,為了減小渲染圖像與真實(shí)圖像的差異(minimize the gap between rendered images and real test images),作者給渲染圖像增加了背景和光照變換茫死,參考方法見(jiàn)原文引用文獻(xiàn)跪但。

※?Baseline Experiments

這一章節(jié),作者給出了在提出的數(shù)據(jù)集上四個(gè)任務(wù)的baseline:object proposal generating, 2D object detection, joint 2D detection and 3D pose estimation, and image-based 3D shape retrieval. 實(shí)驗(yàn)時(shí)峦萎,數(shù)據(jù)集中training/validation (trainval) set有45,440張圖片屡久,test set有44,687張圖片忆首。

● 4.1 Object Proposal Generation

作者選擇了四種區(qū)域提案算法(four different region proposal methods)在數(shù)據(jù)集上驗(yàn)證其表現(xiàn):SelectiveSearch, EdgeBoxes, Multiscale Combinatorial Grouping (MCG), Region Proposal Network (RPN)(具體算法細(xì)節(jié)見(jiàn)原參考文獻(xiàn))。作者使用detection recall來(lái)衡量region proposal的好壞被环,圖6中展示了不同設(shè)置下(different number of object proposals per image and different IoU thresholds)糙及,四種方法的recall表現(xiàn)變化,其中RPN使用了AlexNet和VGGNet兩種不同的backbones筛欢〗牵可以大致發(fā)現(xiàn)兩個(gè)結(jié)論:1)在proposals的數(shù)量為1000左右時(shí),所有方法的recall在IoU為0.5時(shí)版姑,精度接近90%揣钦;2)RPN+VGGNet在IoU為0.5~0.7之間表現(xiàn)最優(yōu),而MCG在各個(gè)IoU下持續(xù)表現(xiàn)更魯棒漠酿。更多關(guān)于實(shí)驗(yàn)結(jié)果的解讀冯凹,參見(jiàn)原文。(PS:內(nèi)容不夠炒嘲,解讀來(lái)湊

Fig. 6. Evaluation of four different object proposal method on our dataset: SelectiveSearch [37], EdgeBoxes [43], MCG [6] and RPN [25].

● 4.2 2D Object Detection

作者選擇SOTA算法Faster R-CNN來(lái)驗(yàn)證2D目標(biāo)檢測(cè)的性能宇姚。同樣地,作者選擇使用AlexNet和VGGNet兩種不同的backbones夫凸,模型均在ImageNet上完成了預(yù)訓(xùn)練浑劳。圖7展示了檢測(cè)網(wǎng)絡(luò)Faster R-CNN的大致架構(gòu),然后作者再對(duì)檢測(cè)過(guò)程作以簡(jiǎn)單介紹(First, an input image is fed into a sequence of convolutional layers to compute a feature map of the image. Then, given a region proposal, the RoI pooling layer extracts a feature vector for the region proposal from the feature map. The feature vector is then processed by two Fully Connected (FC) layers, i.e., FC6 and FC7 each with dimension 4096. Finally, the network terminates at two FC branches with different losses (i.e., the third branch for viewpoint estimation in Fig. 7 is not used here), one for object class classification, and the other one for bounding box regression (see Faster R-CNN [11] for more details).)夭拌。由于本文中數(shù)據(jù)集有100個(gè)類別魔熏,因此FC接分類的分支有101個(gè)輸出(background作為一個(gè)額外的類別),接檢測(cè)框回歸的分支有404個(gè)輸出鸽扁。

Fig. 7. Illustration of the network architecture based on Fast R-CNN [11] for objectdetection and pose estimation.

目標(biāo)檢測(cè)使用Average Precision (AP)來(lái)度量檢測(cè)精度蒜绽,針對(duì)多目標(biāo)檢測(cè)任務(wù),則使用mean AP (mAP)來(lái)度量性能桶现。表3展示了mAP的結(jié)果躲雅,從中作者作以三點(diǎn)解讀:1)使用VGGNet比AlexNet的結(jié)果更好;2)四種region proposal方法中骡和,SelectiveSearch和MCG比EdgeBoxes表現(xiàn)更好相赁,而RPN則更能受益于VGGNet強(qiáng)大的特征提取優(yōu)勢(shì);3)VGGNet+RPN可以在文章提出的數(shù)據(jù)集上達(dá)到67.5的mAP慰于,作為對(duì)比钮科,含有200個(gè)物體類別的ImageNet,在2015年的挑戰(zhàn)賽上的最佳mAP結(jié)果為62.0婆赠。(PS:內(nèi)容不夠绵脯,解讀來(lái)湊

圖8則展示了每個(gè)類別的AP檢測(cè)結(jié)果(the detection AP of each category),從中可以發(fā)現(xiàn)一些相對(duì)容易檢測(cè)的類別( aeroplane, motorbike and train),和一些難檢測(cè)的類別(cabinet, pencil and road pole)桨嫁,這些難檢測(cè)類別大都有較大的類內(nèi)差距具有難以辨識(shí)的特征(These categories either have large intra-class variability or have less discriminative features)(PS:內(nèi)容不夠植兰,解讀來(lái)湊

Fig. 8. Bar plot of the detection AP and viewpoint estimation AOS of the 100 categories on the test set from VGGNet with SelectiveSearch proposals.

作者進(jìn)一步將100個(gè)類別按照6個(gè)大類重新分組,然后借助diagnosing工具(參考原文獻(xiàn))璃吧,報(bào)告Fasle Positive檢測(cè)樣例(We group all 100 categories into six super-categories: container, electronics, furniture, personal items, tools and vehicles, and analyze the detection false positives of these six groups using the diagnosing tool from [14])楣导。圖9展示了分析結(jié)果。對(duì)于tools和vehicles畜挨,localization error占據(jù)了大部分Fasle Positive檢測(cè)結(jié)果筒繁,而對(duì)于其他類別,confusion with other categories or background是占據(jù)最大的檢測(cè)誤差巴元。(PS:內(nèi)容不夠毡咏,解讀來(lái)湊

Fig. 9. Distribution of top-ranked false positive types from VGGNet with SelectiveSearch proposals: Loc - pool localization; Sim - confusion with a similar category;Oth - confusion with a dissimilar category; BG - a false positive fires on background.

● 4.3 Joint 2D Detection and Continuous 3D Pose Estimation

作者借助Faster R-CNN的檢測(cè)結(jié)構(gòu),在最后一個(gè)FC層增加了第三個(gè)分支逮刨,用于聯(lián)合預(yù)測(cè)物體的3D姿態(tài)呕缭,參見(jiàn)圖7右下角的粉色區(qū)域(a viewpoint regression FC branch)。對(duì)于該分支修己,共計(jì)3×101個(gè)輸出恢总,即每個(gè)類別對(duì)應(yīng)3個(gè)姿態(tài)變量:azimuth, elevation and in-plane rotation。和bounding box的回歸一樣睬愤,作者選擇使用smoothed L1 loss回歸3D姿態(tài)片仿。

至于3D姿態(tài)估計(jì)的度量標(biāo)準(zhǔn),作者選擇了兩種:PASCAL3D+中使用的Average Viewpoint Precision (AVP)尤辱;KITTI中使用的Average Orientation Similarity (AOS)砂豌。然而,這兩個(gè)標(biāo)準(zhǔn)均只用于度量azimuth光督,為了度量3個(gè)角度阳距,作者拓展了上述兩個(gè)度量標(biāo)準(zhǔn)\Delta (R,R_{gt})=(1/\sqrt{2})| \log (R^T R_{gt}) \|_F,它表示的是估計(jì)的旋轉(zhuǎn)矩陣與GT旋轉(zhuǎn)矩陣之間的測(cè)地距 (the geodesic distance between the estimated rotation matrix?R and the GT rotation matrix R_{gt})可帽。對(duì)于AVP娄涩,當(dāng)\Delta(R,R_{gt})<\pi/6時(shí),認(rèn)為姿態(tài)估計(jì)正確映跟;對(duì)于AOS,利用兩個(gè)姿態(tài)之間的余弦相似性(cosine similarity)扬虚,即\cos(\Delta(R,R_{gt}))來(lái)定量化度量努隙。

表4展示了具體的度量結(jié)果。圖8中展示了每個(gè)類別的AOS的結(jié)果(由于AOS的計(jì)算方式辜昵,檢測(cè)AP總是比AOS要高一些)荸镊。通過(guò)圖8中AOS與AP之間的差異,我們可以發(fā)現(xiàn)一些姿態(tài)估計(jì)較差的物體類別,如comb, forkteapot躬存,這些物體大都具有對(duì)稱性或具有極大的平面內(nèi)旋轉(zhuǎn)角度(These categories may be nearly symmetric or have large in-plane rotation angles.)张惹。如圖10,為了進(jìn)一步理解3個(gè)姿態(tài)角的誤差分布岭洲,作者再次借助diagnosing工具宛逗,統(tǒng)計(jì)了6大類中的誤差結(jié)果《苁#可以發(fā)現(xiàn)雷激,azimuth占據(jù)了絕大多數(shù)的誤差,對(duì)于tools和personal items兩大類告私,in-plane rotation誤差則明顯增大屎暇。

Fig. 10. Viewpoint error distribution of top-ranked true positives from VGGNet withSelectiveSearch proposals.

● 4.4 Image-based 3D Shape Retrieval

本章節(jié)主要介紹在3.6章節(jié)提出的基于deep metric learning的3D形狀檢索方法的一些細(xì)節(jié)。比如基于3D形狀生成100個(gè)渲染圖像時(shí)驻粟,會(huì)兼顧多種姿態(tài)視角(These viewpoints are sampled from a distribution estimated with kernel density estimation using the viewpoint annotations (azimuth, elevation and in-plane rotation) in our database for that category.)根悼;為了模擬真實(shí)圖像,渲染的2D圖像會(huì)隨機(jī)地加入背景信息蜀撑,這些背景選擇自SUN database番挺;為了驗(yàn)證得到最優(yōu)的方法,作者對(duì)比了三種度量方法屯掖,即contrastive embedding玄柏、triplet embeddinglifted structured feature embedding,提取特征的網(wǎng)絡(luò)backbone均使用GoogleNet贴铜。

為了度量learned embedding粪摘,對(duì)于每個(gè)3D形狀,作者隨機(jī)地選擇100張渲染的圖像中一半作為訓(xùn)練集绍坝,另一半用于測(cè)試徘意,完成訓(xùn)練后,給定一張渲染的圖像轩褐,檢索返回topK測(cè)試圖像椎咧。用Recall@K作為度量標(biāo)準(zhǔn)(which is computed as the percentage of testing images which have at least one correctly retrieved image among the top K retrieval results.)。

表5中展示了對(duì)比結(jié)果把介,顯然勤讽,lifted structured feature embedding方法是最好的,因此被選做標(biāo)注工具的3D形狀輔助推薦方法(The goal is to provide the top K ranked 3D shapes for each 2D object, then ask annotators to select the most similar 3D shape among the K returned ones, since it is not feasible to ask an annotator to select the most similar shape among hundreds or even thousands of 3D shapes.)拗踢。(PS:這一段描述似乎與3.6章節(jié)大量重復(fù))圖12展示了部分3D形狀檢索的測(cè)試樣例脚牍。

Fig. 12. Example of 3D shape retrieval. Green boxes are the selected shapes. The lastrow shows two examples where we cannot find a similar shape among the top 5 ones.

最后,作者還設(shè)計(jì)了user study來(lái)進(jìn)一步驗(yàn)證形狀檢索方法在真實(shí)圖像上的性能巢墅。作者隨機(jī)地從42個(gè)類別(3D形狀從ShapeNetCore中擴(kuò)充過(guò))中各挑選出100個(gè)物體诸狭,然后要求3名標(biāo)注者判定在3D形狀檢索的top-20中是否有與物體相似的結(jié)果券膀,然后基于判定結(jié)果計(jì)算每個(gè)類別的Recall@20,最終結(jié)果見(jiàn)圖11驯遇。平均Recall@20為69.2%芹彬,基本滿足大量的3D形狀標(biāo)注任務(wù)。

Fig. 11. Recall@20 from our user study for 42 categories that have 3D shapes fromShapeNetCore. The number of 3D shapes for each category is shown in the brackets.

※?Conclusions

In this work, we have successfully built a large scale database with 2D images and 3D shapes for 100 object categories. We provide 3D annotations to objects in our database by aligning a closest 3D shape to a 2D object. As a result, our database can be used to benchmark different object recognition tasks including 2D object detection, 3D object pose estimation and image-based 3D shape retrieval. We have provided baseline experiments on these tasks, and demonstrated the usefulness of our database.

立足于提供的2D image-3D shape對(duì)叉庐,突出數(shù)據(jù)集的大規(guī)模性有用性舒帮。

3、新穎點(diǎn)

文章整體上中規(guī)中矩眨唬,行文十分清晰流暢会前,是一篇工作飽滿內(nèi)容翔實(shí)的中等佳作,這從文章多個(gè)界限分明的主標(biāo)題和副標(biāo)題上匾竿,可見(jiàn)一斑瓦宜。這說(shuō)明工作扎實(shí)的情況下,沒(méi)必要整太多花活岭妖,一五一十地將工作與創(chuàng)新點(diǎn)介紹清楚即可临庇。

4、總結(jié)

本文雖然中規(guī)中矩昵慌,但很多寫(xiě)作方面的內(nèi)容假夺,以及整體實(shí)驗(yàn)部分的安排,值得借鑒:

1)寫(xiě)作內(nèi)容上斋攀,需要說(shuō)明的重點(diǎn)部分已卷,基本都給出了細(xì)節(jié),很多看似沒(méi)必要的補(bǔ)充和解釋淳蔼,則有兩點(diǎn)好處:1)擴(kuò)充了文章內(nèi)容侧蘸;2)更加詳盡的分析總不會(huì)讓人拒絕

2)誠(chéng)然鹉梨,本文的主要貢獻(xiàn)在于提出了全新的大規(guī)模數(shù)據(jù)集ObjectNet3D讳癌,雖然為數(shù)據(jù)集類文章,但按照以往經(jīng)驗(yàn)存皂,最好還是需要體現(xiàn)novelty和innovation的晌坤。文章給出的四個(gè)任務(wù)上的baseline,均非原創(chuàng)旦袋,但每一個(gè)任務(wù)上又加入了部分新的工作骤菠,尤其是聯(lián)合3D姿態(tài)估計(jì)和形狀檢索任務(wù),這在一定程度上挽回了僅僅包含單純的數(shù)據(jù)集搜集和構(gòu)建的單薄和缺陷猜憎。

3)本文的一作Yu Xiang(向宇)娩怎,同樣也是PASCAL3D+數(shù)據(jù)集的一作,在2014年完成了PASCAL3D+的制作之后胰柑,作者又?jǐn)U展了自己的工作截亦,包括與原PASCAL3D+數(shù)據(jù)集標(biāo)注界面極為相似的標(biāo)注工具,以及各類2D圖像和3D模型柬讨”廊浚基于自己已經(jīng)完成的工作,擴(kuò)充得到了更扎實(shí)的ObjectNet3D數(shù)據(jù)集踩官,這表明了連續(xù)工作的重要性却桶。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市蔗牡,隨后出現(xiàn)的幾起案子颖系,更是在濱河造成了極大的恐慌,老刑警劉巖辩越,帶你破解...
    沈念sama閱讀 217,657評(píng)論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件嘁扼,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡黔攒,警方通過(guò)查閱死者的電腦和手機(jī)趁啸,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,889評(píng)論 3 394
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)督惰,“玉大人不傅,你說(shuō)我怎么就攤上這事∩团撸” “怎么了访娶?”我有些...
    開(kāi)封第一講書(shū)人閱讀 164,057評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)觉阅。 經(jīng)常有香客問(wèn)我崖疤,道長(zhǎng),這世上最難降的妖魔是什么留拾? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,509評(píng)論 1 293
  • 正文 為了忘掉前任戳晌,我火速辦了婚禮,結(jié)果婚禮上痴柔,老公的妹妹穿的比我還像新娘沦偎。我一直安慰自己,他們只是感情好咳蔚,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,562評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布豪嚎。 她就那樣靜靜地躺著,像睡著了一般谈火。 火紅的嫁衣襯著肌膚如雪侈询。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,443評(píng)論 1 302
  • 那天糯耍,我揣著相機(jī)與錄音扔字,去河邊找鬼囊嘉。 笑死,一個(gè)胖子當(dāng)著我的面吹牛革为,可吹牛的內(nèi)容都是我干的扭粱。 我是一名探鬼主播,決...
    沈念sama閱讀 40,251評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼震檩,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼琢蛤!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起抛虏,我...
    開(kāi)封第一講書(shū)人閱讀 39,129評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤博其,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后迂猴,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體慕淡,經(jīng)...
    沈念sama閱讀 45,561評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,779評(píng)論 3 335
  • 正文 我和宋清朗相戀三年错忱,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了儡率。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,902評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡以清,死狀恐怖儿普,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情掷倔,我是刑警寧澤眉孩,帶...
    沈念sama閱讀 35,621評(píng)論 5 345
  • 正文 年R本政府宣布,位于F島的核電站勒葱,受9級(jí)特大地震影響浪汪,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜凛虽,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,220評(píng)論 3 328
  • 文/蒙蒙 一死遭、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧凯旋,春花似錦呀潭、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,838評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至荒椭,卻和暖如春谐鼎,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背趣惠。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,971評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工狸棍, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留身害,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 48,025評(píng)論 2 370
  • 正文 我出身青樓隔缀,卻偏偏與公主長(zhǎng)得像题造,于是被迫代替她去往敵國(guó)和親傍菇。 傳聞我的和親對(duì)象是個(gè)殘疾皇子猾瘸,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,843評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容