0化戳、關(guān)鍵詞
annotated videos, 3D object detection, object-centric videos, pose annotations, Objectron dataset, 3D object tracking, 3D shape representation, object-centric short videos, annotated images, robotics, image retrieval, augmented reality
1眯分、鏈接
該論文來自谷歌研究院(Google Research?It's Google :-( 內(nèi)地需要VPN才能訪問 )新荤。秉承其形成技術(shù)壁壘的一貫作風(fēng),要么“力大磚飛”,使用大規(guī)模集群或高性能GPU,讓人輕易不能訓(xùn)練和驗(yàn)證算法效果衡便;要么依靠其廣泛的原始數(shù)據(jù)來源和充足的工程師隊(duì)伍,獲得并標(biāo)注泛化性強(qiáng)的高質(zhì)量數(shù)據(jù)集洋访。該數(shù)據(jù)集型論文Objectron顯然為后者镣陕。
論文鏈接:https://ieeexplore.ieee.org/abstract/document/9578264
論文代碼及主頁:https://github.com/google-research-datasets/Objectron/
論文官方網(wǎng)站介紹:https://google.github.io/mediapipe/solutions/objectron
對(duì)于3D目標(biāo)檢測和6D姿態(tài)估計(jì)領(lǐng)域,Objectron貢獻(xiàn)了一個(gè)高質(zhì)量的in-the-wild的公共數(shù)據(jù)集姻政,但礙于3D物體的高復(fù)雜性呆抑,即使到這篇文章提出為止,依然沒有能對(duì)標(biāo)2D目標(biāo)檢測領(lǐng)域內(nèi)MS-COCO數(shù)據(jù)集那樣的工作誕生汁展。Objectron的靜態(tài)圖像數(shù)據(jù)規(guī)模為400萬鹊碍,但僅包含14819個(gè)視頻和17095個(gè)物體實(shí)例(COCO的實(shí)例多樣性更勝一籌),物體類別上也僅為9類(COCO為80個(gè)類別)食绿,遠(yuǎn)未覆蓋日常所能見到的所有物體侈咕。拋開這些,Objectron依然是目前領(lǐng)域內(nèi)最具挑戰(zhàn)的數(shù)據(jù)集之一器紧。
盡管耀销,Objectron非常詳盡地闡述了其構(gòu)建思路與過程,并慷慨地公開了數(shù)據(jù)集铲汪,但是有兩項(xiàng)重要內(nèi)容熊尉,其未公布:1)baseline模型的訓(xùn)練代碼Training Codes;2)3D物體框標(biāo)注工具Annotation Tool桥状。有經(jīng)驗(yàn)的領(lǐng)域內(nèi)專家也許可以找到替代品帽揪,或者自行復(fù)現(xiàn),但顯然費(fèi)時(shí)費(fèi)力辅斟,這就是Google出產(chǎn)的技術(shù)壁壘所在,一般公司似乎都會(huì)這么做吧芦拿。
2士飒、主要內(nèi)容概述
※ Abstract
3D目標(biāo)檢測應(yīng)用十分廣泛( robotics, augmented reality, autonomy, and image retrieval)查邢,本文提出的Objectron數(shù)據(jù)集致力于推進(jìn)3D目標(biāo)檢測,以及多個(gè)相關(guān)領(lǐng)域的發(fā)展(包括3D object tracking, view synthesis, and improved 3D shape representation)
數(shù)據(jù)集具體信息:The dataset contains?object-centric short videos?with?pose annotations?for?nine?categories and includes?4 million?annotated images in?14, 819?annotated videos酵幕。
另外扰藕,針對(duì)3D目標(biāo)檢測任務(wù),本文還提出了新的度量指標(biāo)芳撒,即3D Intersection over Union邓深。
最后,基于自建的benckmark笔刹,?作者提供了兩個(gè)baselines:3D object detection任務(wù)和novel view synthesis任務(wù)芥备。
※ Introduction
在機(jī)器學(xué)習(xí)算法和大量訓(xùn)練圖像的加持下,計(jì)算機(jī)視覺任務(wù)取得了精度上的巨大提升舌菜,同理萌壳,3D目標(biāo)理解任務(wù)也有了很大的進(jìn)步。然而日月,因?yàn)槿狈Υ罅空鎸?shí)世界中的數(shù)據(jù)集(the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet [8], COCO [22], and Open Images [20]))袱瓮,理解3D物體較2D物體有很大難度。本文想要制作一個(gè)object-centric video datasets爱咬,也就是以物體為中心尺借,環(huán)繞其四周錄制不同的連續(xù)視角下的觀察短視頻。
具體來講精拟,每個(gè)短視頻使用AR一體化設(shè)備獲取褐望,元數(shù)據(jù)包括相機(jī)姿態(tài)、稀疏點(diǎn)云和外表面(camera poses, sparse point-clouds, and surface planes)串前,另外針對(duì)每個(gè)物體瘫里,還有人工標(biāo)注的3D包圍框(3D bounding boxes),這些標(biāo)注描述了物體的9維信息荡碾,包括位置谨读、朝向和維度( position, orientation, and dimensions)【也就是X, Y, Z, pitch, yaw, roll, length, width, height】。為了保持?jǐn)?shù)據(jù)的多樣性坛吁,所有14, 819個(gè)短視頻樣例盡量采集自地理上分散的不同國家(from a geo-diverse sample covering ten countries across five continents)【谷歌的優(yōu)勢(shì)之一:研究院遍布全球】
Objectron數(shù)據(jù)集有以下幾點(diǎn)優(yōu)勢(shì):
●?Videos contain multiple views of the same object, enabling many applications well beyond 3D object detection. This includes multi-view geometric understanding, view synthesis, 3D shape reconstruction, etc.
●?The 3D bounding box is present in the entire video and is temporally consistent, thus enabling 3D tracking applications.
●?Our dataset is collected in the wild to provide better generalization for real-world scenarios in contrast to datasets that are collected in a controlled environment?[13] [4].【主要是數(shù)據(jù)集LineMOD (ACCV2012)和YCB (IJRR2017)】
●?Each instance’s translation and size are stored in metric scale, thanks to accurate on-device AR tracking and provides sparse point clouds in 3D, enabling sparse depth estimation techniques. The images are calibrated and the camera parameters are provided, enabling the recovery of the object’s true scale.
●?Our annotations are dense and continuous, unlike some of the previous work [30]?【Fei-Fei Li提出的數(shù)據(jù)集3DObject (ICCV2007)劳殖,年代久遠(yuǎn)~】where viewpoints have been discretized to fit into bins.
●?Each object category contains hundreds of instances, collected from different locations across different countries in different lighting conditions. 【強(qiáng)調(diào)每個(gè)類別下的數(shù)據(jù)規(guī)模大,且數(shù)據(jù)分布的多樣性更優(yōu)】
※ Previous?Work
作者主要比較了以往常用的多個(gè)具有代表性的3D目標(biāo)檢測數(shù)據(jù)集:
●?從具有更大的數(shù)據(jù)集規(guī)模拨脉,更高清的視頻和更真實(shí)的常見物體上比較:BOP challenge, T-LESS, Rutgers APC,? LineMOD,?IC-BIN, YCB哆姻;
● 從具備更豐富的標(biāo)注維度上(9-DoF vs. 6-DoF)比較:ObjectNet3D,?Pascal3D+,?Pix3D,?3DObject;
● 從更復(fù)雜的場景數(shù)據(jù)集(scene datasets)采集方式上比較(RGBD或LIDAR):ScanNet,?Scan2CAD, Rio玫膀;
● 從數(shù)據(jù)真假(synthetic data or photo-realistic scenes vs. real world)上比較:ShapeNet,?HyperSim矛缨; 【Synthetic datasets offer valuable data for training and benchmarking, but the ability to generalize to the real-world is unknown.】
更多關(guān)于比較的詳細(xì)描述,見原論文,或參考我的相關(guān)博客6D Object Pose Estimation Datasets
※ Data Collection and Annotation
●?Object Categories
首先闡明挑選物體類別的幾個(gè)大致標(biāo)準(zhǔn):
1)In Objectron dataset, the aim was to select meaningful categories of common objects that form a representative set of all categories that are practically relevant and technically challenging. 【引出了cups, chairs and bikes】
2)The object categories in the dataset contain both rigid, and non-rigid objects.?【引出了非剛性物體bikes 和 laptops箕昭,當(dāng)然灵妨,在制作數(shù)據(jù)集的過程中,錄制視頻時(shí)落竹,非剛性物體是保持不動(dòng)的 remain?stationary】
3)Many 3D object detection models are known to exhibit difficulties in estimating rotations of symmetric objects [21]. Symmetric objects have ambiguity in their one, two, or even three degrees of rotation.【引出了強(qiáng)對(duì)稱性物體cups 和 bottles】
4)It has been shown that vision models pay special attention to texts in the images. Re-producing texts and labels correctly are important in generative models too.【強(qiáng)調(diào)有些時(shí)候可能需要恢復(fù)一些細(xì)節(jié)泌霍,比如文本信息,引出了books 和 cereal boxes】
5)Since we strive for real-time perception we included a few categories (shoes and chairs) that enable exciting applications, such as augmented reality and image retrieval. 【引出了shoes 和 chairs】
共計(jì)9類物體:bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
●?Data Collection
通過谷歌自家的手持式AR設(shè)備(ARKit和ARCore)采集數(shù)據(jù)述召,公布的數(shù)據(jù)集中將包括記錄的視頻朱转,和AR設(shè)備采集到的元數(shù)據(jù)』【W(wǎng)e assume the standard pinhole camera model, and provide calibration, extrinsics and intrinsics matrix for every frame in the dataset】所有視頻的分辨率為1920×1080藤为,F(xiàn)PS為30,均使用手機(jī)的前置攝像頭拍攝呀酸,且
1)設(shè)備類別不超過5種凉蜂,以保證成像的統(tǒng)一性;
2)視頻長度在10秒左右性誉,以減少AR設(shè)備本身的偏差帶來的干擾窿吩;
3)數(shù)據(jù)集采集者被要求不能快速移動(dòng),以防止產(chǎn)生模糊圖像错览。
當(dāng)然纫雁,全程要求物體保持相對(duì)靜止。正是由于采用了手機(jī)錄制的方案倾哺,研究者才能快速地在世界上不同區(qū)域內(nèi)發(fā)起標(biāo)注計(jì)劃轧邪,下圖是數(shù)據(jù)來源的地區(qū)分布情況(來自5個(gè)大洲的10個(gè)國家):
●?Data Annotation
Efficient and accurate data annotation is the key to building large-scale datasets. 【說易做難】
Annotating 3D bounding boxes for each image is time-consuming and expensive. 【做優(yōu)質(zhì)數(shù)據(jù)集必定很難】
Instead, we annotate 3D objects in a video clip and populate them to all frames in the clip, scaling up the annotation process, and reducing the per image annotation cost. 【也就是說每個(gè)短視頻只需要標(biāo)注首個(gè)關(guān)鍵幀,后續(xù)每一幀的標(biāo)注信息可以借助相機(jī)參數(shù)自動(dòng)生成羞海,以大大減少標(biāo)注負(fù)擔(dān)】標(biāo)注工具的用戶界面如下:
以下是原文中的標(biāo)注過程描述:
Next, we show the 3D world map to the annotator side-by-side with the images from the video sequence (Figure 4a). The annotator draws a 3D bounding box in the 3D world map, and our tool projects the 3D bounding box over all the frames given pre-computed camera poses from the AR sessions (such as ARKit or ARCore). The annotator looks at the projected bounding box and makes necessary adjustments (position, orientation, and the scale of a 3D bounding box) so the projected bounding box looks consistent across different frames. At the end of the process, the user saves the 3D bounding box and the annotation. The benefits of our approach are 1) by annotating a video once, we get annotated images for all frames in the video sequence; 2) by using AR, we can get accurate metric sizes for bounding boxes.
【從以上描述中可以看出忌愚,盡管標(biāo)注的是2D幀,但其標(biāo)注對(duì)象為3D點(diǎn)云却邓,2D圖像僅用于輔助判斷待標(biāo)注的各個(gè)維度的物體信息硕糊,標(biāo)注者也只能操作右側(cè)3D世界圖中的3D框,而左側(cè)均為投影和計(jì)算所得腊徙。這類使用點(diǎn)云的3D標(biāo)注工具很常見简十,因此盡管文章未將標(biāo)注工具開源,但仍可以很快找到替代品撬腾,比如openvinotoolkit開源的標(biāo)注工具CVAT】
●?Annotation Variance
嚴(yán)謹(jǐn)?shù)母哔|(zhì)量的標(biāo)注過程螟蝙,必須有定量分析!作者提到了兩個(gè)影響標(biāo)注精度的因素【The accuracy of our annotation hinges on two factors】:
1) the amount of drift in the estimated camera pose throughout the captured video, and?
2) the accuracy of the raters annotating the 3D bounding box.?
【因素1為相機(jī)姿態(tài)的估計(jì)誤差民傻,因素2為不同標(biāo)注者標(biāo)注標(biāo)簽的不一致性】
We compared the relative positional drift in our camera pose against an offline refined camera pose (obtained by an offline bundle adjustment) 【針對(duì)因素1中的誤差胰默,作者想到了使用線下的集束調(diào)整算法场斑,比較直接獲得的相機(jī)姿態(tài)參數(shù)與計(jì)算微調(diào)后的結(jié)果之間的差異。這里使用的集束優(yōu)化算法為SfM初坠,詳細(xì)描述作者放在了第4章節(jié)的最后一段】【同時(shí)和簸,為了減少人工誤差彭雾,作者將視頻長度控制在10秒左右碟刺,如下圖】
To evaluate the accuracy of the rater, we asked eight annotators to re-annotate same sequences. 【針對(duì)因素2,作者使用了冗余標(biāo)注的方式薯酝,通過多人重復(fù)標(biāo)注同意樣本半沽,最后取平均值來減小人工標(biāo)注的誤差,這里用到了8人/per sequence吴菠,谷歌就是有錢啊】
Overall for the chairs, the standard deviation for the chair orientation, translation, and scale was 4.6°, 1cm, and 4cm, respectively which demonstrates insignificant variance of the annotation results between different raters.【下圖是對(duì)椅子的標(biāo)注示例者填,文章統(tǒng)計(jì)結(jié)果聲稱,最終標(biāo)注誤差的方差是很小的做葵,這個(gè)數(shù)值是否也代表了算法預(yù)測的上限呢占哟?】
※ Objectron Dataset
In this section, we describe the details of our Objectron dataset and provide some statistics. 【本章節(jié)是對(duì)Objectron數(shù)據(jù)集具備的特征的統(tǒng)計(jì)分析】共計(jì)9類物體:bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes。其中bikes 和?laptops是非剛性物體酿矢。 In total there are 17, 095 object instances that appear in 4M annotated images from 14819 annotated videos (not counting the unreleased evaluation set for future competitions). 訓(xùn)練集和測試集劃分如下:
此外榨乎,作者還可視化分析了每類物體的方位角(azimuth)和仰角(elevation)的分布情況,如下圖
※ Baseline Experiments and Evaluations
作者發(fā)布了訓(xùn)練集和測試集瘫筐,以及測試評(píng)估代碼蜜暑,但并不包括訓(xùn)練代碼,以及某些物體類別的預(yù)訓(xùn)練模型策肝。測試標(biāo)準(zhǔn)包括:3D IoU, 2D projection error, view-point error, polar and azimuth error, and rotation error. 其中肛捍,除了3D IoU這項(xiàng)指標(biāo),其它指標(biāo)都是標(biāo)準(zhǔn)化的定義之众,無需贅述拙毫。
●?3D Intersection Over Union
作者聲稱,文中的3D IoU定義是全新的棺禾。作者指出缀蹄,之前文獻(xiàn)中涉及到的3D boxes重疊率計(jì)算方式,都是被過度簡化了的帘睦,比如坐標(biāo)對(duì)齊(axis-aligned)后再計(jì)算重疊率袍患,或者將3D boxes投影到某個(gè)平面,然后計(jì)算2D投影多邊形(2D projected polygons)的重疊率竣付。然而诡延,在通用的場景中,這些假設(shè)是不穩(wěn)固的古胆。Although this approach works for vehicles on the road, it has two limitations:?
1) The object should sit on the same ground plane, which limits the degrees of freedom of the box from 9 to 7. The box only has freedom in yaw, and the roll and pitch are set to 0.?
2) it assumes the boxes have the same height. For the Objectron datasets, these assumptions do not hold.
因此肆良,作者提出了通用的計(jì)算方法(computing accurate 3D IoU values for?general 3D-oriented boxes)筛璧。過程原文講述也不夠詳細(xì),且作者建議我們仔細(xì)閱讀其中涉及到的關(guān)鍵經(jīng)典算法the Sutherland-Hodgman Polygon clipping algorithm惹恃,這個(gè)多邊形裁剪算法(polygon-clipping algorithm)在計(jì)算機(jī)圖形學(xué)領(lǐng)域最為著名夭谤,下圖中展示了其原理示例。具體計(jì)算過程巫糙,請(qǐng)參考公布的測試代碼朗儒。
另外,對(duì)于自對(duì)稱性物體参淹,比如杯子和瓶子醉锄,由于其外形上有旋轉(zhuǎn)不變性,3D IoU需要重新定義浙值。因此恳不,作者會(huì)在度量時(shí),將預(yù)測的3D框繞對(duì)稱軸均勻地旋轉(zhuǎn)开呐,并匹配最合適的3D框(3D IoU最大)作為預(yù)測結(jié)果烟勋,下圖為示例。
●?Baselines for 3D object detection
作者首先給出了3D目標(biāo)檢測任務(wù)的baseline筐付。為了形成對(duì)比卵惦,作者使用了兩種檢測算法:1)MobilePose 單階段輕量級(jí)檢測網(wǎng)絡(luò),由谷歌先前提出家妆,掛在arxiv上但未被會(huì)議收錄鸵荠;2)SSD + EfficientNet-Lite 兩階段檢測架構(gòu),文章中首次提出伤极。下圖是兩種網(wǎng)絡(luò)架構(gòu)的示例圖蛹找。
需要說明的是,兩種網(wǎng)絡(luò)都是預(yù)測回歸3D包圍框頂點(diǎn)的2D投影哨坪,因此還需要將2D預(yù)測點(diǎn)恢復(fù)為3D庸疾,作者使用了EPnP算法〉北啵【We use a similar EPnP algorithm as in [16] to lift the 2D predicted keypoints to 3D.】【這里需要說明的是届慈,OpenCV中雖然也集成了EPnP算法,但其使用接口并不適用本文的情況忿偷。經(jīng)典的PnP算法都是將2D點(diǎn)和對(duì)應(yīng)3D點(diǎn)作為輸入金顿,輸出為估計(jì)相機(jī)參數(shù)。而本文是在知道3D包圍框的8個(gè)2D頂點(diǎn)的相互幾何關(guān)系后鲤桥,估計(jì)并恢復(fù)3D信息揍拆,相機(jī)參數(shù)默認(rèn)是固定的,詳細(xì)計(jì)算方式茶凳,參見公布的代碼】
一些定量化檢測結(jié)果見下圖嫂拴,對(duì)于數(shù)據(jù)的解讀與分析播揪,參見原文
●?Baselines for Neural Radiance Field
NeRF的定義:It can learn the scene and object representation with fine details. The NeRF model learns the color and density value of each voxel in the scene and can generate novel views. 作者在文中主要執(zhí)行以下兩個(gè)任務(wù):We used NeRF for two baselines: 1) Computing segmentation mask and 2) Novel view synthesis. 下面是效果圖,詳細(xì)過程需要參閱參考文獻(xiàn)[24]筒狠。
※?Conclusion
This paper introduces the Objectron dataset: a large scale object-centric dataset of 14, 819 short videos in the wild with object pose annotation. We developed an efficient and scalable data collection and annotation framework based on on-device AR libraries. By releasing this dataset, we hope to enable the research community to push the limits of 3D object geometry understanding and foster new research and applications in 3D understanding, video models, object retrieval, view synthetics, and 3D reconstruction. 【直言不諱猪狈,本文最大的貢獻(xiàn)便是數(shù)據(jù)集構(gòu)建過程中,利用AR設(shè)備搜集和3D標(biāo)注框架的創(chuàng)新辩恼,思路很值得借鑒】
3雇庙、特別關(guān)注細(xì)節(jié)
在理解數(shù)據(jù)集和復(fù)現(xiàn)測試代碼的過程中,有以下幾個(gè)極易被忽略的細(xì)節(jié):
● 使用EPnP算法將2D投影點(diǎn)恢復(fù)成3D點(diǎn):原官方代碼 https://github.com/google-research-datasets/Objectron/ 未涉及這一部分运挫,推薦的原C++代碼也不便使用状共,幾經(jīng)查閱套耕,在其另一官方博客 https://google.github.io/mediapipe/solutions/objectron 中找到了Python接口代碼谁帕,閱讀代碼發(fā)現(xiàn)其指向mediapipe開源包https://github.com/google/mediapipe,其中與EPnP(Lift2DTo3D)相關(guān)的代碼及文件為:
https://github.com/google/mediapipe/blob/master/mediapipe/python/solutions/objectron.py
https://github.com/google/mediapipe/blob/master/mediapipe/python/solution_base.py
https://github.com/google/mediapipe/blob/master/mediapipe/modules/objectron/calculators/decoder.cc#L201
https://github.com/google/mediapipe/blob/master/mediapipe/modules/objectron/calculators/epnp.cc
由于上述Lift2DTo3D函數(shù)被封裝過于復(fù)雜冯袍,不便理解和遷移使用匈挖,在進(jìn)一步搜索后發(fā)現(xiàn),算法CenterPose(arxiv2021)使用了Objectron數(shù)據(jù)集作為benckmark康愤,直接閱讀它的代碼更方便儡循。Lift2DTo3D函數(shù)的實(shí)現(xiàn),見下
https://github.com/NVlabs/CenterPose/blob/4355198a492b72e785a02ee911a9db8d8b63c0ab/src/tools/objectron_eval/eval_image_official.py#L805
● 使用本文定義的全新的3D IoU度量方式:原官方代碼已給出使用步驟
https://github.com/google-research-datasets/Objectron/blob/master/notebooks/3D_IOU.ipynb
但其中沒有涉及到對(duì)稱物體的3D IoU計(jì)算方式征冷,可以參見issue中有些提問者給出的方案择膝,或者直接按照上一個(gè)注意事項(xiàng),參考算法CenterPose中的代碼检激。
● 使用或者復(fù)現(xiàn)文中的baselines:本文提到的兩個(gè)3D目標(biāo)檢測baselines都是一筆帶過肴捉,且沒有公布其訓(xùn)練代碼,推理代碼可能通過閱讀https://github.com/google/mediapipe 能找到叔收,但過程仍舊過于復(fù)雜齿穗。比較好的方式是,直接將backbone換成更常用的YOLOv5等饺律,或者轉(zhuǎn)向算法CenterPose窃页,該開源代碼剛好使用了Objectron數(shù)據(jù)集訓(xùn)練和測試。
4复濒、新穎點(diǎn)
參考在Introduction章節(jié)中脖卖,最后一段枚舉的關(guān)于Objectron數(shù)據(jù)集的多項(xiàng)優(yōu)勢(shì)
5、總結(jié)
本文作為數(shù)據(jù)集類文章巧颈,雖然平時(shí)在CVPR會(huì)議中不多見畦木,但初次細(xì)讀此類文章,還是被其寫作框架和介紹邏輯所折服洛二。相較于其它各類“長篇大論”的數(shù)據(jù)集文章馋劈,比如ECCV會(huì)議上的MS-COCO攻锰,或IJCV期刊上的Pascal VOC,Objectron顯然是短小精悍的妓雾。當(dāng)然娶吞,考慮其數(shù)據(jù)集規(guī)模和當(dāng)前所處的發(fā)展階段,似乎這么比較并不公平械姻,但我仍能得到一些啟示:
● 是否能直接對(duì)2D圖像進(jìn)行3D信息的標(biāo)注妒蛇?在沒有3D點(diǎn)云的情況下,直接標(biāo)注大量獨(dú)立的2D圖像楷拳,勢(shì)必會(huì)少于9個(gè)自由度绣夺,那能否滿足6個(gè)自由度,或5個(gè)自由度呢欢揖?
● 如何確定標(biāo)注標(biāo)簽的合理性和合法性陶耍?合理性指的是有原理或公式依據(jù),合法性指的是人工標(biāo)注標(biāo)簽需要具有一致性她混,本文在Annotation Variance章節(jié)烈钞,提供了很好的證明思路。
● 如何掌握好此類文章投頂會(huì)可能面臨的創(chuàng)新性問題坤按?盡管文章在Introduction章節(jié)的最后面枚舉了Objectron數(shù)據(jù)集多達(dá)6個(gè)優(yōu)勢(shì)項(xiàng)毯欣,但在我看來,實(shí)驗(yàn)部分的baselines設(shè)計(jì)也很重要臭脓。如果只是測試MobilePose酗钞,勢(shì)必會(huì)讓整體內(nèi)容顯得單薄(重?cái)?shù)據(jù)而輕方法)来累。兩階段檢測框架的加入砚作,以及NeRF章節(jié)的補(bǔ)充,讓實(shí)驗(yàn)部分更加平衡了一些佃扼,思路值得借鑒搔涝。