CVPR2021 Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations

0化戳、關(guān)鍵詞

annotated videos, 3D object detection, object-centric videos, pose annotations, Objectron dataset, 3D object tracking, 3D shape representation, object-centric short videos, annotated images, robotics, image retrieval, augmented reality

1眯分、鏈接

該論文來自谷歌研究院（Google Research?It's Google :-( 內(nèi)地需要VPN才能訪問）新荤。秉承其形成技術(shù)壁壘的一貫作風(fēng)，要么“力大磚飛”，使用大規(guī)模集群或高性能GPU，讓人輕易不能訓(xùn)練和驗(yàn)證算法效果衡便；要么依靠其廣泛的原始數(shù)據(jù)來源和充足的工程師隊(duì)伍，獲得并標(biāo)注泛化性強(qiáng)的高質(zhì)量數(shù)據(jù)集洋访。該數(shù)據(jù)集型論文Objectron顯然為后者镣陕。

論文鏈接：https://ieeexplore.ieee.org/abstract/document/9578264

論文代碼及主頁：https://github.com/google-research-datasets/Objectron/

論文官方網(wǎng)站介紹：https://google.github.io/mediapipe/solutions/objectron

對(duì)于3D目標(biāo)檢測和6D姿態(tài)估計(jì)領(lǐng)域，Objectron貢獻(xiàn)了一個(gè)高質(zhì)量的in-the-wild的公共數(shù)據(jù)集姻政，但礙于3D物體的高復(fù)雜性呆抑，即使到這篇文章提出為止，依然沒有能對(duì)標(biāo)2D目標(biāo)檢測領(lǐng)域內(nèi)MS-COCO數(shù)據(jù)集那樣的工作誕生汁展。Objectron的靜態(tài)圖像數(shù)據(jù)規(guī)模為400萬鹊碍，但僅包含14819個(gè)視頻和17095個(gè)物體實(shí)例（COCO的實(shí)例多樣性更勝一籌），物體類別上也僅為9類（COCO為80個(gè)類別）食绿，遠(yuǎn)未覆蓋日常所能見到的所有物體侈咕。拋開這些，Objectron依然是目前領(lǐng)域內(nèi)最具挑戰(zhàn)的數(shù)據(jù)集之一器紧。

Samples: Objectron is a dataset of short object centric video clips with pose annotations.

盡管耀销，Objectron非常詳盡地闡述了其構(gòu)建思路與過程，并慷慨地公開了數(shù)據(jù)集铲汪，但是有兩項(xiàng)重要內(nèi)容熊尉，其未公布：1）baseline模型的訓(xùn)練代碼Training Codes；2）3D物體框標(biāo)注工具Annotation Tool桥状。有經(jīng)驗(yàn)的領(lǐng)域內(nèi)專家也許可以找到替代品帽揪，或者自行復(fù)現(xiàn)，但顯然費(fèi)時(shí)費(fèi)力辅斟，這就是Google出產(chǎn)的技術(shù)壁壘所在，一般公司似乎都會(huì)這么做吧芦拿。

2士飒、主要內(nèi)容概述

※ Abstract

3D目標(biāo)檢測應(yīng)用十分廣泛（ robotics, augmented reality, autonomy, and image retrieval）查邢，本文提出的Objectron數(shù)據(jù)集致力于推進(jìn)3D目標(biāo)檢測，以及多個(gè)相關(guān)領(lǐng)域的發(fā)展（包括3D object tracking, view synthesis, and improved 3D shape representation）

數(shù)據(jù)集具體信息：The dataset contains?object-centric short videos?with?pose annotations?for?nine?categories and includes?4 million?annotated images in?14, 819?annotated videos酵幕。

另外扰藕，針對(duì)3D目標(biāo)檢測任務(wù)，本文還提出了新的度量指標(biāo)芳撒，即3D Intersection over Union邓深。

最后，基于自建的benckmark笔刹，?作者提供了兩個(gè)baselines：3D object detection任務(wù)和novel view synthesis任務(wù)芥备。

※ Introduction

在機(jī)器學(xué)習(xí)算法和大量訓(xùn)練圖像的加持下，計(jì)算機(jī)視覺任務(wù)取得了精度上的巨大提升舌菜，同理萌壳，3D目標(biāo)理解任務(wù)也有了很大的進(jìn)步。然而日月，因?yàn)槿狈Υ罅空鎸?shí)世界中的數(shù)據(jù)集（the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet [8], COCO [22], and Open Images [20])）袱瓮，理解3D物體較2D物體有很大難度。本文想要制作一個(gè)object-centric video datasets爱咬，也就是以物體為中心尺借，環(huán)繞其四周錄制不同的連續(xù)視角下的觀察短視頻。

具體來講精拟，每個(gè)短視頻使用AR一體化設(shè)備獲取褐望，元數(shù)據(jù)包括相機(jī)姿態(tài)、稀疏點(diǎn)云和外表面（camera poses, sparse point-clouds, and surface planes）串前，另外針對(duì)每個(gè)物體瘫里，還有人工標(biāo)注的3D包圍框（3D bounding boxes），這些標(biāo)注描述了物體的9維信息荡碾，包括位置谨读、朝向和維度（ position, orientation, and dimensions）【也就是X, Y, Z, pitch, yaw, roll, length, width, height】。為了保持?jǐn)?shù)據(jù)的多樣性坛吁，所有14, 819個(gè)短視頻樣例盡量采集自地理上分散的不同國家（from a geo-diverse sample covering ten countries across five continents）【谷歌的優(yōu)勢(shì)之一：研究院遍布全球】

Figure 2: Our dataset consists of object-centric videos, which capture different views of the same objects from different angles.

Objectron數(shù)據(jù)集有以下幾點(diǎn)優(yōu)勢(shì)：

●?Videos contain multiple views of the same object, enabling many applications well beyond 3D object detection. This includes multi-view geometric understanding, view synthesis, 3D shape reconstruction, etc.

●?The 3D bounding box is present in the entire video and is temporally consistent, thus enabling 3D tracking applications.

●?Our dataset is collected in the wild to provide better generalization for real-world scenarios in contrast to datasets that are collected in a controlled environment?[13] [4].【主要是數(shù)據(jù)集LineMOD (ACCV2012)和YCB (IJRR2017)】

●?Each instance’s translation and size are stored in metric scale, thanks to accurate on-device AR tracking and provides sparse point clouds in 3D, enabling sparse depth estimation techniques. The images are calibrated and the camera parameters are provided, enabling the recovery of the object’s true scale.

●?Our annotations are dense and continuous, unlike some of the previous work [30]?【Fei-Fei Li提出的數(shù)據(jù)集3DObject (ICCV2007)劳殖，年代久遠(yuǎn)~】where viewpoints have been discretized to fit into bins.

●?Each object category contains hundreds of instances, collected from different locations across different countries in different lighting conditions. 【強(qiáng)調(diào)每個(gè)類別下的數(shù)據(jù)規(guī)模大，且數(shù)據(jù)分布的多樣性更優(yōu)】

※ Previous?Work

作者主要比較了以往常用的多個(gè)具有代表性的3D目標(biāo)檢測數(shù)據(jù)集：

●?從具有更大的數(shù)據(jù)集規(guī)模拨脉，更高清的視頻和更真實(shí)的常見物體上比較：BOP challenge, T-LESS, Rutgers APC,? LineMOD,?IC-BIN, YCB哆姻；

● 從具備更豐富的標(biāo)注維度上（9-DoF vs. 6-DoF）比較：ObjectNet3D,?Pascal3D+,?Pix3D,?3DObject；

● 從更復(fù)雜的場景數(shù)據(jù)集（scene datasets）采集方式上比較（RGBD或LIDAR）：ScanNet,?Scan2CAD, Rio玫膀；

● 從數(shù)據(jù)真假（synthetic data or photo-realistic scenes vs. real world）上比較：ShapeNet,?HyperSim矛缨；【Synthetic datasets offer valuable data for training and benchmarking, but the ability to generalize to the real-world is unknown.】

更多關(guān)于比較的詳細(xì)描述，見原論文，或參考我的相關(guān)博客6D Object Pose Estimation Datasets

※ Data Collection and Annotation

●?Object Categories

首先闡明挑選物體類別的幾個(gè)大致標(biāo)準(zhǔn)：

1）In Objectron dataset, the aim was to select meaningful categories of common objects that form a representative set of all categories that are practically relevant and technically challenging. 【引出了cups, chairs and bikes】

2）The object categories in the dataset contain both rigid, and non-rigid objects.?【引出了非剛性物體bikes 和 laptops箕昭，當(dāng)然灵妨，在制作數(shù)據(jù)集的過程中，錄制視頻時(shí)落竹，非剛性物體是保持不動(dòng)的 remain?stationary】

3）Many 3D object detection models are known to exhibit difficulties in estimating rotations of symmetric objects [21]. Symmetric objects have ambiguity in their one, two, or even three degrees of rotation.【引出了強(qiáng)對(duì)稱性物體cups 和 bottles】

4）It has been shown that vision models pay special attention to texts in the images. Re-producing texts and labels correctly are important in generative models too.【強(qiáng)調(diào)有些時(shí)候可能需要恢復(fù)一些細(xì)節(jié)泌霍，比如文本信息，引出了books 和 cereal boxes】

5）Since we strive for real-time perception we included a few categories (shoes and chairs) that enable exciting applications, such as augmented reality and image retrieval. 【引出了shoes 和 chairs】

共計(jì)9類物體：bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes

●?Data Collection

通過谷歌自家的手持式AR設(shè)備（ARKit和ARCore）采集數(shù)據(jù)述召，公布的數(shù)據(jù)集中將包括記錄的視頻朱转，和AR設(shè)備采集到的元數(shù)據(jù)』【W(wǎng)e assume the standard pinhole camera model, and provide calibration, extrinsics and intrinsics matrix for every frame in the dataset】所有視頻的分辨率為1920×1080藤为，F(xiàn)PS為30，均使用手機(jī)的前置攝像頭拍攝呀酸，且

1）設(shè)備類別不超過5種凉蜂，以保證成像的統(tǒng)一性；

2）視頻長度在10秒左右性誉，以減少AR設(shè)備本身的偏差帶來的干擾窿吩；

3）數(shù)據(jù)集采集者被要求不能快速移動(dòng)，以防止產(chǎn)生模糊圖像错览。

當(dāng)然纫雁，全程要求物體保持相對(duì)靜止。正是由于采用了手機(jī)錄制的方案倾哺，研究者才能快速地在世界上不同區(qū)域內(nèi)發(fā)起標(biāo)注計(jì)劃轧邪，下圖是數(shù)據(jù)來源的地區(qū)分布情況（來自5個(gè)大洲的10個(gè)國家）：

Figure 3: Countries where we collected data from.

●?Data Annotation

Efficient and accurate data annotation is the key to building large-scale datasets. 【說易做難】

Annotating 3D bounding boxes for each image is time-consuming and expensive. 【做優(yōu)質(zhì)數(shù)據(jù)集必定很難】

Instead, we annotate 3D objects in a video clip and populate them to all frames in the clip, scaling up the annotation process, and reducing the per image annotation cost. 【也就是說每個(gè)短視頻只需要標(biāo)注首個(gè)關(guān)鍵幀，后續(xù)每一幀的標(biāo)注信息可以借助相機(jī)參數(shù)自動(dòng)生成羞海，以大大減少標(biāo)注負(fù)擔(dān)】標(biāo)注工具的用戶界面如下：

Figure 4: Data annotation. The annotated 3D box is verified at multiple views, and then populated to all images in the sequence.

以下是原文中的標(biāo)注過程描述：

Next, we show the 3D world map to the annotator side-by-side with the images from the video sequence (Figure 4a). The annotator draws a 3D bounding box in the 3D world map, and our tool projects the 3D bounding box over all the frames given pre-computed camera poses from the AR sessions (such as ARKit or ARCore). The annotator looks at the projected bounding box and makes necessary adjustments (position, orientation, and the scale of a 3D bounding box) so the projected bounding box looks consistent across different frames. At the end of the process, the user saves the 3D bounding box and the annotation. The benefits of our approach are 1) by annotating a video once, we get annotated images for all frames in the video sequence; 2) by using AR, we can get accurate metric sizes for bounding boxes.

【從以上描述中可以看出忌愚，盡管標(biāo)注的是2D幀，但其標(biāo)注對(duì)象為3D點(diǎn)云却邓，2D圖像僅用于輔助判斷待標(biāo)注的各個(gè)維度的物體信息硕糊，標(biāo)注者也只能操作右側(cè)3D世界圖中的3D框，而左側(cè)均為投影和計(jì)算所得腊徙。這類使用點(diǎn)云的3D標(biāo)注工具很常見简十，因此盡管文章未將標(biāo)注工具開源，但仍可以很快找到替代品撬腾，比如openvinotoolkit開源的標(biāo)注工具CVAT】

●?Annotation Variance

嚴(yán)謹(jǐn)?shù)母哔|(zhì)量的標(biāo)注過程螟蝙，必須有定量分析！作者提到了兩個(gè)影響標(biāo)注精度的因素【The accuracy of our annotation hinges on two factors】：

1) the amount of drift in the estimated camera pose throughout the captured video, and?

2) the accuracy of the raters annotating the 3D bounding box.?

【因素1為相機(jī)姿態(tài)的估計(jì)誤差民傻，因素2為不同標(biāo)注者標(biāo)注標(biāo)簽的不一致性】

We compared the relative positional drift in our camera pose against an offline refined camera pose (obtained by an offline bundle adjustment) 【針對(duì)因素1中的誤差胰默，作者想到了使用線下的集束調(diào)整算法场斑，比較直接獲得的相機(jī)姿態(tài)參數(shù)與計(jì)算微調(diào)后的結(jié)果之間的差異。這里使用的集束優(yōu)化算法為SfM初坠，詳細(xì)描述作者放在了第4章節(jié)的最后一段】【同時(shí)和簸，為了減少人工誤差彭雾，作者將視頻長度控制在10秒左右碟刺，如下圖】

Figure 5: Distribution of the video length in our dataset. Majority of the videos are 10 seconds long (300 frames) and the longest video is 2022 frames long.

To evaluate the accuracy of the rater, we asked eight annotators to re-annotate same sequences. 【針對(duì)因素2，作者使用了冗余標(biāo)注的方式薯酝，通過多人重復(fù)標(biāo)注同意樣本半沽，最后取平均值來減小人工標(biāo)注的誤差，這里用到了8人/per sequence吴菠，谷歌就是有錢啊】

Overall for the chairs, the standard deviation for the chair orientation, translation, and scale was 4.6°, 1cm, and 4cm, respectively which demonstrates insignificant variance of the annotation results between different raters.【下圖是對(duì)椅子的標(biāo)注示例者填，文章統(tǒng)計(jì)結(jié)果聲稱，最終標(biāo)注誤差的方差是很小的做葵，這個(gè)數(shù)值是否也代表了算法預(yù)測的上限呢占哟？】

Figure 6: The overlay of the 3D bounding boxes annotated by different annotators shows the annotations from different raters are very close.

※ Objectron Dataset

In this section, we describe the details of our Objectron dataset and provide some statistics. 【本章節(jié)是對(duì)Objectron數(shù)據(jù)集具備的特征的統(tǒng)計(jì)分析】共計(jì)9類物體：bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes。其中bikes 和?laptops是非剛性物體酿矢。 In total there are 17, 095 object instances that appear in 4M annotated images from 14819 annotated videos (not counting the unreleased evaluation set for future competitions). 訓(xùn)練集和測試集劃分如下：

Table 1:?Per-category statistics of Objectron dataset.

此外榨乎，作者還可視化分析了每類物體的方位角（azimuth）和仰角（elevation）的分布情況，如下圖

Figure 7: View-point distribution of samples per object category. The top row shows the azimuth distribution in polar graph, and the bottom row denotes the elevation distribution.

※ Baseline Experiments and Evaluations

作者發(fā)布了訓(xùn)練集和測試集瘫筐，以及測試評(píng)估代碼蜜暑，但并不包括訓(xùn)練代碼，以及某些物體類別的預(yù)訓(xùn)練模型策肝。測試標(biāo)準(zhǔn)包括：3D IoU, 2D projection error, view-point error, polar and azimuth error, and rotation error. 其中肛捍，除了3D IoU這項(xiàng)指標(biāo)，其它指標(biāo)都是標(biāo)準(zhǔn)化的定義之众，無需贅述拙毫。

●?3D Intersection Over Union

作者聲稱，文中的3D IoU定義是全新的棺禾。作者指出缀蹄，之前文獻(xiàn)中涉及到的3D boxes重疊率計(jì)算方式，都是被過度簡化了的帘睦，比如坐標(biāo)對(duì)齊（axis-aligned）后再計(jì)算重疊率袍患，或者將3D boxes投影到某個(gè)平面，然后計(jì)算2D投影多邊形（2D projected polygons）的重疊率竣付。然而诡延，在通用的場景中，這些假設(shè)是不穩(wěn)固的古胆。Although this approach works for vehicles on the road, it has two limitations:?

1) The object should sit on the same ground plane, which limits the degrees of freedom of the box from 9 to 7. The box only has freedom in yaw, and the roll and pitch are set to 0.?

2) it assumes the boxes have the same height. For the Objectron datasets, these assumptions do not hold.

因此肆良，作者提出了通用的計(jì)算方法（computing accurate 3D IoU values for?general 3D-oriented boxes）筛璧。過程原文講述也不夠詳細(xì)，且作者建議我們仔細(xì)閱讀其中涉及到的關(guān)鍵經(jīng)典算法the Sutherland-Hodgman Polygon clipping algorithm惹恃，這個(gè)多邊形裁剪算法（polygon-clipping algorithm）在計(jì)算機(jī)圖形學(xué)領(lǐng)域最為著名夭谤，下圖中展示了其原理示例。具體計(jì)算過程巫糙，請(qǐng)參考公布的測試代碼朗儒。

Figure 8:Accurate computation of 3D IoU using polygon-clipping algorithm.

另外，對(duì)于自對(duì)稱性物體参淹，比如杯子和瓶子醉锄，由于其外形上有旋轉(zhuǎn)不變性，3D IoU需要重新定義浙值。因此恳不，作者會(huì)在度量時(shí)，將預(yù)測的3D框繞對(duì)稱軸均勻地旋轉(zhuǎn)开呐，并匹配最合適的3D框（3D IoU最大）作為預(yù)測結(jié)果烟勋，下圖為示例。

Figure 9: 3D IoU computation for symmetric objects: Rotating the bounding box along the?Y?axis of symmetry to maximize 3D IoU.

●?Baselines for 3D object detection

作者首先給出了3D目標(biāo)檢測任務(wù)的baseline筐付。為了形成對(duì)比卵惦，作者使用了兩種檢測算法：1）MobilePose 單階段輕量級(jí)檢測網(wǎng)絡(luò)，由谷歌先前提出家妆，掛在arxiv上但未被會(huì)議收錄鸵荠；2）SSD + EfficientNet-Lite 兩階段檢測架構(gòu)，文章中首次提出伤极。下圖是兩種網(wǎng)絡(luò)架構(gòu)的示例圖蛹找。

The architecture of the MobilePose model

Architecture of the two-stage model. The red blocks are 1 × 1 convolutional layers, green blocks are depth-wise convolutional layers, andblue blocks are addition layers for skip connection. The black block at the end is a fully connected layer.

需要說明的是，兩種網(wǎng)絡(luò)都是預(yù)測回歸3D包圍框頂點(diǎn)的2D投影哨坪，因此還需要將2D預(yù)測點(diǎn)恢復(fù)為3D庸疾，作者使用了EPnP算法〉北啵【We use a similar EPnP algorithm as in [16] to lift the 2D predicted keypoints to 3D.】【這里需要說明的是届慈，OpenCV中雖然也集成了EPnP算法，但其使用接口并不適用本文的情況忿偷。經(jīng)典的PnP算法都是將2D點(diǎn)和對(duì)應(yīng)3D點(diǎn)作為輸入金顿，輸出為估計(jì)相機(jī)參數(shù)。而本文是在知道3D包圍框的8個(gè)2D頂點(diǎn)的相互幾何關(guān)系后鲤桥，估計(jì)并恢復(fù)3D信息揍拆，相機(jī)參數(shù)默認(rèn)是固定的，詳細(xì)計(jì)算方式茶凳，參見公布的代碼】

一些定量化檢測結(jié)果見下圖嫂拴，對(duì)于數(shù)據(jù)的解讀與分析播揪，參見原文

Table 2:?Evaluation of different baseline models for the Objectron dataset.

Figure 10: Evaluation of MobilePose network [16]?on the Objectron dataset

Figure 11: Evaluation of two-stage network on the Objectron dataset

●?Baselines for Neural Radiance Field

NeRF的定義：It can learn the scene and object representation with fine details. The NeRF model learns the color and density value of each voxel in the scene and can generate novel views. 作者在文中主要執(zhí)行以下兩個(gè)任務(wù)：We used NeRF for two baselines: 1) Computing segmentation mask and 2) Novel view synthesis. 下面是效果圖，詳細(xì)過程需要參閱參考文獻(xiàn)[24]筒狠。

Figure 12: Results of the NeRF?[24]?model on our dataset. 1) Input image with annotation, 2) NeRF rendering from the same view, 3) Estimated depth map from NeRF, 4) Extracted segmentation mask, and 5) Synthesized novel view.

※?Conclusion

This paper introduces the Objectron dataset: a large scale object-centric dataset of 14, 819 short videos in the wild with object pose annotation. We developed an efficient and scalable data collection and annotation framework based on on-device AR libraries. By releasing this dataset, we hope to enable the research community to push the limits of 3D object geometry understanding and foster new research and applications in 3D understanding, video models, object retrieval, view synthetics, and 3D reconstruction. 【直言不諱猪狈，本文最大的貢獻(xiàn)便是數(shù)據(jù)集構(gòu)建過程中，利用AR設(shè)備搜集和3D標(biāo)注框架的創(chuàng)新辩恼，思路很值得借鑒】

3雇庙、特別關(guān)注細(xì)節(jié)

在理解數(shù)據(jù)集和復(fù)現(xiàn)測試代碼的過程中，有以下幾個(gè)極易被忽略的細(xì)節(jié)：

● 使用EPnP算法將2D投影點(diǎn)恢復(fù)成3D點(diǎn)：原官方代碼 https://github.com/google-research-datasets/Objectron/ 未涉及這一部分运挫，推薦的原C++代碼也不便使用状共，幾經(jīng)查閱套耕，在其另一官方博客 https://google.github.io/mediapipe/solutions/objectron 中找到了Python接口代碼谁帕，閱讀代碼發(fā)現(xiàn)其指向mediapipe開源包https://github.com/google/mediapipe，其中與EPnP（Lift2DTo3D）相關(guān)的代碼及文件為：

https://github.com/google/mediapipe/blob/master/mediapipe/python/solutions/objectron.py

https://github.com/google/mediapipe/blob/master/mediapipe/python/solution_base.py

https://github.com/google/mediapipe/blob/master/mediapipe/modules/objectron/calculators/decoder.cc#L201

https://github.com/google/mediapipe/blob/master/mediapipe/modules/objectron/calculators/epnp.cc

由于上述Lift2DTo3D函數(shù)被封裝過于復(fù)雜冯袍，不便理解和遷移使用匈挖，在進(jìn)一步搜索后發(fā)現(xiàn)，算法CenterPose（arxiv2021）使用了Objectron數(shù)據(jù)集作為benckmark康愤，直接閱讀它的代碼更方便儡循。Lift2DTo3D函數(shù)的實(shí)現(xiàn)，見下

https://github.com/NVlabs/CenterPose/blob/4355198a492b72e785a02ee911a9db8d8b63c0ab/src/tools/objectron_eval/eval_image_official.py#L805

● 使用本文定義的全新的3D IoU度量方式：原官方代碼已給出使用步驟

https://github.com/google-research-datasets/Objectron/blob/master/notebooks/3D_IOU.ipynb

但其中沒有涉及到對(duì)稱物體的3D IoU計(jì)算方式征冷，可以參見issue中有些提問者給出的方案择膝，或者直接按照上一個(gè)注意事項(xiàng)，參考算法CenterPose中的代碼检激。

● 使用或者復(fù)現(xiàn)文中的baselines：本文提到的兩個(gè)3D目標(biāo)檢測baselines都是一筆帶過肴捉，且沒有公布其訓(xùn)練代碼，推理代碼可能通過閱讀https://github.com/google/mediapipe 能找到叔收，但過程仍舊過于復(fù)雜齿穗。比較好的方式是，直接將backbone換成更常用的YOLOv5等饺律，或者轉(zhuǎn)向算法CenterPose窃页，該開源代碼剛好使用了Objectron數(shù)據(jù)集訓(xùn)練和測試。

4复濒、新穎點(diǎn)

參考在Introduction章節(jié)中脖卖，最后一段枚舉的關(guān)于Objectron數(shù)據(jù)集的多項(xiàng)優(yōu)勢(shì)

5、總結(jié)

本文作為數(shù)據(jù)集類文章巧颈，雖然平時(shí)在CVPR會(huì)議中不多見畦木，但初次細(xì)讀此類文章，還是被其寫作框架和介紹邏輯所折服洛二。相較于其它各類“長篇大論”的數(shù)據(jù)集文章馋劈，比如ECCV會(huì)議上的MS-COCO攻锰，或IJCV期刊上的Pascal VOC，Objectron顯然是短小精悍的妓雾。當(dāng)然娶吞，考慮其數(shù)據(jù)集規(guī)模和當(dāng)前所處的發(fā)展階段，似乎這么比較并不公平械姻，但我仍能得到一些啟示：

● 是否能直接對(duì)2D圖像進(jìn)行3D信息的標(biāo)注妒蛇？在沒有3D點(diǎn)云的情況下，直接標(biāo)注大量獨(dú)立的2D圖像楷拳，勢(shì)必會(huì)少于9個(gè)自由度绣夺，那能否滿足6個(gè)自由度，或5個(gè)自由度呢欢揖？

● 如何確定標(biāo)注標(biāo)簽的合理性和合法性陶耍？合理性指的是有原理或公式依據(jù)，合法性指的是人工標(biāo)注標(biāo)簽需要具有一致性她混，本文在Annotation Variance章節(jié)烈钞，提供了很好的證明思路。

● 如何掌握好此類文章投頂會(huì)可能面臨的創(chuàng)新性問題坤按？盡管文章在Introduction章節(jié)的最后面枚舉了Objectron數(shù)據(jù)集多達(dá)6個(gè)優(yōu)勢(shì)項(xiàng)毯欣，但在我看來，實(shí)驗(yàn)部分的baselines設(shè)計(jì)也很重要臭脓。如果只是測試MobilePose酗钞，勢(shì)必會(huì)讓整體內(nèi)容顯得單薄（重?cái)?shù)據(jù)而輕方法）来累。兩階段檢測框架的加入砚作，以及NeRF章節(jié)的補(bǔ)充，讓實(shí)驗(yàn)部分更加平衡了一些佃扼，思路值得借鑒搔涝。

最后編輯于：2022.05.11 04:05:30

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末汗盘，一起剝皮案震驚了整個(gè)濱河市尘喝，隨后出現(xiàn)的幾起案子炕倘，更是在濱河造成了極大的恐慌，老刑警劉巖瘤运，帶你破解...
沈念sama閱讀 206,602評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件窍霞，死亡現(xiàn)場離奇詭異，居然都是意外死亡拯坟，警方通過查閱死者的電腦和手機(jī)但金，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,442評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來郁季，“玉大人冷溃，你說我怎么就攤上這事钱磅。” “怎么了似枕？”我有些...
開封第一講書人閱讀 152,878評(píng)論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵盖淡，是天一觀的道長。經(jīng)常有香客問我凿歼，道長褪迟，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,306評(píng)論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任答憔，我火速辦了婚禮味赃，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘虐拓。我一直安慰自己心俗，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 64,330評(píng)論 5贊 373
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布侯嘀。她就那樣靜靜地躺著另凌，像睡著了一般。火紅的嫁衣襯著肌膚如雪戒幔。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,071評(píng)論 1贊 285
城市分裂傳說
那天土童，我揣著相機(jī)與錄音诗茎，去河邊找鬼。笑死献汗，一個(gè)胖子當(dāng)著我的面吹牛敢订，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播罢吃，決...
沈念sama閱讀 38,382評(píng)論 3贊 400
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼楚午，長吁一口氣：“原來是場噩夢(mèng)啊……” “哼！你這毒婦竟也來了尿招？” 一聲冷哼從身側(cè)響起矾柜，我...
開封第一講書人閱讀 37,006評(píng)論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎就谜，沒想到半個(gè)月后怪蔑，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,512評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡丧荐，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,965評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年缆瓣，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片虹统。...
茶點(diǎn)故事閱讀 38,094評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡弓坞，死狀恐怖隧甚，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情渡冻，我是刑警寧澤呻逆，帶...
沈念sama閱讀 33,732評(píng)論 4贊 323
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站菩帝，受9級(jí)特大地震影響咖城，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜呼奢，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,283評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一宜雀、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧握础，春花似錦辐董、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,286評(píng)論 0贊 19
一樁弒父案简烘，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至定枷，卻和暖如春孤澎，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背欠窒。一陣腳步聲響...
開封第一講書人閱讀 31,512評(píng)論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工覆旭，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人岖妄。一個(gè)月前我還...
沈念sama閱讀 45,536評(píng)論 2贊 354
代替公主和親
正文我出身青樓型将，卻偏偏與公主長得像，于是被迫代替她去往敵國和親荐虐。傳聞我的和親對(duì)象是個(gè)殘疾皇子七兜，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,828評(píng)論 2贊 345