數(shù)據(jù)集LVIS|LVIS: A Dataset for Large Vocabulary Instance Segmentation

FAIR公開大型實例分割數(shù)據(jù)集药版,多類別少訓練樣本

2019.8

論文地址:https://arxiv.org/pdf/1908.03195.pdf

? ? ? ? ? ? ? ? ??https://www.arxiv-vanity.com/papers/1908.03195/

主頁:http://www.lvisdataset.org


Abstract

Progress on object detection is enabled by datasets that focus the research community’s attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce?LVIS?(pronounced ‘el-vis’): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect?~2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images,?LVIS?naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge.?LVIS?is available at?http://www.lvisdataset.org.

數(shù)據(jù)集實現(xiàn)了對象檢測的進展,這些數(shù)據(jù)集將研究界的注意力集中在開放性挑戰(zhàn)上。這個過程使我們從簡單的圖像到復雜的場景特纤,從邊界框到分割蒙版铺韧。在這項工作中赤炒,我們介紹了LVIS(發(fā)音為'el-vis'):一個用于大詞匯量實例分割的新數(shù)據(jù)集吊骤。我們計劃為164k圖像中的1000多個入門級對象類別收集約?200萬個高質(zhì)量實例分割掩碼缎岗。由于Zipfian在自然圖像中分類,LVIS自然有一長串的類別水援,只有很少的訓練樣本密强。鑒于用于物體檢測的最先進的深度學習方法在低樣本制度中表現(xiàn)不佳茅郎,我們認為我們的數(shù)據(jù)集提出了一個重要且令人興奮的新科學挑戰(zhàn)蜗元。LVIS可在http://www.lvisdataset.org上找到

1 INTRODUCTION

A central goal of computer vision is to endow algorithms with the ability to intelligently describe images. Object detection is a canonical image description task; it is intuitively appealing, useful in applications, and straightforward to benchmark in existing settings. The accuracy of object detectors has improved dramatically and new capabilities, such as predicting segmentation masks and 3D representations, have been developed. There are now exciting opportunities to push these methods towards new goals.

計算機視覺的核心目標是賦予算法智能描述圖像的能力系冗。對象檢測是規(guī)范的圖像描述任務;?它具有直觀的吸引力奕扣,在應用程序中很有用,并且可以直接在現(xiàn)有設置中進行基準測試掌敬。物體探測器的精確度得到了顯著提高惯豆,并且已經(jīng)開發(fā)出新功能池磁,例如預測分割掩模和3D表示。現(xiàn)在有令人興奮的機會將這些方法推向新的目標楷兽。

Today, rigorous evaluation of general purpose object detectors is mostly performed in the few category regime (e.g. 80) or when there are a large number of training examples per category (e.g. 100 to 1000+). Thus, there is an opportunity to enable research in the natural setting where there are a large number of categories?and?per-category data is sometimes scarce.?The long tail of rare categories is inescapable; annotating more images simply uncovers previously unseen, rare categories (see Fig.9and[29,?25,?24,?27]). Efficiently learning from few examples is a significant open problem in machine learning and computer vision, making this opportunity one of the most exciting from a scientific and practical perspective. But to open this area to empirical study, a suitable, high-quality dataset and benchmark is required.

今天地熄,通用物體探測器的嚴格評估主要在少數(shù)類別制度(例如?80)中進行,或者當每個類別存在大量訓練樣例(例如芯杀,100至1000+)時端考。因此,有機會在自然環(huán)境中進行研究揭厚,其中存在大量類別却特,并且每類別數(shù)據(jù)有時很少稀有類別的缺點是不可避免的;?注釋多個圖像簡單地揭示以前看不到的筛圆,稀有的類別(參見圖?9和?[?29裂明,2524太援,27?])闽晦。從少數(shù)例子中有效地學習是機器學習和計算機視覺中一個重要的開放性問題,從科學和實踐的角度來看提岔,這個機會是最令人興奮的尼荆。但要開放這個領域進行實證研究,需要一個合適的唧垦,高質(zhì)量的數(shù)據(jù)集和基準捅儒。

We aim to enable this new research direction by designing and collecting?LVIS?(pronounced ‘el-vis’)—a benchmark dataset for research on Large Vocabulary Instance Segmentation. We are collecting instance segmentation masks for more than 1000 entry-level object categories (see Fig.1). When completed, we plan for our dataset to contain 164k images and?~2 million?high-quality?instance masks.1?Our annotation pipeline starts from a set of images that were collected without prior knowledge of the categories that will be labeled in them. We engage annotators in an iterative object spotting process that uncovers the long tail of categories that naturally appears in the images and avoids using machine learning algorithms to automate data labeling.

我們的目標是通過設計和收集LVIS(發(fā)音為'el-vis') - 一個用于大詞匯量實例分割研究的基準數(shù)據(jù)集來實現(xiàn)這一新的研究方向。我們正在為超過1000個入門級對象類別收集實例分段掩碼(參見圖?1)振亮。完成后巧还,我們計劃為我們的數(shù)據(jù)包含164K的圖像和??200萬高品質(zhì)的情況下mask。1我們的注釋管道從一組圖像開始坊秸,這些圖像是在沒有事先知道將在其中標記的類別的情況下收集的麸祷。我們在一個迭代對象定位過程中使用注釋器來揭示圖像中自然出現(xiàn)的類別的長尾,并避免使用機器學習算法來自動化數(shù)據(jù)標記褒搔。

We designed a crowdsourced annotation pipeline that enables the collection of our large-scale dataset while also yielding high-quality segmentation masks. Quality is important for future research because relatively coarse masks, such as those in the COCO dataset[18], limit the ability to differentiate algorithm-predicted mask quality beyond a certain, coarse point. When compared to expert annotators, our segmentation masks have higher overlap and boundary consistency than both COCO and ADE20K[28].

我們設計了一個眾包注釋管道阶牍,可以收集我們的大型數(shù)據(jù)集,同時還可以生成高質(zhì)量的分段掩碼星瘾。質(zhì)量對于未來的研究非常重要走孽,因為相對粗糙的掩模,例如COCO數(shù)據(jù)集[?18?]中的掩模琳状,限制了將算法預測的掩模質(zhì)量區(qū)分為?某個粗糙點的能力磕瓷。與專家注釋器相比,我們的分割掩模具有比COCO和ADE20K更高的重疊和邊界一致性?[?28?]。

To build our dataset, we adopt an?evaluation-first design principle. This principle states that we should first determine exactly how to perform quantitative evaluation and only then design and build a dataset collection pipeline to gather the data entailed by the evaluation. We select our benchmark task to be COCO-style instance segmentation and we use the same COCO-style average precision (AP) metric that averages over categories and different mask intersection over union (IoU) thresholds[19]. Task and metric continuity with COCO reduces barriers to entry.

為了構建我們的數(shù)據(jù)集困食,我們采用評估優(yōu)先設計原則边翁。該原則指出,我們應首先確定如何執(zhí)行定量評估硕盹,然后才設計和構建數(shù)據(jù)集收集管道以收集評估所需的數(shù)據(jù)符匾。我們選擇我們的基準任務為COCO風格的實例分段,我們使用相同的COCO風格平均精度(AP)度量標準瘩例,該平均值超過類別和不同的掩碼交集超過聯(lián)合(IoU)閾值?[?19?]待讳。COCO的任務和指標連續(xù)性降低了進入門檻。

Buried within this seemingly innocuous task choice are immediate technical challenges: How do we fairly evaluate detectors when one object can reasonably be labeled with multiple categories (see Fig.2)? How do we make the annotation workload feasible when labeling 164k images with segmented objects from over 1000 categories?

在這個看似無害的任務選擇中埋藏著直接的技術挑戰(zhàn):當一個物體可以合理地用多個類別標記時仰剿,我們?nèi)绾喂降卦u估探測器创淡?(見圖?2)?當使用來自1000多個類別的分段對象標記164k圖像時南吮,我們?nèi)绾问棺⑨尮ぷ髁孔兊每尚校?/p>

The essential design choice resolving these challenges is to build a?federated dataset: a single dataset that is formed by the union of a large number of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category. Each small dataset provides the essential guarantee of?exhaustive annotations?for a single category—all instances of that category are annotated. Multiple constituent datasets may overlap and thus a single object within an image can be labeled with multiple categories. Furthermore, since the exhaustive annotation guarantee only holds within each small dataset, we do not require the entire federated dataset to be exhaustively annotated with all categories, which dramatically reduces the annotation workload. Crucially, at test time the membership of each image with respect to the constituent datasets is not known by the algorithm and thus it must make predictions as if all categories will be evaluated. The evaluation oracle evaluates each category fairly on its constituent dataset.

解決這些挑戰(zhàn)的基本設計選擇是構建聯(lián)合數(shù)據(jù)集:由大量較小的組成數(shù)據(jù)集聯(lián)合形成的單個數(shù)據(jù)集琳彩,每個數(shù)據(jù)集看起來與單個類別的傳統(tǒng)對象檢測數(shù)據(jù)集完全相同。每個小數(shù)據(jù)集為單個類別提供詳盡注釋的基本保證 -?該類別的所有實例都被注釋部凑。多個組成數(shù)據(jù)集可以重疊露乏,因此圖像中的單個對象可以用多個類別標記。此外涂邀,由于詳盡的注釋保證僅保留在每個小數(shù)據(jù)集中瘟仿,因此我們不需要對所有類別詳盡地注釋整個聯(lián)合數(shù)據(jù)集,這大大減少了注釋工作量比勉。至關重要的是劳较,在測試時,算法不知道每個圖像相對于組成數(shù)據(jù)集的成員資格浩聋,因此必須進行預測观蜗,就好像將評估所有類別一樣。評估oracle在其組成數(shù)據(jù)集上公平地評估每個類別衣洁。

In the remainder of this paper, we summarize how our dataset and benchmark relate to prior work, provide details on the evaluation protocol, describe how we collected data, and then discuss results of the analysis of this data.

在本文的其余部分墓捻,我們總結了我們的數(shù)據(jù)集和基準與先前工作的關系,提供了評估協(xié)議的詳細信息坊夫,描述了我們?nèi)绾问占瘮?shù)據(jù)砖第,然后討論了這些數(shù)據(jù)的分析結果。

Dataset Timeline.

We report detailed analysis on the 5000 image?val?subset that we have annotated twice. We have now annotated an additional 77k images (split between?train,?val, and?test), representing?~50% of the final dataset; we refer to this as?LVIS?v0.5?(see §A?for details). The first?LVIS?Challenge, based on v0.5, will be held at the COCO Workshop at ICCV 2019.

我們報告了我們已注釋兩次的5000圖像val子集的詳細分析环凿。我們現(xiàn)在已經(jīng)注釋了額外的77k圖像(在train梧兼,val和test之間劃分),占最終數(shù)據(jù)集的約?50%;?我們稱此為LVIS?V0.5(見第一個細節(jié))拷邢。第一個基于v0.5的LVIS挑戰(zhàn)賽將在ICC 2019年的COCO研討會上舉行袱院。

1.1 RELATED DATASETS

Datasets shape the technical problems researchers study and consequently the path of scientific discovery[17]. We owe much of our current success in image recognition to pioneering datasets such as MNIST[16], BSDS[20], Caltech 101[6], PASCAL VOC[5], ImageNet[23], and COCO[18]. These datasets enabled the development of algorithms that detect edges, perform large-scale image classification, and localize objects by bounding boxes and segmentation masks. They were also used in the discovery of important ideas, such as Convolutional Networks[15,?13], Residual Networks[10], and Batch Normalization[11].

數(shù)據(jù)集塑造了研究人員研究的技術問題,從而形成了科學發(fā)現(xiàn)的途徑?[?17?]瞭稼。我們目前在圖像識別方面的成功很大程度上歸功于MNIST?[?16?]忽洛,BSDS?[?20?],Caltech 101?[?6?]环肘,PASCAL VOC?[?5?]欲虚,ImageNet?[?23?]和COCO?[?18?]等先驅數(shù)據(jù)集。?悔雹。這些數(shù)據(jù)集支持開發(fā)檢測邊緣复哆,執(zhí)行大規(guī)模圖像分類以及通過邊界框和分割蒙版定位對象的算法。他們也用在重要的思想腌零,如卷積網(wǎng)絡的發(fā)現(xiàn)?[?15梯找,13],殘余網(wǎng)絡?[?10?]益涧,并批標準化?[?11?]锈锤。

LVIS?is inspired by these and other related datasets, including those focused on street scenes (Cityscapes[3]?and Mapillary[22]) and pedestrians (Caltech Pedestrians[4]). We review the most closely related datasets below.

LVIS的靈感來自這些和其他相關數(shù)據(jù)集,包括那些關注街景(Cityscapes?[?3?]和Mapillary?[?22])和行人(Caltech Pedestrians?[?4?])的數(shù)據(jù)集闲询。我們將回顧下面最密切相關的數(shù)據(jù)集久免。

Coco[18]is the most popular instance segmentation benchmark for common objects. It contains 80 categories that are pairwise distinct. There are a total of 118k training images, 5k validation images, and 41k test images. All 80 categories are exhaustively annotated in all images (ignoring annotation errors), leading to approximately 1.2 million instance segmentation masks. To establish continuity with COCO, we adopt the same instance segmentation task and AP metric, and we are also annotating all images from the COCO 2017 dataset. All 80 COCO categories can be mapped into our dataset. In addition to representing an order of magnitude more categories than COCO, our annotation pipeline leads to higher-quality segmentation masks that more closely follow object boundaries (see §4).

是常見對象最受歡迎的實例分段基準。它包含80個成對不同的類別扭弧⊙掷眩總共有118k個訓練圖像,5k驗證圖像和41k測試圖像鸽捻。所有80個類別在所有圖像中都被詳盡地注釋(忽略注釋錯誤)呼巴,導致大約120萬個實例分割掩碼。為了與COCO建立連續(xù)性御蒲,我們采用相同的實例分段任務和AP度量伊磺,并且我們還注釋來自COCO 2017數(shù)據(jù)集的所有圖像。所有80個COCO類別都可以映射到我們的數(shù)據(jù)集中删咱。除了比COCO表示更多類別的數(shù)量級之外屑埋,我們的注釋管道還可以生成更高質(zhì)量的分段掩碼,這些掩碼更接近于對象邊界(見第4節(jié))痰滋。

Ade20k[28] is an ambitious effort to annotate almost every pixel in 25k images with object instance, ‘stuff’, and part segmentations. The dataset includes approximately 3000 named objects, stuff regions, and parts. Notably, ADE20K was annotated by a?single expert annotator, which increases consistency but also limits dataset size. Due to the relatively small number of annotated images, most of the categories do not have enough data to allow for both training and evaluation. Consequently, the instance segmentation benchmark associated with ADE20K evaluates algorithms on the 100 most frequent categories. In contrast, our goal is to enable benchmarking of?large vocabulary?instance segmentation methods.

是一項雄心勃勃的工作摘能,用對象實例,“東西”和部分分段來注釋25k圖像中的幾乎每個像素敲街。數(shù)據(jù)集包括大約3000個命名對象团搞,填充區(qū)域和部分。值得注意的是多艇,ADE20K由一個專家注釋器注釋逻恐,這增加了一致性,但也限制了數(shù)據(jù)集的大小。由于注釋圖像的數(shù)量相對較少复隆,因此大多數(shù)類別沒有足夠的數(shù)據(jù)來進行訓練和評估拨匆。因此,與ADE20K相關聯(lián)的實例分段基準評估了100個最常見類別的算法挽拂。相比之下惭每,我們的目標是啟用大型詞匯表實例分割方法的基準測試。

iNaturalist[26] contains nearly 900k images annotated with bounding boxes for 5000 plant and animal species. Similar to our goals, iNaturalist emphasizes the importance of benchmarking classification and detection in the few example regime. Unlike our effort, iNaturalist does not include segmentation masks and is focussed on a different image and?fine-grained?category distribution; our category distribution emphasizes entry-level categories.

包含近900k個帶有5000個植物和動物物種的邊界框注釋的圖像亏栈。與我們的目標類似台腥,iNaturalist強調(diào)在少數(shù)示例制度中對分類和檢測進行基準測試的重要性。與我們的努力不同绒北,iNaturalist不包括分割蒙版黎侈,而是專注于不同的圖像和細粒度的類別分布;?我們的類別分布強調(diào)入門級類別。

Open Images v4[14] is a large dataset of 1.9M images. The detection portion of the dataset includes 15M bounding boxes labeled with 600 object categories. The associated benchmark evaluates the 500 most frequent categories, all of which have over 100 training samples (>70% of them have over 1000 training samples). Thus, unlike our benchmark, low-shot learning is not integral to Open Images. Also different from our dataset is the use of machine learning algorithms to select which images will be annotated by using classifiers for the target categories. Our data collection process, in contrast, involves no machine learning algorithms and instead discovers the objects that appear within a given set of images. Starting with release v4, Open Images has used a federated dataset design for object detection.

是一個1.9M圖像的大型數(shù)據(jù)集闷游。數(shù)據(jù)集的檢測部分包括標有600個對象類別的15M邊界框峻汉。相關聯(lián)的基準評估500個最頻繁的類別,所有這些具有超過100的訓練樣本(>其中70%有超過1000個訓練樣本)储藐。因此俱济,與我們的基準測試不同,低鏡頭學習不是Open Images的組成部分钙勃。與我們的數(shù)據(jù)集不同的是使用機器學習算法來選擇通過使用目標類別的分類器來注釋哪些圖像蛛碌。相反,我們的數(shù)據(jù)收集過程不涉及機器學習算法辖源,而是發(fā)現(xiàn)出現(xiàn)在給定圖像集中的對象蔚携。從版本v4開始,Open Images使用聯(lián)合數(shù)據(jù)集設計進行對象檢測克饶。

2 DATASET DESIGN

We followed an?evaluation-first design principle: prior to any data collection, we precisely defined what task would be performed and how it would be evaluated. This principle is important because there are technical challenges that arise when evaluating detectors on a large vocabulary dataset that do not occur when there are few categories. These must be resolved first, because they have profound implications for the structure of the dataset, as we discuss next.

我們遵循評估優(yōu)先設計原則:在任何數(shù)據(jù)收集之前酝蜒,我們精確定義了將執(zhí)行的任務以及如何評估它。這個原則很重要矾湃,因為在評估大型詞匯數(shù)據(jù)集上的檢測器時會出現(xiàn)技術挑戰(zhàn)亡脑,這些挑戰(zhàn)在類別很少時不會發(fā)生。必須首先解決這些問題邀跃,因為它們對數(shù)據(jù)集的結構有深遠的影響霉咨,我們將在下面討論。

2.1TASK AND EVALUATION OVERVIEW

Task and Metric. Our dataset benchmark is the instance segmentation task: given a fixed, known set of categories, design an algorithm that when presented with a previously unseen image will output a segmentation mask for each instance of each category that appears in the image along with the category label and a confidence score. Given the output of an algorithm over a set of images, we compute?mask average precision?(AP) using the definition and implementation from the COCO dataset[19]?(for more detail see §2.3).

任務和指標拍屑。 我們的數(shù)據(jù)集基準是實例分割任務:給定一組固定的已知類別途戒,設計一種算法,當呈現(xiàn)先前看不見的圖像時僵驰,將為圖像中出現(xiàn)的每個類別的每個實例輸出一個分割掩碼以及類別標簽和一個置信度分數(shù)喷斋。給定算法在一組圖像上的輸出唁毒,我們使用COCO數(shù)據(jù)集[?19?]中的定義和實現(xiàn)來計算掩模平均精度(AP)?(更多細節(jié)見第2.3節(jié))。

Evaluation Challenges. Datasets like PASCAL VOC and COCO use manually selected categories that are?pairwise disjoint: when annotating a?car, there’s never any question if the object is instead a?potted plant?or a?sofa. When increasing the number of categories, it is inevitable that other types of pairwise relationships will occur: (1) partially overlapping visual concepts; (2) parent-child relationships; and (3) perfect synonyms. See Fig.2?for examples.

評估挑戰(zhàn)星爪。 像PASCAL VOC和COCO這樣的數(shù)據(jù)集使用手動選擇的成對不相交的類別:當注釋汽車時浆西,如果對象是盆栽植物沙發(fā),則永遠不會有任何疑問移必。當增加類別數(shù)量時室谚,不可避免地會出現(xiàn)其他類型的成對關系:(1)部分重疊的視覺概念;?(2)親子關系;?(3)完美的同義詞毡鉴。有關示例崔泵,請參見圖?2

If these relations are not properly addressed, then the evaluation protocol will be unfair. For example, most?toys?are not?deer?and most?deer?are not?toys, but a?toy deer?is both—if a detector outputs?deer?and the object is only labeled?toy, the detection will be marked as wrong. Likewise, if a car is only labeled?vehicle, and the algorithm outputs?car, it will be incorrectly judged to be wrong. Or, if an object is only labeled?backpack?and the algorithm outputs the synonym?rucksack, it will be incorrectly penalized. Providing a fair benchmark is important for accurately reflecting algorithm performance.

如果這些關系沒有得到妥善解決猪瞬,那么評估協(xié)議將是不公平的憎瘸。例如,大多數(shù)玩具不是鹿陈瘦,大多數(shù)鹿不是玩具幌甘,但是玩具鹿都是 - 如果探測器輸出鹿并且物體僅標記為玩具,則檢測將被標記為錯誤痊项。同樣地锅风,如果汽車僅被標記為車輛,并且算法輸出汽車鞍泉,則將錯誤地判斷為錯誤皱埠。或者咖驮,如果對象僅標記為背包边器,則算法輸出同義詞背包,它會受到不正當?shù)膽土P托修。提供公平的基準對于準確反映算法性能非常重要忘巧。

These problems occur when the ground-truth annotations are missing one or more true labels for an object. If an algorithm happens to predict one of these correct,?but missing?labels, it will be unfairly penalized. Now, if all objects are exhaustively and correctly labeled with all categories, then the problem is trivially solved. But correctly and exhaustively labeling 164k images each with 1000 categories is undesirable: it forces a binary judgement deciding if each category applies to each object; there will be many cases of genuine ambiguity and inter-annotator disagreement. Moreover, the annotation workload will be very large. Given these drawbacks, we describe our solution next.

當?shù)孛鎸崨r注釋缺少對象的一個??或多個真實標簽時,會出現(xiàn)這些問題睦刃。如果算法恰好預測了這些正確但缺失的標簽之一砚嘴,則會受到不公平的懲罰。現(xiàn)在涩拙,如果所有對象都是詳盡且正確地標記了所有類別际长,那么問題就可以解決了。但正確而詳盡地標記每個有1000個類別的164k圖像是不可取的:它強制二元判斷決定每個類別是否適用于每個對象;?會有很多真正含糊不清和注釋時間不一致的案例吃环。而且也颤,注釋工作量將非常大。鑒于這些缺點郁轻,我們接下來描述我們的解決方案缕允。

2.2 FEDERATED DATASETS

Our key observation is that the desired evaluation protocol does not require us to exhaustively annotate all images with all categories. What is required instead is that for each category?c?there must exist two disjoint subsets of the entire dataset?D?for which the following guarantees hold:

我們的關鍵觀察是暮胧,期望的評估協(xié)議不要求我們詳盡地注釋所有類別的所有圖像搬设。相反,對于每個類別c燥翅,必須存在整個數(shù)據(jù)集D的兩個不相交的子集,以下保證包含:

Positive set:?there exists a subset of images?Pc?D?such that all instances of?c?in?Pc?are segmented. In other words,?Pc?is exhaustively annotated for category?c.

Negative set:?there exists a subset of images?Nc?D?such that no instance of?c?appears in any of these images.

Given these two subsets for a category?c,?Pc∪Nc?can be used to perform standard COCO-style AP evaluation for?c. The evaluation oracle only judges the algorithm on a category?c?over the subset of images in which?c?has been exhaustively annotated; if a detector reports a detection of category?c?on an image?i?Pc∪Nc, the detection is?not?evaluated.

By collecting the per-category sets into a single dataset,?D=∪c(Pc∪Nc), we arrive at the concept of a?federated dataset. A federated dataset is a dataset that is formed by the union of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category. By not annotating all images with all categories, freedom is created to design an annotation process that avoids ambiguous cases and collects annotations only if there is sufficient inter-annotator agreement. At the same time, the workload can be dramatically reduced.

通過收集每一類別集成一個單一的數(shù)據(jù)集蜕提,d = ∪ ?(P c ^ ∪ ? ?)森书,我們得出一個的概念聯(lián)合數(shù)據(jù)集。聯(lián)合數(shù)據(jù)集是由較小的組成數(shù)據(jù)集聯(lián)合形成的數(shù)據(jù)集谎势,每個數(shù)據(jù)集看起來與單個類別的傳統(tǒng)對象檢測數(shù)據(jù)集完全相同凛膏。通過不注釋具有所有類別的所有圖像,創(chuàng)建自由以設計注釋過程以避免模糊情況并且僅在存在足夠的注釋器間協(xié)議時收集注釋脏榆。同時猖毫,可以大大減少工作量

Finally, we note that positive set and negative set membership on the test split is not disclosed and therefore algorithms have no side information about what categories will be evaluated in each image. An algorithm thus must make its best prediction for?all?categories in each test image.

最后,我們注意到未公開測試分割的正集和負集成員資格须喂,因此算法沒有關于將在每個圖像中評估哪些類別的輔助信息吁断。因此,算法必須對每個測試圖像中的所有類別進行最佳預測坞生。


Reduced Workload. Federated dataset design allows us to make?|Pc∪Nc|?|D|,?c. This choice dramatically reduces the workload and allows us to undersample the most frequent categories in order to avoid wasting annotation resources on them (e.g.?person?accounts for 30% of COCO). Of our estimated?~2 million instances, likely no single category will account for more than?~3% of the total instances.

減少工作量仔役。 聯(lián)邦數(shù)據(jù)集設計允許我們制作| P c ^ ∪ ? ? | ? | D | ,? ?是己。這種選擇大大減少了工作量又兵,并允許我們對最常見的類別進行欠采樣,以避免在其上浪費注釋資源(例如赃泡,人員占COCO的30%)寒波。我們估計??200萬分的情況下,可能沒有一個單一的類別將占到超過?總實例的3%升熊。

2.3 EVALUATION DETAILS

The challenge evaluation server will only return the overall AP, not per-category AP’s. We do this because: (1) it avoids leaking which categories are present in the?test?set;2?(2) given that tail categories are rare, there will be few examples for evaluation in some cases, which makes per-category AP unstable; (3) by averaging over a large number of categories, the overall category-averaged AP has lower variance, making it a robust metric for ranking algorithms.

挑戰(zhàn)評估服務器將僅返回整體AP俄烁,而不是每類別AP。我們這樣做是因為:(1)它避免泄漏測試集中存在哪些類別;?2(2)鑒于尾部類別很少级野,在某些情況下評估的例子很少页屠,這使得每類AP不穩(wěn)定;?(3)通過對大量類別求平均,整體類別平均AP具有較低的方差蓖柔,使其成為排序算法的穩(wěn)健度量辰企。

Non-Exhaustive Annotations.We also collect an image-level boolean label,?eci, indicating if image?i∈Pc?is exhaustively annotated for category?c. In most cases (91%), this flag is true, indicating that the annotations are indeed exhaustive. In the remaining cases, there is at least one instance in the image that is not annotated. Missing annotations often occur in ‘crowds’ where there are a large number of instances and delineating them is difficult. During evaluation, we do not count false positives for category?c?on images?i?that have?eci?set to false. We do measure recall on these images: the detector is expected to predict accurate segmentation masks for the labeled instances. Our strategy differs from other datasets that use a small maximum number of instances per image, per category (10-15) together with ‘crowd regions’ (COCO) or use a special ‘group of?c’ label to represent 5 or more instances (Open Images v4). Our annotation pipeline (§3) attempts to collect segmentations for?all?instances in an image, regardless of count, and then checks if the labeling is in fact exhaustive. See Fig.3.

Hierarchy.

During evaluation, we treat all categories the same; we do nothing special in the case of hierarchical relationships. To perform best, for each detected object?o, the detector should output the most specific correct category as well as all more general categories,?e.g., a canoe should be labeled both?canoe?and?boat. The detected object?o?in image?i?will be evaluated with respect to all labeled positive categories?{c?|?i∈Pc}, which may be any subset of categories between the most specific and the most general.

Synonyms.

A federated dataset that separates synonyms into different categories is valid, but is unnecessarily fragmented (see Fig.2, right). We avoid splitting synonyms into separate categories with WordNet[21]. Specifically, in?LVIS?each category?c?is a WordNet?synset—a word sense specified by a set of synonyms and a definition.

3 DATASET CONSTRUCTION

In this section we provide an overview of the annotation pipeline that we use to collect?LVIS.

3.1ANNOTATION PIPELINE

Fig.4?illustrates our annotation pipeline by showing the output of each stage, which we describe below. For now, assume that we have a fixed category vocabulary?V. We will describe how the vocabulary was collected in §3.2.

Object Spotting, Stage 1.

The goals of the object spotting stage are to: (1) generate the positive set,?Pc, for each category?c∈V?and (2) elicit vocabulary recall such that many different object categories are included in the dataset.

Object spotting is an iterative process in which each image is visited a variable number of times. On the first visit, an annotator is asked to mark one object with a point and to name it with a category?c∈V?using an?autocompletetext input. On each subsequent visit, all previously spotted objects are displayed and an annotator is asked to mark an object of a previously unmarked category or to skip the image if no more categories in?V?can be spotted. When an image has been skipped 3 times, it will no longer be visited. The autocomplete is performed against the set of all synonyms, presented with their definitions; we internally map the selected word to its synset/category to resolve synonyms.

Obvious and salient objects are spotted early in this iterative process. As an image is visited more, less obvious objects are spotted, including incidental, non-salient ones. We run the spotting stage twice, and for each image we retain categories that were spotted in both runs.?Thus two people must independently agree on a name in order for it to be included in the dataset; this increases naming consistency.

To summarize the output of stage 1: for each category in the vocabulary, we have a (possibly empty) set of images in which one object of that category is marked per image. This defines an initial positive set,?Pc, for each category?c.

Exhaustive Instance Marking, Stage 2.

The goals this stage are to: (1) verify stage 1 annotations and (2) take each image?i∈Pc?and mark?all?instances of?c?in?i?with a point.

In this stage,?(i,c)?pairs from stage 1 are each sent to 5 annotators. They are asked to perform two steps. First, they are shown the definition of category?c?and asked to verify if it describes the spotted object. Second, if it matches, then the annotators are asked to mark all other instances of the same category. If it does not match, there is no second step. To prevent frequent categories from dominating the dataset and to reduce the overall workload, we subsample frequent categories such that no positive set exceeds more than 1% of the images in the dataset.

To ensure annotation quality, we embed a ‘gold set’ within the pool of work. These are cases for which we know the correct ground-truth. We use the gold set to automatically evaluate the work quality of each annotator so that we can direct work towards more reliable annotators. We use 5 annotators per?(i,c)?pair to help ensure instance-level recall.

To summarize, from stage 2 we have exhaustive instance spotting for each image?i∈Pc?for each category?c∈V.

Instance Segmentation, Stage 3.

The goals of the instance segmentation stage are to: (1) verify the category for each marked object from stage 2 and (2) upgrade each marked object from a point annotation to a full segmentation mask.

To do this, each pair?(i,o)?of image?i?and marked object instance?o?is presented to one annotator who is asked to verify that the category label for?o?is correct and if it is correct, to draw a?detailed?segmentation mask for it (e.g. see Fig.3).

We use a training task to establish our quality standards. Annotator quality is assessed with a gold set and by tracking their average vertex count per polygon. We use these metrics to assign work to reliable annotators.

In sum, from stage 3 we have for each image and spotted instance pair one segmentation mask (if it is not rejected).

Segment Verification, Stage 4.

The goal of the segment verification stage is to verify the quality of the segmentation masks from stage 3. We show each segmentation to up to 5 annotators and ask them to rate its quality using a rubric. If two or more annotators reject the mask, then we requeue the instance for stage 3 segmentation. Thus we only accept a segmentation if 4 annotators agree it is high-quality. Unreliable workers from stage 3 are not invited to judge segmentations in stage 4; we also use rejections rates from this stage to monitor annotator reliability. We iterate between stages 3 & 4 a total of four times, each time only re-annotating rejected instances.

To summarize the output of stage 4 (after iterating back and forth with stage 3): we have a high-quality segmentation mask for?>99% of all marked objects.

Full Recall Verification, Stage 5.

The full recall verification stage finalizes the positive sets. The goal is to find images?i∈Pc?where?c?is not exhaustively annotated. We do this by asking annotators if there are any unsegmented instances of category?c?in?i. We ask up to 5 annotators and require at least 4 to agree that annotation is exhaustive. As soon as two believe it is not, we mark the exhaustive annotation flag?eci?as false. We use a gold set to maintain quality.

To summarize the output of stage 5: we have a boolean flag?eci?for each image?i∈Pc?indicating if category?c?is exhaustively annotated in image?i. This finalizes the positive sets along with their instance segmentation annotations.

Negative Sets, Stage 6.

The final stage of the pipeline is to collect a negative set?Nc?for each category?c?in the vocabulary. We do this by randomly sampling images?i∈D?Pc, where?D?is all images in the dataset. For each sampled image?i, we ask up to 5 annotators if category?c?appears in image?i. If any one annotator reports that it does, we reject the image. Otherwise?i?is added to?Nc. We sample until the negative set?Nc?reaches a target size of 1% of the images in the dataset. We use a gold set to maintain quality.

To summarize, from stage 6 we have a negative image set?Nc?for each category?c∈V?such that the category does not appear in any of the images in?Nc.

3.2VOCABULARY CONSTRUCTION

We construct the vocabulary?V?with an iterative process that starts from a large super-vocabulary and uses the object spotting process (stage 1) to winnow it down. We start from 8.8k synsets that were selected from WordNet by removing some obvious cases (e.g. proper nouns) and then finding the intersection with highly concrete common nouns[2]. This yields a high-recall set of concrete, and thus likely visual, entry-level synsets. We then apply object spotting to 10k COCO images with autocomplete against this super-vocabulary. This yields a reduced vocabulary with which we repeat the process once more. Finally, we perform minor manual editing. The resulting vocabulary contains 1723 synsets—the upper bound on the number of categories that can appear in?LVIS.

4.3 EVALUATION PROTOCOL

COCO Detectors on?Lvis.

To validate our annotations and federated dataset design we downloaded three Mask R-CNN[9]?models from the Detectron Model Zoo[7]?and evaluated them on?LVIS?annotations for the categories in COCO. Tab.2?shows that both box AP and mask AP are close between our annotations and the original ones from COCO for all models, which span a wide AP range. This result validates our annotations and evaluation protocol: even though?LVISuses a federated dataset design with sparse annotations, the quantitative outcome closely reproduces the ‘gold standard’ results from dense COCO annotations.

Federated Dataset Simulations.

For insight into how AP changes with positive and negative sets sizes?|Pc|?and?|Nc|, we randomly sample smaller evaluation sets from COCO?val2017?and recompute AP. To plot quartiles and min-max ranges, we re-test each setting 20 times. In Fig.(a)a?we use all positive instances for evaluation, but vary?max|Nc|?between 50 and 5k. AP decreases somewhat (~2 points) as we increase the number of negative images as the ratio of negative to positive examples grows with fixed?|Pc|?and increasing?|Nc|. Next, in Fig(b)b?we set?max|Nc|=50?and vary?|Pc|. We observe that even with a small positive set size of 80, AP is similar to the baseline with low variance. With smaller positive sets (down to 5) variance increases, but the AP gap from 1st to 3rd quartile remains below 2 points. These simulations together with COCO detectors tested on?LVIS?(Tab.2) indicate that including smaller evaluation sets for each category is viable for evaluation.

Low-Shot Detection.

To validate the claim that low-shot detection is a challenging open problem, we trained Mask R-CNN on random subsets of COCO?train2017?ranging from 1k to 118k images. For each subset, we optimized the learning rate schedule and weight decay by grid search. Results on?val2017?are shown in Fig.(c)c. At 1k images, mask AP drops from 36.4% (full dataset) to 9.8% (1k subset). In the 1k subset, 89% of the categories have more than 20 training instances, while the low-shot literature typically considers???20 examples per category[8].

Low-Shot Category Statistics.

Fig.9?(left) shows category growth as a function of image count (up to 977 categories in 5k images). Extrapolating the trajectory, our final dataset will include over 1k categories (upper bounded by the vocabulary size, 1723). Since the number of categories increases during data collection, the low-shot nature of?LVIS?is somewhat independent of the dataset scale, see Fig.9?(right) where we bin categories based on how many images they appear in:?rare?(1-10 images),?common?(11-100), and?frequent?(>100). These bins, as measured w.r.t. the training set, will be used to present disaggregated AP metrics.

5 CONCLUSION

We introduced?LVIS, a new dataset designed to enable, for the first time, the rigorous study of instance segmentation algorithms that can recognize a large vocabulary of object categories (>1000) and must do so using methods that can cope with the open problem of low-shot learning. While?LVIS?emphasizes learning from few examples, the dataset is not small: it will span 164k images and label?~2 million object instances. Each object instance is segmented with a high-quality mask that surpasses the annotation quality of related datasets. We plan to establish?LVIS?as a benchmark challenge that we hope will lead to exciting new object detection, segmentation, and low-shot learning algorithms.

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市况鸣,隨后出現(xiàn)的幾起案子牢贸,更是在濱河造成了極大的恐慌,老刑警劉巖镐捧,帶你破解...
    沈念sama閱讀 219,427評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件潜索,死亡現(xiàn)場離奇詭異臭增,居然都是意外死亡,警方通過查閱死者的電腦和手機竹习,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,551評論 3 395
  • 文/潘曉璐 我一進店門誊抛,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人整陌,你說我怎么就攤上這事拗窃。” “怎么了泌辫?”我有些...
    開封第一講書人閱讀 165,747評論 0 356
  • 文/不壞的土叔 我叫張陵随夸,是天一觀的道長。 經(jīng)常有香客問我甥郑,道長逃魄,這世上最難降的妖魔是什么荤西? 我笑而不...
    開封第一講書人閱讀 58,939評論 1 295
  • 正文 為了忘掉前任澜搅,我火速辦了婚禮,結果婚禮上邪锌,老公的妹妹穿的比我還像新娘勉躺。我一直安慰自己,他們只是感情好觅丰,可當我...
    茶點故事閱讀 67,955評論 6 392
  • 文/花漫 我一把揭開白布饵溅。 她就那樣靜靜地躺著,像睡著了一般妇萄。 火紅的嫁衣襯著肌膚如雪蜕企。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,737評論 1 305
  • 那天冠句,我揣著相機與錄音轻掩,去河邊找鬼。 笑死懦底,一個胖子當著我的面吹牛唇牧,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播聚唐,決...
    沈念sama閱讀 40,448評論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼丐重,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了杆查?” 一聲冷哼從身側響起扮惦,我...
    開封第一講書人閱讀 39,352評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎亲桦,沒想到半個月后崖蜜,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體掺栅,經(jīng)...
    沈念sama閱讀 45,834評論 1 317
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,992評論 3 338
  • 正文 我和宋清朗相戀三年纳猪,在試婚紗的時候發(fā)現(xiàn)自己被綠了氧卧。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,133評論 1 351
  • 序言:一個原本活蹦亂跳的男人離奇死亡氏堤,死狀恐怖沙绝,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情鼠锈,我是刑警寧澤闪檬,帶...
    沈念sama閱讀 35,815評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站购笆,受9級特大地震影響粗悯,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜同欠,卻給世界環(huán)境...
    茶點故事閱讀 41,477評論 3 331
  • 文/蒙蒙 一样傍、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧铺遂,春花似錦衫哥、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,022評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至粮坞,卻和暖如春蚊荣,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背莫杈。 一陣腳步聲響...
    開封第一講書人閱讀 33,147評論 1 272
  • 我被黑心中介騙來泰國打工互例, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人姓迅。 一個月前我還...
    沈念sama閱讀 48,398評論 3 373
  • 正文 我出身青樓敲霍,卻偏偏與公主長得像,于是被迫代替她去往敵國和親丁存。 傳聞我的和親對象是個殘疾皇子肩杈,可洞房花燭夜當晚...
    茶點故事閱讀 45,077評論 2 355

推薦閱讀更多精彩內(nèi)容