CVPR2019|LightTrack: AGenericFrameworkforOnlineTop-DownHumanPoseTracking

論文地址:https://arxiv.org/pdf/1905.02822.pdf

項(xiàng)目地址:https://github.com/Guanghan/lighttrack

mpii數(shù)據(jù)集:http://human-pose.mpi-inf.mpg.de/#download

posetrack數(shù)據(jù)集:https://posetrack.net/users/download.php


Abstract

In this paper, we propose a novel effective light-weight framework, called as LightTrack, for online human pose tracking. The proposed framework is designed to be generic for top down pose tracking and is faster than exiting online and of?ine methods. Single person Pose Tracking (SPT) and Visual Object Tracking(VOT) are incorporated into one uni?ed functioning entity, easily implemented by a replaceable single-person pose estimation module. Our framework uni?es single-person pose tracking with multi-person identity association and sheds ?rst light upon bridging keypoint tracking with object tracking. We also propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. Incontrary to other Re ID modules,we use a graphical representation of human joints for matching. The skeleton based representation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. To the best of our knowledge, this is the ?rst paper to propose an online human pose tracking framework in a top down fashion. The proposed framework is general enough to ?t other pose estimators and candidate maching mechanisms. Our method outperforms other online methods while maintaining a much higher frame rate, and is very competitive with our of?ine state-of-the-art. We make the code publicly available at: https://github.com/Guanghan/lighttrack.

在本文中,我們提出了一種新的有效輕量級框架洲鸠,稱為LightTrack牌里,用于在線人體姿勢跟蹤旗国。建議的框架被設(shè)計(jì)為自上而下的姿勢跟蹤通用,并且比在線和offline方法更快撞蚕。單人姿勢跟蹤(SPT)和視覺對象跟蹤(VOT)被整合到一個(gè)統(tǒng)一的功能實(shí)體中戈毒,可通過可替換的單人姿勢估計(jì)模塊輕松實(shí)現(xiàn)。我們的框架統(tǒng)一單人姿勢跟蹤與多人身份關(guān)聯(lián)和使關(guān)鍵點(diǎn)檢測和目標(biāo)追蹤輕量化聪富。我們還提出了一個(gè)用于人體姿勢匹配的連體圖形卷積網(wǎng)絡(luò)(SGCN)作為我們的姿勢跟蹤系統(tǒng)中的Re-ID模塊。與其他Re ID模塊相比著蟹,我們使用人體關(guān)鍵點(diǎn)的圖形表示用于匹配墩蔓∩颐В基于骨架的表示有效地捕獲了人體姿勢的相似性,并且在計(jì)算上很便宜钢拧。它對于突然攝像機(jī)移位引入人類漂移具有魯棒性。據(jù)我們所知炕横,這是提出在線人體姿勢跟蹤框架的第一篇自上而下的方式的論文源内。所提出的框架足夠通用,可以用于其他姿勢估計(jì)和候選機(jī)制份殿。我們的方法優(yōu)于其他在線方法膜钓,同時(shí)保持更高的幀速率,并且與我們的現(xiàn)有技術(shù)相比具有很強(qiáng)的競爭力卿嘲。我們將代碼公開發(fā)布在:https://github.com/Guanghan/lighttrack颂斜。

1.Introduction

Pose tracking is the task of estimating multi-person human poses in videos and assigning unique instance IDs for each keypoint across frames. Accurate estimation of human keypoint-trajectories is useful for human action recognition, human interaction understanding, motion capture and animation, etc. Recently, the publicly available PoseTrackdataset[18,3 andMPIIVideoPosedataset[17]have pushedtheresearchonhumanmotiona alysisonestepfurther to its real-world scenario. Two PoseTrack challenges have been held. However, most existing methods are of ?ine hence lacking the potential to be real-time. More emphasis has been put on the Multi-Object Tracking Accuracy (MOTA criterion compared to the Frame PerSecond(FPS) criterion. Existing of?ine methods divide the tasks of human detection, candidate pose estimation, and identity association into sequential stages. In the procedure, multi person poses are estimated across frames within a video. Based on the pose estimation results, the pose tracking outputs are computed via solving an optimization problem. It requires the poses of future frames to be pre-computed, or at least for the frames within some range.?

姿勢跟蹤是在視頻中估計(jì)多人姿勢并為幀中的每個(gè)關(guān)鍵點(diǎn)分配唯一實(shí)例ID的任務(wù)。對人類關(guān)鍵點(diǎn)軌跡的準(zhǔn)確估計(jì)對于人類行為識別拾枣,人類交互理解沃疮,動(dòng)作捕捉和動(dòng)畫等是有用的。最近梅肤,公開可用的Pose Track dataset [18,3和MPII Video Pose dataset [17]已經(jīng)推動(dòng)了對人類動(dòng)態(tài)的分析以及其真實(shí)場景的進(jìn)一步發(fā)展司蔬。已經(jīng)舉行了兩次PoseTrack挑戰(zhàn)。然而姨蝴,大多數(shù)現(xiàn)有方法都不具備實(shí)時(shí)性俊啼。更多的重點(diǎn)放在多目標(biāo)跟蹤精度(MOTA) 標(biāo)準(zhǔn)與幀周期(FPS)標(biāo)準(zhǔn)相比較。現(xiàn)有的offline方法將人體檢測左医,候選姿態(tài)估計(jì)和身份關(guān)聯(lián)的任務(wù)劃分為連續(xù)階段授帕。在視頻內(nèi)的幀中估計(jì)多人姿勢「∩遥基于姿勢估計(jì)結(jié)果跛十,通過求解優(yōu)化問題來計(jì)算姿勢跟蹤輸出。它需要預(yù)先計(jì)算未來幀的姿勢秕硝,或者至少對于某些幀內(nèi)的幀范圍偶器。

In this paper, we propose a novel effective light-weight framework for pose tracking. It is designed to be generic, top-down (i.e., pose estimation is performed after candidates are detected), and truly online. The proposed framework uni?es single-person pose tracking with multi-person identityassociation. It sheds ?rst light on bridging key point tracking with object tracking. To the best of our knowledge, this is the ?rst paper to propose an online pose tracking framework in a top-down fashion. The proposed framework is general enough to ?t other pose estimators and candidate matching mechanisms. Thus, if individual component is further improved in the future, our framework will be faster and/or more accurate.

在本文中,我們提出了一種新穎有效的輕量級姿態(tài)跟蹤框架缝裤。 它被設(shè)計(jì)為通用的屏轰,自上而下的(即,在檢測到候選者之后執(zhí)行姿勢估計(jì))憋飞,并且真正在線霎苗。 擬議的框架統(tǒng)一了具有多人身份關(guān)聯(lián)的單人姿勢跟蹤。 它首先闡明了使用對象跟蹤進(jìn)行橋接關(guān)鍵點(diǎn)跟蹤榛做。 據(jù)我們所知唁盏,這是第一篇以自上而下的方式提出在線姿勢跟蹤框架的論文内狸。 所提出的框架足夠通用于其他姿勢估計(jì)器和候選匹配機(jī)制。 因此厘擂,如果將來進(jìn)一步改進(jìn)單個(gè)組件昆淡,我們的框架將更快和/或更準(zhǔn)確。

In contrast to Visual Object Tracking (VOT) methods, in which the visual features are implicitly represented by kernels or CNN feature maps, we track each human pose by recursively updating the bounding box and its corresponding pose in an explicit manner. The bounding box region of a target is inferred from the explicit features, i.e., the human keypoints. Human keypoints can be considered as a series of special visual features. The advantages of using pose as explicit features include: (1) The explicit features are human-related and interpretable, and have very strong and stable relationship with the boundingbox position. Human pose enforces direct constraint on the boundingbox region. (2) The task of pose estimation and tracking requires human keypoints be predicted in the ?rst place. Taking advantage of the predicted keypoints is ef?cient in tracking the ROI region, which is almost free. This mechanism makes the online tracking possible. (3) It naturally keeps the identity of the candidates,which greatly all eviates the burden of data association in the system. Even when data association is necessary, we can re-use the pose features for skeletonbased pose matching. Single Pose Tracking (SPT) and Single Visual Object Tracking(VOT) are thus incorporated into one uni?ed functioning entity, easily implemented by a replaceable single-person human pose estimation module.?

與視覺對象跟蹤(VOT)方法相比刽严,其中視覺特征由核或CNN特征圖隱式表示昂灵,我們通過以顯式方式遞歸地更新邊界框及其對應(yīng)的姿勢來跟蹤每個(gè)人體姿勢。從顯式特征(即人類關(guān)鍵點(diǎn))推斷出目標(biāo)的邊界框區(qū)域舞萄。人類關(guān)鍵點(diǎn)可以被視為一系列特殊的視覺特征眨补。使用姿勢作為顯式特征的優(yōu)點(diǎn)包括:

(1)顯式特征與人類相關(guān)且可解釋,并且與邊界框位置具有非常強(qiáng)大且穩(wěn)定的關(guān)系倒脓。人體姿勢對邊界框區(qū)域強(qiáng)制執(zhí)行直接約束撑螺。?

(2)姿勢估計(jì)和跟蹤的任務(wù)需要首先預(yù)測人類關(guān)鍵點(diǎn)。利用預(yù)測的關(guān)鍵點(diǎn)可以有效地跟蹤ROI區(qū)域崎弃,這幾乎是免費(fèi)的甘晤。該機(jī)制使在線跟蹤成為可能。?

3)它自然地保留了候選者的身份饲做,這極大地消除了系統(tǒng)中數(shù)據(jù)關(guān)聯(lián)的負(fù)擔(dān)安皱。即使需要數(shù)據(jù)關(guān)聯(lián),我們也可以重新使用姿勢特征進(jìn)行基于骨架的姿勢匹配艇炎。單姿態(tài)跟蹤(SPT)和單視覺對象跟蹤(VOT因此被整合到一個(gè)統(tǒng)一的功能實(shí)體中酌伊,可由可替換的單人人體姿勢估計(jì)模塊容易地實(shí)現(xiàn)。

Figure 1. Overview of the proposed online pose tracking framework. We detect human candidates in the ?rst frame, then track each candidate’s position and pose by a single-person pose estimator. When a target is lost, we perform detection for this frame and data association with a graph convolution network for skeleton-based pose matching. We use skeleton-based pose matching because visually similar candidates with different identities may confuse visual classi?ers. Extracting visual features can also be computationally expensive in an online tracking system. Pose matching is considered because we observe that in two adjacent frames, the location of a person may drift away due to sudden camera shift, but the human pose will stay almost the same as people usually cannot act that fast.

圖1.建議的在線姿勢跟蹤框架概述缀踪。 我們在第一幀中檢測人類候選者居砖,然后通過單人姿勢估計(jì)器跟蹤每個(gè)候選人的位置和姿勢。 當(dāng)目標(biāo)丟失時(shí)驴娃,我們對該幀進(jìn)行檢測奏候,并與圖形卷積網(wǎng)絡(luò)進(jìn)行數(shù)據(jù)關(guān)聯(lián),以進(jìn)行基于骨架的姿勢匹配唇敞。 我們使用基于骨架的姿勢匹配蔗草,因?yàn)榫哂胁煌矸莸囊曈X上相似的候選人可能會(huì)混淆視覺分類。 在線跟蹤系統(tǒng)中疆柔,提取視覺特征在計(jì)算上也是昂貴的咒精。 考慮姿勢匹配是因?yàn)槲覀冇^察到在兩個(gè)相鄰的幀中,人的位置可能由于突然的相機(jī)移位而漂移旷档,但是人的姿勢將保持幾乎相同模叙,因?yàn)槿藗兺ǔ2荒芸焖俚匦?/b>動(dòng)。

Our contributions are in three-fold: (1) We propose a general online pose tracking framework that is suitable for top-down approaches of human pose estimation. Both human pose estimator and Re-ID module are replaceable. In contrast to Multi-Object Tracking (MOT) frameworks, our framework is specially designed for the task of pose tracking. To the best of our knowledge, this is the ?rst paper to propose an online human pose tracking system in a topdown fashion. (2) We propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. Different to existing Re-ID modules, we use a graphical representation of humanjointsformatching. Theskeleton-basedrepresentation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. (3) We conduct extensive experiments with various settings and ablation studies. Our proposed online pose tracking approach outperforms existing online methods and is competitive to the of?ine stateof-the-arts but with much higher frame rates. We make the code publicly available to facilitate future research.?

我們的貢獻(xiàn)有三方面:

(1)我們提出了一種通用的在線姿勢跟蹤框架鞋屈,適用于人體姿勢估計(jì)的自上而下的方法范咨。人體姿勢估計(jì)器和Re-ID模塊都是可替換的故觅。與多目標(biāo)跟蹤(MOT)框架相比,我們的框架專為姿勢跟蹤任務(wù)而設(shè)計(jì)渠啊。據(jù)我們所知输吏,這是第一篇以自上而下的方式提出在線人體姿勢跟蹤系統(tǒng)的論文。?

(2)我們在姿勢跟蹤系統(tǒng)中提出了一種用于人體姿勢匹配的連體圖形卷積網(wǎng)絡(luò)(SGCN)作為Re-ID模塊替蛉。與現(xiàn)有的Re-ID模塊不同贯溅,我們使用人關(guān)鍵點(diǎn)匹配的圖形表示∶鸱担基于骨架的表示有效地捕獲人類姿勢相似性并且在計(jì)算上是便宜的盗迟。突然的相機(jī)移位很強(qiáng)大坤邪,引入了人類漂移熙含。?

(3)我們進(jìn)行了各種設(shè)置和消融研究的廣泛實(shí)驗(yàn)。我們提出的在線姿勢跟蹤方法優(yōu)于現(xiàn)有的在線方法艇纺,并且與現(xiàn)有技術(shù)相比具有競爭力怎静,但具有更高的幀速率。我們公開提供代碼以方便將來的研究黔衡。

2.RelatedWork

2.1.HumanPoseEstimationandTracking

Human Pose Estimation (HPE) has seen rapid progress with thee mergence of CNN-based methods[34,31,39,21]. The most widely usedd atasets,e.g.,MPII[4]andLSP[20], are saturated with methods that achieve 90% and higher accuracy. Multi-person human pose estimation is more realistic and challenging, and has received increasing attentions with the hosting of COCO keypoints challenges [26] since 2017. Existing methods can be classi?ed into top-down and bottom-up approaches. The top-down approaches [14, 32, 15] rely on the detection module to obtain human candidates and then applying single-person pose estimation to locate human keypoints. The bottom-up methods [6,35,30]detect human keypoints from all potential candidates and then assemble these keypoints into human limbs for each individual based on various data association techniques. The advantage of bottom-up approaches is their excellent trade-off between estimation accuracy and computational cost because the cost is nearly invariant to the number of human candidates in the image. In contrast, the advantage of top-down approaches is their capability in disassembling the task into multiple comparatively easier tasks, i.e., object detection and single-person pose estimation. The object detector is expert in detecting hard (usually small) candidates,so that the pose estimator will perform better with a focused regression space. Pose tracking is a new topic that is primarily introduced by the Pose Track dataset[18,3]and MPII Video Pose dataset [17]. The task is to estimate human keypoints and assign unique IDs to each keypoint at instance-level across frames in videos. A typical top-down but of?ine method was introduced in[17],whereposetracking is transformed into a minimum cost multi-cut problem with a graph partitioning formulation.?

人體姿勢估計(jì)(HPE)隨著基于CNN的方法的合并而迅速發(fā)展[34,31,39,21]蚓聘。最廣泛使用的atasets,例如MPII [4]和LSP [20]盟劫,已經(jīng)達(dá)到了達(dá)到90%和更高精度的方法夜牡。多人人體姿勢估計(jì)更加現(xiàn)實(shí)和具有挑戰(zhàn)性,并且自2017年以來越來越多地關(guān)注COCO關(guān)鍵點(diǎn)挑戰(zhàn)的托管[26]÷虑現(xiàn)有方法可以分為自上而下和自下而上的方法塘装。自上而下的方法[14,32,15]依靠檢測模塊來獲得人類候選者,然后應(yīng)用單人姿勢估計(jì)來定位人類關(guān)鍵點(diǎn)影所。自下而上的方法[6,35,30]檢測來自所有潛在候選者的人類關(guān)鍵點(diǎn)蹦肴,然后基于各種數(shù)據(jù)關(guān)聯(lián)技術(shù)將這些關(guān)鍵點(diǎn)組裝成每個(gè)人的人體肢體。自下而上方法的優(yōu)點(diǎn)是它們在估計(jì)精度和計(jì)算成本之間的出色折衷猴娩,因?yàn)槌杀緦τ趫D像中的人類候選者的數(shù)量幾乎不變阴幌。相反,自上而下方法的優(yōu)點(diǎn)是它們能夠?qū)⑷蝿?wù)分解成多個(gè)相對容易的任務(wù)卷中,即物體檢測和單人姿勢估計(jì)矛双。物體檢測器是檢測硬(通常是小的)候選物的專家,因此姿勢估計(jì)器將在聚焦回歸空間中表現(xiàn)更好蟆豫。姿勢跟蹤是一個(gè)新主題背零,主要由Pose Track數(shù)據(jù)集[18,3]和MPII Video Pose數(shù)據(jù)集[17]引入。任務(wù)是估計(jì)人體關(guān)鍵點(diǎn)无埃,并在視頻中的幀中為實(shí)例級別的每個(gè)關(guān)鍵點(diǎn)分配唯一ID徙瓶。在[17]中引入了一種典型的自上而下的方法毛雇,其中使用圖分區(qū)公式將輪廓跟蹤轉(zhuǎn)換為最小成本多切割問題。

[17] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. Arttrack:articulated multiperson tracking in the wild. In CVPR, 2017. 1, 2

2.2.Object Detection vs. Human Pose Estimation

Earlier works in object detection regress visual features into bounding box coordinates. HPE, on the other hand, usually regresses visual features into heatmaps, each channel representing a human joint. Recently, research in HPE has inspired many works on object detection [40, 22, 28]. These works predict heat maps for a set of special keypoints to infer detection results (bounding boxes). Based on this motivation, we propose to predict human keypoints to infer boundingboxregions. Human keypoints are a special set of keypoints to represent detection of the human class only

早期的物體檢測工作將視覺特征回歸到邊界框坐標(biāo)中侦镇。 另一方面灵疮,HPE通常將視覺特征回歸到熱圖中,每個(gè)通道代表人體關(guān)節(jié)壳繁。 最近震捣,HPE的研究激發(fā)了許多關(guān)于物體檢測的工作[40,22,28]。 這些工作預(yù)測了一組特殊關(guān)鍵點(diǎn)的熱圖闹炉,以推斷檢測結(jié)果(邊界框)蒿赢。 基于這一動(dòng)機(jī),我們建議預(yù)測人類關(guān)鍵點(diǎn)以推斷邊界框區(qū)域渣触。 人類關(guān)鍵點(diǎn)是一組特殊的關(guān)鍵點(diǎn)羡棵,僅用于表示人類的檢測

[40] X. Zhou, J. Zhuo, and P. Kr¨ahenb¨uhl. Bottom-up object detection by grouping extreme and center points. In arXiv preprint arXiv:1901.08043, 2019. 2?

[22] H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018. 2?

[28] K.Maninis,S.Caelles,J.Pont-Tuset,andL.VanGool. Deep extremecut: Fromextremepointstoobjectsegmentation. In CVPR, 2018. 2?

2.3.Multi-ObjectTracking

MOT aims to estimate trajectories of multiple objects by ?nding target locations while maintaining their identities across frames. Of?ine methods use both pastand future frames to generate trajectories while online methods only exploit information that is available until the current frame. An online MOT pipeline [41] was presented with applying a single object tracker to keep tracking each target given these target detections in each frame. The target state is set as tracked until the tracking result becomes unreliable. The target is then regarded as lost, and data association is performed to compute the similarity between the track-let and detections. Our proposed online pose tracking framework also tracks each target (with corresponding keypoints) individually while keeping their identities, and performs data association when target is lost. However, our framework is distinct in several aspects: (a) the detection is generated by object detector only at key frames, therefore not necessarily provided at each frame. It can be provided scarcely; (b)the single object tracker is actually a pose estimator that predicts keypoints based on an enlarged region.?

MOT旨在通過找到目標(biāo)位置來估計(jì)多個(gè)物體的軌跡,同時(shí)保持它們在幀之間的id嗅钻。offline方法使用過去和未來幀來生成軌跡皂冰,而在線方法僅利用在當(dāng)前幀之前可用的信息。提出了一個(gè)online MOT管道[41]养篓,應(yīng)用單個(gè)物體跟蹤器秃流,在每個(gè)幀中給定這些目標(biāo)檢測,跟蹤每個(gè)目標(biāo)柳弄。目標(biāo)狀態(tài)被設(shè)置為跟蹤舶胀,直到跟蹤結(jié)果變得不可靠。然后將目標(biāo)視為丟失碧注,并且執(zhí)行數(shù)據(jù)關(guān)聯(lián)以計(jì)算軌道和檢測之間的相似性嚣伐。我們提出的在線姿勢跟蹤框架還在保持其身份的同時(shí)單獨(dú)跟蹤每個(gè)目標(biāo)(具有相應(yīng)的關(guān)鍵點(diǎn)),并在目標(biāo)丟失時(shí)執(zhí)行數(shù)據(jù)關(guān)聯(lián)应闯。然而纤控,我們的框架在幾個(gè)方面是不同的:(a)檢測僅由關(guān)鍵幀的對象檢測器生成,因此不一定在每個(gè)幀處提供碉纺〈颍可以幾乎沒有提供; (b)單個(gè)對象跟蹤器實(shí)際上是一個(gè)姿勢估計(jì)器,它根據(jù)放大的區(qū)域預(yù)測關(guān)鍵點(diǎn)骨田。

2.4.Graphical Representation for Human Pose

It is recently studied in [38] on how to effectively model dynamic skeletons with a specially tailored graph convolution operation. The graph convolution operation turns human skeletons into spatio-temporal representation of human actions. Inspired by this work, we propose to employ GCN to encode spatial relationship among human joints in toalatent representation of human pose. The representation aims to robustly encode the pose, which is invariant to human location or view angle. We measure similarities of such encodings for the matching of human poses.

最近在[38]中研究了如何使用專門定制的圖形卷積運(yùn)算有效地建模動(dòng)態(tài)骨架耿导。graph卷積運(yùn)算將人類骨骼轉(zhuǎn)化為人類行為的時(shí)空表示。 受這項(xiàng)工作的啟發(fā)态贤,我們建議采用GCN編碼人體關(guān)節(jié)之間的空間關(guān)系舱呻,以人體姿勢的表現(xiàn)形式。 該表示旨在對姿勢進(jìn)行魯棒編碼,該姿勢對于人的位置或視角是不變的箱吕。 我們測量這種編碼的相似性芥驳,以匹配人體姿勢。

GCN:https://tkipf.github.io/graph-convolutional-networks/

GCN講解:https://www.cnblogs.com/SivilTaram/p/graph_neural_network_1.html

3.ProposedMethod

3.1.Top-Down Pose Tracking Framework?

We propose a novel top-down pose tracking framework. It has been proved that human pose can be employed for better inference of human locations [27]. We observe that, in a top-down approach, accurate human locations also ease the estimation of human poses. We further study the relationships between these two levels of information: (1) Coarse person location can be distilled into body keypoints by a single-person pose estimator. (2) The position of human joints can be straightforwardly used to indicate rough locations of human candidates. (3) Thus, recurrently estimating one from the other is a feasible strategy for Single-person Pose Tracking (SPT).

我們提出了一種新穎的自上而下姿勢跟蹤框架茬高。 已經(jīng)證明兆旬,人類姿勢可以用于更好地推斷人類位置[27]。 我們觀察到怎栽,在自上而下的方法中丽猬,準(zhǔn)確的人體位置也可以簡化人體姿勢的估計(jì)。 我們進(jìn)一步研究了這兩個(gè)層面信息之間的關(guān)系:(1)粗人位置可以通過單人姿勢估計(jì)器提煉成身體關(guān)鍵點(diǎn)熏瞄。 (2)人體關(guān)節(jié)的位置可以直接用于指示人類候選人的粗略位置脚祟。 (3)因此,反復(fù)估計(jì)一個(gè)是單人姿勢跟蹤(SPT)的可行策略强饮。

However, it is not a good idea to merely consider the Multi-target Pose Tracking (MPT) problem as a repeated SPT problem for multiple individuals. Because certain constraints need to be met, e.g., in a certain frame, two different IDs should not belong to the same person; neither two candidates should share the same identity. A better way is to track multiple individuals simultaneously and preserve/update their identities with an additional Re-ID module. The Re-ID module is essential because it is usually hard to maintain correct identities all the way. It is unlikely to track the individual pose seffectively across frames of the entire video. For instance, under the following scenarios, identities have to be updated: (1) some people disappear from the camera view or get occluded; (2) new candidates come in or previous candidates re-appear; (3) people walk across each other (two identities may merge into one if not treatedcarefully);(4)tracking fails due to fast camera shifting or zooming.

然而由桌,僅將多目標(biāo)姿勢跟蹤(MPT)問題視為多個(gè)人的重復(fù)SPT問題并不是一個(gè)好主意。因?yàn)樾枰獫M足某些約束胡陪,例如沥寥,在某個(gè)幀中碍舍,兩個(gè)不同的ID不應(yīng)該屬于同一個(gè)人;兩位候選人都不應(yīng)該擁有相同的身份柠座。更好的方法是同時(shí)跟蹤多個(gè)人并使用額外的Re-ID模塊保留/更新他們的身份。 Re-ID模塊是必不可少的片橡,因?yàn)橥ǔ:茈y保持正確的身份妈经。不可能跨整個(gè)視頻的幀有效地跟蹤個(gè)體姿勢。例如捧书,在以下情況下吹泡,必須更新身份:(1)有些人從攝像機(jī)視圖中消失或被遮擋; (2)新候選人進(jìn)入或以前的候選人重新出現(xiàn); (3)人們互相走過(如果不經(jīng)過嚴(yán)格的處理,兩個(gè)身份可能合并成一個(gè));(4)由于快速的攝像機(jī)移動(dòng)或縮放而導(dǎo)致跟蹤失敗经瓷。

In our method, we ?rst treat each human candidate separately such that their corresponding identity is kept across the frames. In this way, we circumvent the time-consuming of?ine optimization procedure. In case the tracked candidate isl ost due to occlusion or camera shift,we then call the detection module to revive candidates and associate them to the tracked targets from the previous frame via pose matching. In this way, we accomplish multi-target pose tracking with an SPT module and a pose matching module.?

在我們的方法中爆哑,我們首先分別處理每個(gè)人候選人,使他們的相應(yīng)身份保持在幀之間舆吮。 通過這種方式揭朝,我們可以避免耗時(shí)的優(yōu)化程序。 如果被跟蹤的候選者由于遮擋或相機(jī)移位而失敗色冀,我們?nèi)缓笳{(diào)用檢測模塊來恢復(fù)候選者并通過姿勢匹配將它們與前一幀中的跟蹤目標(biāo)相關(guān)聯(lián)潭袱。 通過這種方式,我們使用SPT模塊和姿勢匹配模塊完成多目標(biāo)姿態(tài)跟蹤锋恬。

Speci?cally, the bounding box of the person in the upcoming frame is inferred from the joints estimated by the pose module from the current frame. We ?nd the minimum and maximum coordinates and enlarge this ROI region by 20% on each side. The enlarged bounding box is treated as the localized region for this person in the next frame. If the average con?dence score ˉ s from the estimated joints is lower than the standard τs, it re?ects that the target is lost since the joints are not likely to appear in the bounding box region. The state of the target is de?ned as:?

具體地屯换,即將到來的幀中的人的邊界框是從姿勢模塊從當(dāng)前幀估計(jì)的關(guān)節(jié)推斷的。 我們找到最小和最大坐標(biāo)与学,并將每側(cè)的ROI區(qū)域擴(kuò)大20%彤悔。 放大的邊界框在下一幀中被視為此人的局部區(qū)域嘉抓。 如果估計(jì)關(guān)節(jié)的平均信度得分低于標(biāo)準(zhǔn)τs,則反映目標(biāo)丟失晕窑,因?yàn)殛P(guān)節(jié)不太可能出現(xiàn)在邊界框區(qū)域掌眠。 目標(biāo)的狀態(tài)定義為:

If the target is lost, we have two modes: (1) Fixed Keyframe Interval (FKI) mode. Neglect this target until the scheduled next key-frame, where the detection module re-generate the candidates and then associate their IDs to the tracking history. (2) Adaptive Keyframe Interval (AKI) mode. Immediately revive the missing target by candidate detection and identity association. The advantage of FKI mode is that the frame rate of pose tracking is stable due to the ?xed interval of keyframes. The advantage of AKI mode is that the average frame rate can be higher for noncomplex videos. In our experiments, we incorporate them by taking keyframes with ?xed intervals while also calling detection module once a target is lost before the arrival of thenextarrangedkeyframe. The tracking accuracy is higher because when a target is lost, it is handled immediately

如果目標(biāo)丟失,我們有兩種模式:(1)固定關(guān)鍵幀間隔(FKI)模式幕屹。 忽略該目標(biāo)直到預(yù)定的下一個(gè)關(guān)鍵幀蓝丙,其中檢測模塊重新生成候選者,然后將他們的ID與跟蹤歷史相關(guān)聯(lián)望拖。 (2)自適應(yīng)關(guān)鍵幀間隔(AKI)模式渺尘。 通過候選檢測和身份關(guān)聯(lián)立即恢復(fù)缺失的目標(biāo)。 FKI模式的優(yōu)點(diǎn)在于说敏,由于關(guān)鍵幀的固定間隔鸥跟,姿勢跟蹤的幀速率是穩(wěn)定的。 AKI模式的優(yōu)點(diǎn)是非復(fù)雜視頻的平均幀速率可以更高盔沫。 在我們的實(shí)驗(yàn)中医咨,我們通過采用固定間隔的關(guān)鍵幀來合并它們,同時(shí)一旦目標(biāo)在下一個(gè)關(guān)鍵幀到達(dá)之前丟失架诞,也會(huì)調(diào)用檢測模塊拟淮。 跟蹤精度更高,因?yàn)楫?dāng)目標(biāo)丟失時(shí)谴忧,立即處理

Figure 2. Sequentially adjacent frames with sudden camera shift (left frames), and sudden zooming (right frames). Each bounding box in the current frame indicates the corresponding region inferred from the human keypoints from the previous frame. The human pose in the current frame is estimated by the pose estimator. The ROI for the pose estimator is the expanded bounding box. 圖2.具有突然相機(jī)移位(左幀)和突然縮放(右?guī)┑捻樞蛳噜弾?當(dāng)前幀中的每個(gè)邊界框表示從前一幀的人類關(guān)鍵點(diǎn)推斷出的相應(yīng)區(qū)域很泊。 當(dāng)前幀中的人體姿勢由姿勢估計(jì)器估計(jì)。 姿勢估計(jì)器的ROI是擴(kuò)展的邊界框沾谓。

For identity association, we propose to consider two complementary information: spatial consistency and pose consistency. We ?rst rely on spatial consistency, i.e., if two bounding boxes from the current and the previous frames are adjacent, or their Intersection Over Union (IOU) is above a certain threshold, we consider them to belong to the same target. Speci?cally, we set the matching ?ag m(tk,dk) to 1 if the maximum IOU overlap ratio o(tk,Di,k) between the tracked target tk ∈Tk and the corresponding detection dk ∈ Dk for key-frame k is higher than the threshold τo. Otherwise, m(tk,dk) is set as 0:?

對于身份關(guān)聯(lián)委造,我們建議考慮兩個(gè)互補(bǔ)信息:空間一致性和姿勢一致性我們首先依賴于空間一致性均驶,即昏兆,如果來自當(dāng)前幀和先前幀的兩個(gè)邊界框相鄰,或者它們的交叉點(diǎn)(IOU)超過某個(gè)閾值妇穴,我們認(rèn)為它們屬于同一目標(biāo)爬虱。 具體地,如果跟蹤目標(biāo)tk∈Tk與關(guān)鍵幀k的對應(yīng)檢測dk∈Dk之間的最大IOU重疊率o(tk伟骨,di饮潦,k) 比閾值τo更高,則我們將匹配flag m(tk携狭,dk)設(shè)置為1继蜡。 否則,m(tk,dk)設(shè)置為0:

The above criterion is based on the assumption that the tracked target from the previous frame and the actual location of the target in the current frame have signi?cant overlap, which is true for most cases. However, such assumption is not always reliable, especially when the camera shifts swiftly. In such cases, we need to match the new observation to the tracked candidates. In Re-ID problems, this is usually accomplished by a visual feature classi?er. However, visually similar candidates with different identities may confuse such classi?ers. Extracting visual features can also be computationally expensive in an online tracking system. Therefore, we design a Graph Convolution Network (GCN) to leverage the graphical representation of the human joints. We observe that in two adjacent frames, the location of a person may drift away due to sudden camera shift, but the human pose will stay almost the same as people usually cannot act that fast, as illustrated in Fig. 2. Consequently, the graph representation of human skeletons can be a strong cue for candidate matching, which we refer to as pose matching in the following text.?

上述標(biāo)準(zhǔn)基于以下假設(shè):來自前一幀的跟蹤目標(biāo)和當(dāng)前幀中目標(biāo)的實(shí)際位置具有顯著重疊稀并,這對于大多數(shù)情況都是如此仅颇。然而,這種假設(shè)并不總是可靠的碘举,尤其是當(dāng)攝像機(jī)快速移動(dòng)時(shí)忘瓦。在這種情況下,我們需要將新觀察與被跟蹤候選人匹配引颈。在Re-ID問題中耕皮,這通常由視覺特征分類器完成。然而蝙场,具有不同身份的視覺上相似的候選人可能會(huì)混淆這些分類凌停。在線跟蹤系統(tǒng)中,提取視覺特征在計(jì)算上也是昂貴的售滤。因此罚拟,我們設(shè)計(jì)了一個(gè)圖形卷積網(wǎng)絡(luò)(GCN)來利用人體關(guān)節(jié)的圖形表示。我們觀察到完箩,在兩個(gè)相鄰的幀中赐俗,由于突然的相機(jī)移位,人的位置可能會(huì)漂移弊知,但是人的姿勢將保持幾乎與人們通常不能快速行動(dòng)的相同阻逮,如圖2所示。因此吉捶,圖人體骨架的表示可以是候選匹配的強(qiáng)烈提示夺鲜,我們在下文中將其稱為姿勢匹配皆尔。

3.2.Siamese Graph Convolutional Networks?

Siamese Network: Given the sequences of body joints in the form of 2D coordinates, we construct a spatial graph with the joints as graph nodes and connectivities in human body structures as graph edges. The input to our graph convolutional network is the joint coordinate vectors on the graph nodes. It is analogous to image-based CNNs where the input is formed by pixel intensity vectors residing on the 2D image grid [38]. Multiple graph convolutions are performed on the input to generate a feature representation vector as a conceptual summary of the human pose. It inherently encodes the spatial relationship among the human joints. The input to the siamese networks is therefore a pair of inputs to the GCN network. The distance between two output features represent how similar two poses are to each other. Two poses are called a match if they are conceptually similar. The network is illustrated in Fig. 3. The siamese network consists of 2 GCN layers and 1 convolutional layer using contrastive loss. We take normalized keypoint coordinates as input; the output is a 128 dimensional feature vector. The network is optimized with contrastive loss L because we want the network to generate feature representations, that are close by enough for positive pairs, whereas they are far away at least by a minimum for negative pairs. we employ the margin contrastive loss:

Siamese Network:給定2D坐標(biāo)形式的身體關(guān)節(jié)序列呐舔,我們構(gòu)建一個(gè)空間圖,其中關(guān)節(jié)為圖形節(jié)點(diǎn)慷蠕,人體結(jié)構(gòu)中的連接為圖邊珊拼。我們的圖卷積網(wǎng)絡(luò)的輸入是圖節(jié)點(diǎn)上的聯(lián)合坐標(biāo)向量。它類似于基于圖像的CNN流炕,其中輸入由駐留在2D圖像網(wǎng)格上的像素強(qiáng)度矢量形成[38]澎现。對輸入執(zhí)行多個(gè)圖形卷積以生成特征表示向量作為人類姿勢的概念性概要。它固有地編碼人體關(guān)節(jié)之間的空間關(guān)系每辟。因此剑辫,對連接網(wǎng)絡(luò)的輸入是GCN網(wǎng)絡(luò)的一對輸入。兩個(gè)輸出特征之間的距離表示兩個(gè)姿勢彼此相似的程度炎码。如果兩個(gè)姿勢在概念上相似贡定,則稱為匹配聪黎。網(wǎng)絡(luò)如圖3所示再姑。連接網(wǎng)絡(luò)由2個(gè)GCN層和1個(gè)使用對比度損失的卷積層組成胳岂。我們將標(biāo)準(zhǔn)化的關(guān)鍵點(diǎn)坐標(biāo)作為輸入;輸出是128維特征向量编整。網(wǎng)絡(luò)使用對比度損失L進(jìn)行優(yōu)化,因?yàn)槲覀兿MW(wǎng)絡(luò)生成特征表示乳丰,對于正對來說掌测,它們足夠接近,而對于負(fù)對产园,它們至少距離最小汞斧。我們使用保證金對比損失margin contrastive loss:

Graph Convolution for Skeleton: For standard 2D convolution on natural images, the output feature maps can have the same size as the input feature maps with stride 1 and appropriate padding. Similarly, the graph convolution operation is designed to output graphs with the same number of nodes. The dimensionality of attributes of these nodes, which is analogous to the number of feature map channels in standard convolution, may change after the graph convolution operation.?

骨架的圖形卷積:對于自然圖像上的標(biāo)準(zhǔn)2D卷積,輸出要素圖可以與具有步幅1和適當(dāng)填充的輸入要素圖相同什燕。 類似地断箫,圖形卷積運(yùn)算被設(shè)計(jì)為輸出具有相同數(shù)量節(jié)點(diǎn)的圖形。 這些節(jié)點(diǎn)的屬性的維度(類似于標(biāo)準(zhǔn)卷積中的特征映射信道的數(shù)量)可以在圖形卷積運(yùn)算之后改變秋冰。

The standard convolution operation is defined as follows: given a convolution operator with the kernel size of K ×K, and an input feature map fin with the number of channels c, the output value of a single channel at the spatial location x can be written as:

標(biāo)準(zhǔn)卷積運(yùn)算定義如下:給定內(nèi)核大小為K×K的卷積運(yùn)算符仲义,以及具有通道數(shù)c的輸入特征映射fin,可以寫出空間位置x處的單個(gè)通道的輸出值 如:

The convolution operation on graphs is defined by extending the above formulation to the cases where the input features map resides on a spatial graph Vt, i.e. the feature map fin t : Vt -> Rc has a vector on each node of the graph. The next step of the extension is to re-define the sampling function p and the weight function w. We follow the method proposed in [38]. For each node, only its adjacent nodes are sampled. The neighbor set for node vi is B(vi) = {vj|d(vj; vi) ≤ 1}. The sampling function p : B(vi) -> V can be written as p(vi; vj) = vj. In this way, the number of adjacent nodes is not fixed, nor is the weighting order. In order to have a fixed number of samples and a fixed order of weighting them, we label the neighbor nodes around the root node with fixed number of partitions, and then weight these nodes based on their partition class. The specific partitioning method is illustrated in Fig. 4.

通過將上述公式擴(kuò)展到輸入特征圖存在于空間圖Vt上的情況來定義圖上的卷積運(yùn)算剑勾,即埃撵,特征圖fin t:Vt - > Rc在圖的每個(gè)節(jié)點(diǎn)上具有向量。 擴(kuò)展的下一步是重新定義采樣函數(shù)p和權(quán)重函數(shù)w虽另。 我們遵循[38]中提出的方法暂刘。 對于每個(gè)節(jié)點(diǎn),僅對其相鄰節(jié)點(diǎn)進(jìn)行采樣捂刺。 節(jié)點(diǎn)vi的鄰居集合為B(vi)= {vj | d(vj; vi)≤1}谣拣。 采樣函數(shù)p:B(vi) - > V可寫為p(vi; vj)= vj。 這樣族展,相鄰節(jié)點(diǎn)的數(shù)量不固定森缠,加權(quán)順序也不固定。 為了獲得固定數(shù)量的樣本和加權(quán)的固定順序仪缸,我們使用固定數(shù)量的分區(qū)標(biāo)記根節(jié)點(diǎn)周圍的鄰居節(jié)點(diǎn)贵涵,然后根據(jù)它們的分區(qū)類對這些節(jié)點(diǎn)進(jìn)行加權(quán)。 具體的劃分方法如圖4所示恰画。

在[38]中提出的空間配置劃分策略用于圖形采樣和加權(quán)以構(gòu)建圖形卷積運(yùn)算宾茂。 節(jié)點(diǎn)根據(jù)它們與骨架重心(黑色圓圈)的距離與根節(jié)點(diǎn)(綠色)的距離進(jìn)行標(biāo)記。 向心節(jié)點(diǎn)具有較短的距離(藍(lán)色)拴还,而離心節(jié)點(diǎn)具有比根節(jié)點(diǎn)更長的距離(黃色)跨晴。

Therefore, Eq. (4) for graph convolution is re-written as:

where the normalization term Zi(vj) is to balance the contributions of different subsets to the output. According to the partition method mentioned above, we have:

歸一化項(xiàng)Zi(vj)是為了平衡不同子集對輸出的貢獻(xiàn)。 根據(jù)上面提到的分區(qū)方法片林,我們有:

where ri is the average distance from gravity center to joint i over all frames in the training set.

其中ri是訓(xùn)練集中所有幀上從重心到關(guān)節(jié)i的平均距離端盆。

4. Experiments

In this section, we present quantitative results of our experiments. Some qualitative results are shown in Fig. 5.

4.1. Dataset

PoseTrack [3] is a large-scale benchmark for human pose estimation and articulated tracking in videos. It provides publicly available training and validation sets as well as an evaluation server for benchmarking on a held-out test set. The benchmark is a basis for the challenge competitions at ICCV’17 [1] and ECCV’18 [2] workshops. The dataset consisted of over 68; 000 frames for the ICCV’17 challenge and is extended to twice as many frames for the ECCV’18 challenge. It now includes 593 training videos, 74 validation videos and 375 testing videos. For held-out test set, at most four submissions per task can be made for the same approach. Evaluation on validation set has no submission limit. Therefore, ablation studies in Section 4.4 are performed on the validation set. Since PoseTrack’18 test set is not open yet, we compare our results with other approaches in Sec. 4.5 on PoseTrack’17 test set.

PoseTrack [3]是人類姿勢估計(jì)和視頻中清晰跟蹤的大規(guī)氖鞑t;鶞?zhǔn)。 它提供公開可用的培訓(xùn)和驗(yàn)證集以及評估服務(wù)器爱谁,用于對保留的測試集進(jìn)行基準(zhǔn)測試晒喷。 該基準(zhǔn)是ICCV'17 [1]和ECCV'18 [2]研討會(huì)挑戰(zhàn)賽的基礎(chǔ)。 數(shù)據(jù)集對于ICCV'17挑戰(zhàn)访敌,幀數(shù)超過68000幀凉敲,并且擴(kuò)展到ECCV'18挑戰(zhàn)的幀數(shù)的兩倍。 它現(xiàn)在包括593個(gè)培訓(xùn)視頻寺旺,74個(gè)驗(yàn)證視頻和375個(gè)測試視頻爷抓。 對于保持測試集,對于相同的方法阻塑,每個(gè)任務(wù)最多可以提交四個(gè)提交蓝撇。 驗(yàn)證集的評估沒有提交限制。 因此陈莽,第4.4節(jié)中的消融研究是在驗(yàn)證集上進(jìn)行的渤昌。 由于PoseTrack'18測試集尚未開放,我們將結(jié)果與Sec中的其他方法進(jìn)行比較走搁。 4.5在PoseTrack'17測試集上独柑。

4.2. Evaluation Metrics

The evaluation includes pose estimation accuracy and pose tracking accuracy. Pose estimation accuracy is evaluated using the standard mAP metric, whereas the evaluation of pose tracking is according to the clear MOT [5] metrics that are the standard for evaluation of multi-target tracking.

評估包括姿勢估計(jì)精度和姿勢跟蹤精度。 使用標(biāo)準(zhǔn)mAP度量評估姿勢估計(jì)準(zhǔn)確度私植,而姿勢跟蹤的評估根據(jù)作為評估多目標(biāo)跟蹤的標(biāo)準(zhǔn)的明確MOT [5]度量忌栅。

4.3. Implementation Details

We adopt state-of-the-art key-frame object detectors trained with ImageNet and COCO datasets. Specifically, we use pre-trained models from deformable ConvNets [9]. We conduct experiments on validation sets to choose the object detector with better recall rates. For the object detectors, we compare the deformable convolution versions of the RFCN network [8] and of the FPN network [25], both with ResNet101 backbone [16]. The FPN feature extractor is attached to the Fast R-CNN [13] head for detection. We compare the detection results with the ground truth based on the precision and recall rate on PoseTrack’17 validation set. In order to eliminate redundant candidates, we drop candidates with lower likelihood. As shown in Table 2, precision and recall of the detectors are given for various drop thresholds. Since the FPN network performs better, we choose it as our human candidate detector. During training, we infer ground truth bounding boxes of candidates from the annotated keypoints, because in PoseTrack’17 dataset, the bounding box positions are not provided in the annotations. Specifically, we locate a bounding box from the minimum and maximum coordinates of the 15 keypoints, and then enlarge this box by 20% both horizontally and vertically

我們采用使用ImageNet和COCO數(shù)據(jù)集訓(xùn)練的最先進(jìn)的關(guān)鍵幀物體探測器。具體來說曲稼,我們使用來自可變形ConvNets的預(yù)訓(xùn)練模型[9]索绪。我們對驗(yàn)證集進(jìn)行實(shí)驗(yàn),以選擇具有更好召回率的物體檢測器贫悄。對于物體探測器瑞驱,我們比較了RFCN網(wǎng)絡(luò)[8]和FPN網(wǎng)絡(luò)[25]的可變形卷積版本,兩者都與ResNet101主干[16]清女。 FPN特征提取器連接到快速R-CNN [13]頭進(jìn)行檢測钱烟。我們根據(jù)PoseTrack'17驗(yàn)證集上的精度和召回率,將檢測結(jié)果與基本事實(shí)進(jìn)行比較嫡丙。為了消除多余的候選人,我們放棄了可能性較低的候選人读第。如表2所示曙博,針對各種下降閾值給出了檢測器的精度和召回率。由于FPN網(wǎng)絡(luò)表現(xiàn)更好怜瞒,我們選擇它作為我們的人類候選檢測器父泳。在訓(xùn)練期間般哼,我們從帶注釋的關(guān)鍵點(diǎn)推斷出候選地面真實(shí)邊界框,因?yàn)樵赑oseTrack'17數(shù)據(jù)集中惠窄,注釋中沒有提供邊界框位置蒸眠。具體來說,我們從15個(gè)關(guān)鍵點(diǎn)的最小和最大坐標(biāo)中找到一個(gè)邊界框杆融,然后將此框水平和垂直放大20%

For the single-person human pose estimator, we adopt CPN101 [7] and MSRA152 [36] with slight modifications. We first train the networks with the merged dataset of PoseTrack’17 and COCO for 260 epochs. Then we fainetune the network solely on PoseTrack’17 for 40 epochs in order to mitigate the inaccurate regression on head and neck. For COCO, bottom-head and top-head positions are not given We infer these keypoints by interpolation on the annotated keypoints. We find that by finetuning on the PoseTrack dataset, the prediction on head keypoints will be refined. During finetuning, we use the technique of online hard keypoint mining, only focusing on losses from the 7 hardest keypoints out of the total 15 keypoints. Pose inference is performed online with single thread.

對于單人人體姿勢估計(jì)器楞卡,我們采用CPN101 [7]和MSRA152 [36]稍作修改。 我們首先使用PoseTrack'17和COCO的合并數(shù)據(jù)集訓(xùn)練網(wǎng)絡(luò)260個(gè)epoch脾歇。 然后我們僅在PoseTrack '17上將網(wǎng)絡(luò)用于40個(gè)epoch蒋腮,以減輕頭部和頸部的不準(zhǔn)確回歸。 對于COCO藕各,沒有給出底部頭部和頂部頭部位置池摧。我們通過在注釋關(guān)鍵點(diǎn)上插值來推斷這些關(guān)鍵點(diǎn)。 我們發(fā)現(xiàn)通過對PoseTrack數(shù)據(jù)集進(jìn)行微調(diào)激况,可以優(yōu)化對頭部關(guān)鍵點(diǎn)的預(yù)測作彤。 在微調(diào)期間,我們使用在線硬關(guān)鍵點(diǎn)挖掘技術(shù)乌逐,僅關(guān)注總共15個(gè)關(guān)鍵點(diǎn)中7個(gè)最難關(guān)鍵點(diǎn)的損失宦棺。 使用單線程在線執(zhí)行姿勢推斷。

For the pose matching module, we train a siamese graph convolutional network with 2 GCN layers and 1 convolutional layer using contrastive loss. We take normalized keypoint coordinates as input; the output is a 128 dimensional feature vector. Following [38], we use spatial configuration partitioning as the sampling method for graph convolution and use learnable edge importance weighting. To train the siamese network, we generate training data from the PoseTrack dataset. Specifically, we extract people with same IDs within adjacent frames as positive pairs, and extract people with different IDs within the same frame and across frames as negative pairs. Hard negative pairs only include spatially overlapped poses. The number of collected pairs are illustrated in Table 1. We train the model with batch size of 32 for a total of 200 epochs with SGD optimizer. Initial learning rate is set to 0:001 and is decayed by 0.1 at epochs of 40,60, 80,100. Weight decay is 10?4.

對于姿勢匹配模塊黔帕,我們使用對比損失訓(xùn)練具有2個(gè)GCN層和1個(gè)卷積層的siamese圖卷積網(wǎng)絡(luò)代咸。 我們將標(biāo)準(zhǔn)化的關(guān)鍵點(diǎn)坐標(biāo)作為輸入; 輸出是128維特征向量。 在[38]之后成黄,我們使用空間配置分區(qū)作為圖卷積的采樣方法呐芥,并使用可學(xué)習(xí)的邊緣重要性加權(quán)。 為了訓(xùn)練暹羅網(wǎng)絡(luò)奋岁,我們從PoseTrack數(shù)據(jù)集生成訓(xùn)練數(shù)據(jù)思瘟。 具體而言,我們在相鄰幀中提取具有相同ID的人作為正對闻伶,并在同一幀內(nèi)提取具有不同ID的人滨攻,并在負(fù)幀對中提取幀。 硬負(fù)對只包括空間重疊的姿勢蓝翰。 收集對的數(shù)量如表1所示光绕。我們使用SGD優(yōu)化器訓(xùn)練批量大小為32的模型,總共200個(gè)時(shí)期畜份。 初始學(xué)習(xí)率設(shè)置為0.001诞帐,并且在40,60,80,100的時(shí)期內(nèi)衰減0.1。 重量衰減是10-4爆雹。

4.4. Ablation Study

We conducted a series of ablation studies to analyze the contribution of each component on the overall performance.

探測器比較:PoseTrack2017驗(yàn)證集上的精確召回停蕉。 如果其帶有GT的IoU高于某個(gè)閾值愕鼓,則邊界框是正確的,對于所有實(shí)驗(yàn)慧起,該閾值設(shè)置為0.4
使用PoseTrack'17驗(yàn)證集上的各種檢測器比較離線姿勢跟蹤結(jié)果菇晃。

Detectors: We experimented with several detectors and decide to use Deformable ConvNets with ResNet101 as backbone, Feature Pyramid Networks (FPN) for feature extraction, and fast R-CNN scheme as detection head. As shown in Table 2, this detector performs better than Deformable R-FCN with the same backbone. It is no surprise that the better detector results in better performances on both pose estimation and pose tracking, as shown in Table 3.

探測器:我們試驗(yàn)了幾個(gè)探測器并決定使用可變形的ConvNets和ResNet101作為主干,特征金字塔網(wǎng)絡(luò)(FPN)用于特征提取蚓挤,以及Faster R-CNN方案作為探測頭磺送。 如表2所示,該檢測器比具有相同骨干的可變形R-FCN表現(xiàn)更好屈尼。 毫無疑問册着,更好的探測器可以在姿態(tài)估計(jì)和姿態(tài)跟蹤方面獲得更好的性能,如表3所示脾歧。

可變型卷積提出:DCN

Offline vs. Online: We studied the effect of keyframe intervals of our online method and compare with the offline method. For fair comparison, we use identical human candidate detector and pose estimator for both methods. For offline method, we pre-compute human candidate detection and estimate the pose for each candidate, then we adopt a flow-based pose tracker [37], where pose flows are built by associating poses that indicate the same person across frames. For online method, we perform truly online pose tracking. Since human candidate detection is performed only at key frames, the online performance varies with different intervals. In Table 4, we illustrate the performance of the offline method, compared with the online method that is given various keyframe intervals. Offline methods performed better than online methods. But we can see the great potential of online methods when the detections (DET) at keyframes are more accurate, the upper-limited of which is achieved with ground truth (GT) detections. As expected, frequent keyframe helps more with the performance. Note that the online methods only use spatial consistency for data association at key frames. We report ablation experiments on the pose matching module in the following text.

離線與在線:我們研究了在線方法的關(guān)鍵幀間隔的影響甲捏,并與離線方法進(jìn)行了比較。為了公平比較鞭执,我們對兩種方法使用相同的人類候選檢測器和姿勢估計(jì)器司顿。對于離線方法,我們預(yù)先計(jì)算人類候選人檢測并估計(jì)每個(gè)候選人的姿勢兄纺,然后我們采用基于流量的姿勢跟蹤器[37]大溜,其中通過關(guān)聯(lián)表示跨越幀的同一人的姿勢來建立姿勢流。對于在線方法估脆,我們執(zhí)行真正的在線姿勢跟蹤钦奋。由于僅在關(guān)鍵幀處執(zhí)行人類候選檢測,因此在線性能隨著不同的間隔而變化疙赠。在表4中付材,我們說明了離線方法的性能,與給出各種關(guān)鍵幀間隔的在線方法進(jìn)行了比較圃阳。離線方法比在線方法表現(xiàn)更好厌衔。但是當(dāng)關(guān)鍵幀的檢測(DET)更準(zhǔn)確時(shí),我們可以看到在線方法的巨大潛力捍岳,其中上限是通過基礎(chǔ)事實(shí)(GT)檢測實(shí)現(xiàn)的富寿。正如所料,頻繁的關(guān)鍵幀有助于提高性能锣夹。請注意页徐,聯(lián)機(jī)方法僅在關(guān)鍵幀處使用空間一致性來進(jìn)行數(shù)據(jù)關(guān)聯(lián)。我們在下文中報(bào)告姿勢匹配模塊的消融實(shí)驗(yàn)晕城。

GCN vs. Spatial Consistency (SC): Next, we report results when pose matching is performed during data association stage, compared with only employing spatial consistency. It can be shown in Table 5 that the tracking performance increases with GCN-based pose matching. However, in some situations, different people may have near-duplicate poses, as shown in Fig. 6. To mitigate such ambiguities, spatial consistency is considered prior to pose similarity

GCN與空間一致性(SC):接下來泞坦,我們報(bào)告在數(shù)據(jù)關(guān)聯(lián)階段執(zhí)行姿勢匹配時(shí)的結(jié)果,與僅采用空間一致性進(jìn)行比較砖顷。 表5中可以顯示贰锁,跟蹤性能隨著基于GCN的姿勢匹配而增加。 然而滤蝠,在某些情況下豌熄,不同的人可能具有接近重復(fù)的姿勢,如圖6所示物咳。為了減輕這種模糊锣险,在姿勢相似性之前考慮空間一致性。

4.5. Performance Comparison

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末览闰,一起剝皮案震驚了整個(gè)濱河市芯肤,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌压鉴,老刑警劉巖崖咨,帶你破解...
    沈念sama閱讀 206,602評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異油吭,居然都是意外死亡击蹲,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,442評論 2 382
  • 文/潘曉璐 我一進(jìn)店門婉宰,熙熙樓的掌柜王于貴愁眉苦臉地迎上來歌豺,“玉大人,你說我怎么就攤上這事心包±噙郑” “怎么了?”我有些...
    開封第一講書人閱讀 152,878評論 0 344
  • 文/不壞的土叔 我叫張陵蟹腾,是天一觀的道長痕惋。 經(jīng)常有香客問我,道長岭佳,這世上最難降的妖魔是什么血巍? 我笑而不...
    開封第一講書人閱讀 55,306評論 1 279
  • 正文 為了忘掉前任,我火速辦了婚禮珊随,結(jié)果婚禮上述寡,老公的妹妹穿的比我還像新娘。我一直安慰自己叶洞,他們只是感情好鲫凶,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,330評論 5 373
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著衩辟,像睡著了一般螟炫。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上艺晴,一...
    開封第一講書人閱讀 49,071評論 1 285
  • 那天昼钻,我揣著相機(jī)與錄音掸屡,去河邊找鬼。 笑死然评,一個(gè)胖子當(dāng)著我的面吹牛仅财,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播碗淌,決...
    沈念sama閱讀 38,382評論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼盏求,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了亿眠?” 一聲冷哼從身側(cè)響起碎罚,我...
    開封第一講書人閱讀 37,006評論 0 259
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎纳像,沒想到半個(gè)月后荆烈,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,512評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡爹耗,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,965評論 2 325
  • 正文 我和宋清朗相戀三年耙考,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片潭兽。...
    茶點(diǎn)故事閱讀 38,094評論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡倦始,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出山卦,到底是詐尸還是另有隱情鞋邑,我是刑警寧澤,帶...
    沈念sama閱讀 33,732評論 4 323
  • 正文 年R本政府宣布账蓉,位于F島的核電站枚碗,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏铸本。R本人自食惡果不足惜肮雨,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,283評論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望箱玷。 院中可真熱鬧怨规,春花似錦、人聲如沸锡足。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,286評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽舶得。三九已至掰烟,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背纫骑。 一陣腳步聲響...
    開封第一講書人閱讀 31,512評論 1 262
  • 我被黑心中介騙來泰國打工蝎亚, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人惧磺。 一個(gè)月前我還...
    沈念sama閱讀 45,536評論 2 354
  • 正文 我出身青樓颖对,卻偏偏與公主長得像捻撑,于是被迫代替她去往敵國和親磨隘。 傳聞我的和親對象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,828評論 2 345

推薦閱讀更多精彩內(nèi)容