Modeling and Propagating CNNs in a Tree Structure for Visual Tracking

  1. Abstract
    We present an online visual tracking algorithm by managing multiple target appearance models in a tree structure. The proposed algorithm employs Convolutional Neural Networks (CNNs) to represent target appearances, where multiple CNNs collaborate to estimate target states and determine the desirable paths for online model updates in the tree. By maintaining multiple CNNs in diverse branches of tree structure, it is convenient to deal with multi-modality in target appearances and preserve model reliability through smooth updates along tree paths.

The final target state is estimated by sampling target candidates around the state in the previous frame and identifying the best sample in terms of a weighted average score from a set of active CNNs.

  1. Introduction
    Most existing tracking algorithms update the target model assuming that target appearances change smoothly over time. However, this strategy may not be appropriate for handling more challenging situations such as occlusion, illumination variation, abrupt motion and deformation, which may break temporal smoothness assumption. Some algorithms employ multiple models[38, 28, 40], multi-modal representations [11] or nonlinear classifiers [10, 12] to address these issues. However, the constructed models are still not strong enough and online model updates are limited to sequential learning in a temporal order, which may not be able to make the models sufficiently discriminative and diverse.

However, online learning with CNNs is not straightforward because neural networks tend to forget previously learned information quickly when they learn new information [27]. This property often incurs drift problem especially when background information contaminates target appearance models, targets are completely occluded by other objects, or tracking fails temporarily. This problem may be alleviated by maintaining multiple versions of target appearance models constructed at different time steps and updating a subset of models selectively to keep a history of target appearances. This idea has been investigated in [22], where a pool of CNNs are used to model target appearances, but it does not consider the reliability of each CNN to estimate target states and update models.

We propose an online visual tracking algorithm, which estimates target state using the likelihoods obtained from multiple CNNs. The CNNs are maintained in a tree structure and updated online along the path in the tree. Since each path keeps track of a separate history about target appearance changes, the proposed algorithm is effective to handle multi-modal target appearances and other exceptions such as short-term occlusions and tracking failures. In addition, since the new model corresponding to the current frame is constructed by fine-tuning the CNN that produces the highest likelihood for target state estimation, more consistent and reliable models are to be generated through online learning only with few training examples.

The main contributions of our paper are summarized below:
? We propose a visual tracking algorithm to manage target appearance models based on CNNs in a tree structure, where
the models are updated online along the path in the tree. This strategy enables us to learn more persistent models
through smooth updates.
? Our tracking algorithm employs multiple models to capture diverse target appearances and performs more robust
tracking even with challenges such as appearance changes, occlusions, and temporary tracking failures.

  1. Related Works
    Tracking-by-detection approaches formulate visual tracking as a discriminative object classification problem in a sequence of video frames. The techniques in this category typically learn classifiers to differentiate targets from surrounding backgrounds.

Tracking algorithms based on hand-crafted features [15, 38] often outperform CNN-based approaches. This is partly because CNNs are difficult to train using noisy labeled data online while they are easy to overfit to a small number of training examples; it is not straightforward to apply CNNs to visual tracking problems involving online learning. For example, the performance of [22], which is based on a shallow custom neural network, is not as successful as recent tracking algorithms based on shallow feature learning. However, CNN-based tracking algorithms started to present competitive accuracy in the online tracking benchmark [37] by transferring the CNNs pretrained on ImageNet [8].

Multiple models are often employed in generative tracking algorithms to handle target appearance variations and recover from tracking failures. Trackers based on sparse representation [28, 40] maintain multiple target templates to compute the likelihood of each sample by minimizing its reconstruction error while [21] integrates multiple observation models via an MCMC framework. Nam et al. [29] integrates patch-matching results from multiple frames and estimates the posterior of target state. On the other hand, ensemble classifiers have sometimes been applied to visual tracking problem. Tang et al. [32] proposed a co-tracking framework based on two support vector machines. An ensemble of weak classifiers is employed to estimate target states in [1, 3]. Zhang et al. [38] presented a framework based on multiple snapshots of SVM-based trackers to recover from tracking failures.

  1. Algorithm Overview
    Our algorithm maintains multiple target appearance models based on CNNs in a tree structure to preserve model consistency and handle appearance multi-modality effectively. The proposed approach consists of two main components as in ordinary tracking algorithms—state estimation and model update—whose procedures are illustrated in Figure 1. Note that both components require interaction between multiple CNNs.

When a new frame is given, we draw candidate samples around the target state estimated in the previous frame, and compute the likelihood of each sample based on the weighted average of the scores from multiple CNNs. The weight of each CNN is determined by the reliability of the path along which the CNN has been updated in the tree structure. The target state in the current frame is estimated by finding the candidate with the maximum likelihood. After
tracking a predefined number of frames, a new CNN is derived from an existing one, which has the highest weight among the contributing CNNs to target state estimation. This strategy is helpful to ensure smooth model updates and maintain reliable models in practice.

image.png

Our approach has something in common with [22], which employs a candidate pool of multiple CNNs. It selects k nearest CNNs based on prototype matching distances for tracking. Our algorithm is differentiated from this approach since it is more interested in how to keep multimodality of multiple CNNs and maximize their reliability by introducing a novel model maintenance technique using a tree structure. Visual tracking based on a tree-structured graphical model has been investigated in [13], but this work is focused on identifying the optimal density propagation path for offline tracking. The idea in [29] is also related, but it mainly discusses posterior propagation on directed acyclic graphs for visual tracking.

  1. Proposed Algorithm
    5.1 CNN Architecture
    Our network consists of three convolutional layers and three fully connected layers. The convolution filters are identical to the ones in VGG-M network [4] pretrained on ImageNet [8]. The last fully connected layer has 2 units for binary classification while the preceding two fully connected layers are composed of 512 units. All weights in these three layers are initialized randomly. The input to our network is a 75 × 75 RGB image and its size is equivalent to the receptive field size of the only single unit (per channel) in the last convolutional layer. Note that, although we borrow the convolution filters from VGG-M network, the size of our network is smaller than the original VGG-M network. The output of an input image x is a normalized vector [φ(x), 1 ? φ(x)]T , whose elements represent scores for target and background, respectively.

5.2 Tree Construction
We maintain a tree structure to manage hierarchical multiple target appearance models based on CNNs. In the tree structure T = {V, E}, a vertex v ∈ V corresponds to a CNN and a directed edge (u, v) ∈ E defines the relationship between CNNs. The score of an edge (u, v) is the affinity between two end vertices, which is given by

image.png

where Fv is a set of consecutive frames that is used to train the CNN associated with v, x^?_t is the estimated target state at frame t, and φu(·) is the predicted positive score with respect to the CNN in u.

5.3. Target State Estimation using Multiple CNNs

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末惊科,一起剝皮案震驚了整個(gè)濱河市兽埃,隨后出現(xiàn)的幾起案子锅很,更是在濱河造成了極大的恐慌囚巴,老刑警劉巖,帶你破解...
    沈念sama閱讀 210,978評(píng)論 6 490
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異撩笆,居然都是意外死亡谣蠢,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 89,954評(píng)論 2 384
  • 文/潘曉璐 我一進(jìn)店門垢乙,熙熙樓的掌柜王于貴愁眉苦臉地迎上來锨咙,“玉大人,你說我怎么就攤上這事追逮∪迓澹” “怎么了?”我有些...
    開封第一講書人閱讀 156,623評(píng)論 0 345
  • 文/不壞的土叔 我叫張陵弥姻,是天一觀的道長(zhǎng)婿禽。 經(jīng)常有香客問我,道長(zhǎng)巴席,這世上最難降的妖魔是什么历涝? 我笑而不...
    開封第一講書人閱讀 56,324評(píng)論 1 282
  • 正文 為了忘掉前任,我火速辦了婚禮漾唉,結(jié)果婚禮上荧库,老公的妹妹穿的比我還像新娘。我一直安慰自己赵刑,他們只是感情好分衫,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,390評(píng)論 5 384
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著料睛,像睡著了一般丐箩。 火紅的嫁衣襯著肌膚如雪摇邦。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,741評(píng)論 1 289
  • 那天屎勘,我揣著相機(jī)與錄音施籍,去河邊找鬼。 笑死概漱,一個(gè)胖子當(dāng)著我的面吹牛丑慎,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播瓤摧,決...
    沈念sama閱讀 38,892評(píng)論 3 405
  • 文/蒼蘭香墨 我猛地睜開眼竿裂,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了照弥?” 一聲冷哼從身側(cè)響起腻异,我...
    開封第一講書人閱讀 37,655評(píng)論 0 266
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎这揣,沒想到半個(gè)月后悔常,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,104評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡给赞,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,451評(píng)論 2 325
  • 正文 我和宋清朗相戀三年机打,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片片迅。...
    茶點(diǎn)故事閱讀 38,569評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡残邀,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出柑蛇,到底是詐尸還是另有隱情芥挣,我是刑警寧澤,帶...
    沈念sama閱讀 34,254評(píng)論 4 328
  • 正文 年R本政府宣布耻台,位于F島的核電站九秀,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏粘我。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,834評(píng)論 3 312
  • 文/蒙蒙 一痹换、第九天 我趴在偏房一處隱蔽的房頂上張望征字。 院中可真熱鬧,春花似錦娇豫、人聲如沸匙姜。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,725評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽氮昧。三九已至框杜,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間袖肥,已是汗流浹背咪辱。 一陣腳步聲響...
    開封第一講書人閱讀 31,950評(píng)論 1 264
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留椎组,地道東北人油狂。 一個(gè)月前我還...
    沈念sama閱讀 46,260評(píng)論 2 360
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像寸癌,于是被迫代替她去往敵國和親专筷。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,446評(píng)論 2 348

推薦閱讀更多精彩內(nèi)容

  • “誒蒸苇,你看娛樂新聞沒有磷蛹,薛之謙這次演唱會(huì)票房相當(dāng)慘淡呀,這種歌手聲音辨識(shí)度不夠溪烤,而且音域不廣味咳,唱法比較通俗..加上...
    張無畏閱讀 450評(píng)論 2 0
  • ——當(dāng)代大學(xué)畢業(yè)生就業(yè)生存實(shí)錄 【目錄】 從頭讀起:第一章 上一章 文丨春申君黃歇 在金融系蹭課的那段時(shí)間里枪眉,給A...
    春申君黃歇閱讀 304評(píng)論 0 4
  • 周末接待了兩個(gè)大學(xué)同學(xué)捺檬,一個(gè)正在中山大學(xué)讀博,一個(gè)在上海大學(xué)讀博贸铜,我一個(gè)屌絲本科在他們面前太弱了堡纬,四年不見,褚哥變...
    Alex789閱讀 198評(píng)論 0 0
  • 對(duì)不起我沒有聽你的忠告 他對(duì)你好 就把一切都給他 然后越陷越深 他卻越來越清醒 今晚是他的畢業(yè)晚會(huì) 結(jié)束后 很多人...
    誰搶了我名字閱讀 94評(píng)論 0 0