論文筆記-Fully-Convolutional Siamese Networks for Object Tracking

題目：Fully-Convolutional Siamese Networks for Object Tracking

來源：CVPR2016

論文主頁（有matlab代碼）：http://www.robots.ox.ac.uk/~luca/siamese-fc.html?

貢獻(xiàn)：網(wǎng)絡(luò)雖然很簡單奶是，但達(dá)到了實(shí)時(shí)性壕吹。

Siamese缺點(diǎn)：

1迄埃、因?yàn)镾iamese屬于模板匹配類的算法索烹，對于突然變化和超出邊界框（圖像）的目標(biāo)跟蹤會失敗喲

2浮还、對于背景雜斑較多，即有太多相似性物體的時(shí)候载慈，跟蹤效果不好毅桃。

摘要：在進(jìn)行目標(biāo)跟蹤時(shí)，往往是通過使用訓(xùn)練的視頻集來學(xué)習(xí)一個(gè)物體的外觀模型來實(shí)現(xiàn)的翁狐。盡管這些方法很成功类溢，但他們這種只進(jìn)行在線的方式所學(xué)到的模型豐富度不夠。最近露懒，為了提高學(xué)到的模型的豐富度闯冷，深度卷積網(wǎng)絡(luò)進(jìn)入了視線。然而懈词，在跟蹤的物體在先前是不太確定的時(shí)候蛇耀，那就有必要在線使用隨機(jī)梯度下降來調(diào)整網(wǎng)絡(luò)的參數(shù)，當(dāng)然了坎弯，系統(tǒng)的速度不是很好纺涤。

In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video.Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

1 Introduction

在進(jìn)行單目標(biāo)跟蹤的時(shí)候译暂，因?yàn)楦櫵惴赡鼙灰蟾櫲我獾囊粋€(gè)物體，所以擁有早已收集好的數(shù)據(jù)且訓(xùn)練一個(gè)專門的探測器幾乎是不可能的撩炊。多年以來外永，最成功的方法基本上就是在線學(xué)習(xí)物體的外觀模型，方法有TLD拧咳，Struck伯顶，KCF。然而骆膝，一個(gè)最大的缺陷就是所學(xué)習(xí)的模型太簡單了祭衩。而使用deep conv-nets 的話，由于一些問題谭网，使用上有困難汪厨。

這些問題有2個(gè)，訓(xùn)練數(shù)據(jù)集的稀缺和實(shí)時(shí)性的操作約束

當(dāng)然了愉择，有問題就有解決方法劫乱。為了解決這2個(gè)局限，有一些工作出現(xiàn)了锥涕。這些工作主要使用一個(gè)預(yù)訓(xùn)練deep conv-net衷戈，這個(gè)網(wǎng)絡(luò)是為一種不同但是相關(guān)的工作而學(xué)習(xí)到的。工作有2種层坠。

1殖妇、shallow methods

using the network’s internal representation as features,e.g. correlation filters

2、SGD

perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network

二者各自的不足：

While the use of shallow methods does not take full advantage of the benefits of end-to-end learning, methods that apply SGD during tracking to achieve state-of-the-art results have not been able to operate in real-time.

那么破花，我們采取什么方法呢谦趣？

a deep conv-net is trained to address a more general similarity learning problem in an initial offline phase,and then this function is simply evaluated online during tracking.

本文的貢獻(xiàn)點(diǎn)在于

The key contribution of this paper is to demonstrate that this approach achieves very competitive performance in modern tracking benchmarks at speeds that far exceed the frame-rate requirement.Specifically, we train a Siamese network to locate an exemplar image within a larger search image.

A further contribution is a novel Siamese architecture that is fully-convolutional with respect to the search image: dense and efficient sliding-window evaluation is achieved with a bilinear layer that computes the cross-correlation of its two inputs.

2 Deep similarity learning for tracking

用相似性學(xué)習(xí)的方法來跟蹤任意的物體。

We propose to learn a function f (z, x) that compares an exemplar image z to a candidate image x of the same size and returns a high score if the two images depict the same object and a low score otherwise.To find the position of the object in a new image, we can then exhaustively test all possible locations and choose the candidate with the maximum similarity to the past appearance of the object. In experiments, we will simply use the initial appearance of the object as the exemplar. The function f will be learnt from a dataset of videos with labelled object trajectories.

我們用的function f就是deep conv-net座每。而用deep conv-nets進(jìn)行相似學(xué)習(xí)往往可以通過使用Siamese architectures（體系結(jié)構(gòu)）來進(jìn)行解決前鹅。這個(gè)是網(wǎng)絡(luò)設(shè)計(jì)的核心。

那么峭梳，孿生網(wǎng)絡(luò)是怎么回事兒呢舰绘？

Siamese networks apply an identical transformation（對2個(gè)輸入圖像而言相同的變換φ） φ to both inputs and then combine their representations using another function g according to f (z, x) = g(φ(z), φ(x)). When the function g is a simple distance or similarity metric, the function φ can be considered an embedding.

上圖

Fully-convolutional Siamese architecture

2.1 Fully-convolutional Siamese architecture

深度卷積網(wǎng)絡(luò)的相似性學(xué)習(xí)是通過孿生體系結(jié)構(gòu)體現(xiàn)的。接下來介紹全卷積體系結(jié)構(gòu)葱椭。

函數(shù)是完全卷積的定義：

1捂寿、We say that a function is fully-convolutional if it commutes with translation.

2、To give a more precise definition, introducing L τ to denote the translation operator

(L τ x)[u] = x[u ? τ ], a function h that maps signals to signals is fully-convolutional with integer stride k if

??????????????????????????????????? h(L kτ x) = L τ h(x)???????????????????????? ? ? ? ? ? ? ? ? ? ? ?????? (1)

for any translation τ . (When x is a finite signal, this only need hold for the valid region of the output.)

完全卷積網(wǎng)絡(luò)的優(yōu)點(diǎn)：

提供更大的搜索圖像作為網(wǎng)絡(luò)的Input孵运，而不是相同大小的候選圖像秦陋。并且，在單次評估的時(shí)候治笨，可以計(jì)算基于密集網(wǎng)格的所有變換子窗口的相似度驳概。

要充分實(shí)現(xiàn)這一優(yōu)點(diǎn)的話粪小，可以：

use a convolutional embedding function φ and combine the resulting feature maps using a cross-correlation layer

??????????????????? ? ? ? ? ? ? ? ? ?? ???? f (z, x) = φ(z) ? φ(x) + b 1 ,???????????????????????????? ? ? ?? (2)

where b 1 denotes a signal which takes value b ∈ R in every location. The output of this network is not a single score but rather a score map defined on a finite grid D ? Z 2 as illustrated in Figure 1. Note that the output of the embedding function is a feature map with spatial support as opposed to a plain vector.

在跟蹤期間，我們使用以目標(biāo)的上一個(gè)位置為中心的搜索圖像抡句。最大得分的位置與得分圖的中心有關(guān)探膊，乘上網(wǎng)絡(luò)的stride，就可以得出從幀到幀的目標(biāo)的位移待榔。在單次前進(jìn)中逞壁，多個(gè)尺度通過組裝小批量的尺度變化圖像而被搜索到。

使用互相關(guān)的方法聯(lián)合特征圖和在更大的圖像上評估網(wǎng)絡(luò)在數(shù)學(xué)上相當(dāng)于combining feature maps using the inner product and evaluating the network on each translated sub-window independently.這種方法在training and testing的時(shí)候都是有用的锐锣。

2.2 Training with large search images

采用判別式的方法腌闯。

圖解：當(dāng)一個(gè)子窗口的延伸超過圖像的范圍，缺失的部分用平均RGB值來填充雕憔。

在上圖Fig2中姿骏，上下的圖像對兒是從一個(gè)視頻的兩幀中提取出來的，都包含目標(biāo)斤彼，最多以T幀作為間隔分瘦。

物體的class在training的過程中是不考慮的。每個(gè)圖像中物體的尺寸在不破壞圖像的縱橫比的情況下被 normalized（歸一化）琉苇。

接著上圖嘲玫。

score map中，正負(fù)examples的loss并扇，為了消除類的不均衡性是要進(jìn)行加權(quán)的去团。

Note that since the network is symmetric f (z, x) = f (x, z), it is in fact also fully-convolutional in the exemplar.This allows us to use different size exemplar images for different objects in theory。

2.3 ImageNet Video for tracking

It can safely be used to train a deep model for tracking without over-fitting.

2.4 Practical considerations

出于實(shí)際考慮（practical consideration）穷蛹，有以下幾個(gè)方面的內(nèi)容土陪，都是實(shí)戰(zhàn)干貨啊。

Dataset curation

Network architecture

The dimensions of the parameters and activations are given in Table 1. Max-pooling is employed after the first two convolutional layers. ReLU non-linearities follow every convolutional layer except for conv5, the final layer. During training, batch normalization is inserted immediately after every linear layer.The stride of the final representation is eight. An important aspect of the design is that no padding（填充） is introduced within the network. Although this is common practice in image classification, it violates （違反了）the fully-convolutional property of eq. 1.

Tracking algorithm

Since our purpose is to prove the efficacy of our fully-convolutional Siamese network and its generalization capability when trained on ImageNet Video, we use an extremely simplistic algorithm to perform tracking.

Unlike more sophisticated trackers, we do not update a model or maintain a memory of past appearances, we do not incorporate additional cues such as optical flow（光流） or colour histograms, and we do not refine our prediction with bounding box regression.

Yet, despite its simplicity, the tracking algorithm achieves surprisingly good results when equipped with our offline-learnt similarity metric.

Online, we do incorporate some elementary temporal constraints（納入一些基本的時(shí)間約束）: we only search for the object within a region of approximately four times its previous size, and a cosine window is added to the score map to penalize large displacements.

Tracking through scale space is achieved by processing several scaled versions of the search image.（通過處理幾個(gè)縮放版本的搜索圖像來實(shí)現(xiàn)縮放空間的跟蹤） Any change in scale is penalized and updates of the current scale are damped.（任何尺度上的變化都會受到懲罰肴熏，當(dāng)前尺度的更新也會受到阻礙）

這部分內(nèi)容的一個(gè)整體效果如下鬼雀。（部分截圖啦）

結(jié)論：我們的方法不執(zhí)行任何的模型更新，只用第一幀來進(jìn)行計(jì)算扮超，但結(jié)果卻出乎意料的在motion blur取刃、 drastic change of appearance蹋肮、 poor illumination and scale change 表現(xiàn)出魯棒性出刷。此外，我們的方法對于復(fù)雜的場景是敏感的坯辩，因?yàn)槟Ｐ蛷奈幢桓履俟辏院苋菀譫rift。

3 Related work

對于目標(biāo)跟蹤問題而言漆魔，有一些工作是train RNN坷檩。比如訓(xùn)練RNN來預(yù)測每一幀的目標(biāo)的絕對位置却音。再比如，使用可微的注意機(jī)制來簡單的訓(xùn)練一個(gè)用于跟蹤的RNN矢炼。這些方法結(jié)果不是很理想系瓢，但確實(shí)是值得研究的。

利用每個(gè)新的視頻來訓(xùn)練深度卷積網(wǎng)絡(luò)是不可行的句灌，這個(gè)時(shí)候可以想到已經(jīng)預(yù)訓(xùn)練好參數(shù)的微調(diào)方法夷陋。SO-DLT和MDNet都在離線階段訓(xùn)練了一個(gè)用于簡單探測任務(wù)的卷積網(wǎng)絡(luò)，在測試階段用SGD來學(xué)習(xí)一個(gè)探測器胰锌，可惜骗绕，這些方法的實(shí)時(shí)性不是很好。一種可以替代的方法是shallow methods（使用預(yù)訓(xùn)練的卷積網(wǎng)絡(luò)的內(nèi)在表現(xiàn)作為特征）资昧。這類方法有FCNT酬土，DeepSRDCF等。他們?nèi)〉昧撕芎玫慕Y(jié)果格带，但是卻由于卷積網(wǎng)絡(luò)所表現(xiàn)的高維性而沒有實(shí)現(xiàn)實(shí)時(shí)性的操作撤缴。

我們的工作，當(dāng)然也有其他人的一些工作叽唱，提出使用用于目標(biāo)跟蹤的卷積網(wǎng)絡(luò)腹泌，這個(gè)網(wǎng)絡(luò)會學(xué)習(xí)一個(gè)圖像對的函數(shù)。就拿GOTURN來說尔觉，一個(gè)卷積網(wǎng)絡(luò)被訓(xùn)練出來凉袱，主要是用于從兩張圖片到第一張圖片所展示的物體在第二張圖片的位置定位的直接回歸。預(yù)測的是一個(gè)矩形而不是位置具有這樣的優(yōu)點(diǎn)：尺度的變化可以在不評估的情況下進(jìn)行很好的控制侦铜。然而這種方法還是有缺點(diǎn)的专甩，缺點(diǎn)：它不具有對第二張圖像變換的內(nèi)在的不變性。這意味著網(wǎng)絡(luò)必須在所有位置顯示示例钉稍，這是通過相當(dāng)大的數(shù)據(jù)集實(shí)現(xiàn)的涤躲。

具有競爭性的方法MDNet，SINT贡未，GOTURN种樱，這些方法在視頻序列上進(jìn)行訓(xùn)練，用的是屬于相同ALOV/OTB/VOT的訓(xùn)練數(shù)據(jù)俊卤，這種數(shù)據(jù)可能存在過擬合現(xiàn)象嫩挤。所以，這篇論文中提出用于有效的目標(biāo)跟蹤的卷積網(wǎng)絡(luò)消恍，這個(gè)網(wǎng)絡(luò)的特點(diǎn)是不用與測試集相同的視頻數(shù)據(jù)來訓(xùn)練岂昭。

4 Experiments

4.1 Implementation details
Training：這個(gè)模塊主要是進(jìn)行參數(shù)設(shè)定，并說明相關(guān)的設(shè)定方法狠怨。

Tracking：使用簡單的策略在線更新示例的特征表示约啊。策略有線性插值邑遏，雙三次插值等∏【兀可知记盒，用后者可以進(jìn)行更加精確的定位，為了處理尺度變化外傅，我們搜索超過5個(gè)尺度(-2,-1,0,1,2)的對象孽鸡，并用線性插值法更新比例。

4.2 Evaluation

兩個(gè)變體：

SiamFC(Siamense Fully-Convolutional)

SiamFC-3s(搜索超過了3個(gè)尺度而不是5個(gè)尺度)

4.3 The OTB-13 benchmark

The OTB-13 benchmark 考慮到了在不同閾值下平均每幀的成功率：一個(gè)跟蹤器如果估計(jì)值與真實(shí)值間的IoU（交集）超過某個(gè)閾值的話栏豺，那么在給定幀下是成功的彬碱。

4.4 The VOT benchmarks

vot2015-final 可以在所選的356個(gè)序列中評估跟蹤器，在這些序列中奥洼，很好的展現(xiàn)了7種不同的挑戰(zhàn)情景巷疼。

VOT-14 results：

跟蹤器的兩個(gè)評價(jià)指標(biāo)：accuracy and robustness

Accuracy is calculated as the average IoU.

Robustness is expressed in terms of the total number of failures.

VOT-15 results：

VOT-16 results：

我們的fully-convolutional Siamese network 可以達(dá)到state-of-the-art 效果，是一款real-time的追蹤器灵奖。此外嚼沿，還可以采取一些方法提升性能，比如：

model update瓷患，bounding-box regression骡尽，fine-tuning，memory

4.5 Dataset size

訓(xùn)練卷積網(wǎng)絡(luò)需要大量-大量-大量的數(shù)據(jù)集擅编。

This finding suggests that using a larger video dataset could increase the performance even further.

5 Conclusion

In this work, we depart from （放棄）the traditional online learning methodology employed in tracking, and show an alternative approach that focuses on learning strong embeddings in an offline phase. Differently from their use in classification settings, we demonstrate that for tracking applications Siamese fully-convolutional deep networks have the ability to use the available data more efficiently. This is reflected both at test-time, by performing efficient spatial searches, but also at training-time, （在兩大階段都有有效的空間搜索的體驗(yàn)）where every sub-window effectively represents a useful sample with little extra cost. The experiments show that deep embeddings provide a naturally rich source of features for online trackers, and enable simplistic test-time strategies to perform well. We believe that this approach is complementary to more sophisticated（復(fù)雜） online tracking methodologies,and expect future work to explore this relationship more thoroughly.

最后編輯于：2017.12.07 01:26:30

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末攀细，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子爱态，更是在濱河造成了極大的恐慌谭贪，老刑警劉巖，帶你破解...
沈念sama閱讀 211,042評論 6贊 490
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件锦担，死亡現(xiàn)場離奇詭異俭识，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)洞渔，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 89,996評論 2贊 384
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門套媚，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人磁椒，你說我怎么就攤上這事堤瘤。” “怎么了衷快？”我有些...
開封第一講書人閱讀 156,674評論 0贊 345
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵宙橱，是天一觀的道長姨俩。經(jīng)常有香客問我蘸拔，道長师郑，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 56,340評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任调窍，我火速辦了婚禮宝冕，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘邓萨。我一直安慰自己地梨，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 65,404評論 5贊 384
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布缔恳。她就那樣靜靜地躺著宝剖，像睡著了一般。火紅的嫁衣襯著肌膚如雪歉甚。梳的紋絲不亂的頭發(fā)上万细，一...
開封第一講書人閱讀 49,749評論 1贊 289
城市分裂傳說
那天，我揣著相機(jī)與錄音纸泄，去河邊找鬼赖钞。笑死，一個(gè)胖子當(dāng)著我的面吹牛聘裁，可吹牛的內(nèi)容都是我干的雪营。我是一名探鬼主播，決...
沈念sama閱讀 38,902評論 3贊 405
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼衡便，長吁一口氣：“原來是場噩夢啊……” “哼献起！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起镣陕，我...
開封第一講書人閱讀 37,662評論 0贊 266
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤征唬，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后茁彭，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體总寒，經(jīng)...
沈念sama閱讀 44,110評論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,451評論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年理肺，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了摄闸。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 38,577評論 1贊 340
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡妹萨，死狀恐怖年枕，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情乎完，我是刑警寧澤熏兄，帶...
沈念sama閱讀 34,258評論 4贊 328
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響摩桶，放射性物質(zhì)發(fā)生泄漏桥状。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,848評論 3贊 312
男人毒藥：我在死后第九天來索命
文/蒙蒙一硝清、第九天我趴在偏房一處隱蔽的房頂上張望辅斟。院中可真熱鬧，春花似錦芦拿、人聲如沸士飒。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,726評論 0贊 21
一樁弒父案蔗崎，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽酵幕。三九已至，卻和暖如春缓苛，著一層夾襖步出監(jiān)牢的瞬間裙盾，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,952評論 1贊 264
情欲美人皮
我被黑心中介騙來泰國打工他嫡，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留番官，地道東北人。一個(gè)月前我還...
沈念sama閱讀 46,271評論 2贊 360
代替公主和親
正文我出身青樓钢属，卻偏偏與公主長得像徘熔，于是被迫代替她去往敵國和親。傳聞我的和親對象是個(gè)殘疾皇子淆党，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,452評論 2贊 348