標(biāo)題:A Twofold Siamese Network for Real-Time Object Tracking
作者:Anfeng He, Chong Luo, Xinmei Tian, Wenjun Zeng.
出處:CVPR2018
領(lǐng)域:單目標(biāo)跟蹤
【code】: 嘗試復(fù)現(xiàn)論文效果和悦,項(xiàng)目doing峻贮。歡迎討論和交流昧廷。
new iders.?兩個(gè)siameseFC脾猛,channel attention
why work??1魔种、deep representation combine(utilize heterogeneous features)(比它的baseline siameseFC效果好的最主要原因);2辅鲸、大量的訓(xùn)練數(shù)據(jù),ImageNet屉栓;3、large search regions;
Abstract:
????????作者發(fā)現(xiàn):圖像分類任務(wù)的語(yǔ)義特征Semantic features耸袜,圖像相似性匹配的表觀特征Appearance feature友多,具有互補(bǔ)的性質(zhì)。兩個(gè)分支S_SiameseNet和A-SiameseNet都是基于siameseFC結(jié)構(gòu)堤框,分開訓(xùn)練域滥。其中A-Net和SiameFC基本相似;S-Net中使用了通道注意力機(jī)制蜈抓。
1. Introduction
? ? ? ? The key to design a high-performance tracker is to find expressive features and corresponding calssifiers that are simultaneously discriminative and generalized.?Being discriminative allows the tracker to differentiate the true target from the cluttered or even deceptive background.?Being generalized means that a tracker would tolerate the appearance changes of the tracked object, even when the object is not known a priori.
? ? ? ? 跟蹤算法的判別能力:能夠?qū)⒛繕?biāo)從復(fù)雜(雜斑骗绕、欺騙性的)背景中區(qū)分出來(lái);
? ? ? ? 跟蹤算法的泛化能力:能夠應(yīng)對(duì)目標(biāo)的表觀變化资昧。
? ? ? ? To siameFC, the generalization capability remains quite poor and it encounters difficulties when the target has significant appearance change. As a result, SiameFC still has a performance gap to the best online tracker. As a result, SiamFC still has a performance gap to the best online tracker.
? ? ? ? siameFC的泛化能力較差:當(dāng)目標(biāo)發(fā)生較大的表觀變化時(shí),就會(huì)漂移荆忍。所以論文的目的格带,improve siameFC的泛化能力generalization capability。
????????It is widely understood that, in a deep CNN trained for image classification task, features from deeper layers contain stronger semantic information and is more invariant to object appearance changes. These semantic features are an ideal complement to the appearance features trained in a similarity learning problem
? ? ? ? 大家都知道widely understood that刹枉,來(lái)自圖像分類任務(wù)的預(yù)訓(xùn)練CNN的高層特征較強(qiáng)的語(yǔ)義信息叽唱,對(duì)目標(biāo)表觀變化具有不變性(當(dāng)目標(biāo)變形時(shí),這個(gè)特征仍然代表這個(gè)目標(biāo))微宝。
????????For the semantic branch, we further propose a channel attention mechanism to achieve a minimum degree of target adaptation. The motivation is that different objects activate different sets of feature channels. We shall give higher weights to channels that play more important roles in tracking specific targets. This is realized by computing channel-wise weights based on the channel responses at the target object and in the surrounding context. This simplest form of target adaptation improves the discrimination power of the tracker.
? ??????有些特征通道channel(注意是特征通道棺亭,而不是特征)對(duì)某些特定的跟蹤目標(biāo)是很有用的,而另一些對(duì)該跟蹤目標(biāo)的基本沒(méi)什么作用蟋软;所以應(yīng)該give higher weights to channels that play more important roles in tracking specific targets.?
? ??????小結(jié):
? ??????1镶摘、SiameFC有一個(gè)不足,就是當(dāng)目標(biāo)表觀發(fā)生極大變化岳守,容易跟丟凄敢。而目標(biāo)的語(yǔ)義特征對(duì)目標(biāo)的表觀變化具有不變性。兩者結(jié)合可以互補(bǔ)湿痢。
? ? ? ? 2涝缝、不同特征通道,對(duì)特定的跟蹤目標(biāo)的判別能力不同。有些特征通道對(duì)于跟蹤某些目標(biāo)很重要拒逮,而有些通道對(duì)跟蹤這些目標(biāo)基本不起作用罐氨。
2. Related Work
2.1. Siamese Network Based Trackers
? ??????A notable advantage of this method is that it needs no or little online training. Thus, real-time tracking can be easily achieved.
? ??????The advantage of a fullyconvolutional network is that, instead of a candidate patch of the same size of the target patch, one can provide as input to the network a much larger search image and it will compute the similarity at all translated sub-windows on a dense grid in a single evaluation.
? ??????Significantly better performance is achieved without much speed drop.
????????SA-Siam inherits network architecture from SiamFC. We intend to improve SiamFC with an innovative way to utilize heterogeneous features.
2.2. Ensemble Trackers
????????A common insight of these ensemble trackers is that it is possible to make a strong tracker by utilizing different layers of CNN features. Besides, the correlation across models should be weak. In SA-Siam design, the appearance branch and the semantic branch use features at very different abstraction levels. Besides, they are not jointly trained to avoid becoming homogeneous.
2.3. Adaptive Feature Selection
????????不同特征對(duì)不同的跟蹤目標(biāo)的不同的影響,使用單一對(duì)象跟蹤的所有特性既不高效也不有效滩援。Recently, SENet demonstrates the effectiveness of channel-wise attention on image recognition tasks栅隐。
????????In our SA-Siam network, we perform channel-wise attention based on the channel activations. It can be looked?on as a type of target adaptation, which potentially improves the tracking performance.
3. Our Approach
????????The fundamental idea behind this design :相似性學(xué)習(xí)的表觀特征和分類任務(wù)的語(yǔ)義特征具有互補(bǔ)性質(zhì)。他們發(fā)現(xiàn)了狠怨。
3.1 SA-Sia Network Architecture
????????The two branches are separately trained and not combined until testing time.
? ? ? ? The appearance branch
????????類似于siameseFC.
? ? ? ? The semantic branch:
????????pretrained CNN(ALexNet)约啊、conv4/conv5、fusion module(1 X 1 ConvNet)佣赖、crop operation恰矩、attention module.
? ? ? ? we only train the fusion module and the channel attention module.
? ? ? ? During testing time
????????按權(quán)重結(jié)合two branches產(chǎn)生的響應(yīng)圖。Similar to SiamFC憎蛤,use multi-scale changes. find that using three scales strikes a good balance between performance and speed.
3.2 Channel Attension in Semantic Branch
????????高層語(yǔ)義特征對(duì)目標(biāo)的表觀變化魯棒外傅,因此使跟蹤算法more generalized,但是less discriminative俩檬,定位不準(zhǔn)萎胰。為了提高semantic branch的discriminative power,設(shè)計(jì)了通道注意力機(jī)制棚辽。
????????直觀上技竟,不同通道在跟蹤不同目標(biāo)中扮演不同的角色。一些通道對(duì)跟蹤某些目標(biāo)極其重要屈藐,但是在跟蹤另一些目標(biāo)時(shí)卻是可有可無(wú)榔组。If we could adapt the channel importance to the tracking target, we achieve the minimum functonality of target adaptation。In order to do so,不僅與目標(biāo)有關(guān)联逻,而且目標(biāo)的背景區(qū)域也很重要搓扯。Therefore,the proposed attention module 的輸入不是目標(biāo)本身包归,而是包含背景信息比目標(biāo)區(qū)域更大的區(qū)域锨推。
? ? ? ? 以conv5特征圖為例。該特征圖的大小是22X22公壤。
? ? ? ? 首先將特征圖分為3X3網(wǎng)格换可,中間一塊為6X6大小,與目標(biāo)區(qū)域一樣大厦幅。
????????然后锦担,在每個(gè)網(wǎng)格上做max pooling。
? ? ? ? 再次慨削,使用兩層的多層感知機(jī)(MLP)為這個(gè)通道產(chǎn)生一個(gè)系數(shù)洞渔。
? ? ? ? 最后套媚,使用帶有bias的sigmoid函數(shù),生成最后的參數(shù)磁椒。
3.3. Discussions of Design Choices
? ? ? ? We separately train the two branches.
? ? ? ? We do not fine-tune S-Net.
? ? ? ? We keep A-Net as it is in SiameFC.
4. Experiments
4.1. Implementation Details
? ? ? ?Network structure:A-Net和SiamseFC的網(wǎng)絡(luò)結(jié)構(gòu)exactly一樣堤瘤。S-Net采用imageNet上預(yù)訓(xùn)練的AlexNet;對(duì)stride做一點(diǎn)小的改變浆熔,使S-Net的輸出和A-Net有相同的大小本辐。
? ? ? ? 在注意力模塊中,池化后的特征stack into 9維vector医增。The following MLP有一個(gè)有9個(gè)神經(jīng)元的隱藏層慎皱,使用了ReLU非線性函數(shù)。最后在使用Sigmoid函數(shù)叶骨,使用的bias為0.5茫多。this is to ensure that no channel will be suppressed to zero。
? ? ? ? Data dimensions:
? ? ? ??input:127*127*3忽刽、255*255*3天揖。
? ? ? ? output:6*6*256、22*22*256.
? ? ? ? conv4:24*24*384.
? ? ? ? conv5:22*22*256.
????????response maps :17*17.
? ? ? ? Training:
? ??????ILSVRC-2015跪帝,只使用Color images今膊。tensorflow。測(cè)試的平均速度是50fps.
? ? ? ? Hyperpatrameters:
? ? ? ? conbine weight = 0.3伞剑。 three scales斑唬。
4.2. Datasets and Evaluation Metrics
? ? ? ? OTB:
? ? ? ? VOT:
4.3. ?Ablation Analysis
? ? ? ? The semantic branch and the appearance branch complement each other.
? ? ? ? Using multilevel features and channel attention bring gain.
? ? ? ? Separate vs. joint training.
4.4. Comparison with State-of-the-Arts
? ? ? ? OTB benchmarks.
? ? ? ? VOT2015 benchmark.
? ??????VOT2016 benchmark.
? ??????VOT2017 benchmark.
5. Conclusion
????????In the feature, we plan to continue exploring the effective fusion of deep feature in object trcking task.