RePr: Improved Training of Convolutional Filters翻譯[上]

RePr: Improved Training of Convolutional Filters翻譯 下

RePr: Improved Training of Convolutional Filters

RePr:改進(jìn)的卷積濾波器訓(xùn)練

論文:http://static.tongtianta.site/paper_pdf/2e51726c-3f1d-11e9-ba5b-00163e08bb86.pdf

Abstract

摘要

A well-trained Convolutional Neural Network can easily be pruned without signi?cant loss of performance. This is because of unnecessary overlap in the features captured by the network’s ?lters.Innovations in network architecture such as skip/dense connections and Inception units have mitigated this problem to some extent, but these improvements come with increased computation and memory requirements at run-time. We attempt to address this problem from another angle - not by changing the network structure but by altering the training method. We show that by temporarily pruning and then restoring a subset of the model’s ?lters, and repeating this process cyclically, overlap in the learned features is reduced, producing improved generalization. We show that the existing model-pruning criteria are not optimal for selecting ?lters to prune in this context and introduce inter-?lter orthogonality as the ranking criteria to determine under-expressive ?lters. Our method is applicable both to vanilla convolutional networks and more complex modern architectures, and improves the performance across a variety of tasks, especially when applied to smaller networks.

訓(xùn)練有素的卷積神經(jīng)網(wǎng)絡(luò)可以輕松修剪而不會(huì)顯著降低性能芋忿。這是因?yàn)榫W(wǎng)絡(luò)過(guò)濾器捕獲的功能不必要地重疊供置。網(wǎng)絡(luò)架構(gòu)的創(chuàng)新廓译,例如跳過(guò)/密集連接和Inception單元在某種程度上緩解了這個(gè)問(wèn)題,但這些改進(jìn)伴隨著運(yùn)行時(shí)計(jì)算和內(nèi)存需求的增加愉昆。我們?cè)噲D從另一個(gè)角度解決這個(gè)問(wèn)題 - 不是通過(guò)改變網(wǎng)絡(luò)結(jié)構(gòu)股耽,而是通過(guò)改變訓(xùn)練方法女轿。我們通過(guò)臨時(shí)修剪然后恢復(fù)模型濾波器的子集并循環(huán)地重復(fù)該過(guò)程來(lái)顯示丹鸿,減少了學(xué)習(xí)特征的重疊,從而產(chǎn)生了改進(jìn)的泛化缓呛。我們表明催享,現(xiàn)有的模型修剪標(biāo)準(zhǔn)不是最佳選擇過(guò)濾器在這種情況下修剪,并引入過(guò)濾器正交性作為確定表達(dá)不足過(guò)濾器的排名標(biāo)準(zhǔn)哟绊。我們的方法既適用于香草卷積網(wǎng)絡(luò)因妙,也適用于更復(fù)雜的現(xiàn)代架構(gòu),可以提高各種任務(wù)的性能票髓,特別是在應(yīng)用于較小的網(wǎng)絡(luò)時(shí)攀涵。

1. Introduction

1.簡(jiǎn)介

Convolutional Neural Networks have achieved state-ofthe-art results in various computer vision tasks [1, 2]. Much of this success is due to innovations of a novel, taskspeci?c network architectures [3, 4].Despite variation in network design, the same core optimization techniques are used across tasks. These techniques consider each individual weight as its own entity and update them independently. Limited progress has been made towards developing a training process speci?cally designed for convolutional networks, in which ?lters are the fundamental unit of the network. A ?lter is not a single weight parameter but a stack of spatial kernels.

卷積神經(jīng)網(wǎng)絡(luò)在各種計(jì)算機(jī)視覺(jué)任務(wù)中取得了最先進(jìn)的成果[1,2]。這種成功很大程度上歸功于一種新穎的洽沟,任務(wù)特定的網(wǎng)絡(luò)架構(gòu)的創(chuàng)新[3,4]以故。盡管網(wǎng)絡(luò)設(shè)計(jì)存在差異,但跨任務(wù)使用相同的核心優(yōu)化技術(shù)裆操。這些技術(shù)將每個(gè)個(gè)體權(quán)重視為自己的實(shí)體并獨(dú)立更新据德。在開(kāi)發(fā)專門針對(duì)卷積網(wǎng)絡(luò)設(shè)計(jì)的培訓(xùn)過(guò)程方面取得了有限的進(jìn)展,其中過(guò)濾器是網(wǎng)絡(luò)的基本單元跷车。過(guò)濾器不是單個(gè)權(quán)重參數(shù),而是一堆空間內(nèi)核橱野。

Because models are typically over-parameterized, a trained convolutional network will contain redundant ?lters [5, 6]. This is evident from the common practice of pruning ?lters [7, 8, 6, 9, 10, 11], rather than individual parameters [12], to achieve model compression. Most of these pruning methods are able to drop a signi?cant number of ?lters with only a marginal loss in the performance of the model. However, a model with fewer ?lters cannot be trained from scratch to achieve the performance of a large model that has been pruned to be roughly the same size [6, 11, 13]. Standard training procedures tend to learn models with extraneous and prunable ?lters, even for architectures without any excess capacity. This suggests that there is room for improvement in the training of Convolutional Neural Networks (ConvNets).

由于模型通常過(guò)度參數(shù)化朽缴,訓(xùn)練有素的卷積網(wǎng)絡(luò)將包含冗余濾波器[5,6]。從修剪過(guò)濾器[7,8,6,9,10,11]的常規(guī)做法中可以看出這一點(diǎn)水援,而不是單個(gè)參數(shù)[12]密强,以實(shí)現(xiàn)模型壓縮茅郎。大多數(shù)這些修剪方法能夠丟棄相當(dāng)數(shù)量的過(guò)濾器,而模型的性能只有輕微的損失或渤。然而系冗,具有較少濾波器的模型無(wú)法從頭開(kāi)始訓(xùn)練以實(shí)現(xiàn)已被修剪為大致相同尺寸的大型模型的性能[6,11,13]。標(biāo)準(zhǔn)培訓(xùn)程序傾向于學(xué)習(xí)具有無(wú)關(guān)和可修剪過(guò)濾器的模型薪鹦,即使對(duì)于沒(méi)有任何過(guò)剩容量的體系結(jié)構(gòu)也是如此掌敬。這表明卷積神經(jīng)網(wǎng)絡(luò)(ConvNets)的訓(xùn)練還有改進(jìn)的余地。

Part of the research was done while the author was an intern at MSR

部分研究是在作者是MSR實(shí)習(xí)生的情況下完成的

image

To this end, we propose a training scheme in which, after some number of iterations of standard training, we select a subset of the model’s ?lters to be temporarily dropped. After additional training of the reduced network, we reintroduce the previously dropped ?lters, initialized with new weights, and continue standard training. We observe that following the reintroduction of the dropped ?lters, the model is able to achieve higher performance than was obtained before the drop. Repeated application of this process obtains models which outperform those obtained by standard training as seen in Figure 1 and discussed in Section 4. We observe this improvement across various tasks and over various types of convolutional networks. This training procedure is able to produce improved performance across a range of possible criteria for choosing which ?lters to drop, and further gains can be achieved by careful selection of the ranking criterion. According to a recent hypothesis [14], the relative success of over-parameterized networks may largely be due to an abundance of initial sub-networks. Our method aims to preserve successful sub-networks while allowing the re-initialization of less useful ?lters.

為此池磁,我們提出了一種訓(xùn)練方案奔害,其中在經(jīng)過(guò)一定數(shù)量的標(biāo)準(zhǔn)訓(xùn)練迭代后,我們選擇暫時(shí)丟棄模型濾波器的子集地熄。在對(duì)減少的網(wǎng)絡(luò)進(jìn)行額外培訓(xùn)后华临,我們重新引入先前丟棄的過(guò)濾器,使用新的權(quán)重進(jìn)行初始化端考,并繼續(xù)進(jìn)行標(biāo)準(zhǔn)培訓(xùn)雅潭。我們觀察到,在重新引入掉落的過(guò)濾器之后却特,該模型能夠?qū)崿F(xiàn)比下降之前獲得的更高的性能扶供。重復(fù)應(yīng)用該過(guò)程獲得的模型優(yōu)于通過(guò)標(biāo)準(zhǔn)培訓(xùn)獲得的模型,如圖1所示并在第4節(jié)中討論核偿。我們觀察到各種任務(wù)和各種類型的卷積網(wǎng)絡(luò)的這種改進(jìn)诚欠。該培訓(xùn)程序能夠在一系列可能的標(biāo)準(zhǔn)中產(chǎn)生改進(jìn)的性能,以選擇丟棄哪些過(guò)濾器漾岳,并且通過(guò)仔細(xì)選擇排序標(biāo)準(zhǔn)可以實(shí)現(xiàn)進(jìn)一步的增益轰绵。根據(jù)最近的一個(gè)假設(shè)[14],過(guò)度參數(shù)化網(wǎng)絡(luò)的相對(duì)成功可能主要是由于大量的初始子網(wǎng)絡(luò)尼荆。我們的方法旨在保留成功的子網(wǎng)絡(luò)左腔,同時(shí)允許重新初始化不太有用的過(guò)濾器。

In addition to our novel training strategy, the second major contribution of our work is an exploration of metrics to guide ?lter dropping. Our experiments demonstrate that standard techniques for permanent ?lter pruning are suboptimal in our setting, and we present an alternative metric which can be ef?ciently computed, and which gives a signi?cant improvement in performance. We propose a metric based on the inter-?lter orthogonality within convolutional layers and show that this metric outperforms state-of-the-art ?lter importance ranking methods used for network pruning in the context of our training strategy. We observe that even small, under-parameterized networks tend to learn redundant ?lters, which suggests that ?lter redundancy is not solely a result of over-parameterization, but is also due to ineffective training. Our goal is to reduce the redundancy of the ?lters and increase the expressive capacity of ConvNets and we achieve this by changing the training scheme rather than the model architecture.

除了我們新穎的培訓(xùn)策略捅儒,我們工作的第二個(gè)主要貢獻(xiàn)是探索指導(dǎo)過(guò)濾器丟失的指標(biāo)液样。我們的實(shí)驗(yàn)表明,永久性過(guò)濾器修剪的標(biāo)準(zhǔn)技術(shù)在我們的設(shè)置中并不是最理想的巧还,我們提出了一個(gè)可以有效計(jì)算的替代度量鞭莽,并且可以顯著提高性能。我們提出了一個(gè)基于卷積層內(nèi)的中間濾波器正交性的度量麸祷,并表明該度量?jī)?yōu)于在我們的訓(xùn)練策略環(huán)境中用于網(wǎng)絡(luò)修剪的最先進(jìn)的濾波器重要性排序方法澎怒。我們觀察到即使是小型,參數(shù)不足的網(wǎng)絡(luò)也傾向于學(xué)習(xí)冗余過(guò)濾器阶牍,這表明過(guò)濾器冗余不僅僅是過(guò)度參數(shù)化的結(jié)果喷面,而且還是由于無(wú)效的訓(xùn)練星瘾。我們的目標(biāo)是減少過(guò)濾器的冗余并提高ConvNets的表達(dá)能力,我們通過(guò)改變培訓(xùn)方案而不是模型架構(gòu)來(lái)實(shí)現(xiàn)這一目標(biāo)惧辈。

2. Related Work

2.相關(guān)工作

Training Scheme Many changes to the training paradigm have been proposed to reduce over-?tting and improve generalization. Dropout [15] is widely used in training deep nets. By stochastically dropping the neurons it prevents co-adaption of feature detectors. A similar effect can be achieved by dropping a subset of activations [16]. Wu et al. [15] extend the idea of stochastic dropping to convolutional neural networks by probabilistic pooling of convolution activations. Yet another form of stochastic training recommends randomly dropping entire layers [17], forcing the model to learn similar features across various layers which prevent extreme over?tting. In contrast, our technique encourages the model to use a linear combination of features instead of duplicating the same feature. Han et al. [18] propose Dense-Sparse-Dense (DSD), a similar training scheme, in which they apply weight regularization mid-training to encourage the development of sparse weights, and subsequently remove the regularization to restore dense weights. While DSD works at the level of individual parameters, our method is speci?cally designed to apply to convolutional ?lters.

培訓(xùn)計(jì)劃已經(jīng)提出了許多培訓(xùn)模式的變化琳状,以減少過(guò)度使用并改善泛化。輟學(xué)[15]廣泛用于訓(xùn)練深網(wǎng)盒齿。通過(guò)隨機(jī)丟棄神經(jīng)元念逞,它可以防止特征檢測(cè)器的共同適應(yīng)。通過(guò)刪除激活子集可以實(shí)現(xiàn)類似的效果[16]县昂。吳等人肮柜。 [15]通過(guò)卷積激活的概率匯集將隨機(jī)下降的概念擴(kuò)展到卷積神經(jīng)網(wǎng)絡(luò)。另一種形式的隨機(jī)訓(xùn)練建議隨機(jī)丟棄整個(gè)層[17]倒彰,迫使模型學(xué)習(xí)各層之間的相似特征审洞,以防止極端過(guò)度配置。相比之下待讳,我們的技術(shù)鼓勵(lì)模型使用線性特征組合芒澜,而不是復(fù)制相同的特征。韓等人创淡。 [18]提出了密集稀疏密集(DSD)痴晦,一種類似的訓(xùn)練方案,其中它們?cè)谟?xùn)練中應(yīng)用權(quán)重正則化以鼓勵(lì)稀疏權(quán)重的發(fā)展琳彩,并隨后去除正則化以恢復(fù)密集權(quán)重誊酌。雖然DSD在各個(gè)參數(shù)的水平上工作,但我們的方法專門設(shè)計(jì)用于卷積濾波器。

Model Compression Knowledge Distillation (KD) [19] is a training scheme which uses soft logits from a larger trained model (teacher) to train a smaller model (student). Soft logits capture hierarchical information about the object and provide a smoother loss function for optimization. This leads to easier training and better convergence for small models. In a surprising result, Born-Again-Network [20] shows that if the student model is of the same capacity as the teacher it can outperform the teacher. A few other variants of KD have been proposed [21] and all of them require training several models. Our training scheme does not depend on an external teacher and requires less training than KD. More importantly, when combined with KD, our method gives better performance than can be achieved by either technique independently (discussed in Section 7).

模型壓縮知識(shí)蒸餾(KD)[19]是一種訓(xùn)練方案,它使用來(lái)自較大訓(xùn)練模型(教師)的軟記錄來(lái)訓(xùn)練較小的模型(學(xué)生)懦胞。軟logit捕獲有關(guān)對(duì)象的分層信息,并為優(yōu)化提供更平滑的損失函數(shù)箱锐。這使得小型模型的培訓(xùn)更容易,收斂更好劳较。令人驚訝的結(jié)果是驹止,Born-Again-Network [20]表明,如果學(xué)生模型與教師具有相同的能力观蜗,那么它可以勝過(guò)教師臊恋。已經(jīng)提出了KD的一些其他變體[21]并且所有這些變體都需要訓(xùn)練幾個(gè)模型。我們的培訓(xùn)計(jì)劃不依賴于外部教師墓捻,需要的培訓(xùn)少于KD捞镰。更重要的是,當(dāng)與KD結(jié)合使用時(shí),我們的方法比單獨(dú)使用任何一種技術(shù)都能提供更好的性能(在第7節(jié)中討論)岸售。

Neuron ranking Interest in ?nding the least salient neurons/weights has a long history. LeCun [22] and Hassibiet al. [23] show that using the Hessian, which contains secondorder derivative, identi?es the weak neurons and performs better than using the magnitude of the weights. Computing the Hessian is expensive and thus is not widely used. Han et al. [12] show that the norm of weights is still effective ranking criteria and yields sparse models. The sparse models do not translate to faster inference, but as a neuron ranking criterion, they are effective. Hu et al. [24] explore Average Percentage of Zeros (APoZ) in the activations and use a data-driven threshold to determine the cut-off. Molchanov et al. [9] recommend the second term from the Taylor expansion of the loss function.We provide detail comparison and show results on using these metrics with our training scheme in Section 5.

神經(jīng)元排名發(fā)現(xiàn)最不顯著的神經(jīng)元/體重的興趣有很長(zhǎng)的歷史。LeCun [22]和Hassibiet al厂画。 [23]表明凸丸,使用含有二階導(dǎo)數(shù)的Hessian,可以識(shí)別弱神經(jīng)元袱院,并且比使用權(quán)重的大小表現(xiàn)更好屎慢。計(jì)算Hessian是昂貴的,因此沒(méi)有被廣泛使用忽洛。韓等人腻惠。 [12]表明權(quán)重的范數(shù)仍然是有效的排名標(biāo)準(zhǔn),并產(chǎn)生稀疏模型欲虚。稀疏模型不能轉(zhuǎn)化為更快的推理集灌,但作為神經(jīng)元排序標(biāo)準(zhǔn),它們是有效的复哆。胡等人欣喧。 [24]探索激活中零點(diǎn)的平均百分比(APoZ),并使用數(shù)據(jù)驅(qū)動(dòng)的閾值來(lái)確定截止值梯找。Molchanov等唆阿。 [9]推薦第二項(xiàng)來(lái)自損失函數(shù)的泰勒展開(kāi)。我們提供詳細(xì)比較锈锤,并在第5節(jié)中使用我們的培訓(xùn)計(jì)劃顯示使用這些指標(biāo)的結(jié)果驯鳖。

Architecture Search Neural architecture search [25, 26, 27] is where the architecture is modi?ed during training, and multiple neural network structures are explored in search of the best architecture for a given dataset. Such methods do not have any bene?ts if the architecture is ?xed ahead of time. Our scheme improves training for a given architecture by making better use of the available parameters.This could be used in conjunction with architecture search if there is ?exibility around the ?nal architecture or used on its own when the architecture is ?xed due to certi?ed model deployment, memory requirements, or other considerations.

架構(gòu)搜索神經(jīng)架構(gòu)搜索[25,26,27]是在訓(xùn)練期間修改架構(gòu)的地方,探索多個(gè)神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)以尋找給定數(shù)據(jù)集的最佳架構(gòu)久免。如果架構(gòu)提前固定浅辙,這些方法沒(méi)有任何好處。我們的方案通過(guò)更好地利用可用參數(shù)來(lái)改進(jìn)給定體系結(jié)構(gòu)的培訓(xùn)妄壶。如果由于經(jīng)過(guò)認(rèn)證的模型部署摔握,內(nèi)存要求或其他考慮因素而導(dǎo)致體系結(jié)構(gòu)固定,則可以在最終體系結(jié)構(gòu)周圍存在靈活性或單獨(dú)使用時(shí)丁寄,可以將其與體系結(jié)構(gòu)搜索結(jié)合使用氨淌。

Feature correlation A well-known shortcoming of vanilla convolutional networks is their correlated feature maps [5, 28]. Architectures like Inception-Net [29] are motivated by analyzing the correlation statistics of features across layers. They aim to reduce the correlation between the layers by using concatenated features from various sized ?lters, subsequent research shows otherwise [30]. More recent architectures like ResNet [1] and DenseNet [31] aim to implicitly reduce feature correlations by summing or concatenating activations from previous layers. That said, these models are computationally expensive and require large memory to store previous activations. Our aim is to induce decorrelated features without changing the architecture of the convolutional network. This bene?ts all the existing implementations of ConvNet without having to change the infrastructure. While our technique performs best with vanilla ConvNet architectures it still marginally improves the performance of modern architectures.

特征相關(guān)香草卷積網(wǎng)絡(luò)的一個(gè)眾所周知的缺點(diǎn)是它們的相關(guān)特征圖[5,28]。像Inception-Net [29]這樣的架構(gòu)是通過(guò)分析跨層特征的相關(guān)統(tǒng)計(jì)來(lái)激發(fā)的伊磺。他們的目標(biāo)是通過(guò)使用來(lái)自不同尺寸濾波器的連接特征來(lái)減少層之間的相關(guān)性盛正,隨后的研究表明[30]。最新的體系結(jié)構(gòu)如ResNet [1]和DenseNet [31]旨在通過(guò)對(duì)先前層的激活進(jìn)行求和或連接來(lái)隱式減少特征相關(guān)性屑埋。也就是說(shuō)豪筝,這些模型的計(jì)算成本很高,并且需要大量?jī)?nèi)存來(lái)存儲(chǔ)以前的激活。我們的目標(biāo)是在不改變卷積網(wǎng)絡(luò)架構(gòu)的情況下誘導(dǎo)去相關(guān)的特征续崖。這有利于ConvNet的所有現(xiàn)有實(shí)施敲街,而無(wú)需更改基礎(chǔ)架構(gòu)。雖然我們的技術(shù)在vanilla ConvNet架構(gòu)中表現(xiàn)最佳严望,但它仍然略微提高了現(xiàn)代架構(gòu)的性能多艇。

3. Motivation for Orthogonal Features

3.正交特征的動(dòng)機(jī)

A feature for a convolutional ?lter is de?ned as the pointwise sum of the activations from individual kernels of the ?lter. A feature is considered useful if it helps to improve the generalization of the model.A model that has poor generalization usually has features that, in aggregate, capture limited directions in activation space [32]. On the other hand, if a model’s features are orthogonal to one another, they will each capture distinct directions in activation space, leading to improved generalization. For a triviallysized ConvNet, we can compute the maximally expressive ?lters by analyzing the correlation of features across layers and clustering them into groups [33]. However, this scheme is computationally impractical for the deep ConvNets used in real-world applications. Alternatively, a computationally feasible option is the addition of a regularization term to the loss function used in standard SGD training which encourages the minimization of the covariance of the activations, but this produces only limited improvement in model performance [34, 5]. A similar method, in which the regularization term instead encourages the orthogonality of ?lter weights, has also produced marginal improvements [35, 36, 37, 38]. Shang et al. [39] discovered the low-level ?lters are duplicated with opposite phase. Forcing ?lters to be orthogonal will minimize this duplication without changing the activation function. In addition to improvements in performance and generalization, Saxe et al. [40] show that the orthogonality of weights also improves the stability of network convergence during training. The authors of [38, 41] further demonstrate the value of orthogonal weights to the ef?cient training of networks. Orthogonal initialization is common practice for Recurrent Neural Networks due to their increased sensitivity to initial conditions [42], but it has somewhat fallen out of favor for ConvNets. These factors shape our motivation for encouraging orthogonality of features in the ConvNet and form the basis of our ranking criteria. Because features are dependent on the input data, determining their orthogonality requires computing statistics across the entire training set, and is therefore prohibitive. We instead compute the orthogonality of ?lter weights as a surrogate. Our experiments show that encouraging weight orthogonality through a regularization term is insuf?cient to promote the development of features which capture the full space of the input data manifold. Our method of dropping overlapping ?lters acts as an implicit regularization and leads to the better orthogonality of ?lters without hampering model convergence.

卷積濾波器的一個(gè)特征被定義為濾波器各個(gè)內(nèi)核激活的逐點(diǎn)和。如果某個(gè)功能有助于改進(jìn)模型的泛化像吻,則該功能被認(rèn)為是有用的峻黍。泛化程度較差的模型通常具有總體上捕獲激活空間中有限方向的特征[32]。另一方面拨匆,如果模型的特征彼此正交姆涩,它們將各自捕獲激活空間中的不同方向,從而導(dǎo)致改進(jìn)的泛化惭每。對(duì)于一個(gè)簡(jiǎn)單的ConvNet骨饿,我們可以通過(guò)分析各層特征的相關(guān)性并將它們聚類成組來(lái)計(jì)算最大表達(dá)式濾波器[33]。然而洪鸭,對(duì)于在實(shí)際應(yīng)用中使用的深度ConvNets样刷,該方案在計(jì)算上是不切實(shí)際的±谰簦或者置鼻,計(jì)算上可行的選項(xiàng)是在標(biāo)準(zhǔn)SGD訓(xùn)練中使用的損失函數(shù)中加入正則化項(xiàng),這有助于最小化激活的協(xié)方差蜓竹,但這僅產(chǎn)生模型性能的有限改進(jìn)[34,5]箕母。一種類似的方法,其中正則化項(xiàng)反而鼓勵(lì)過(guò)濾器權(quán)重的正交性俱济,也產(chǎn)生了微小的改進(jìn)[35,36,37,38]嘶是。尚等人。 [39]發(fā)現(xiàn)低水平過(guò)濾器是重復(fù)相反的蛛碌。強(qiáng)制濾波器正交將最小化這種重復(fù)聂喇,而不改變激活功能。除了性能和泛化方面的改進(jìn)蔚携,Saxe等人希太。 [40]表明權(quán)重的正交性也提高了訓(xùn)練期間網(wǎng)絡(luò)收斂的穩(wěn)定性。[38,41]的作者進(jìn)一步證明了正交權(quán)重對(duì)網(wǎng)絡(luò)有效訓(xùn)練的價(jià)值酝蜒。正交初始化是回歸神經(jīng)網(wǎng)絡(luò)的常見(jiàn)做法誊辉,因?yàn)樗鼈儗?duì)初始條件的敏感性增加[42],但它有點(diǎn)不受ConvNets的青睞亡脑。這些因素決定了我們鼓勵(lì)ConvNet中功能正交性的動(dòng)機(jī)堕澄,并構(gòu)成了我們排名標(biāo)準(zhǔn)的基礎(chǔ)邀跃。因?yàn)樘卣魅Q于輸入數(shù)據(jù),所以確定它們的正交性需要在整個(gè)訓(xùn)練集中計(jì)算統(tǒng)計(jì)數(shù)據(jù)蛙紫,因此是禁止的拍屑。我們改為將過(guò)濾器權(quán)重的正交性計(jì)算為替代。我們的實(shí)驗(yàn)表明惊来,通過(guò)正則化項(xiàng)來(lái)鼓勵(lì)權(quán)重正交性不足以促進(jìn)捕獲輸入數(shù)據(jù)流形的整個(gè)空間的特征的發(fā)展丽涩。我們丟棄重疊濾波器的方法充當(dāng)隱式正則化,并且在不妨礙模型收斂的情況下導(dǎo)致濾波器更好的正交性裁蚁。

We use Canonical Correlation Analysis [43] (CCA) to study the overlap of features in a single layer.CCA ?nds the linear combinations of random variables that show maximum correlation with each other.It is a useful tool to determine if the learned features are overlapping in their representational capacity.Li et al. [44] apply correlation analysis to ?lter activations to show that most of the well-known ConvNet architectures learn similar representations. Raghu et al. [30] combine CCA with SVD to perform a correlation analysis of the singular values of activations from various layers. They show that increasing the depth of a model does not always lead to a corresponding increase of the model’s dimensionality, due to several layers learning representations in correlated directions. We ask an even more elementary question - how correlated are the activations from various ?lters within a single layer? In an over-parameterized network like VGG-16, which has several convolutional layers with 512 ?lters each, it is no surprise that most of the ?lter activations are highly correlated. As a result, VGG16 has been shown to be easily pruned - more than 50% of the ?lters can be dropped while maintaining the performance of the full network [9, 44]. Is this also true for significantly smaller convolutional networks, which under-?t the dataset? We will consider a simple network with two convolutional layers of 32 ?lters each, and a softmax layer at the end. Training this model on CIFAR-10 for 100 epochs with an annealed learning rate results in test set accuracy of 58.2%, far below the 93.5% achieved by VGG-16. In the case of VGG-16, we might expect that correlation between ?lters is merely an artifact of the over-parameterization of the model - the dataset simply does not have a dimensionality high enough to require every feature to be orthogonal to every other. On the other hand, our small network has clearly failed to capture the full feature space of the training data, and thus any correlation between its ?lters is due to inef?ciencies in training, rather than over-parameterization.

我們使用Canonical Correlation Analysis [43](CCA)來(lái)研究單層特征的重疊。CCA找到隨機(jī)變量的線性組合继准,這些隨機(jī)變量顯示出彼此最大的相關(guān)性枉证。它是一種有用的工具,用于確定所學(xué)習(xí)的特征是否在其代表性能力中重疊移必。李等人室谚。 [44]將相關(guān)性分析應(yīng)用于過(guò)濾器激活,以顯示大多數(shù)著名的ConvNet架構(gòu)學(xué)習(xí)類似的表示崔泵。Raghu等人秒赤。 [30]將CCA與SVD結(jié)合起來(lái),對(duì)來(lái)自不同層的激活的奇異值進(jìn)行相關(guān)分析憎瘸。他們表明入篮,增加模型的深度并不總是導(dǎo)致模型維度的相應(yīng)增加,這是由于在相關(guān)方向上的幾個(gè)層學(xué)習(xí)表示幌甘。我們問(wèn)一個(gè)更基本的問(wèn)題 - 單層內(nèi)各種過(guò)濾器的激活有多相關(guān)潮售?在像VGG-16這樣的過(guò)度參數(shù)化網(wǎng)絡(luò)中,它有幾個(gè)卷積層锅风,每層有512個(gè)濾波器酥诽,因此大多數(shù)濾波器激活都是高度相關(guān)的并不奇怪。因此皱埠,VGG16已被證明易于修剪 - 超過(guò)50%的濾波器可以被丟棄肮帐,同時(shí)保持整個(gè)網(wǎng)絡(luò)的性能[9,44]。這對(duì)于顯著較小的卷積網(wǎng)絡(luò)是否也是如此边器,這對(duì)于數(shù)據(jù)集不足训枢?我們將考慮一個(gè)簡(jiǎn)單的網(wǎng)絡(luò),其中每個(gè)卷積層有32個(gè)濾波器饰抒,最后是softmax層肮砾。使用退火學(xué)習(xí)率在CIFAR-10上訓(xùn)練該模型100個(gè)時(shí)期導(dǎo)致測(cè)試集準(zhǔn)確度為58.2%,遠(yuǎn)低于VGG-16實(shí)現(xiàn)的93.5%袋坑。在VGG-16的情況下仗处,我們可能期望濾波器之間的相關(guān)性僅僅是模型過(guò)度參數(shù)化的假象 - 數(shù)據(jù)集根本沒(méi)有足夠高的維度來(lái)要求每個(gè)特征彼此正交眯勾。另一方面,我們的小型網(wǎng)絡(luò)顯然無(wú)法捕獲訓(xùn)練數(shù)據(jù)的完整特征空間婆誓,因此其過(guò)濾器之間的任何相關(guān)性都是由于訓(xùn)練效率低下而非過(guò)度參數(shù)化吃环。

image

Figure 2: Left: Canonical Correlation Analysis of activations from two layers of a ConvNet trained on CIFAR-10. Right: Distribution of change in accuracy when the model is evaluated by dropping one ?lter at a time.

圖2:左圖:經(jīng)CIFAR-10訓(xùn)練的兩層ConvNet激活的典型相關(guān)分析。右:通過(guò)一次丟棄一個(gè)過(guò)濾器評(píng)估模型時(shí)的準(zhǔn)確度變化分布洋幻。

Given a trained model, we can evaluate the contribution of each ?lter to the model’s performance by removing (zeroing out) that ?lter and measuring the drop in accuracy on the test set. We will call this metric of ?lter importance the ”greedy Oracle”. We perform this evaluation independently for every ?lter in the model, and plot the distribution of the resulting drops in accuracy in Figure 2 (right). Most of the second layer ?lters contribute less than 1% in accuracy and with ?rst layer ?lters, there is a long tail. Some ?lters are important and contribute over 4% of accuracy but most ?lters are around 1%. This implies that even a tiny and under-performing network could be ?lter pruned without signi?cant performance loss. The model has not ef?ciently allocated ?lters to capture wider representations of necessary features. Figure 2 (left) shows the correlations from linear combinations of the ?lter activations (CCA) at both the layers. It is evident that in both the layers there is a signi?cant correlation among ?lter activations with several of them close to a near perfect correlation of 1 (bright yellow spots
image

). The second layer (upper right diagonal) has lot more overlap of features the ?rst layer (lower right). For a random orthogonal matrix any value above 0.3 (lighter than dark blue
image

) is an anomaly.The activations are even more correlated if the linear combinations are extended to kernel functions [45] or singular values [30]. Regardless, it suf?ces to say that standard training for convolutional ?lters does not maximize the representational potential of the network.

給定訓(xùn)練有素的模型郁轻,我們可以通過(guò)移除(歸零)濾波器并測(cè)量測(cè)試集上的精度下降來(lái)評(píng)估每個(gè)濾波器對(duì)模型性能的貢獻(xiàn)。我們將這個(gè)過(guò)于重要的度量稱為“貪婪的Oracle”文留。我們對(duì)模型中的每個(gè)濾波器獨(dú)立執(zhí)行此評(píng)估好唯,并在圖2(右)中繪制所得到的精度分布的準(zhǔn)確度。大多數(shù)第二層濾波器的準(zhǔn)確度不到1%燥翅,并且對(duì)于第一層濾波器骑篙,有一個(gè)長(zhǎng)尾。有些濾波器很重要森书,精度超過(guò)4%靶端,但大多數(shù)濾波器約為1%。這意味著即使是一個(gè)微小且性能不佳的網(wǎng)絡(luò)也可以進(jìn)行過(guò)濾修剪而不會(huì)出現(xiàn)明顯的性能損失凛膏。該模型沒(méi)有有效地分配過(guò)濾器來(lái)捕獲必要特征的更廣泛表示杨名。圖2(左)顯示了兩個(gè)層上過(guò)濾器激活(CCA)的線性組合的相關(guān)性。很明顯猖毫,在兩個(gè)層中台谍,濾波器激活之間存在顯著的相關(guān)性,其中幾個(gè)接近于1的近似完美相關(guān)(亮黃色點(diǎn)
image

)鄙麦。第二層(右上角)與第一層(右下)的特征重疊得多典唇。對(duì)于隨機(jī)正交矩陣,任何高于0.3的值(比深藍(lán)色
image

輕)都是異常胯府。如果將線性組合擴(kuò)展到核函數(shù)[45]或奇異值[30]介衔,則激活甚至更相關(guān)。無(wú)論如何骂因,它可以說(shuō)卷積濾波器的標(biāo)準(zhǔn)訓(xùn)練并不能最大化網(wǎng)絡(luò)的代表性潛力学密。

4. Our Training Scheme : RePr

4.我們的培訓(xùn)計(jì)劃:RePr

We modify the training process by cyclically removing redundant ?lters, retraining the network, re-initializing the removed ?lters, and repeating. We consider each ?lter (3D tensor) as a single unit, and represent it as a long vector - (f ). Let M denote a model with F ?lters spread across L layers. Let
image

denote a subset of F ?lters, such that
image

denotes a complete network whereas,
image

denotes a sub-network without that
image

?lters. Our training scheme alternates between training the complete network (
image

) and the sub-network (
image

). This introduces two hyper-parameters. First is the number of iterations to train each of the networks before switching over; let this be
image

for the full network and
image

for the sub-network. These have to be non-trivial values so that each of the networks learns to improve upon the results of the previous network. The second hyper-parameter is the total number of times to repeat this alternating scheme; let it be N. This value has minimal impact beyond certain range and does not require tuning.

我們通過(guò)循環(huán)移除冗余過(guò)濾器晚吞,重新訓(xùn)練網(wǎng)絡(luò),重新初始化移除的過(guò)濾器以及重復(fù)來(lái)修改訓(xùn)練過(guò)程。我們將每個(gè)濾波器(3D張量)視為一個(gè)單元坦喘,并將其表示為長(zhǎng)向量 - (f)运悲。設(shè)M表示一個(gè)模型它呀,其中F濾波器分布在L層上海诲。讓
image

表示F濾波器的子集,使得
image

表示完整網(wǎng)絡(luò)页屠,而
image

表示沒(méi)有
image

濾波器的子網(wǎng)絡(luò)粹胯。我們的培訓(xùn)計(jì)劃在培訓(xùn)完整網(wǎng)絡(luò)(
image

)和子網(wǎng)絡(luò)(
image

)之間交替進(jìn)行蓖柔。這引入了兩個(gè)超參數(shù)。首先是在切換之前訓(xùn)練每個(gè)網(wǎng)絡(luò)的迭代次數(shù);讓它成為完整網(wǎng)絡(luò)的
image

和子網(wǎng)絡(luò)的
image

风纠。這些必須是非平凡的值况鸣,以便每個(gè)網(wǎng)絡(luò)都學(xué)會(huì)改進(jìn)先前網(wǎng)絡(luò)的結(jié)果。第二個(gè)超參數(shù)是重復(fù)該交替方案的總次數(shù);讓它成為N.該值具有超出特定范圍的最小影響竹观,并且不需要調(diào)整镐捧。

The most important part of our algorithm is the metric used to rank the ?lters. Let R be the metric which associates some numeric value to a ?lter. This could be a norm of the weights or its gradients or our metric - inter-?lter orthogonality in a layer. Here we present our algorithm agnostic to the choice of metric. Most sensible choices for ?lter importance results in an improvement over standard training when applied to our training scheme (see Ablation Study 6).

我們算法中最重要的部分是用于對(duì)濾波器進(jìn)行排名的度量。設(shè)R是將某個(gè)數(shù)值與濾波器相關(guān)聯(lián)的度量臭增。這可能是權(quán)重或其梯度或我們的度量標(biāo)準(zhǔn) - 層中的中間過(guò)濾器正交性的標(biāo)準(zhǔn)懂酱。在這里,我們提出我們的算法與度量的選擇無(wú)關(guān)誊抛。當(dāng)應(yīng)用于我們的訓(xùn)練計(jì)劃時(shí)玩焰,對(duì)過(guò)濾器重要性的最明智選擇會(huì)導(dǎo)致對(duì)標(biāo)準(zhǔn)訓(xùn)練的改進(jìn)(參見(jiàn)消融研究6)。

Our training scheme operates on a macro-level and is not a weight update rule. Thus, is not a substitute for SGD or other adaptive methods like Adam [46] and RmsProp [47]. Our scheme works with any of the available optimizers and shows improvement across the board. However, if using an optimizer that has parameters speci?c learning rates (like Adam), it is important to re-initialize the learning rates corresponding to the weights that are part of the pruned ?lters (
image

). Corresponding Batch Normalization [48] parameters (
image

) must also be re-initialized. For this reason, comparisons of our training scheme with standard training are done with a common optimizer.

我們的培訓(xùn)計(jì)劃在宏觀層面上運(yùn)作芍锚,而不是權(quán)重更新規(guī)則。因此蔓榄,不能替代SGD或其他適應(yīng)性方法并炮,如Adam [46]和RmsProp [47]。我們的方案適用于任何可用的優(yōu)化器甥郑,并顯示全面改進(jìn)逃魄。但是,如果使用具有特定參數(shù)學(xué)習(xí)速率的優(yōu)化器(如Adam)澜搅,則重新初始化與作為修剪過(guò)濾器(
image

)一部分的權(quán)重相對(duì)應(yīng)的學(xué)習(xí)速率非常重要伍俘。還必須重新初始化相應(yīng)的批量標(biāo)準(zhǔn)化[48]參數(shù)(
image

)。出于這個(gè)原因勉躺,我們的培訓(xùn)計(jì)劃與標(biāo)準(zhǔn)培訓(xùn)的比較是通過(guò)一個(gè)共同的優(yōu)化器完成的癌瘾。

We reinitialize the ?lters (
image

) to be orthogonal to its value before being dropped and the current value of nonpruned ?lters (
image

). We use the QR decomposition on the weights of the ?lters from the same layer to ?nd the null-space and use that to ?nd an orthogonal initialization point.

我們重新初始化濾波器(
image

),使其與被丟棄之前的值正交饵溅,以及非濾波濾波器的當(dāng)前值(
image

)妨退。我們對(duì)來(lái)自同一層的濾波器的權(quán)重使用QR分解來(lái)找到零空間,并使用它來(lái)找到正交初始化點(diǎn)蜕企。

Our algorithm is training interposed with Re-initializing and Pruning - RePr (pronounced: reaper). We summarize our training scheme in Algorithm 1.

我們的算法是訓(xùn)練插入Re-initializing和Pruning - RePr(發(fā)音為:reaper)咬荷。我們?cè)谒惴?中總結(jié)了我們的訓(xùn)練方案。

image

We use a shallow model to analyze the dynamics of our training scheme and its impact on the train/test accuracy. A shallow model will make it feasible to compute the greedy Oracle ranking for each of the ?lters. This will allow us to understand the impact of training scheme alone without confounding the results due to the impact of ranking criteria. We provide results on larger and deeper convolutional networks in Section Results 8.

我們使用淺層模型來(lái)分析我們的訓(xùn)練方案的動(dòng)態(tài)及其對(duì)列車/測(cè)試精度的影響轻掩。淺模型可以計(jì)算每個(gè)過(guò)濾器的貪婪Oracle排名幸乒。這將使我們能夠單獨(dú)了解培訓(xùn)計(jì)劃的影響,而不會(huì)因排名標(biāo)準(zhǔn)的影響而混淆結(jié)果唇牧。我們?cè)诮Y(jié)果8中提供了更大更深的卷積網(wǎng)絡(luò)的結(jié)果罕扎。

Consider a n layer vanilla ConvNet, without a skip or dense connections, with X ?lter each, as shown below:

考慮一個(gè)n層香草ConvNet聚唐,沒(méi)有跳過(guò)或密集連接,每個(gè)都有X濾波器壳影,如下所示:

image

We will represent this architecture as
image

. Thus, a
image

has 96 ?lters, and when trained with SGD with a learning rate of 0.01, achieves test accuracy of 73%. Figure 1 shows training plots for accuracy on the training set (left) and test set (right). In this example, we use a RePr training scheme with
image

and the ranking criteria R as a greedy Oracle. We exclude a separate validation set of 5K images from the training set to compute the Oracle ranking. In the training plot, annotation [A] shows the point at which the ?lters are ?rst pruned. Annotation [C] marks the test accuracy of the model at this point. The drop in test accuracy at [C] is lower than that of training accuracy at [A], which is not a surprise as most models over?t the training set. However, the test accuracy at [D] is the same as [C] but at this point, the model only has 70% of the ?lters. This is not a surprising result, as research on ?lter pruning shows that at lower rates of pruning most if not all of the performance can be recovered [9].

我們將此架構(gòu)表示為
image

拱层。因此,
image

有96個(gè)濾波器宴咧,當(dāng)用SGD訓(xùn)練根灯,學(xué)習(xí)率為0.01時(shí),測(cè)試精度達(dá)到73%掺栅。圖1顯示了訓(xùn)練集(左)和測(cè)試集(右)的準(zhǔn)確性訓(xùn)練圖烙肺。在這個(gè)例子中,我們使用帶有
image

的RePr訓(xùn)練方案和排名標(biāo)準(zhǔn)R作為貪婪的Oracle氧卧。我們從訓(xùn)練集中排除了單獨(dú)的5K圖像驗(yàn)證集桃笙,以計(jì)算Oracle排名。在訓(xùn)練圖中沙绝,注釋[A]顯示了首次修剪過(guò)濾器的點(diǎn)搏明。注釋[C]標(biāo)記此時(shí)模型的測(cè)試精度。[C]時(shí)測(cè)試精度的下降低于[A]時(shí)的訓(xùn)練精度闪檬,這并不令人意外星著,因?yàn)榇蠖鄶?shù)模型都超過(guò)訓(xùn)練集。但是粗悯,[D]的測(cè)試精度與[C]相同虚循,但此時(shí),模型只有70%的過(guò)濾器样傍。這并不是一個(gè)令人驚訝的結(jié)果横缔,因?yàn)閷?duì)過(guò)濾修剪的研究表明,在較低的修剪率下衫哥,大多數(shù)(如果不是全部)性能都可以恢復(fù)[9]茎刚。

What is surprising is that test accuracy at [E], which is only a couple of epochs after re-introducing the pruned ?lters, is signi?cantly higher than point [C]. Both point [C] and point [E] are same capacity networks, and higher accuracy at [E] is not due to the model convergence. In the standard training (orange line) the test accuracy does not change during this period. Models that ?rst grow the network and then prune [49, 50], unfortunately, stopped shy of another phase of growth, which yields improved performance. In their defense, this technique defeats the purpose of obtaining a smaller network by pruning. However, if we continue RePr training for another two iterations, we see that the point [F], which is still at 70% of the original ?lters yields accuracy which is comparable to the point [E] (100% of the model size.

令人驚訝的是,[E]的測(cè)試精度炕檩,即重新引入修剪后的濾波器后的幾個(gè)時(shí)期斗蒋,明顯高于點(diǎn)[C]。點(diǎn)[C]和點(diǎn)[E]都是相同的容量網(wǎng)絡(luò)笛质,[E]處的更高精度不是由于模型收斂泉沾。在標(biāo)準(zhǔn)培訓(xùn)(橙色線)中,測(cè)試精度在此期間不會(huì)改變妇押。不幸的是跷究,第一次成長(zhǎng)網(wǎng)絡(luò)然后修剪[49,50]的模型已經(jīng)不再適應(yīng)另一個(gè)增長(zhǎng)階段,從而提高了性能敲霍。在他們的辯護(hù)中俊马,這種技術(shù)破壞了通過(guò)修剪獲得更小網(wǎng)絡(luò)的目的丁存。然而,如果我們繼續(xù)進(jìn)行另外兩次迭代的RePr訓(xùn)練柴我,我們看到仍然是原始濾波器的70%的點(diǎn)[F]產(chǎn)生的精度與點(diǎn)[E](模型尺寸的100%)相當(dāng)解寝。

Another observation we can make from the plots is that training accuracy of RePr model is lower, which signi?es some form of regularization on the model. This is evident in the Figure 4 (Right), which shows RePr with a large number of iterations (
image

). While the marginal bene?t of higher test accuracy diminishes quickly, the generalization gap between train and test accuracy is reduced signi?cantly.

我們可以從圖中得到的另一個(gè)觀察結(jié)果是RePr模型的訓(xùn)練精度較低,這表明模型上存在某種形式的正則化艘儒。這在圖4(右)中很明顯聋伦,它顯示了具有大量迭代的RePr(
image

)。雖然較高的測(cè)試精度的邊際收益迅速減少界睁,但列車和測(cè)試精度之間的泛化差距顯著降低觉增。

5. Our Metric : inter-?lter orthogonality

5.我們的度量標(biāo)準(zhǔn):中間過(guò)濾器正交性

The goals of searching for a metric to rank least important ?lters are twofold - (1) computing the greedy Oracle is not computationally feasible for large networks, and (2) the greedy Oracle may not be the best criteria. If a ?lter which captures a unique direction, thus not replaceable by a linear combination of other ?lters, has a lower contribution to accuracy, the Oracle will drop that ?lter. On a subsequent re-initialization and training, we may not get back the same set of directions.

搜索度量標(biāo)準(zhǔn)以排列最不重要的過(guò)濾器的目標(biāo)是雙重的 - (1)計(jì)算貪婪的Oracle對(duì)于大型網(wǎng)絡(luò)而言在計(jì)算上不可行,以及(2)貪婪的Oracle可能不是最佳標(biāo)準(zhǔn)翻斟。如果捕獲唯一方向的濾波器(因此無(wú)法通過(guò)其他濾波器的線性組合替換)對(duì)精度的貢獻(xiàn)較低逾礁,則Oracle將刪除該濾波器。在隨后的重新初始化和訓(xùn)練中访惜,我們可能無(wú)法返回相同的方向集嘹履。

The directions captured by the activation pattern expresses the capacity of a deep network [51]. Making orthogonal features will maximize the directions captured and thus expressiveness of the network. In a densely connected layer, orthogonal weights lead to orthogonal features, even in the presence of ReLU [42]. However, it is not clear how to compute the orthogonality of a convolutional layer.

激活模式捕獲的方向表達(dá)了深度網(wǎng)絡(luò)的容量[51]。制作正交特征將最大化捕獲的方向并因此最大化網(wǎng)絡(luò)的表現(xiàn)力债热。在密集連接的層中植捎,正交權(quán)重導(dǎo)致正交特征,即使存在ReLU [42]阳柔。但是,目前尚不清楚如何計(jì)算卷積層的正交性蚓峦。

A convolutional layer is composed of parameters grouped into spatial kernels and sparsely share the incoming activations. Should all the parameters in a single convolutional layer be considered while accounting for orthogonality? The theory that promotes initializing weights to be orthogonal is based on densely connected layers (FC-layers) and popular deep learning libraries follow this guide1 by considering convolutional layer as one giant vector disregarding the sparse connectivity. A recent attempt to study orthogonality of convolutional ?lters is described in [41] but their motivation is the convergence of very deep networks (10K layers) and not orthogonality of the features. Our empirical study suggests a strong preference for requiring orthogonality of individual ?lters in a layer (inter-?lter & intra-layer) rather than individual kernels.

卷積層由分組到空間內(nèi)核中的參數(shù)組成舌剂,并且稀疏地共享傳入的激活。是否應(yīng)考慮單個(gè)卷積層中的所有參數(shù)暑椰,同時(shí)考慮正交性霍转?促進(jìn)初始化權(quán)重為正交的理論基于密集連接的層(FC層),并且流行的深度學(xué)習(xí)庫(kù)遵循本指南1一汽,將卷積層視為忽略稀疏連接的一個(gè)巨型向量避消。最近在[41]中描述了研究卷積濾波器正交性的嘗試,但他們的動(dòng)機(jī)是非常深的網(wǎng)絡(luò)(10K層)的融合而不是特征的正交性召夹。我們的實(shí)證研究表明岩喷,強(qiáng)烈要求在一個(gè)層(跨層和內(nèi)層)而不是單個(gè)內(nèi)核中要求單個(gè)過(guò)濾器的正交性。

A ?lter of kernel size
image

is commonly a 3D tensor of shape
image

, where c is the number of channels in the incoming activations. Flatten this tensor to a 1D vector of size
image

, and denote it by f. Let
image

denote the number of ?lters in the layer ?, where
image

, and L is the number of layers in the ConvNet. Let
image

be a matrix, such that the individual rows are the ?attened ?lters (f ) of the layer ?.

內(nèi)核大小
image

的過(guò)濾器通常是形狀為
image

的3D張量监憎,其中c是傳入激活中的通道數(shù)纱意。將此張量展平為大小為
image

的1D向量,并用f表示鲸阔。讓
image

表示層?中的過(guò)濾器數(shù)量偷霉,其中
image

和L是ConvNet中的層數(shù)迄委。設(shè)
image

是一個(gè)矩陣,這樣各行就是層l的fl參與過(guò)濾器(f)类少。

Let
image

denote the normalized weights. Then, the measure of Orthogonality for ?lter f in a layer ? (denoted by
image

) is computed as shown in the equations below.

設(shè)
image

表示標(biāo)準(zhǔn)化權(quán)重叙身。然后,計(jì)算層l中的濾波器f的正交性度量(由
image

表示)硫狞,如下面的等式所示信轿。

image
image

is a matrix of size
image

and
image

denotes ith row of P. Off-diagonal elements of a row of P for a ?lter f denote projection of all the other ?lters in the same layer with f. The sum of a row is minimum when other ?lters are orthogonal to this given ?lter. We rank the ?lters least important (thus subject to pruning) if this value is largest among all the ?lters in the network. While we compute the metric for a ?lter over a single layer, the ranking is computed over all the ?lters in the network. We do not enforce per layer rank because that would require learning a hyper-parameter
image

for every layer and some layers are more sensitive than others. Our method prunes more ?lters from deeper layers compared to the earlier layers. This is in accordance with the distribution of contribution of each ?lter in a given network (Figure 2 right).

image

是一個(gè)大小為
image

的矩陣,
image

是一個(gè)P的第i行妓忍。用于濾波器f的P行的非對(duì)角線元素表示與f在同一層中的所有其他濾波器的投影虏两。當(dāng)其他濾波器與該給定濾波器正交時(shí),行的總和是最小的世剖。如果此值在網(wǎng)絡(luò)中的所有過(guò)濾器中最大定罢,我們將過(guò)濾器排名最不重要(因此需要修剪)。當(dāng)我們計(jì)算單個(gè)層上的過(guò)濾器的度量時(shí)旁瘫,將對(duì)網(wǎng)絡(luò)中的所有過(guò)濾器計(jì)算排名祖凫。我們不強(qiáng)制執(zhí)行每層排名,因?yàn)檫@需要為每個(gè)層學(xué)習(xí)一個(gè)超參數(shù)
image

酬凳,而某些層比其他層更敏感惠况。與早期的圖層相比,我們的方法可以從更深的層中修剪更多的濾鏡宁仔。這與給定網(wǎng)絡(luò)中每個(gè)濾波器的貢獻(xiàn)分布一致(圖2右)稠屠。

Computation of our metric does not require expensive calculations of the inverse of Hessian [22] or the second order derivatives [23] and is feasible for any sized networks. The most expensive calculations are L matrix products of size
image

, but GPUs are designed for fast matrixmultiplications. Still, our method is more expensive than computing norm of the weights or the activations or the Average Percentage of Zeros (APoZ).

我們的度量的計(jì)算不需要昂貴的Hessian [22]或二階導(dǎo)數(shù)[23]的逆的計(jì)算,并且對(duì)于任何大小的網(wǎng)絡(luò)都是可行的翎苫。最昂貴的計(jì)算是尺寸為
image

的L矩陣產(chǎn)品权埠,但GPU設(shè)計(jì)用于快速矩陣乘法。盡管如此煎谍,我們的方法比計(jì)算權(quán)重或激活的平均值或零點(diǎn)的平均百分比(APoZ)更昂貴攘蔽。

1tensor?ow:ops/init ops.py#L543 & pytorch:nn/init.py#L350

1tensor fl ow:ops / init ops.py#L543&pytorch:nn / init.py#L350

image

Given the choice of Orthogonality of ?lters, an obvious question would be to ask if adding a soft penalty to the loss function improve this training? A few researchers [35, 36, 37] have reported marginal improvements due to added regularization in the ConvNets used for task-speci?c models. We experimented by adding
image

to the loss function, but we did not see any improvement. Soft regularization penalizes all the ?lters and changes the loss surface to encourage random orthogonality in the weights without improving expressiveness.

考慮到濾波器正交性的選擇,一個(gè)顯而易見(jiàn)的問(wèn)題是詢問(wèn)是否在損失函數(shù)中加入軟懲罰可以改善這種訓(xùn)練呐粘?一些研究人員[35,36,37]報(bào)道了由于在用于任務(wù)特定模型的ConvNets中增加正則化而導(dǎo)致的邊際改進(jìn)满俗。我們通過(guò)在損失函數(shù)中添加
image

進(jìn)行了實(shí)驗(yàn),但我們沒(méi)有看到任何改進(jìn)作岖。軟正規(guī)化懲罰所有過(guò)濾器并改變損失表面以鼓勵(lì)權(quán)重中的隨機(jī)正交性而不改善表達(dá)性唆垃。

文章引用于 http://tongtianta.site/paper/13375
編輯 Lornatang
校準(zhǔn) Lornatang

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市痘儡,隨后出現(xiàn)的幾起案子降盹,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 222,946評(píng)論 6 518
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件蓄坏,死亡現(xiàn)場(chǎng)離奇詭異价捧,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)涡戳,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,336評(píng)論 3 399
  • 文/潘曉璐 我一進(jìn)店門结蟋,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人渔彰,你說(shuō)我怎么就攤上這事嵌屎。” “怎么了恍涂?”我有些...
    開(kāi)封第一講書(shū)人閱讀 169,716評(píng)論 0 364
  • 文/不壞的土叔 我叫張陵宝惰,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我再沧,道長(zhǎng)尼夺,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 60,222評(píng)論 1 300
  • 正文 為了忘掉前任炒瘸,我火速辦了婚禮淤堵,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘顷扩。我一直安慰自己屋吨,他們只是感情好钝诚,可當(dāng)我...
    茶點(diǎn)故事閱讀 69,223評(píng)論 6 398
  • 文/花漫 我一把揭開(kāi)白布谨胞。 她就那樣靜靜地躺著洛姑,像睡著了一般。 火紅的嫁衣襯著肌膚如雪婶芭。 梳的紋絲不亂的頭發(fā)上乘陪,一...
    開(kāi)封第一講書(shū)人閱讀 52,807評(píng)論 1 314
  • 那天,我揣著相機(jī)與錄音雕擂,去河邊找鬼。 笑死贱勃,一個(gè)胖子當(dāng)著我的面吹牛井赌,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播贵扰,決...
    沈念sama閱讀 41,235評(píng)論 3 424
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼仇穗,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了戚绕?” 一聲冷哼從身側(cè)響起纹坐,我...
    開(kāi)封第一講書(shū)人閱讀 40,189評(píng)論 0 277
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎舞丛,沒(méi)想到半個(gè)月后耘子,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體果漾,經(jīng)...
    沈念sama閱讀 46,712評(píng)論 1 320
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,775評(píng)論 3 343
  • 正文 我和宋清朗相戀三年谷誓,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了绒障。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,926評(píng)論 1 353
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡捍歪,死狀恐怖户辱,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情糙臼,我是刑警寧澤庐镐,帶...
    沈念sama閱讀 36,580評(píng)論 5 351
  • 正文 年R本政府宣布,位于F島的核電站变逃,受9級(jí)特大地震影響必逆,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜韧献,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 42,259評(píng)論 3 336
  • 文/蒙蒙 一末患、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧锤窑,春花似錦璧针、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 32,750評(píng)論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至绘证,卻和暖如春隧膏,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背嚷那。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,867評(píng)論 1 274
  • 我被黑心中介騙來(lái)泰國(guó)打工胞枕, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人魏宽。 一個(gè)月前我還...
    沈念sama閱讀 49,368評(píng)論 3 379
  • 正文 我出身青樓腐泻,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親队询。 傳聞我的和親對(duì)象是個(gè)殘疾皇子派桩,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,930評(píng)論 2 361