Squeeze-and-Excitation Networks論文翻譯——中英文對照

文章作者：Tyan
博客：noahsnail.com ?|? CSDN ?|? 簡書

聲明：作者翻譯論文僅為學習咖耘，如有侵權請聯(lián)系作者刪除博文，謝謝蜀踏！

翻譯論文匯總：https://github.com/SnailTyan/deep-learning-papers-translation

Squeeze-and-Excitation Networks

Abstract

Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, much existing work has shown the benefits of enhancing spatial encoding. In this work, we focus on channels and propose a novel architectural unit, which we term the "Squeeze-and-Excitation"(SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at slight computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to $2.251%$, achieving a $\sim25%$ relative improvement over the winning entry of 2016.

摘要

卷積神經網絡建立在卷積運算的基礎上维蒙，通過融合局部感受野內的空間信息和通道信息來提取信息特征。為了提高網絡的表示能力果覆，許多現(xiàn)有的工作已經顯示出增強空間編碼的好處颅痊。在這項工作中，我們專注于通道局待，并提出了一種新穎的架構單元斑响，我們稱之為“Squeeze-and-Excitation”（SE）塊，通過顯式地建模通道之間的相互依賴關系钳榨，自適應地重新校準通道式的特征響應舰罚。通過將這些塊堆疊在一起，我們證明了我們可以構建SENet架構薛耻，在具有挑戰(zhàn)性的數據集中可以進行泛化地非常好营罢。關鍵的是，我們發(fā)現(xiàn)SE塊以微小的計算成本為現(xiàn)有的最先進的深層架構產生了顯著的性能改進饼齿。SENets是我們ILSVRC 2017分類提交的基礎饲漾，它贏得了第一名，并將top-5錯誤率顯著減少到$2.251 %$缕溉，相對于2016年的獲勝成績取得了$\sim25%$的相對改進考传。

1. Introduction

Convolutional neural networks (CNNs) have proven to be effective models for tackling a variety of visual tasks [19, 23, 29, 41]. For each convolutional layer, a set of filters are learned to express local spatial connectivity patterns along input channels. In other words, convolutional filters are expected to be informative combinations by fusing spatial and channel-wise information together, while restricted in local receptive fields. By stacking a series of convolutional layers interleaved with non-linearities and downsampling, CNNs are capable of capturing hierarchical patterns with global receptive fields as powerful image descriptions. Recent work has demonstrated the performance of networks can be improved by explicitly embedding learning mechanisms that help capture spatial correlations without requiring additional supervision. One such approach was popularised by the Inception architectures [14, 39], which showed that the network can achieve competitive accuracy by embedding multi-scale processes in its modules. More recent work has sought to better model spatial dependence [1, 27] and incorporate spatial attention [17].

1. 引言

卷積神經網絡（CNNs）已被證明是解決各種視覺任務的有效模型[19,23,29,41]。對于每個卷積層倒淫，沿著輸入通道學習一組濾波器來表達局部空間連接模式伙菊。換句話說，期望卷積濾波器通過融合空間信息和信道信息進行信息組合敌土，而受限于局部感受野镜硕。通過疊加一系列非線性和下采樣交織的卷積層，CNN能夠捕獲具有全局感受野的分層模式作為強大的圖像描述返干。最近的工作已經證明兴枯，網絡的性能可以通過顯式地嵌入學習機制來改善，這種學習機制有助于捕捉空間相關性而不需要額外的監(jiān)督矩欠。Inception架構推廣了一種這樣的方法[14,39]财剖，這表明網絡可以通過在其模塊中嵌入多尺度處理來取得有競爭力的準確度悠夯。最近的工作在尋找更好地模型空間依賴[1,27]，結合空間注意力[17]躺坟。

In contrast to these methods, we investigate a different aspect of architectural design —— the channel relationship, by introducing a new architectural unit, which we term the “Squeeze-and-Excitation” (SE) block. Our goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.

與這些方法相反沦补，通過引入新的架構單元，我們稱之為“Squeeze-and-Excitation” (SE)塊咪橙，我們研究了架構設計的一個不同方向——通道關系夕膀。我們的目標是通過顯式地建模卷積特征通道之間的相互依賴性來提高網絡的表示能力。為了達到這個目的美侦，我們提出了一種機制产舞，使網絡能夠執(zhí)行特征重新校準，通過這種機制可以學習使用全局信息來選擇性地強調信息特征并抑制不太有用的特征菠剩。

The basic structure of the SE building block is illustrated in Fig.1. For any given transformation $\mathbf{F}_{tr} : \mathbf{X} \rightarrow \mathbf{U}$, $\mathbf{X} \in \mathbb{R}^{W' \times H' \times C'}, \mathbf{U} \in \mathbb{R}^{W \times H \times C}$, (e.g. a convolution or a set of convolutions), we can construct a corresponding SE block to perform feature recalibration as follows. The features $\mathbf{U}$ are first passed through a squeeze operation, which aggregates the feature maps across spatial dimensions $W \times H$ to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. This is followed by an excitation operation, in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps $\mathbf{U}$ are then reweighted to generate the output of the SE block which can then be fed directly into subsequent layers.

Figure 1

Figure 1. A Squeeze-and-Excitation block.

SE構建塊的基本結構如圖1所示易猫。對于任何給定的變換$\mathbf{F}_{tr} : \mathbf{X} \rightarrow \mathbf{U}$, $\mathbf{X} \in \mathbb{R}^{W' \times H' \times C'}, \mathbf{U} \in \mathbb{R}^{W \times H \times C}$，(例如卷積或一組卷積)具壮，我們可以構造一個相應的SE塊來執(zhí)行特征重新校準准颓，如下所示。特征$\mathbf{U}$首先通過squeeze操作嘴办，該操作跨越空間維度$W \times H$聚合特征映射來產生通道描述符瞬场。這個描述符嵌入了通道特征響應的全局分布，使來自網絡全局感受野的信息能夠被其較低層利用涧郊。這之后是一個excitation操作贯被，其中通過基于通道依賴性的自門機制為每個通道學習特定采樣的激活，控制每個通道的激勵妆艘。然后特征映射$\mathbf{U}$被重新加權以生成SE塊的輸出彤灶，然后可以將其直接輸入到隨后的層中。

Figure 1

圖1. Squeeze-and-Excitation塊

An SE network can be generated by simply stacking a collection of SE building blocks. SE blocks can also be used as a drop-in replacement for the original block at any depth in the architecture. However, while the template for the building block is generic, as we show in Sec. 6.3, the role it performs at different depths adapts to the needs of the network. In the early layers, it learns to excite informative features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the SE block becomes increasingly specialised, and responds to different inputs in a highly class-specific manner. Consequently, the benefits of feature recalibration conducted by SE blocks can be accumulated through the entire network.

SE網絡可以通過簡單地堆疊SE構建塊的集合來生成批旺。SE塊也可以用作架構中任意深度的原始塊的直接替換幌陕。然而，雖然構建塊的模板是通用的汽煮，正如我們6.3節(jié)中展示的那樣搏熄，但它在不同深度的作用適應于網絡的需求。在前面的層中暇赤，它學習以類不可知的方式激發(fā)信息特征心例，增強共享的較低層表示的質量。在后面的層中鞋囊，SE塊越來越專業(yè)化止后，并以高度類特定的方式響應不同的輸入。因此，SE塊進行特征重新校準的好處可以通過整個網絡進行累積译株。

The development of new CNN architectures is a challenging engineering task, typically involving the selection of many new hyperparameters and layer configurations. By contrast, the design of the SE block outlined above is simple, and can be used directly with existing state-of-the-art architectures whose convolutional layers can be strengthened by direct replacement with their SE counterparts. Moreover, as shown in Sec. 4, SE blocks are computationally lightweight and impose only a slight increase in model complexity and computational burden. To support these claims, we develop several SENets, namely SE-ResNet, SE-Inception, SE-ResNeXt and SE-Inception-ResNet and provide an extensive evaluation of SENets on the ImageNet 2012 dataset [30]. Further, to demonstrate the general applicability of SE blocks, we also present results beyond ImageNet, indicating that the proposed approach is not restricted to a specific dataset or a task.

新CNN架構的開發(fā)是一項具有挑戰(zhàn)性的工程任務瓜喇，通常涉及許多新的超參數和層配置的選擇。相比之下歉糜，上面概述的SE塊的設計是簡單的乘寒，并且可以直接與現(xiàn)有的最新架構一起使用，其卷積層可以通過直接用對應的SE層來替換從而進行加強现恼。另外肃续，如第四節(jié)所示黍檩，SE塊在計算上是輕量級的叉袍，并且在模型復雜性和計算負擔方面僅稍微增加。為了支持這些聲明刽酱，我們開發(fā)了一些SENets喳逛，即SE-ResNet，SE-Inception棵里，SE-ResNeXt和SE-Inception-ResNet润文，并在ImageNet 2012數據集[30]上對SENets進行了廣泛的評估。此外殿怜，為了證明SE塊的一般適用性典蝌，我們還呈現(xiàn)了ImageNet之外的結果，表明所提出的方法不受限于特定的數據集或任務头谜。

Using SENets, we won the first place in the ILSVRC 2017 classification competition. Our top performing model ensemble achieves a $2.251%$ top-5 error on the test set. This represents a $\sim 25%$ relative improvement in comparison to the winner entry of the previous year (with a top-$5$ error of $2.991%$). Our models and related materials have been made available to the research community.

使用SENets骏掀，我們贏得了ILSVRC 2017分類競賽的第一名。我們的表現(xiàn)最好的模型集合在測試集上達到了$2.251%$的top-5錯誤率柱告。與前一年的獲獎者（$2.991%$的top-5錯誤率）相比截驮，這表示$\sim 25%$的相對改進。我們的模型和相關材料已經提供給研究界际度。

2. Related Work

Deep architectures. A wide range of work has shown that restructuring the architecture of a convolutional neural network in a manner that eases the learning of deep features can yield substantial improvements in performance. VGGNets [35] and Inception models [39] demonstrated the benefits that could be attained with an increased depth, significantly outperforming previous approaches on ILSVRC 2014. Batch normalization (BN) [14] improved gradient propagation through deep networks by inserting units to regulate layer inputs stabilising the learning process, which enables further experimentation with a greater depth. He et al. [9, 10] showed that it was effective to train deeper networks by restructuring the architecture to learn residual functions through the use of identity-based skip connections which ease the flow of information across units. More recently, reformulations of the connections between network layers [5, 12] have been shown to further improve the learning and representational properties of deep networks.

2. 近期工作

深層架構葵袭。大量的工作已經表明，以易于學習深度特征的方式重構卷積神經網絡的架構可以大大提高性能乖菱。VGGNets[35]和Inception模型[39]證明了深度增加可以獲得的好處坡锡，明顯超過了ILSVRC 2014之前的方法。批標準化（BN）[14]通過插入單元來調節(jié)層輸入穩(wěn)定學習過程窒所，改善了通過深度網絡的梯度傳播鹉勒，這使得可以用更深的深度進行進一步的實驗。He等人[9,10]表明墩新，通過重構架構來訓練更深層次的網絡是有效的贸弥，通過使用基于恒等映射的跳躍連接來學習殘差函數，從而減少跨單元的信息流動海渊。最近绵疲，網絡層間連接的重新表示[5,12]已被證明可以進一步改善深度網絡的學習和表征屬性哲鸳。

An alternative line of research has explored ways to tune the functional form of the modular components of a network. Grouped convolutions can be used to increase cardinality (the size of the set of transformations) [13, 43] to learn richer representations. Multi-branch convolutions can be interpreted as a generalisation of this concept, enabling more flexible compositions of convolutional operators [14, 38, 39, 40]. Cross-channel correlations are typically mapped as new combinations of features, either independently of spatial structure [6, 18] or jointly by using standard convolutional filters [22] with $1\times 1$ convolutions, while much of this work has concentrated on the objective of reducing model and computational complexity. This approach reflects an assumption that channel relationships can be formulated as a composition of instance-agnostic functions with local receptive fields. In contrast, we claim that providing the network with a mechanism to explicitly model dynamic, non-linear dependencies between channels using global information can ease the learning process, and significantly enhance the representational power of the network.

另一種研究方法探索了調整網絡模塊化組件功能形式的方法】可以用分組卷積來增加基數（一組變換的大嗅悴ぁ）[13,43]以學習更豐富的表示。多分支卷積可以解釋為這個概念的概括郁岩，使得卷積算子可以更靈活的組合[14,38,39,40]婿奔。跨通道相關性通常被映射為新的特征組合问慎，或者獨立的空間結構[6,18]萍摊，或者聯(lián)合使用標準卷積濾波器[22]和$1\times 1$卷積，然而大部分工作的目標是集中在減少模型和計算復雜度上面如叼。這種方法反映了一個假設冰木，即通道關系可以被表述為具有局部感受野的實例不可知的函數的組合。相比之下笼恰，我們聲稱為網絡提供一種機制來顯式建模通道之間的動態(tài)踊沸、非線性依賴關系，使用全局信息可以減輕學習過程社证，并且顯著增強網絡的表示能力逼龟。

Attention and gating mechanisms. Attention can be viewed, broadly, as a tool to bias the allocation of available processing resources towards the most informative components of an input signal. The development and understanding of such mechanisms has been a longstanding area of research in the neuroscience community [15, 16, 28] and has seen significant interest in recent years as a powerful addition to deep neural networks [20, 25]. Attention has been shown to improve performance across a range of tasks, from localisation and understanding in images [3, 17] to sequence-based models [2, 24]. It is typically implemented in combination with a gating function (e.g. a softmax or sigmoid) and sequential techniques [11, 37]. Recent work has shown its applicability to tasks such as image captioning [4, 44] and lip reading [7], in which it is exploited to efficiently aggregate multi-modal data. In these applications, it is typically used on top of one or more layers representing higher-level abstractions for adaptation between modalities. Highway networks [36] employ a gating mechanism to regulate the shortcut connection, enabling the learning of very deep architectures. Wang et al. [42] introduce a powerful trunk-and-mask attention mechanism using an hourglass module [27], inspired by its success in semantic segmentation. This high capacity unit is inserted into deep residual networks between intermediate stages. In contrast, our proposed SE-block is a lightweight gating mechanism, specialised to model channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of modules throughout the network.

注意力和門機制。從廣義上講追葡，可以將注意力視為一種工具腺律，將可用處理資源的分配偏向于輸入信號的信息最豐富的組成部分。這種機制的發(fā)展和理解一直是神經科學社區(qū)的一個長期研究領域[15,16,28]辽俗，并且近年來作為一個強大補充疾渣，已經引起了深度神經網絡的極大興趣[20,25]。注意力已經被證明可以改善一系列任務的性能崖飘，從圖像的定位和理解[3,17]到基于序列的模型[2,24]榴捡。它通常結合門功能（例如softmax或sigmoid）和序列技術來實現(xiàn)[11,37]。最近的研究表明朱浴，它適用于像圖像標題[4,44]和口頭閱讀[7]等任務吊圾，其中利用它來有效地匯集多模態(tài)數據。在這些應用中翰蠢，它通常用在表示較高級別抽象的一個或多個層的頂部项乒，以用于模態(tài)之間的適應。高速網絡[36]采用門機制來調節(jié)快捷連接梁沧，使得可以學習非常深的架構檀何。王等人[42]受到語義分割成功的啟發(fā)，引入了一個使用沙漏模塊[27]的強大的trunk-and-mask注意力機制。這個高容量的單元被插入到中間階段之間的深度殘差網絡中频鉴。相比之下栓辜，我們提出的SE塊是一個輕量級的門機制，專門用于以計算有效的方式對通道關系進行建模垛孔，并設計用于增強整個網絡中模塊的表示能力藕甩。

3. Squeeze-and-Excitation Blocks

The Squeeze-and-Excitation block is a computational unit which can be constructed for any given transformation $\mathbf{F}_{tr}: \mathbf{X} \rightarrow \mathbf{U}, , \mathbf{X} \in \mathbb{R}^{W' \times H' \times C'}, \mathbf{U} \in \mathbb{R}^{W \times H \times C}$. For simplicity of exposition, in the notation that follows we take $\mathbf{F}_{tr}$ to be a standard convolutional operator. Let $\mathbf{V}= [\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_{C}]$ denote the learned set of filter kernels, where $\mathbf{v}_c$ refers to the parameters of the $c$-th filter. We can then write the outputs of $\mathbf{F}_{tr}$ as $\mathbf{U} = [\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_{C}]$ where $$\mathbf{u}_c = \mathbf{v}_c \ast \mathbf{X} = \sum_{s=1}^{{C'}\mathbf{v}}s_c \ast \mathbf{x}^s.$$ Here $\ast$ denotes convolution, $\mathbf{v}_c = [\mathbf{v}^1_c, \mathbf{v}^2_c, \dots, \mathbf{v}^{C'}_c]$ and $\mathbf{X} = [\mathbf{x}^1, \mathbf{x}^2, \dots, \mathbf{x}^{C'}]$ (to simplify the notation, bias terms are omitted). Here $\mathbf{v}^s_c$ is a $2$D spatial kernel, and therefore represents a single channel of $\mathbf{v}_c$ which acts on the corresponding channel of $\mathbf{X}$. Since the output is produced by a summation through all channels, the channel dependencies are implicitly embedded in $\mathbf{v}_c$, but these dependencies are entangled with the spatial correlation captured by the filters. Our goal is to ensure that the network is able to increase its sensitivity to informative features so that they can be exploited by subsequent transformations, and to suppress less useful ones. We propose to achieve this by explicitly modelling channel interdependencies to recalibrate filter responses in two steps, squeeze and excitation, before they are fed into next transformation. A diagram of an SE building block is shown in Fig.1.

3. Squeeze-and-Excitation塊

Squeeze-and-Excitation塊是一個計算單元，可以為任何給定的變換構建：$\mathbf{F}_{tr}: \mathbf{X} \rightarrow \mathbf{U}, , \mathbf{X} \in \mathbb{R}^{W' \times H' \times C'}, \mathbf{U} \in \mathbb{R}^{W \times H \times C}$周荐。為了簡化說明狭莱，在接下來的表示中，我們將$\mathbf{F}_{tr}$看作一個標準的卷積算子概作。$\mathbf{V}= [\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_{C}]$表示學習到的一組濾波器核腋妙，$\mathbf{v}_c$指的是第$c$個濾波器的參數。然后我們可以將$\mathbf{F}_{tr}$的輸出寫作$\mathbf{U} = [\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_{C}]$仆嗦，其中$$\mathbf{u}_c = \mathbf{v}_c \ast \mathbf{X} = \sum_{s=1}^{{C'}\mathbf{v}}s_c \ast \mathbf{x}^s.$$這里$\ast$表示卷積辉阶，$\mathbf{v}_c = [\mathbf{v}^1_c, \mathbf{v}^2_c, \dots, \mathbf{v}^{C'}_c]$，$\mathbf{X} = [\mathbf{x}^1, \mathbf{x}^2, \dots, \mathbf{x}^{{C'}]$（為了簡潔表示瘩扼，忽略偏置項）。這里$\mathbf{v}}s_c$是$2$D空間核垃僚，因此表示$\mathbf{v}_c$的一個單通道集绰，作用于對應的通道$\mathbf{X}$。由于輸出是通過所有通道的和來產生的谆棺，所以通道依賴性被隱式地嵌入到$\mathbf{v}_c$中栽燕，但是這些依賴性與濾波器捕獲的空間相關性糾纏在一起。我們的目標是確保能夠提高網絡對信息特征的敏感度改淑，以便后續(xù)轉換可以利用這些功能碍岔，并抑制不太有用的功能旋廷。我們建議通過顯式建模通道依賴性來實現(xiàn)這一點默穴，以便在進入下一個轉換之前通過兩步重新校準濾波器響應猜绣，兩步為：squeeze和excitation柴信。SE構建塊的圖如圖1所示洞拨。

3.1. Squeeze: Global Information Embedding

In order to tackle the issue of exploiting channel dependencies, we first consider the signal to each channel in the output features. Each of the learned filters operate with a local receptive field and consequently each unit of the transformation output $\mathbf{U}$ is unable to exploit contextual information outside of this region. This is an issue that becomes more severe in the lower layers of the network whose receptive field sizes are small.

3.1. Squeeze:全局信息嵌入

為了解決利用通道依賴性的問題晌涕，我們首先考慮輸出特征中每個通道的信號蚣常。每個學習到的濾波器都對局部感受野進行操作厘惦，因此變換輸出$\mathbf{U}$的每個單元都無法利用該區(qū)域之外的上下文信息饥侵。在網絡較低的層次上其感受野尺寸很小鸵赫，這個問題變得更嚴重。

To mitigate this problem, we propose to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generate channel-wise statistics. Formally, a statistic $\mathbf{z} \in \mathbb{R}^{C}$ is generated by shrinking $\mathbf{U}$ through spatial dimensions $W \times H$, where the $c$-th element of $\mathbf{z}$ is calculated by: $$z_c = \mathbf{F}_{sq}(\mathbf{u}_c) = \frac{1}{W \times H}\sum_{i=1}^{W} \sum_{j=1}^{H} u_c(i,j).$$

為了減輕這個問題躏升，我們提出將全局空間信息壓縮成一個通道描述符辩棒。這是通過使用全局平均池化生成通道統(tǒng)計實現(xiàn)的。形式上，統(tǒng)計$\mathbf{z} \in \mathbb{R}^{C}$是通過在空間維度$W \times H$上收縮$\mathbf{U}$生成的一睁，其中$\mathbf{z}$的第$c$個元素通過下式計算：$$z_c = \mathbf{F}_{sq}(\mathbf{u}_c) = \frac{1}{W \times H}\sum_{i=1}^{W} \sum_{j=1}^{H} u_c(i,j).$$

Discussion. The transformation output $\mathbf{U}$ can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image. Exploiting such information is prevalent in feature engineering work [31, 34, 45]. We opt for the simplest, global average pooling, while more sophisticated aggregation strategies could be employed here as well.

討論藕赞。轉換輸出$\mathbf{U}$可以被解釋為局部描述子的集合，這些描述子的統(tǒng)計信息對于整個圖像來說是有表現(xiàn)力的卖局。特征工程工作中[31,34,45]普遍使用這些信息斧蜕。我們選擇最簡單的全局平均池化，同時也可以采用更復雜的匯聚策略砚偶。

3.2. Excitation: Adaptive Recalibration

To make use of the information aggregated in the squeeze operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. To fulfil this objective, the function must meet two criteria: first, it must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels) and second, it must learn a non-mutually-exclusive relationship as multiple channels are allowed to be emphasised opposed to one-hot activation. To meet these criteria, we opt to employ a simple gating mechanism with a sigmoid activation: $$\mathbf{s} = \mathbf{F}_{ex}(\mathbf{z}, \mathbf{W}) = \sigma(g(\mathbf{z}, \mathbf{W})) = \sigma(\mathbf{W}_2\delta(\mathbf{W}_1\mathbf{z}))$$ where $\delta$ refers to the ReLU[26] function, $\mathbf{W}_1 \in \mathbb{R}^{\frac{C}{r} \times C}$ and $\mathbf{W}_2 \in \mathbb{R}^{C \times \frac{C}{r}}$. To limit model complexity and aid generalisation, we parameterise the gating mechanism by forming a bottleneck with two fully-connected (FC) layers around the non-linearity, i.e. a dimensionality-reduction layer with parameters $\mathbf{W}_1$ with reduction ratio $r$ (we set it to be 16, and this parameter choice is discussed in Sec.6.3), a ReLU and then a dimensionality-increasing layer with parameters $\mathbf{W}_2$. The final output of the block is obtained by rescaling the transformation output $\mathbf{U}$ with the activations: $$\widetilde{\mathbf{x}}_c = \mathbf{F}_{scale}(\mathbf{u}_c, s_c) = s_c \cdot \mathbf{u}_c$$ where $\widetilde{\mathbf{X}} = [\widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \dots, \widetilde{\mathbf{x}}_{C}]$ and $\mathbf{F}_{scale}(\mathbf{u}_c, s_c)$ refers to channel-wise multiplication between the feature map $\mathbf{u}_c \in \mathbb{R}^{W \times H}$ and the scalar $s_c$.

3.2. Excitation:自適應重新校正

為了利用壓縮操作中匯聚的信息批销，我們接下來通過第二個操作來全面捕獲通道依賴性。為了實現(xiàn)這個目標染坯，這個功能必須符合兩個標準：第一均芽，它必須是靈活的（特別是它必須能夠學習通道之間的非線性交互）；第二单鹿，它必須學習一個非互斥的關系掀宋，因為獨熱激活相反，這里允許強調多個通道仲锄。為了滿足這些標準劲妙，我們選擇采用一個簡單的門機制，并使用sigmoid激活：$$\mathbf{s} = \mathbf{F}_{ex}(\mathbf{z}, \mathbf{W}) = \sigma(g(\mathbf{z}, \mathbf{W})) = \sigma(\mathbf{W}_2\delta(\mathbf{W}_1\mathbf{z}))$$儒喊，其中$\delta$是指ReLU[26]函數镣奋，$\mathbf{W}_1 \in \mathbb{R}^{\frac{C}{r} \times C}$和$\mathbf{W}_2 \in \mathbb{R}^{C \times \frac{C}{r}}$。為了限制模型復雜度和輔助泛化怀愧，我們通過在非線性周圍形成兩個全連接（FC）層的瓶頸來參數化門機制侨颈，即降維層參數為$\mathbf{W}_1$，降維比例為$r$（我們把它設置為16芯义，這個參數選擇在6.3節(jié)中討論）哈垢，一個ReLU，然后是一個參數為$\mathbf{W}_2$的升維層扛拨。塊的最終輸出通過重新調節(jié)帶有激活的變換輸出$\mathbf{U}$得到：$$\widetilde{\mathbf{x}}_c = \mathbf{F}_{scale}(\mathbf{u}_c, s_c) = s_c \cdot \mathbf{u}_c$$其中$\widetilde{\mathbf{X}} = [\widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \dots, \widetilde{\mathbf{x}}_{C}]$和$\mathbf{F}_{scale}(\mathbf{u}_c, s_c)$指的是特征映射$\mathbf{u}_c \in \mathbb{R}^{W \times H}$和標量$s_c$之間的對應通道乘積耘分。

Discussion. The activations act as channel weights adapted to the input-specific descriptor $\mathbf{z}$. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, helping to boost feature discriminability.

討論。激活作為適應特定輸入描述符$\mathbf{z}$的通道權重鬼癣。在這方面陶贼，SE塊本質上引入了以輸入為條件的動態(tài)特性，有助于提高特征辨別力待秃。

3.3. Exemplars: SE-Inception and SE-ResNet

The flexibility of the SE block means that it can be directly applied to transformations beyond standard convolutions. To illustrate this point, we develop SENets by integrating SE blocks into two popular network families of architectures, Inception and ResNet. SE blocks are constructed for the Inception network by taking the transformation $\mathbf{F}_{tr}$ to be an entire Inception module (see Fig.2). By making this change for each such module in the architecture, we construct an SE-Inception network.

Figure 2

Figure 2. The schema of the original Inception module (left) and the SE-Inception module (right).

3.3. 模型：SE-Inception和SE-ResNet

SE塊的靈活性意味著它可以直接應用于標準卷積之外的變換拜秧。為了說明這一點，我們通過將SE塊集成到兩個流行的網絡架構系列Inception和ResNet中來開發(fā)SENets章郁。通過將變換$\mathbf{F}_{tr}$看作一個整體的Inception模塊（參見圖2）枉氮，為Inception網絡構建SE塊志衍。通過對架構中的每個模塊進行更改，我們構建了一個SE-Inception網絡聊替。

Figure 2

圖2楼肪。最初的Inception模塊架構(左)和SE-Inception模塊架構(右)。

Residual networks and their variants have shown to be highly effective at learning deep representations. We develop a series of SE blocks that integrate with ResNet [9], ResNeXt [43] and Inception-ResNet [38] respectively. Fig.3 depicts the schema of an SE-ResNet module. Here, the SE block transformation $\mathbf{F}_{tr}$ is taken to be the non-identity branch of a residual module. Squeeze and excitation both act before summation with the identity branch.

Figure 3

Figure 3. The schema of the original Residual module (left) and the SE-ResNet module (right).

殘留網絡及其變種已經證明在學習深度表示方面非常有效惹悄。我們開發(fā)了一系列的SE塊春叫，分別與ResNet[9]，ResNeXt[43]和Inception-ResNet[38]集成泣港。圖3描述了SE-ResNet模塊的架構暂殖。在這里，SE塊變換$\mathbf{F}_{tr}$被認為是殘差模塊的非恒等分支当纱。壓縮和激勵都在恒等分支相加之前起作用呛每。

Figure 3

圖3。最初的Residual模塊架構(左)和SE-ResNet模塊架構(右)坡氯。

4. Model and Computational Complexity

An SENet is constructed by stacking a set of SE blocks. In practice, it is generated by replacing each original block (i.e. residual block) with its corresponding SE counterpart (i.e. SE-residual block). We describe the architecture of SE-ResNet-50 and SE-ResNeXt-50 in Table 1.

Table 1

Table 1. (Left) ResNet-50. (Middle) SE-ResNet-50. (Right) SE-ResNeXt-50 with a $32\times 4d$ template. The shapes and operations with specific parameter settings of a residual building block are listed inside the brackets and the number of stacked blocks in a stage is presented outside. The inner brackets following by fc indicates the output dimension of the two fully connected layers in a SE-module.

4. 模型和計算復雜度

SENet通過堆疊一組SE塊來構建晨横。實際上，它是通過用原始塊的SE對應部分（即SE殘差塊）替換每個原始塊（即殘差塊）而產生的箫柳。我們在表1中描述了SE-ResNet-50和SE-ResNeXt-50的架構手形。

Table 1

表1。(左)ResNet-50滞时，(中)SE-ResNet-50叁幢，(右)具有$32\times 4d$模板的SE-ResNeXt-50。在括號內列出了殘差構建塊特定參數設置的形狀和操作坪稽，并且在外部呈現(xiàn)了一個階段中堆疊塊的數量。fc后面的內括號表示SE模塊中兩個全連接層的輸出維度鳞骤。

For the proposed SE block to be viable in practice, it must provide an acceptable model complexity and computational overhead which is important for scalability. To illustrate the cost of the module, we take the comparison between ResNet-50 and SE-ResNet-50 as an example, where the accuracy of SE-ResNet-50 is obviously superior to ResNet-50 and approaching a deeper ResNet-101 network (shown in Table 2). ResNet-50 requires $\sim$3.86 GFLOPs in a single forward pass for a $224\times224$ pixel input image. Each SE block makes use of a global average pooling operation in the squeeze phase and two small fully connected layers in the excitation phase, followed by an inexpensive channel-wise scaling operation. In aggregate, SE-ResNet-50 requires $\sim$3.87 GFLOPs, corresponding to only a $0.26%$ relative increase over the original ResNet-50.

Table 2

Table 2. Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. The original column refers to the results reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation column. The SENet column refers the corresponding architectures in which SE blocks have been added. The numbers in brackets denote the performance improvement over the re-implemented baselines. ? indicates that the model has been evaluated on the non-blacklisted subset of the validation set (this is discussed in more detail in [38]), which may slightly improve results.

在實踐中提出的SE塊是可行的窒百，它必須提供可接受的模型復雜度和計算開銷，這對于可伸縮性是重要的豫尽。為了說明模塊的成本篙梢，作為例子我們比較了ResNet-50和SE-ResNet-50，其中SE-ResNet-50的精確度明顯優(yōu)于ResNet-50美旧，接近更深的ResNet-101網絡（如表2所示）渤滞。對于$224\times 224$像素的輸入圖像，ResNet-50單次前向傳播需要$\sim$ 3.86 GFLOP榴嗅。每個SE塊利用壓縮階段的全局平均池化操作和激勵階段中的兩個小的全連接層妄呕，接下來是廉價的通道縮放操作。總的來說疏魏，SE-ResNet-50需要$\sim$ 3.87 GFLOP，相對于原始的ResNet-50只相對增加了$0.26%$只厘。

Table 2

表2杂腰。ImageNet驗證集上的單裁剪圖像錯誤率（％）和復雜度比較捐迫。original列是指原始論文中報告的結果。為了進行公平比較器躏，我們重新訓練了基準模型茅茂，并在re-implementation列中報告分數捏萍。SENet列是指已添加SE塊后對應的架構。括號內的數字表示與重新實現(xiàn)的基準數據相比的性能改善空闲。?表示該模型已經在驗證集的非黑名單子集上進行了評估（在[38]中有更詳細的討論）令杈，這可能稍微改善結果。

In practice, with a training mini-batch of $256$ images, a single pass forwards and backwards through ResNet-50 takes $190$ms, compared to $209$ms for SE-ResNet-50 (both timings are performed on a server with $8$ NVIDIA Titan X GPUs). We argue that it is a reasonable overhead as global pooling and small inner-product operations are less optimised in existing GPU libraries. Moreover, due to its importance for embedded device applications, we also benchmark CPU inference time for each model: for a $224\times 224$ pixel input image, ResNet-50 takes $164$ms, compared to for SE-ResNet-$50$. The small additional computational overhead required by the SE block is justified by its contribution to model performance (discussed in detail in Sec. 6).

在實踐中碴倾，訓練的批數據大小為256張圖像逗噩，ResNet-50的一次前向傳播和反向傳播花費$190$ ms掉丽，而SE-ResNet-50則花費$209$ ms（兩個時間都在具有$8$個NVIDIA Titan X GPU的服務器上執(zhí)行）。我們認為這是一個合理的開銷异雁，因為在現(xiàn)有的GPU庫中捶障，全局池化和小型內積操作的優(yōu)化程度較低。此外纲刀，由于其對嵌入式設備應用的重要性项炼，我們還對每個模型的CPU推斷時間進行了基準測試：對于$224\times 224$像素的輸入圖像，ResNet-50花費了$164$ms示绊，相比之下锭部，SE-ResNet-$50$花費了$167$ms。SE塊所需的小的額外計算開銷對于其對模型性能的貢獻來說是合理的（在第6節(jié)中詳細討論）面褐。

Next, we consider the additional parameters introduced by the proposed block. All additional parameters are contained in the two fully connected layers of the gating mechanism, which constitute a small fraction of the total network capacity. More precisely, the number of additional parameters introduced is given by: $$\frac{2}{r} \sum_{s=1}^S N_s \cdot {C_s}^2$$ where $r$ denotes the reduction ratio (we set $r$ to $16$ in all our experiments), $S$ refers to the number of stages (where each stage refers to the collection of blocks operating on feature maps of a common spatial dimension), $C_s$ denotes the dimension of the output channels for stage $s$ and $N_s$ refers to the repeated block number. In total, SE-ResNet-50 introduces $\sim$2.5 million additional parameters beyond the $\sim$25 million parameters required by ResNet-50, corresponding to a $\sim 10%$ increase in the total number of parameters. The majority of these additional parameters come from the last stage of the network, where excitation is performed across the greatest channel dimensions. However, we found that the comparatively expensive final stage of SE blocks could be removed at a marginal cost in performance ($<0.1%$ top-1 error on ImageNet dataset) to reduce the relative parameter increase to $\sim 4%$, which may prove useful in cases where parameter usage is a key consideration.

接下來拌禾，我們考慮所提出的塊引入的附加參數。所有附加參數都包含在門機制的兩個全連接層中展哭，構成網絡總容量的一小部分湃窍。更確切地說，引入的附加參數的數量由下式給出：$$\frac{2}{r} \sum_{s=1}^S N_s \cdot {C_s}^2$$其中$r$表示減少比率（我們在所有的實驗中將$r$設置為$16$）摄杂，$S$指的是階段數量（每個階段是指在共同的空間維度的特征映射上運行的塊的集合）坝咐，$C_s$表示階段$s$的輸出通道的維度，$N_s$表示重復的塊編號析恢∧幔總的來說，SE-ResNet-50在ResNet-50所要求的$\sim$2500萬參數之外引入了$\sim$250萬附加參數映挂，相對增加了$\sim 10%$的參數總數量泽篮。這些附加參數中的大部分來自于網絡的最后階段，其中激勵在最大的通道維度上執(zhí)行柑船。然而帽撑，我們發(fā)現(xiàn)SE塊相對昂貴的最終階段可以在性能的邊際成本（ImageNet數據集上$<0.1 %$的top-1錯誤率）上被移除，將相對參數增加減少到$\sim 4%$鞍时，這在參數使用是關鍵考慮的情況下可能證明是有用的亏拉。

5. Implementation

During training, we follow standard practice and perform data augmentation with random-size cropping [39] to $224\times 224$ pixels ($299\times 299$ for Inception-ResNet-v2 [38] and SE-Inception-ResNet-v2) and random horizontal flipping. Input images are normalised through mean channel subtraction. In addition, we adopt the data balancing strategy described in [32] for mini-batch sampling to compensate for the uneven distribution of classes. The networks are trained on our distributed learning system “ROCS” which is capable of handing efficient parallel training of large networks. Optimisation is performed using synchronous SGD with momentum 0.9 and a mini-batch size of 1024 (split into sub-batches of 32 images per GPU across 4 servers, each containing 8 GPUs). The initial learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs. All models are trained for 100 epochs from scratch, using the weight initialisation strategy described in [8].

5. 實現(xiàn)

在訓練過程中，我們遵循標準的做法逆巍，使用隨機大小裁剪[39]到$224\times 224$像素（$299\times 299$用于Inception-ResNet-v2[38]和SE-Inception-ResNet-v2）和隨機的水平翻轉進行數據增強及塘。輸入圖像通過通道減去均值進行歸一化。另外锐极，我們采用[32]中描述的數據均衡策略進行小批量采樣笙僚，以補償類別的不均勻分布。網絡在我們的分布式學習系統(tǒng)“ROCS”上進行訓練灵再，能夠處理大型網絡的高效并行訓練肋层。使用同步SGD進行優(yōu)化亿笤，動量為0.9，小批量數據的大小為1024（在4個服務器的每個GPU上分成32張圖像的子批次栋猖，每個服務器包含8個GPU）净薛。初始學習率設為0.6，每30個迭代周期減少10倍掂铐。使用[8]中描述的權重初始化策略罕拂，所有模型都從零開始訓練100個迭代周期。

6. Experiments

In this section we conduct extensive experiments on the ImageNet 2012 dataset [30] for the purposes: first, to explore the impact of the proposed SE block for the basic networks with different depths and second, to investigate its capacity of integrating with current state-of-the-art network architectures, which aim to a fair comparison between SENets and non-SENets rather than pushing the performance. Next, we present the results and details of the models for ILSVRC 2017 classification task. Furthermore, we perform experiments on the Places365-Challenge scene classification dataset [48] to investigate how well SENets are able to generalise to other datasets. Finally, we investigate the role of excitation and give some analysis based on experimental phenomena.

6. 實驗

在這一部分全陨，我們在ImageNet 2012數據集上進行了大量的實驗[30]爆班，其目的是：首先探索提出的SE塊對不同深度基礎網絡的影響；其次辱姨，調查它與最先進的網絡架構集成后的能力柿菩，旨在公平比較SENets和非SENets，而不是推動性能雨涛。接下來枢舶，我們將介紹ILSVRC 2017分類任務模型的結果和詳細信息。此外替久，我們在Places365-Challenge場景分類數據集[48]上進行了實驗凉泄，以研究SENets是否能夠很好地泛化到其它數據集。最后蚯根，我們研究激勵的作用后众，并根據實驗現(xiàn)象給出了一些分析。

6.1. ImageNet Classification

The ImageNet 2012 dataset is comprised of 1.28 million training images and 50K validation images from 1000 classes. We train networks on the training set and report the top-1 and the top-5 errors using centre crop evaluations on the validation set, where $224\times 224$ pixels are cropped from each image whose shorter edge is first resized to 256 ($299\times 299$ from each image whose shorter edge is first resized to 352 for Inception-ResNet-v2 and SE-Inception-ResNet-v2).

6.1. ImageNet分類

ImageNet 2012數據集包含來自1000個類別的128萬張訓練圖像和5萬張驗證圖像颅拦。我們在訓練集上訓練網絡蒂誉，并在驗證集上使用中心裁剪圖像評估來報告top-1和top-5錯誤率，其中每張圖像短邊首先歸一化為256距帅，然后從每張圖像中裁剪出$224\times 224$個像素右锨，（對于Inception-ResNet-v2和SE-Inception-ResNet-v2，每幅圖像的短邊首先歸一化到352碌秸，然后裁剪出$299\times 299$個像素）绍移。

Network depth. We first compare the SE-ResNet against a collection of standard ResNet architectures. Each ResNet and its corresponding SE-ResNet are trained with identical optimisation schemes. The performance of the different networks on the validation set is shown in Table 2, which shows that SE blocks consistently improve performance across different depths with an extremely small increase in computational complexity.

網絡深度。我們首先將SE-ResNet與一系列標準ResNet架構進行比較讥电。每個ResNet及其相應的SE-ResNet都使用相同的優(yōu)化方案進行訓練登夫。驗證集上不同網絡的性能如表2所示，表明SE塊在不同深度上的網絡上計算復雜度極小增加允趟，始終提高性能。

Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of $6.62%$, exceeding ResNet-50 ($7.48%$) by $0.86%$ and approaching the performance achieved by the much deeper ResNet-101 network ($6.52%$ top-5 error) with only half of the computational overhead ($3.87$ GFLOPs vs. $7.58$ GFLOPs). This pattern is repeated at greater depth, where SE-ResNet-101 ($6.07%$ top-$5$ error) not only matches, but outperforms the deeper ResNet-152 network ($6.34%$ top-5 error) by $0.27%$. Fig.4 depicts the training and validation curves of SE-ResNets and ResNets, respectively. While it should be noted that the SE blocks themselves add depth, they do so in an extremely computationally efficient manner and yield good returns even at the point at which extending the depth of the base architecture achieves diminishing returns. Moreover, we see that the performance improvements are consistent through training across a range of different depths, suggesting that the improvements induced by SE blocks can be used in combination with adding more depth to the base architecture.

Figure 4

Figure 4. Training curves on ImageNet. (Left): ResNet-50 and SE-ResNet-50; (Right): ResNet-152 and SE-ResNet-152.

值得注意的是鸦致，SE-ResNet-50實現(xiàn)了單裁剪圖像$6.62%$的top-5驗證錯誤率潮剪，超過了ResNet-50（$7.48%$）$0.86%$涣楷，接近更深的ResNet-101網絡（$6.52%$的top-5錯誤率），且只有ResNet-101一半的計算開銷（$3.87$ GFLOPs vs. $7.58$ GFLOPs）抗碰。這種模式在更大的深度上重復狮斗，SE-ResNet-101（$6.07%$的top-5錯誤率）不僅可以匹配，而且超過了更深的ResNet-152網絡（$6.34%$的top-5錯誤率）弧蝇。圖4分別描繪了SE-ResNets和ResNets的訓練和驗證曲線碳褒。雖然應該注意SE塊本身增加了深度，但是它們的計算效率極高看疗，即使在擴展的基礎架構的深度達到收益遞減的點上也能產生良好的回報沙峻。而且，我們看到通過對各種不同深度的訓練两芳，性能改進是一致的摔寨，這表明SE塊引起的改進可以與增加基礎架構更多深度結合使用。

Figure 4

圖4怖辆。ImageNet上的訓練曲線是复。(左)：ResNet-50和SE-ResNet-50；(右)：ResNet-152和SE-ResNet-152竖螃。

Integration with modern architectures. We next investigate the effect of combining SE blocks with another two state-of-the-art architectures, Inception-ResNet-v2 [38] and ResNeXt [43]. The Inception architecture constructs modules of convolutions as multibranch combinations of factorised filters, reflecting the Inception hypothesis [6] that spatial correlations and cross-channel correlations can be mapped independently. In contrast, the ResNeXt architecture asserts that richer representations can be obtained by aggregating combinations of sparsely connected (in the channel dimension) convolutional features. Both approaches introduce prior-structured correlations in modules. We construct SENet equivalents of these networks, SE-Inception-ResNet-v2 and SE-ResNeXt (the configuration of SE-ResNeXt-50 ($32\times 4d$) is given in Table 1). Like previous experiments, the same optimisation scheme is used for both the original networks and their SENet counterparts.

與現(xiàn)代架構集成淑廊。接下來我們將研究SE塊與另外兩種最先進的架構Inception-ResNet-v2[38]和ResNeXt[43]的結合效果。Inception架構將卷積模塊構造為分解濾波器的多分支組合特咆，反映了Inception假設[6]季惩，可以獨立映射空間相關性和跨通道相關性。相比之下坚弱，ResNeXt體架構斷言蜀备，可以通過聚合稀疏連接（在通道維度中）卷積特征的組合來獲得更豐富的表示。兩種方法都在模塊中引入了先前結構化的相關性荒叶。我們構造了這些網絡的SENet等價物碾阁，SE-Inception-ResNet-v2和SE-ResNeXt（表1給出了SE-ResNeXt-50（$32\times 4d$）的配置）。像前面的實驗一樣些楣，原始網絡和它們對應的SENet網絡都使用相同的優(yōu)化方案脂凶。

The results given in Table 2 illustrate the significant performance improvement induced by SE blocks when introduced into both architectures. In particular, SE-ResNeXt-50 has a top-5 error of $5.49%$ which is superior to both its direct counterpart ResNeXt-50 ($5.90%$ top-5 error) as well as the deeper ResNeXt-101 ($5.57%$ top-5 error), a model which has almost double the number of parameters and computational overhead. As for the experiments of Inception-ResNet-v2, we conjecture the difference of cropping strategy might lead to the gap between their reported result and our re-implemented one, as their original image size has not been clarified in [38] while we crop the $299\times 299$ region from a relative larger image (where the shorter edge is resized to 352). SE-Inception-ResNet-v2 ($4.79%$ top-5 error) outperforms our reimplemented Inception-ResNet-v2 ($5.21%$ top-5 error) by $0.42%$ (a relative improvement of $8.1%$) as well as the reported result in [38]. The optimisation curves for each network are depicted in Fig. 5, illustrating the consistency of the improvement yielded by SE blocks throughout the training process.

Figure 5

Figure 5. Training curves on ImageNet. (Left): ResNeXt-50 and SE-ResNeXt-50; (Right): Inception-ResNet-v2 and SE-Inception-ResNet-v2.

表2中給出的結果說明在將SE塊引入到兩種架構中會引起顯著的性能改善。尤其是SE-ResNeXt-50的top-5錯誤率是$5.49%$愁茁，優(yōu)于于它直接對應的ResNeXt-50（$5.90%$的top-5錯誤率）以及更深的ResNeXt-101（$5.57%$的top-5錯誤率）蚕钦，這個模型幾乎有兩倍的參數和計算開銷。對于Inception-ResNet-v2的實驗鹅很，我們猜測可能是裁剪策略的差異導致了其報告結果與我們重新實現(xiàn)的結果之間的差距嘶居，因為它們的原始圖像大小尚未在[38]中澄清，而我們從相對較大的圖像（其中較短邊被歸一化為352）中裁剪出$299\times 299$大小的區(qū)域。SE-Inception-ResNet-v2（$4.79%$的top-5錯誤率）比我們重新實現(xiàn)的Inception-ResNet-v2（$5.21%$的top-5錯誤率）要低$0.42%$（相對改進了$8.1%$）也優(yōu)于[38]中報告的結果邮屁。每個網絡的優(yōu)化曲線如圖5所示整袁，說明了在整個訓練過程中SE塊產生了一致的改進。

Figure 5

圖5佑吝。ImageNet的訓練曲線坐昙。(左): ResNeXt-50和SE-ResNeXt-50；(右)：Inception-ResNet-v2和SE-Inception-ResNet-v2芋忿。

Finally, we assess the effect of SE blocks when operating on a non-residual network by conducting experiments with the BN-Inception architecture [14] which provides good performance at a lower model complexity. The results of the comparison are shown in Table 2 and the training curves are shown in Fig. 6, exhibiting the same phenomena that emerged in the residual architectures. In particular, SE-BN-Inception achieves a lower top-5 error of $7.14%$ in comparison to BN-Inception whose error rate is $7.89%$. These experiments demonstrate that improvements induced by SE blocks can be used in combination with a wide range of architectures. Moreover, this result holds for both residual and non-residual foundations.

Figure 6

Figure 6. Training curves of BN-Inception and SE-BN-Inception on ImageNet.

最后炸客，我們通過對BN-Inception架構[14]進行實驗來評估SE塊在非殘差網絡上的效果，該架構在較低的模型復雜度下提供了良好的性能戈钢。比較結果如表2所示痹仙，訓練曲線如圖6所示，表現(xiàn)出的現(xiàn)象與殘差架構中出現(xiàn)的現(xiàn)象一樣逆趣。尤其是與BN-Inception $7.89%$的錯誤率相比蝶溶，SE-BN-Inception獲得了更低$7.14%$的top-5錯誤。這些實驗表明SE塊引起的改進可以與多種架構結合使用宣渗。而且抖所，這個結果適用于殘差和非殘差基礎。

Figure 6

圖6痕囱。BN-Inception和SE-BN-Inception在ImageNet上的訓練曲線田轧。

Results on ILSVRC 2017 Classification Competition. ILSVRC [30] is an annual computer vision competition which has proved to be a fertile ground for model developments in image classification. The training and validation data of the ILSVRC 2017 classification task are drawn from the ImageNet 2012 dataset, while the test set consists of an additional unlabelled 100K images. For the purposes of the competition, the top-5 error metric is used to rank entries.

ILSVRC 2017分類競賽的結果。ILSVRC[30]是一個年度計算機視覺競賽鞍恢，被證明是圖像分類模型發(fā)展的沃土傻粘。ILSVRC 2017分類任務的訓練和驗證數據來自ImageNet 2012數據集，而測試集包含額外的未標記的10萬張圖像帮掉。為了競爭的目的弦悉，使用top-5錯誤率度量來對輸入條目進行排序。

SENets formed the foundation of our submission to the challenge where we won first place. Our winning entry comprised a small ensemble of SENets that employed a standard multi-scale and multi-crop fusion strategy to obtain a $2.251%$ top-5 error on the test set. This result represents a $\sim 25%$ relative improvement on the winning entry of 2016 ($2.99%$ top-5 error). One of our high-performing networks is constructed by integrating SE blocks with a modified ResNeXt [43] (details of the modifications are provided in Appendix A). We compare the proposed architecture with the state-of-the-art models on the ImageNet validation set in Table 3. Our model achieves a top-1 error of $18.68%$ and a top-5 error of $4.47%$ using a $224\times 224$ centre crop evaluation on each image (where the shorter edge is first resized to 256). To enable a fair comparison with previous models, we also provide a $320\times 320$ centre crop evaluation, obtaining the lowest error rate under both the top-1 ($17.28%$) and the top-5 ($3.79%$) error metrics.

Table 3

Table 3. Single-crop error rates of state-of-the-art CNNs on ImageNet validation set. The size of test crop is $224\times 224$ and $320\times 320$/$299\times299$ as in [10]. Our proposed model, SENet, shows a significant performance improvement on prior work.

SENets是我們在挑戰(zhàn)中贏得第一名的基礎蟆炊。我們的獲勝輸入由一小群SENets組成稽莉，它們采用標準的多尺度和多裁剪圖像融合策略，在測試集上獲得了$2.251%$的top-5錯誤率涩搓。這個結果表示在2016年獲勝輸入（$2.99%$的top-5錯誤率）的基礎上相對改進了$\sim 25%$污秆。我們的高性能網絡之一是將SE塊與修改后的ResNeXt[43]集成在一起構建的（附錄A提供了這些修改的細節(jié)）。在表3中我們將提出的架構與最新的模型在ImageNet驗證集上進行了比較昧甘。我們的模型在每一張圖像使用$224\times 224$中間裁剪評估（短邊首先歸一化到256）取得了$18.68%$的top-1錯誤率和$4.47%$的top-5錯誤率良拼。為了與以前的模型進行公平的比較，我們也提供了$320\times 320$的中心裁剪圖像評估充边，在top-1($17.28%$)和top-5($3.79%$)的錯誤率度量中獲得了最低的錯誤率庸推。

Table 3

表3。最新的CNNs在ImageNet驗證集上單裁剪圖像的錯誤率。測試的裁剪圖像大小是$224\times 224$和[10]中的$320\times 320$/$299\times299$予弧。與前面的工作相比刮吧，我們提出的模型SENet表現(xiàn)出了顯著的改進。

6.2. Scene Classification

Large portions of the ImageNet dataset consist of images dominated by single objects. To evaluate our proposed model in more diverse scenarios, we also evaluate it on the Places365-Challenge dataset [48] for scene classification. This dataset comprises 8 million training images and 36, 500 validation images across 365 categories. Relative to classification, the task of scene understanding can provide a better assessment of the ability of a model to generalise well and handle abstraction, since it requires the capture of more complex data associations and robustness to a greater level of appearance variation.

6.2. 場景分類

ImageNet數據集的大部分由單個對象支配的圖像組成掖蛤。為了在更多不同的場景下評估我們提出的模型，我們還在Places365-Challenge數據集[48]上對場景分類進行評估井厌。該數據集包含800萬張訓練圖像和365個類別的36500張驗證圖像蚓庭。相對于分類，場景理解的任務可以更好地評估模型泛化和處理抽象的能力仅仆，因為它需要捕獲更復雜的數據關聯(lián)以及對更大程度外觀變化的魯棒性器赞。

We use ResNet-152 as a strong baseline to assess the effectiveness of SE blocks and follow the evaluation protocol in [33]. Table 4 shows the results of training a ResNet-152 model and a SE-ResNet-152 for the given task. Specifically, SE-ResNet-152 ($11.01%$ top-5 error) achieves a lower validation error than ResNet-152 ($11.61%$ top-5 error), providing evidence that SE blocks can perform well on different datasets. This SENet surpasses the previous state-of-the-art model Places-365-CNN [33] which has a top-5 error of $11.48%$ on this task.

Table 4

Table 4. Single-crop error rates (%) on the Places365 validation set.

我們使用ResNet-152作為強大的基線來評估SE塊的有效性，并遵循[33]中的評估協(xié)議墓拜。表4顯示了針對給定任務訓練ResNet-152模型和SE-ResNet-152的結果港柜。具體而言，SE-ResNet-152（$11.01%$的top-5錯誤率）取得了比ResNet-152（$11.61%$的top-5錯誤率）更低的驗證錯誤率咳榜，證明了SE塊可以在不同的數據集上表現(xiàn)良好夏醉。這個SENet超過了先前的最先進的模型Places-365-CNN [33]，它在這個任務上有$11.48%$的top-5錯誤率涌韩。

Table 4

表4臣樱。Places365驗證集上的單裁剪圖像錯誤率(%)。

6.3. Analysis and Discussion

Reduction ratio. The reduction ratio $r$ introduced in Eqn. (5) is an important hyperparameter which allows us to vary the capacity and computational cost of the SE blocks in the model. To investigate this relationship, we conduct experiments based on the SE-ResNet-50 architecture for a range of different $r$ values. The comparison in Table 5 reveals that performance does not improve monotonically with increased capacity. This is likely to be a result of enabling the SE block to overfit the channel interdependencies of the training set. In particular, we found that setting $r=16$ achieved a good tradeoff between accuracy and complexity and consequently, we used this value for all experiments.

Table 5

Table 5. Single-crop error rates (%) on the ImageNet validation set and corresponding model sizes for the SE-ResNet-50 architecture at different reduction ratios $r$. Here original refers to ResNet-50.

6.3. 分析和討論

減少比率雇毫。公式（5）中引入的減少比率$r$是一個重要的超參數棚放，它允許我們改變模型中SE塊的容量和計算成本枚粘。為了研究這種關系席吴，我們基于SE-ResNet-50架構進行了一系列不同$r$值的實驗孝冒。表5中的比較表明庄涡，性能并沒有隨著容量的增加而單調上升。這可能是使SE塊能夠過度擬合訓練集通道依賴性的結果拿穴。尤其是我們發(fā)現(xiàn)設置$r=16$在精度和復雜度之間取得了很好的平衡默色，因此我們將這個值用于所有的實驗。

Table 5

表5吃度。 ImageNet驗證集上單裁剪圖像的錯誤率(%)和SE-ResNet-50架構在不同減少比率$r$下的模型大小贴硫。這里original指的是ResNet-50间护。

The role of Excitation. While SE blocks have been empirically shown to improve network performance, we would also like to understand how the self-gating excitation mechanism operates in practice. To provide a clearer picture of the behaviour of SE blocks, in this section we study example activations from the SE-ResNet-50 model and examine their distribution with respect to different classes at different blocks. Specifically, we sample four classes from the ImageNet dataset that exhibit semantic and appearance diversity, namely goldfish, pug, plane and cliff (example images from these classes are shown in Fig. 7). We then draw fifty samples for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block in each stage (immediately prior to downsampling) and plot their distribution in Fig. 8. For reference, we also plot the distribution of average activations across all 1000 classes.

Figure 7

Figure 7. Example images from the four classes of ImageNet.

Figure 8

Figure 8. Activations induced by Excitation in the different modules of SE-ResNet-50 on ImageNet. The module is named as “SE stageID blockID”.

激勵的作用有咨。雖然SE塊從經驗上顯示出其可以改善網絡性能，但我們也想了解自門激勵機制在實踐中是如何運作的苞也。為了更清楚地描述SE塊的行為，本節(jié)我們研究SE-ResNet-50模型的樣本激活殷勘，并考察它們在不同塊不同類別下的分布情況玲销。具體而言贤斜，我們從ImageNet數據集中抽取了四個類猴抹，這些類表現(xiàn)出語義和外觀多樣性蟀给，即金魚坤溃，哈巴狗，刨和懸崖（圖7中顯示了這些類別的示例圖像）。然后缀旁，我們從驗證集中為每個類抽取50個樣本并巍，并計算每個階段最后的SE塊中50個均勻采樣通道的平均激活（緊接在下采樣之前）懊渡，并在圖8中繪制它們的分布剃执。作為參考肾档，我們也繪制所有1000個類的平均激活分布。

Figure 7

圖7遣耍。ImageNet中四個類別的示例圖像配阵。

Figure 8

圖8救拉。SE-ResNet-50不同模塊在ImageNet上由Excitation引起的激活。模塊名為“SE stageID blockID”派昧。

We make the following three observations about the role of Excitation in SENets. First, the distribution across different classes is nearly identical in lower layers, e.g. SE_2_3. This suggests that the importance of feature channels is likely to be shared by different classes in the early stages of the network. Interestingly however, the second observation is that at greater depth, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features e.g. SE_4_6 and SE_5_1. The two observations are consistent with findings in previous work [21, 46], namely that lower layer features are typically more general (i.e. class agnostic in the context of classification) while higher layer features have greater specificity. As a result, representation learning benefits from the recalibration induced by SE blocks which adaptively facilitates feature extraction and specialisation to the extent that it is needed. Finally, we observe a somewhat different phenomena in the last stage of the network. SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to 1 and the remainder are close to 0. At the point at which all activations take the value 1, this block would become a standard residual block. At the end of the network in the SE_5_3 (which is immediately followed by global pooling prior before classifiers), a similar pattern emerges over different classes, up to a slight change in scale (which could be tuned by the classifiers). This suggests that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network. This finding is consistent with the result of the empirical investigation in Sec. 4 which demonstrated that the overall parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance (< $0.1%$ top-1 error).

我們對SENets中Excitation的作用提出以下三點看法。首先五慈，不同類別的分布在較低層中幾乎相同，例如争拐，SE_2_3架曹。這表明在網絡的最初階段特征通道的重要性很可能由不同的類別共享。然而有趣的是绳慎，第二個觀察結果是在更大的深度杏愤，每個通道的值變得更具類別特定性，因為不同類別對特征的判別性值具有不同的偏好厕宗。SE_4_6和SE_5_1已慢。這兩個觀察結果與以前的研究結果一致[21,46]朋腋，即低層特征通常更普遍（即分類中不可知的類別）旭咽，而高層特征具有更高的特異性。因此请垛，表示學習從SE塊引起的重新校準中受益，其自適應地促進特征提取和專業(yè)化到所需要的程度亚兄。最后，我們在網絡的最后階段觀察到一個有些不同的現(xiàn)象膳叨。SE_5_2呈現(xiàn)出朝向飽和狀態(tài)的有趣趨勢，其中大部分激活接近于1龄坪，其余激活接近于0。在所有激活值取1的點處妓局，該塊將成為標準殘差塊好爬。在網絡的末端SE_5_3中（在分類器之前緊接著是全局池化）哎榴，類似的模式出現(xiàn)在不同的類別上尚蝌，尺度上只有輕微的變化（可以通過分類器來調整）。這表明姿鸿，SE_5_2和SE_5_3在為網絡提供重新校準方面比前面的塊更不重要。這一發(fā)現(xiàn)與第四節(jié)實證研究的結果是一致的热某，這表明昔馋，通過刪除最后一個階段的SE塊，總體參數數量可以顯著減少，性能只有一點損失（<$0.1%$的top-1錯誤率）。

7. Conclusion

In this paper we proposed the SE block, a novel architectural unit designed to improve the representational capacity of a network by enabling it to perform dynamic channel-wise feature recalibration. Extensive experiments demonstrate the effectiveness of SENets which achieve state-of-the-art performance on multiple datasets. In addition, they provide some insight into the limitations of previous architectures in modelling channel-wise feature dependencies, which we hope may prove useful for other tasks requiring strong discriminative features. Finally, the feature importance induced by SE blocks may be helpful to related fields such as network pruning for compression.

7. 結論

在本文中审丘，我們提出了SE塊锅知，這是一種新穎的架構單元，旨在通過使網絡能夠執(zhí)行動態(tài)通道特征重新校準來提高網絡的表示能力昌妹。大量實驗證明了SENets的有效性，其在多個數據集上取得了最先進的性能固歪。此外牢裳，它們還提供了一些關于以前架構在建模通道特征依賴性上的局限性的洞察，我們希望可能證明SENets對其它需要強判別性特征的任務是有用的伶椿。最后导狡，由SE塊引起的特征重要性可能有助于相關領域独郎，例如為了壓縮的網絡修剪氓癌。

Acknowledgements. We would like to thank Professor Andrew Zisserman for his helpful comments and Samuel Albanie for his discussions and writing edit for the paper. We would like to thank Chao Li for his contributions in the memory optimisation of the training system. Li Shen is supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010. The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

致謝卢肃。我們要感謝Andrew Zisserman教授的有益評論，并感謝Samuel Albanie的討論并校訂論文。我們要感謝Chao Li在訓練系統(tǒng)內存優(yōu)化方面的貢獻尾组。Li Shen由國家情報總監(jiān)(ODNI)，先期研究計劃中心（IARPA）資助爷耀，合同號為2014-14071600010。本文包含的觀點和結論屬于作者的觀點和結論咆耿，不應理解為ODNI，IARPA或美國政府明示或暗示的官方政策或認可慰技。盡管有任何版權注釋，美國政府有權為政府目的復制和分發(fā)重印艾帐。

References

[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.

[2] T. Bluche. Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. In NIPS, 2016.

[3] C.Cao, X.Liu, Y.Yang, Y.Yu, J.Wang, Z.Wang, Y.Huang, L. Wang, C. Huang, W. Xu, D. Ramanan, and T. S. Huang. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV, 2015.

[4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, 2017.

[5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. arXiv:1707.01629, 2017.

[6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.

[7] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In CVPR, 2017.

[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV, 2015.

[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.

[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.

[12] G. Huang, Z. Liu, K. Q. Weinberger, and L. Maaten. Densely connected convolutional networks. In CVPR, 2017.

[13] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving CNN efficiency with hierarchical filter groups. In CVPR, 2017.

[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[15] L. Itti and C. Koch. Computational modelling of visual attention. Nature reviews neuroscience, 2001.

[16] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, 1998.

[17] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015.

[18] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

[20] H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, 2010.

[21] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

[22] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

[24] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with context gating for video classification. arXiv:1706.06905, 2017.

[25] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.

[26] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.

[27] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.

[28] B. A. Olshausen, C. H. Anderson, and D. C. V. Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 1993.

[29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.

[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 2015.

[31] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. RR-8209, INRIA, 2013.

[32] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, 2016.

[33] L. Shen, Z. Lin, G. Sun, and J. Hu. Places401 and places365 models. https://github.com/lishen-shirley/ Places2-CNNs, 2016.

[34] L. Shen, G. Sun, Q. Huang, S. Wang, Z. Lin, and E. Wu. Multi-level discriminative dictionary learning with application to large scale image classification. IEEE TIP, 2015.

[35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[36] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.

[37] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, 2014.

[38] C.Szegedy, S.Ioffe, V.Vanhoucke, and A.Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016.

[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

[40] C.Szegedy, V.Vanhoucke, S.Ioffe, J.Shlens, and Z.Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.

[41] A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In CVPR, 2014.

[42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In CVPR, 2017.

[43] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

[45] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.

[46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.

[47] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In CVPR, 2017.

[48] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.

A. ILSVRC 2017 Classification Competition Entry Details

The SENet in Table 3 is constructed by integrating SE blocks to a modified version of the $64\times 4d$ ResNeXt-152 that extends the original ResNeXt-101 [43] by following the block stacking of ResNet-152 [9]. More differences to the design and training (beyond the use of SE blocks) were as follows: (a) The number of first $1\times 1$ convolutional channels for each bottleneck building block was halved to reduce the computation cost of the network with a minimal decrease in performance. (b) The first $7\times 7$ convolutional layer was replaced with three consecutive $3\times 3$ convolutional layers. (c) The down-sampling projection $1\times 1$ with stride-2 convolution was replaced with a $3\times 3$ stride-2 convolution to preserve information. (d) A dropout layer (with a drop ratio of 0.2) was inserted before the classifier layer to prevent overfitting. (e) Label-smoothing regularisation (as introduced in [40]) was used during training. (f) The parameters of all BN layers were frozen for the last few training epochs to ensure consistency between training and testing. (g) Training was performed with 8 servers (64 GPUs) in parallelism to enable a large batch size (2048) and initial learning rate of 1.0.

A. ILSVRC 2017分類競賽輸入細節(jié)

表3中的SENet是通過將SE塊集成到$64\times 4d$的ResNeXt-152的修改版本中構建的晰奖，通過遵循ResNet-152[9]的塊堆疊來擴展原始ResNeXt-101[43]匾南。更多設計和訓練差異（除了SE塊的使用之外）如下：（a）對于每個瓶頸構建塊，首先$1\times 1$卷積通道的數量減半豹爹，以性能下降最小的方式降低網絡的計算成本。（b）第一個$7\times 7$卷積層被三個連續(xù)的$3\times 3$卷積層所取代孩等。（c）步長為2的$1\times 1$卷積的下采樣投影被替換步長為2的$3\times 3$卷積以保留信息。（d）在分類器層之前插入一個丟棄層（丟棄比為0.2）以防止過擬合。（e）訓練期間使用標簽平滑正則化（如[40]中所介紹的）隅要。（f）在最后幾個訓練迭代周期，所有BN層的參數都被凍結，以確保訓練和測試之間的一致性。（g）使用8個服務器（64個GPU）并行執(zhí)行培訓撑毛，以實現(xiàn)大批量數據大写菩（2048），初始學習率為1.0鸽心。

最后編輯于：2018.03.19 18:30:58

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子蟀淮，更是在濱河造成了極大的恐慌，老刑警劉巖甚疟，帶你破解...
沈念sama閱讀 216,324評論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異揽祥，居然都是意外死亡讽膏，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,356評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門拄丰，熙熙樓的掌柜王于貴愁眉苦臉地迎上來府树，“玉大人料按，你說我怎么就攤上這事奄侠。” “怎么了载矿？”我有些...
開封第一講書人閱讀 162,328評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵垄潮，是天一觀的道長。經常有香客問我，道長弯洗，這世上最難降的妖魔是什么旅急？我笑而不...
開封第一講書人閱讀 58,147評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮牡整，結果婚禮上藐吮，老公的妹妹穿的比我還像新娘。我一直安慰自己果正，他們只是感情好炎码，可當我...
茶點故事閱讀 67,160評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著秋泳，像睡著了一般潦闲。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上迫皱，一...
開封第一講書人閱讀 51,115評論 1贊 296
城市分裂傳說
那天歉闰，我揣著相機與錄音，去河邊找鬼卓起。笑死和敬，一個胖子當著我的面吹牛，可吹牛的內容都是我干的戏阅。我是一名探鬼主播昼弟，決...
沈念sama閱讀 40,025評論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼奕筐！你這毒婦竟也來了舱痘？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 38,867評論 0贊 274
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤离赫，失蹤者是張志新（化名）和其女友劉穎芭逝，沒想到半個月后，有當地人在樹林里發(fā)現(xiàn)了一具尸體渊胸，經...
沈念sama閱讀 45,307評論 1贊 310
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡旬盯，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,528評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了翎猛。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片胖翰。...
茶點故事閱讀 39,688評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖切厘，靈堂內的尸體忽然破棺而出萨咳，到底是詐尸還是另有隱情，我是刑警寧澤迂卢，帶...
沈念sama閱讀 35,409評論 5贊 343
?日本核電站爆炸內幕
正文年R本政府宣布某弦，位于F島的核電站，受9級特大地震影響而克，放射性物質發(fā)生泄漏靶壮。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,001評論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一员萍、第九天我趴在偏房一處隱蔽的房頂上張望腾降。院中可真熱鬧，春花似錦碎绎、人聲如沸螃壤。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,657評論 0贊 22
一樁弒父案筋帖，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽奸晴。三九已至，卻和暖如春日麸，著一層夾襖步出監(jiān)牢的瞬間寄啼，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,811評論 1贊 268
情欲美人皮
我被黑心中介騙來泰國打工代箭，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留墩划，地道東北人。一個月前我還...
沈念sama閱讀 47,685評論 2贊 368
代替公主和親
正文我出身青樓嗡综，卻偏偏與公主長得像乙帮，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子极景，可洞房花燭夜當晚...
茶點故事閱讀 44,573評論 2贊 353