文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡書
聲明:作者翻譯論文僅為學(xué)習(xí)有缆,如有侵權(quán)請(qǐng)聯(lián)系作者刪除博文粘姜,謝謝获讳!
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Abstract
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5
validation error (and 4.8%
test error), exceeding the accuracy of human raters.
摘要
訓(xùn)練深度神經(jīng)網(wǎng)絡(luò)的復(fù)雜性在于紊馏,每層輸入的分布在訓(xùn)練過程中會(huì)發(fā)生變化赡若,因?yàn)榍懊娴膶拥膮?shù)會(huì)發(fā)生變化她按。通過要求較低的學(xué)習(xí)率和仔細(xì)的參數(shù)初始化減慢了訓(xùn)練框弛,并且使具有飽和非線性的模型訓(xùn)練起來非常困難架馋。我們將這種現(xiàn)象稱為內(nèi)部協(xié)變量轉(zhuǎn)移淫奔,并通過標(biāo)準(zhǔn)化層輸入來解決這個(gè)問題。我們的方法力圖使標(biāo)準(zhǔn)化成為模型架構(gòu)的一部分恕出,并為每個(gè)訓(xùn)練小批量數(shù)據(jù)執(zhí)行標(biāo)準(zhǔn)化班巩。批標(biāo)準(zhǔn)化使我們能夠使用更高的學(xué)習(xí)率,并且不用太注意初始化嘶炭。它也作為一個(gè)正則化項(xiàng)抱慌,在某些情況下不需要Dropout。將批量標(biāo)準(zhǔn)化應(yīng)用到最先進(jìn)的圖像分類模型上眨猎,批標(biāo)準(zhǔn)化在取得相同的精度的情況下抑进,減少了14倍的訓(xùn)練步驟,并以顯著的差距擊敗了原始模型睡陪。使用批標(biāo)準(zhǔn)化網(wǎng)絡(luò)的組合寺渗,我們改進(jìn)了在ImageNet分類上公布的最佳結(jié)果:達(dá)到了4.9% top-5
的驗(yàn)證誤差(和4.8%
測(cè)試誤差)匿情,超過了人類評(píng)估者的準(zhǔn)確性。
1. Introduction
Deep learning has dramatically advanced the state of the art in vision, speech, and many other areas. Stochastic gradient descent (SGD) has proved to be an effective way of training deep networks, and SGD variants such as momentum (Sutskever et al., 2013) and Adagrad (Duchi et al., 2011) have been used to achieve state of the art performance. SGD optimizes the parameters $\Theta$ of the network, so as to minimize the loss
$$\Theta = \arg \min_\Theta \frac{1}{N}\sum_{i=1}^N \ell(x_i, \Theta)$$
where $x_{1\ldots N}$ is the training data set. With SGD, the training proceeds in steps, and at each step we consider a mini-batch $x_{1\ldots m}$ of size $m$. The mini-batch is used to approximate the gradient of the loss function with respect to the parameters, by computing $\frac {1} {m} \sum _{i=1} ^m \frac {\partial \ell(x_i, \Theta)} {\partial \Theta}$. Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases. Second, computation over a batch can be much more efficient than $m$ computations for individual examples, due to the parallelism afforded by the modern computing platforms.
1. 引言
深度學(xué)習(xí)在視覺信殊、語音等諸多方面顯著提高了現(xiàn)有技術(shù)的水平炬称。隨機(jī)梯度下降(SGD)已經(jīng)被證明是訓(xùn)練深度網(wǎng)絡(luò)的有效方式,并且已經(jīng)使用諸如動(dòng)量(Sutskever等涡拘,2013)和Adagrad(Duchi等人玲躯,2011)等SGD變種取得了最先進(jìn)的性能。SGD優(yōu)化網(wǎng)絡(luò)參數(shù)$\Theta$鳄乏,以最小化損失
$$\Theta = \arg \min_\Theta \frac{1}{N}\sum_{i=1}^N \ell(x_i, \Theta)$$
$x_{1\ldots N}$是訓(xùn)練數(shù)據(jù)集跷车。使用SGD,訓(xùn)練將逐步進(jìn)行汞窗,在每一步中姓赤,我們考慮一個(gè)大小為$m$的小批量數(shù)據(jù)$x_{1 \ldots m}$。通過計(jì)算$\frac {1} {m} \sum _{i=1} ^m \frac {\partial \ell(x_i, \Theta)} {\partial \Theta}$仲吏,使用小批量數(shù)據(jù)來近似損失函數(shù)關(guān)于參數(shù)的梯度不铆。使用小批量樣本,而不是一次一個(gè)樣本裹唆,在一些方面是有幫助的誓斥。首先,小批量數(shù)據(jù)的梯度損失是訓(xùn)練集上的梯度估計(jì)许帐,其質(zhì)量隨著批量增加而改善劳坑。第二,由于現(xiàn)代計(jì)算平臺(tái)提供的并行性成畦,對(duì)一個(gè)批次的計(jì)算比單個(gè)樣本計(jì)算$m$次效率更高距芬。
While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers —— so that small changes to the network parameters amplify as the network becomes deeper.
雖然隨機(jī)梯度是簡單有效的,但它需要仔細(xì)調(diào)整模型的超參數(shù)循帐,特別是優(yōu)化中使用的學(xué)習(xí)速率以及模型參數(shù)的初始值框仔。訓(xùn)練的復(fù)雜性在于每層的輸入受到前面所有層的參數(shù)的影響——因此當(dāng)網(wǎng)絡(luò)變得更深時(shí),網(wǎng)絡(luò)參數(shù)的微小變化就會(huì)被放大拄养。
The change in the distributions of layers' inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing $$\ell = F_2(F_1(u, \Theta_1), \Theta_2)$$ where $F_1$ and $F_2$ are arbitrary transformations, and the parameters $\Theta_1, \Theta_2$ are to be learned so as to minimize the loss $\ell$. Learning $\Theta_2$ can be viewed as if the inputs $x=F_1(u,\Theta_1)$ are fed into the sub-network $$\ell = F_2(x, \Theta_2).$$
層輸入的分布變化是一個(gè)問題离斩,因?yàn)檫@些層需要不斷適應(yīng)新的分布。當(dāng)學(xué)習(xí)系統(tǒng)的輸入分布發(fā)生變化時(shí)瘪匿,據(jù)說會(huì)經(jīng)歷協(xié)變量轉(zhuǎn)移(Shimodaira跛梗,2000)。這通常是通過域適應(yīng)(Jiang棋弥,2008)來處理的核偿。然而,協(xié)變量漂移的概念可以擴(kuò)展到整個(gè)學(xué)習(xí)系統(tǒng)之外顽染,應(yīng)用到學(xué)習(xí)系統(tǒng)的一部分宪祥,例如子網(wǎng)絡(luò)或一層聂薪。考慮網(wǎng)絡(luò)計(jì)算$$\ell = F_2(F_1(u, \Theta_1), \Theta_2)$$ $F_1$和$F_2$是任意變換蝗羊,學(xué)習(xí)參數(shù)$\Theta_1藏澳,\Theta_2$以便最小化損失$\ell$。學(xué)習(xí)$\Theta_2$可以看作輸入$x=F_1(u,\Theta_1)$送入到子網(wǎng)絡(luò)$$\ell = F_2(x, \Theta_2)耀找。$$
For example, a gradient descent step $$\Theta_2\leftarrow \Theta_2 - \frac {\alpha} {m} \sum_{i=1}^m \frac {\partial F_2(x_i,\Theta_2)} {\partial \Theta_2}$$ (for batch size $m$ and learning rate $\alpha$) is exactly equivalent to that for a stand-alone network $F_2$ with input $x$. Therefore, the input distribution properties that make training more efficient —— such as having the same distribution between the training and test data —— apply to training the sub-network as well. As such it is advantageous for the distribution of $x$ to remain fixed over time. Then, $\Theta_2$ does not have to readjust to compensate for the change in the distribution of $x$.
例如翔悠,梯度下降步驟$$\Theta_2\leftarrow \Theta_2 - \frac {\alpha} {m} \sum_{i=1}^m \frac {\partial F_2(x_i,\Theta_2)} {\partial \Theta_2}$$(對(duì)于批大小$m$和學(xué)習(xí)率$\alpha$)與輸入為$x$的單獨(dú)網(wǎng)絡(luò)$F_2$完全等價(jià)。因此野芒,輸入分布特性使訓(xùn)練更有效——例如訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù)之間有相同的分布——也適用于訓(xùn)練子網(wǎng)絡(luò)蓄愁。因此$x$的分布在時(shí)間上保持固定是有利的。然后狞悲,$\Theta_2$不必重新調(diào)整來補(bǔ)償$x$分布的變化撮抓。
Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the sub-network, as well. Consider a layer with a sigmoid activation function $z = g(Wu+b)$ where $u$ is the layer input, the weight matrix $W$ and bias vector $b$ are the layer parameters to be learned, and $g(x) = \frac{1}{1+\exp(-x)}$. As $|x|$ increases, $g'(x)$ tends to zero. This means that for all dimensions of $x=Wu+b$ except those with small absolute values, the gradient flowing down to $u$ will vanish and the model will train slowly. However, since $x$ is affected by $W, b$ and the parameters of all the layers below, changes to those parameters during training will likely move many dimensions of $x$ into the saturated regime of the nonlinearity and slow down the convergence. This effect is amplified as the network depth increases. In practice, the saturation problem and the resulting vanishing gradients are usually addressed by using Rectified Linear Units (Nair & Hinton, 2010) $ReLU(x)=\max(x,0)$, careful initialization (Bengio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
子網(wǎng)絡(luò)輸入的固定分布對(duì)于子網(wǎng)絡(luò)外的層也有積極的影響∫》妫考慮一個(gè)激活函數(shù)為$g(x) = \frac{1}{1+\exp(-x)}$的層丹拯,$u$是層輸入,權(quán)重矩陣$W$和偏置向量$b$是要學(xué)習(xí)的層參數(shù)荸恕,$g(x) = \frac{1}{1+\exp(-x)}$乖酬。隨著$|x|$的增加,$g'(x)$趨向于0融求。這意味著對(duì)于$x=Wu+b$的所有維度咬像,除了那些具有小的絕對(duì)值之外,流向$u$的梯度將會(huì)消失生宛,模型將緩慢的進(jìn)行訓(xùn)練县昂。然而,由于$x$受$W,b$和下面所有層的參數(shù)的影響陷舅,訓(xùn)練期間那些參數(shù)的改變可能會(huì)將$x$的許多維度移動(dòng)到非線性的飽和狀態(tài)并減慢收斂倒彰。這個(gè)影響隨著網(wǎng)絡(luò)深度的增加而放大。在實(shí)踐中蔑赘,飽和問題和由此產(chǎn)生的梯度消失通常通過使用修正線性單元(Nair & Hinton, 2010) $ReLU(x)=\max(x,0)$狸驳,仔細(xì)的初始化(Bengio & Glorot, 2010; Saxe et al., 2013)和小的學(xué)習(xí)率來解決预明。然而缩赛,如果我們能保證非線性輸入的分布在網(wǎng)絡(luò)訓(xùn)練時(shí)保持更穩(wěn)定,那么優(yōu)化器將不太可能陷入飽和狀態(tài)撰糠,訓(xùn)練將加速酥馍。
We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
我們把訓(xùn)練過程中深度網(wǎng)絡(luò)內(nèi)部結(jié)點(diǎn)的分布變化稱為內(nèi)部協(xié)變量轉(zhuǎn)移。消除它可以保證更快的訓(xùn)練阅酪。我們提出了一種新的機(jī)制旨袒,我們稱為為批標(biāo)準(zhǔn)化汁针,它是減少內(nèi)部協(xié)變量轉(zhuǎn)移的一個(gè)步驟,這樣做可以顯著加速深度神經(jīng)網(wǎng)絡(luò)的訓(xùn)練砚尽。它通過標(biāo)準(zhǔn)化步驟來實(shí)現(xiàn)施无,標(biāo)準(zhǔn)化步驟修正了層輸入的均值和方差。批標(biāo)準(zhǔn)化減少了梯度對(duì)參數(shù)或它們的初始值尺度上的依賴必孤,對(duì)通過網(wǎng)絡(luò)的梯度流動(dòng)有有益的影響猾骡。這允許我們使用更高的學(xué)習(xí)率而沒有發(fā)散的風(fēng)險(xiǎn)。此外敷搪,批標(biāo)準(zhǔn)化使模型正則化并減少了對(duì)Dropout(Srivastava et al., 2014)的需求兴想。最后,批標(biāo)準(zhǔn)化通過阻止網(wǎng)絡(luò)陷入飽和模式讓使用飽和非線性成為可能赡勘。
In Sec. 4.2, we apply Batch Normalization to the best-performing ImageNet classification network, and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. Using an ensemble of such networks trained with Batch Normalization, we achieve the top-5 error rate that improves upon the best known results on ImageNet classification.
在4.2小節(jié)嫂便,我們將批標(biāo)準(zhǔn)化應(yīng)用到性能最好的ImageNet分類網(wǎng)絡(luò)上,并且表明我們可以使用僅7%的訓(xùn)練步驟來匹配其性能闸与,并且可以進(jìn)一步超過其準(zhǔn)確性一大截毙替。通過使用批標(biāo)準(zhǔn)化訓(xùn)練的網(wǎng)絡(luò)的集合,我們?nèi)〉昧藅op-5錯(cuò)誤率几迄,其改進(jìn)了ImageNet分類上已知的最佳結(jié)果蔚龙。
2. Towards Reducing Internal Covariate Shift
We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift. By fixing the distribution of the layer inputs $x$ as the training progresses, we expect to improve the training speed. It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer. By whitening the inputs to each layer, we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.
2. 減少內(nèi)部協(xié)變量轉(zhuǎn)變
由于訓(xùn)練過程中網(wǎng)絡(luò)參數(shù)的變化,我們將內(nèi)部協(xié)變量轉(zhuǎn)移定義為網(wǎng)絡(luò)激活分布的變化映胁。為了改善訓(xùn)練木羹,我們尋求減少內(nèi)部協(xié)變量轉(zhuǎn)移。隨著訓(xùn)練的進(jìn)行解孙,通過固定層輸入$x$的分布坑填,我們期望提高訓(xùn)練速度。眾所周知(LeCun et al., 1998b; Wiesler & Ney, 2011)如果對(duì)網(wǎng)絡(luò)的輸入進(jìn)行白化弛姜,網(wǎng)絡(luò)訓(xùn)練將會(huì)收斂的更快——即輸入線性變換為具有零均值和單位方差脐瑰,并去相關(guān)。當(dāng)每一層觀察下面的層產(chǎn)生的輸入時(shí)廷臼,實(shí)現(xiàn)每一層輸入進(jìn)行相同的白化將是有利的苍在。通過白化每一層的輸入,我們將采取措施實(shí)現(xiàn)輸入的固定分布荠商,消除內(nèi)部協(xié)變量轉(zhuǎn)移的不良影響寂恬。
We could consider whitening activations at every training step or at some interval, either by modifying the network directly or by changing the parameters of the optimization algorithm to depend on the network activation values (Wiesler et al., 2014; Raiko et al., 2012; Povey et al., 2014; Desjardins & Kavukcuoglu). However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step. For example, consider a layer with the input $u$ that adds the learned bias $b$, and normalizes the result by subtracting the mean of the activation computed over the training data: $\hat x=x - E[x]$ where $x = u+b$, $X={x_{1\ldots N}}$ is the set of values of $x$ over the training set, and $E[x] = \frac{1}{N}\sum_{i=1}^N x_i$. If a gradient descent step ignores the dependence of $E[x]$ on $b$, then it will update $b\leftarrow b+\Delta b$, where $\Delta b\propto -\partial{\ell}/\partial{\hat x}$. Then $u+(b+\Delta b) -E[u+(b+\Delta b)] = u+b-E[u+b]$. Thus, the combination of the update to $b$ and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss. As the training continues, $b$ will grow indefinitely while the loss remains fixed. This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.
我們考慮在每個(gè)訓(xùn)練步驟或在某些間隔來白化激活值,通過直接修改網(wǎng)絡(luò)或根據(jù)網(wǎng)絡(luò)激活值來更改優(yōu)化方法的參數(shù)(Wiesler et al., 2014; Raiko et al., 2012; Povey et al., 2014; Desjardins & Kavukcuoglu)莱没。然而初肉,如果這些修改分散在優(yōu)化步驟中,那么梯度下降步驟可能會(huì)試圖以要求標(biāo)準(zhǔn)化進(jìn)行更新的方式來更新參數(shù)饰躲,這會(huì)降低梯度下降步驟的影響牙咏。例如臼隔,考慮一個(gè)層,其輸入$u$加上學(xué)習(xí)到的偏置$b$妄壶,通過減去在訓(xùn)練集上計(jì)算的激活值的均值對(duì)結(jié)果進(jìn)行歸一化:$\hat x=x - E[x]$摔握,$x = u+b$, $X={x_{1\ldots N}}$是訓(xùn)練集上$x$值的集合,$E[x] = \frac{1}{N}\sum_{i=1}^N x_i$丁寄。如果梯度下降步驟忽略了$E[x]$對(duì)$b$的依賴盒发,那它將更新$b\leftarrow b+\Delta b$,其中$\Delta b\propto -\partial{\ell}/\partial{\hat x}$狡逢。然后$u+(b+\Delta b) -E[u+(b+\Delta b)] = u+b-E[u+b]$宁舰。因此,結(jié)合$b$的更新和接下來標(biāo)準(zhǔn)化中的改變會(huì)導(dǎo)致層的輸出沒有變化奢浑,從而導(dǎo)致?lián)p失沒有變化蛮艰。隨著訓(xùn)練的繼續(xù),$b$將無限增長而損失保持不變雀彼。如果標(biāo)準(zhǔn)化不僅中心化而且縮放了激活值壤蚜,問題會(huì)變得更糟糕。我們?cè)谧畛醯膶?shí)驗(yàn)中已經(jīng)觀察到了這一點(diǎn)徊哑,當(dāng)標(biāo)準(zhǔn)化參數(shù)在梯度下降步驟之外計(jì)算時(shí)袜刷,模型會(huì)爆炸。
The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution. Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters $\Theta$. Let again $x$ be a layer input, treated as a vector, and $\cal X$ be the set of these inputs over the training data set. The normalization can then be written as a transformation $$\hat x=Norm(x, \cal X)$$ which depends not only on the given training example $x$ but on all examples $\cal X$ -- each of which depends on $\Theta$ if $x$ is generated by another layer. For backpropagation, we would need to compute the Jacobians $\frac {\partial Norm(x,\cal X)} {\partial x}$ and $\frac {\partial Norm(x,\cal X)} {\partial \cal X}$; ignoring the latter term would lead to the explosion described above. Within this framework, whitening the layer inputs is expensive, as it requires computing the covariance matrix $Cov[x]=E_{x\in \cal X}[x x^T]- E[x]E[x]^T$ and its inverse square root, to produce the whitened activations $Cov[x]^{-1/2}(x-E[x])$, as well as the derivatives of these transforms for backpropagation. This motivates us to seek an alternative that performs input normalization in a way that is differentiable and does not require the analysis of the entire training set after every parameter update.
上述方法的問題是梯度下降優(yōu)化沒有考慮到標(biāo)準(zhǔn)化中發(fā)生的事實(shí)莺丑。為了解決這個(gè)問題著蟹,我們希望確保對(duì)于任何參數(shù)值,網(wǎng)絡(luò)總是產(chǎn)生具有所需分布的激活值梢莽。這樣做將允許關(guān)于模型參數(shù)損失的梯度來解釋標(biāo)準(zhǔn)化萧豆,以及它對(duì)模型參數(shù)$\Theta$的依賴。設(shè)$x$為層的輸入昏名,將其看作向量涮雷,$\cal X$是這些輸入在訓(xùn)練集上的集合。標(biāo)準(zhǔn)化可以寫為變換$$\hat x=Norm(x, \cal X)$$它不僅依賴于給定的訓(xùn)練樣本$x$而且依賴于所有樣本$\cal X$——它們中的每一個(gè)都依賴于$\Theta$轻局,如果$x$是由另一層生成的洪鸭。對(duì)于反向傳播,我們將需要計(jì)算雅可比行列式$\frac {\partial Norm(x,\cal X)} {\partial x}$和$\frac {\partial Norm(x,\cal X)} {\partial \cal X}$仑扑;忽略后一項(xiàng)會(huì)導(dǎo)致上面描述的爆炸览爵。在這個(gè)框架中晦雨,白化層輸入是昂貴的录淡,因?yàn)樗笥?jì)算協(xié)方差矩陣$Cov[x]=E_{x\in \cal X}[x x^T]- E[x]E[x]T$和它的平方根倒數(shù),從而生成白化的激活$Cov[x]{-1/2}(x-E[x])$和這些變換進(jìn)行反向傳播的偏導(dǎo)數(shù)。這促使我們尋求一種替代方案盒让,以可微分的方式執(zhí)行輸入標(biāo)準(zhǔn)化梅肤,并且在每次參數(shù)更新后不需要對(duì)整個(gè)訓(xùn)練集進(jìn)行分析。
Some of the previous approaches (e.g. (Lyu & Simoncelli, 2008)) use statistics computed over a single training example, or, in the case of image networks, over different feature maps at a given location. However, this changes the representation ability of a network by discarding the absolute scale of activations. We want to a preserve the information in the network, by normalizing the activations in a training example relative to the statistics of the entire training data.
以前的一些方法(例如(Lyu&Simoncelli邑茄,2008))使用通過單個(gè)訓(xùn)練樣本計(jì)算的統(tǒng)計(jì)信息姨蝴,或者在圖像網(wǎng)絡(luò)的情況下,使用給定位置處不同特征圖上的統(tǒng)計(jì)肺缕。然而左医,通過丟棄激活值絕對(duì)尺度改變了網(wǎng)絡(luò)的表示能力。我們希望通過對(duì)相對(duì)于整個(gè)訓(xùn)練數(shù)據(jù)統(tǒng)計(jì)信息的單個(gè)訓(xùn)練樣本的激活值進(jìn)行歸一化來保留網(wǎng)絡(luò)中的信息同木。
3. Normalization via Mini-Batch Statistics
Since the full whitening of each layer's inputs is costly and not everywhere differentiable, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and unit variance. For a layer with $d$-dimensional input $x = (x^{(1)}\ldots x^{(d)})$, we will normalize each dimension $$\hat x^{(k)} = \frac{x^{(k)} - E[x^{(k)}]} {\sqrt {Var[x^{(k)}]}}$$ where the expectation and variance are computed over the training data set. As shown in (LeCun et al., 1998b), such normalization speeds up convergence, even when the features are not decorrelated.
3. 通過Mini-Batch統(tǒng)計(jì)進(jìn)行標(biāo)準(zhǔn)化
由于每一層輸入的整個(gè)白化是代價(jià)昂貴的并且不是到處可微分的浮梢,因此我們做了兩個(gè)必要的簡化。首先是我們將單獨(dú)標(biāo)準(zhǔn)化每個(gè)標(biāo)量特征彤路,從而代替在層輸入輸出對(duì)特征進(jìn)行共同白化秕硝,使其具有零均值和單位方差。對(duì)于具有$d$維輸入$x = (x^{(1)}\ldots x^{(d)})$的層洲尊,我們將標(biāo)準(zhǔn)化每一維$$\hat x^{(k)} = \frac{x^{(k)} - E[x^{(k)}]} {\sqrt {Var[x^{(k)}]}}$$其中期望和方差在整個(gè)訓(xùn)練數(shù)據(jù)集上計(jì)算远豺。如(LeCun et al., 1998b)中所示,這種標(biāo)準(zhǔn)化加速了收斂坞嘀,即使特征沒有去相關(guān)躯护。
Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation $x^{(k)}$, a pair of parameters $\gamma^{(k)}, \beta^{(k)}$, which scale and shift the normalized value: $$y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}.$$ These parameters are learned along with the original model parameters, and restore the representation power of the network. Indeed, by setting $\gamma^{(k)} = \sqrt{Var[x^{(k)}]}$ and $\beta^{(k)} = E[x^{(k)}]$, we could recover the original activations, if that were the optimal thing to do.
注意簡單標(biāo)準(zhǔn)化層的每一個(gè)輸入可能會(huì)改變層可以表示什么。例如丽涩,標(biāo)準(zhǔn)化sigmoid的輸入會(huì)將它們約束到非線性的線性狀態(tài)棺滞。為了解決這個(gè)問題,我們要確保插入到網(wǎng)絡(luò)中的變換可以表示恒等變換矢渊。為了實(shí)現(xiàn)這個(gè)检眯,對(duì)于每一個(gè)激活值$x{(k)}$,我們引入成對(duì)的參數(shù)$\gamma{(k)}昆淡,\beta{(k)}$锰瘸,它們會(huì)歸一化和移動(dòng)標(biāo)準(zhǔn)化值:$$y{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta{(k)}.$$這些參數(shù)與原始的模型參數(shù)一起學(xué)習(xí),并恢復(fù)網(wǎng)絡(luò)的表示能力昂灵。實(shí)際上避凝,通過設(shè)置$\gamma{(k)} = \sqrt{Var[x{(k)}]}$和$\beta{(k)} = E[x^{(k)}]$,我們可以重新獲得原始的激活值眨补,如果這是要做的最優(yōu)的事管削。
In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation. This way, the statistics used for normalization can fully participate in the gradient backpropagation. Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.
每個(gè)訓(xùn)練步驟的批處理設(shè)置是基于整個(gè)訓(xùn)練集的,我們將使用整個(gè)訓(xùn)練集來標(biāo)準(zhǔn)化激活值撑螺。然而含思,當(dāng)使用隨機(jī)優(yōu)化時(shí),這是不切實(shí)際的。因此含潘,我們做了第二個(gè)簡化:由于我們?cè)陔S機(jī)梯度訓(xùn)練中使用小批量饲做,每個(gè)小批量產(chǎn)生每次激活平均值和方差的估計(jì)。這樣遏弱,用于標(biāo)準(zhǔn)化的統(tǒng)計(jì)信息可以完全參與梯度反向傳播盆均。注意,通過計(jì)算每一維的方差而不是聯(lián)合協(xié)方差漱逸,可以實(shí)現(xiàn)小批量的使用泪姨;在聯(lián)合情況下,將需要正則化饰抒,因?yàn)樾∨看笮】赡苄∮诎谆募せ钪档臄?shù)量肮砾,從而導(dǎo)致單個(gè)協(xié)方差矩陣。
Consider a mini-batch $\cal B$ of size $m$. Since the normalization is applied to each activation independently, let us focus on a particular activation $x^{(k)}$ and omit $k$ for clarity. We have $m$ values of this activation in the mini-batch, $$\cal B=\lbrace x_{1\ldots m} \rbrace.$$ Let the normalized values be $\hat x_{1\ldots m}$, and their linear transformations be $y_{1\ldots m}$. We refer to the transform $$BN_{\gamma,\beta}: x_{1\ldots m}\rightarrow y_{1\ldots m}$$ as the Batch Normalizing Transform. We present the BN Transform in Algorithm 1. In the algorithm, $\epsilon$ is a constant added to the mini-batch variance for numerical stability.
考慮一個(gè)大小為$m$的小批量數(shù)據(jù)$\cal B$袋坑。由于標(biāo)準(zhǔn)化被單獨(dú)地應(yīng)用于每一個(gè)激活唇敞,所以讓我們集中在一個(gè)特定的激活$x^{(k)}$,為了清晰忽略$k$咒彤。在小批量數(shù)據(jù)里我們有這個(gè)激活的$m$個(gè)值疆柔,$$\cal B=\lbrace x_{1\ldots m} \rbrace.$$設(shè)標(biāo)準(zhǔn)化值為$\hat x_{1\ldots m}$,它們的線性變換為$y_{1\ldots m}$镶柱。我們把變換$$BN_{\gamma,\beta}: x_{1\ldots m}\rightarrow y_{1\ldots m}$$看作批標(biāo)準(zhǔn)化變換旷档。我們?cè)谒惴?中提出了BN變換。在算法中歇拆,為了數(shù)值穩(wěn)定鞋屈,$\epsilon$是一個(gè)加到小批量數(shù)據(jù)方差上的常量。
The BN transform can be added to a network to manipulate any activation. In the notation $y = BN_{\gamma,\beta}(x)$, we indicate that the parameters $\gamma$ and $\beta$ are to be learned, but it should be noted that the BN transform does not independently process the activation in each training example. Rather, $BN_{\gamma,\beta}(x)$ depends both on the training example and the other examples in the mini-batch. The scaled and shifted values $y$ are passed to other network layers. The normalized activations $\hat x$ are internal to our transformation, but their presence is crucial. The distributions of values of any $\hat x$ has the expected value of $0$ and the variance of $1$, as long as the elements of each mini-batch are sampled from the same distribution, and if we neglect $\epsilon$. This can be seen by observing that $\sum_{i=1}^m \hat x_i = 0$ and $\frac {1} {m} \sum_{i=1}^m \hat x_i^2 = 1$, and taking expectations. Each normalized activation $\hat x^{(k)}$ can be viewed as an input to a sub-network composed of the linear transform $y{(k)}=\gamma{(k)}\hat x{(k)}+\beta{(k)}$, followed by the other processing done by the original network. These sub-network inputs all have fixed means and variances, and although the joint distribution of these normalized $\hat x^{(k)}$ can change over the course of training, we expect that the introduction of normalized inputs accelerates the training of the sub-network and, consequently, the network as a whole.
BN變換可以添加到網(wǎng)絡(luò)上來操縱任何激活故觅。在公式$y = BN_{\gamma,\beta}(x)$中厂庇,我們指出參數(shù)$\gamma$和$\beta$需要進(jìn)行學(xué)習(xí),但應(yīng)該注意到在每一個(gè)訓(xùn)練樣本中BN變換不單獨(dú)處理激活输吏。相反权旷,$BN_{\gamma,\beta}(x)$取決于訓(xùn)練樣本和小批量數(shù)據(jù)中的其它樣本」峤Γ縮放和移動(dòng)的值$y$傳遞到其它的網(wǎng)絡(luò)層拄氯。標(biāo)準(zhǔn)化的激活值$\hat x$在我們的變換內(nèi)部,但它們的存在至關(guān)重要它浅。只要每個(gè)小批量的元素從相同的分布中進(jìn)行采樣译柏,如果我們忽略$\epsilon$,那么任何$\hat x$值的分布都具有期望為$0$姐霍,方差為$1$鄙麦。這可以通過觀察$\sum_{i=1}^m \hat x_i = 0$和$\frac {1} {m} \sum_{i=1}^m \hat x_i^2 = 1$看到典唇,并取得預(yù)期。每一個(gè)標(biāo)準(zhǔn)化的激活值$\hat x{(k)}$可以看作由線性變換$y{(k)}=\gamma^{(k)}\hat x{(k)}+\beta{(k)}$組成的子網(wǎng)絡(luò)的輸入胯府,接下來是原始網(wǎng)絡(luò)的其它處理介衔。所有的這些子網(wǎng)絡(luò)輸入都有固定的均值和方差,盡管這些標(biāo)準(zhǔn)化的$\hat x^{(k)}$的聯(lián)合分布可能在訓(xùn)練過程中改變盟劫,但我們預(yù)計(jì)標(biāo)準(zhǔn)化輸入的引入會(huì)加速子網(wǎng)絡(luò)的訓(xùn)練,從而加速整個(gè)網(wǎng)絡(luò)的訓(xùn)練与纽。
During training we need to backpropagate the gradient of loss $\ell$ through this transformation, as well as compute the gradients with respect to the parameters of the BN transform. We use chain rule, as follows (before simplification):
$$
\begin {align}
&\frac {\partial \ell}{\partial \hat x_i} = \frac {\partial \ell} {\partial y_i} \cdot \gamma\\
&\frac {\partial \ell}{\partial \sigma_\cal B^2} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot(x_i-\mu_\cal B)\cdot \frac {-1}{2}(\sigma_\cal B2+\epsilon){-3/2}\\
&\frac {\partial \ell}{\partial \mu_\cal B} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}}\\
&\frac {\partial \ell}{\partial x_i} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i} \cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}} + \frac {\partial \ell}{\partial \sigma_\cal B^2} \cdot \frac {2(x_i - \mu_\cal B)} {m} + \frac {\partial \ell} {\partial \mu_\cal B} \cdot \frac {1} {m}\\
&\frac {\partial \ell}{\partial \gamma} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i} \cdot \hat x_i \\
&\frac {\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i}
\end{align}
$$
Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training. Furthermore, the learned affine transform applied to these normalized activations allows the BN transform to represent the identity transformation and preserves the network capacity.
在訓(xùn)練過程中我們需要通過這個(gè)變換反向傳播損失$\ell$的梯度侣签,以及計(jì)算關(guān)于BN變換參數(shù)的梯度。我們使用的鏈?zhǔn)椒▌t如下(簡化之前):
$$
\begin {align}
&\frac {\partial \ell}{\partial \hat x_i} = \frac {\partial \ell} {\partial y_i} \cdot \gamma\\
&\frac {\partial \ell}{\partial \sigma_\cal B^2} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot(x_i-\mu_\cal B)\cdot \frac {-1}{2}(\sigma_\cal B2+\epsilon){-3/2}\\
&\frac {\partial \ell}{\partial \mu_\cal B} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}}\\
&\frac {\partial \ell}{\partial x_i} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i} \cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}} + \frac {\partial \ell}{\partial \sigma_\cal B^2} \cdot \frac {2(x_i - \mu_\cal B)} {m} + \frac {\partial \ell} {\partial \mu_\cal B} \cdot \frac {1} {m}\\
&\frac {\partial \ell}{\partial \gamma} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i} \cdot \hat x_i \\
&\frac {\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i}
\end{align}
$$
因此急迂,BN變換是將標(biāo)準(zhǔn)化激活引入到網(wǎng)絡(luò)中的可微變換影所。這確保了在模型訓(xùn)練時(shí),層可以繼續(xù)學(xué)習(xí)輸入分布僚碎,表現(xiàn)出更少的內(nèi)部協(xié)變量轉(zhuǎn)移猴娩,從而加快訓(xùn)練。此外勺阐,應(yīng)用于這些標(biāo)準(zhǔn)化的激活上的學(xué)習(xí)到的仿射變換允許BN變換表示恒等變換并保留網(wǎng)絡(luò)的能力卷中。
3.1. Training and Inference with Batch-Normalized Networks
To Batch-Normalize a network, we specify a subset of activations and insert the BN transform for each of them, according to Alg.1. Any layer that previously received $x$ as the input, now receives $BN(x)$. A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size $m>1$, or with any of its variants such as Adagrad (Duchi et al., 2011). The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization $$\hat x=\frac {x - E[x]} {\sqrt{Var[x] + \epsilon}}$$ using the population, rather than mini-batch, statistics. Neglecting $\epsilon$, these normalized activations have the same mean 0 and variance 1 as during training. We use the unbiased variance estimate $Var[x] = \frac {m} {m-1} \cdot E_\cal B[\sigma_\cal B^2]$, where the expectation is over training mini-batches of size $m$ and $\sigma_\cal B^2$ are their sample variances. Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation. It may further be composed with the scaling by $\gamma$ and shift by $\beta$, to yield a single linear transform that replaces $BN(x)$. Algorithm 2 summarizes the procedure for training batch-normalized networks.
3.1 批標(biāo)準(zhǔn)化網(wǎng)絡(luò)的訓(xùn)練和推斷
為了批標(biāo)準(zhǔn)化一個(gè)網(wǎng)絡(luò),根據(jù)算法1渊抽,我們指定一個(gè)激活的子集蟆豫,然后在每一個(gè)激活中插入BN變換。任何以前接收$x$作為輸入的層現(xiàn)在接收$BN(x)$作為輸入懒闷。采用批標(biāo)準(zhǔn)化的模型可以使用批梯度下降十减,或者用小批量數(shù)據(jù)大小為$m>1$的隨機(jī)梯度下降,或使用它的任何變種例如Adagrad (Duchi et al., 2011)進(jìn)行訓(xùn)練愤估。依賴小批量數(shù)據(jù)的激活值的標(biāo)準(zhǔn)化可以有效地訓(xùn)練帮辟,但在推斷過程中是不必要的也是不需要的;我們希望輸出只確定性地取決于輸入玩焰。為此由驹,一旦網(wǎng)絡(luò)訓(xùn)練完成,我們使用總體統(tǒng)計(jì)來進(jìn)行標(biāo)準(zhǔn)化$$\hat x=\frac {x - E[x]} {\sqrt{Var[x] + \epsilon}}$$昔园,而不是小批量數(shù)據(jù)統(tǒng)計(jì)荔棉。跟訓(xùn)練過程中一樣,如果忽略$\epsilon$蒿赢,這些標(biāo)準(zhǔn)化的激活具有相同的均值0和方差1润樱。我們使用無偏方差估計(jì)$Var[x] = \frac {m} {m-1} \cdot E_\cal B[\sigma_\cal B^2]$,其中期望是在大小為$m$的小批量訓(xùn)練數(shù)據(jù)上得到的羡棵,$\sigma_\cal B^2$是其樣本方差壹若。使用這些值移動(dòng)平均,我們?cè)谟?xùn)練過程中可以跟蹤模型的準(zhǔn)確性。由于均值和方差在推斷時(shí)是固定的店展,因此標(biāo)準(zhǔn)化是應(yīng)用到每一個(gè)激活上的簡單線性變換养篓。它可以進(jìn)一步由縮放$\gamma$和轉(zhuǎn)移$\beta$組成,以產(chǎn)生代替$BN(x)$的單線性變換赂蕴。算法2總結(jié)了訓(xùn)練批標(biāo)準(zhǔn)化網(wǎng)絡(luò)的過程柳弄。
3.2. Batch-Normalized Convolutional Networks
Batch Normalization can be applied to any set of activations in the network. Here, we focus on transforms that consist of an affine transformation followed by an element-wise nonlinearity: $$z = g(Wu+b)$$ where $W$ and $b$ are learned parameters of the model, and $g(\cdot)$ is the nonlinearity such as sigmoid or ReLU. This formulation covers both fully-connected and convolutional layers. We add the BN transform immediately before the nonlinearity, by normalizing $x=Wu+b$. We could have also normalized the layer inputs $u$, but since $u$ is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, $Wu+b$ is more likely to have a symmetric, non-sparse distribution, that is "more Gaussian" (Hyva?rinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution.
3.2. 批標(biāo)準(zhǔn)化卷積網(wǎng)絡(luò)
批標(biāo)準(zhǔn)化可以應(yīng)用于網(wǎng)絡(luò)的任何激活集合。這里我們專注于仿射變換和元素級(jí)非線性組成的變換:$$z = g(Wu+b)$$ 其中$W$和$b$是模型學(xué)習(xí)的參數(shù)概说,$g(\cdot)$是非線性例如sigmoid或ReLU碧注。這個(gè)公式涵蓋了全連接層和卷積層。我們?cè)诜蔷€性之前通過標(biāo)準(zhǔn)化$x=Wu+b$加入BN變換糖赔。我們也可以標(biāo)準(zhǔn)化層輸入$u$萍丐,但由于$u$可能是另一個(gè)非線性的輸出,它的分布形狀可能在訓(xùn)練過程中改變放典,并且限制其第一矩或第二矩不能去除協(xié)變量轉(zhuǎn)移逝变。相比之下,$Wu+b$更可能具有對(duì)稱奋构,非稀疏分布壳影,即“更高斯”(Hyv?rinen&Oja,2000)弥臼;對(duì)其標(biāo)準(zhǔn)化可能產(chǎn)生具有穩(wěn)定分布的激活态贤。
Note that, since we normalize $Wu+b$, the bias $b$ can be ignored since its effect will be canceled by the subsequent mean subtraction (the role of the bias is subsumed by $\beta$ in Alg.1). Thus, $z = g(Wu+b)$ is replaced with $$z = g(BN(Wu))$$ where the BN transform is applied independently to each dimension of $x=Wu$, with a separate pair of learned parameters $\gamma^{(k)}$, $\beta^{(k)}$ per dimension.
注意,由于我們對(duì)$Wu+b$進(jìn)行標(biāo)準(zhǔn)化醋火,偏置$b$可以忽略悠汽,因?yàn)樗男?yīng)將會(huì)被后面的中心化取消(偏置的作用會(huì)歸入到算法1的$\beta$)。因此芥驳,$z = g(Wu+b)$被$$z = g(BN(Wu))$$替代柿冲,其中BN變換獨(dú)立地應(yīng)用到$x=Wu$的每一維,每一維具有單獨(dú)的成對(duì)學(xué)習(xí)參數(shù)$\gamma{(k)}$兆旬,$\beta{(k)}$假抄。
For convolutional layers, we additionally want the normalization to obey the convolutional property —— so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini-batch, over all locations. In Alg.1, we let $\cal B$ be the set of all values in a feature map across both the elements of a mini-batch and spatial locations —— so for a mini-batch of size $m$ and feature maps of size $p\times q$, we use the effective mini-batch of size $m'=|\cal B| = m\cdot p, q$. We learn a pair of parameters $\gamma^{(k)}$ and $\beta^{(k)}$ per feature map, rather than per activation. Alg.2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
另外,對(duì)于卷積層我們希望標(biāo)準(zhǔn)化遵循卷積特性——為的是同一特征映射的不同元素丽猬,在不同的位置宿饱,以相同的方式進(jìn)行標(biāo)準(zhǔn)化。為了實(shí)現(xiàn)這個(gè)脚祟,我們?cè)谒形恢寐?lián)合標(biāo)準(zhǔn)化了小批量數(shù)據(jù)中的所有激活谬以。在算法1中,我們讓$\cal B$是跨越小批量數(shù)據(jù)的所有元素和空間位置的特征圖中所有值的集合——因此對(duì)于大小為$m$的小批量數(shù)據(jù)和大小為$p\times q$的特征映射由桌,我們使用有效的大小為$m'=|\cal B| = m\cdot p, q$的小批量數(shù)據(jù)为黎。我們每個(gè)特征映射學(xué)習(xí)一對(duì)參數(shù)$\gamma{(k)}$和$\beta{(k)}$邮丰,而不是每個(gè)激活。算法2進(jìn)行類似的修改铭乾,以便推斷期間BN變換對(duì)在給定的特征映射上的每一個(gè)激活應(yīng)用同樣的線性變換剪廉。
3.3. Batch Normalization enables higher learning rates
In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues. By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network. For example, this enables the sigmoid nonlinearities to more easily stay in their non-saturated regimes, which is crucial for training deep sigmoid networks but has traditionally been hard to accomplish.
3.3. 批標(biāo)準(zhǔn)化可以提高學(xué)習(xí)率
在傳統(tǒng)的深度網(wǎng)絡(luò)中,學(xué)習(xí)率過高可能會(huì)導(dǎo)致梯度爆炸或梯度消失炕檩,以及陷入差的局部最小值斗蒋。批標(biāo)準(zhǔn)化有助于解決這些問題。通過標(biāo)準(zhǔn)化整個(gè)網(wǎng)絡(luò)的激活值笛质,在數(shù)據(jù)通過深度網(wǎng)絡(luò)傳播時(shí)泉沾,它可以防止層參數(shù)的微小變化被放大。例如经瓷,這使sigmoid非線性更容易保持在它們的非飽和狀態(tài)爆哑,這對(duì)訓(xùn)練深度sigmoid網(wǎng)絡(luò)至關(guān)重要洞难,但在傳統(tǒng)上很難實(shí)現(xiàn)舆吮。
Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters. Indeed, for a scalar $a$, $$BN(Wu) = BN((aW)u)$$ and thus $\frac {\partial BN((aW)u)} {\partial u}= \frac {\partial BN(Wu)} {\partial u} $, so the scale does not affect the layer Jacobian nor, consequently, the gradient propagation. Moreover, $\frac {\partial BN((aW)u)} {\partial (aW)}= \frac {\partial BN(Wu)} {\partial W}$ so larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.
批標(biāo)準(zhǔn)化也使訓(xùn)練對(duì)參數(shù)的縮放更有彈性。通常队贱,大的學(xué)習(xí)率可能會(huì)增加層參數(shù)的縮放色冀,這會(huì)在反向傳播中放大梯度并導(dǎo)致模型爆炸。然而柱嫌,通過批標(biāo)準(zhǔn)化锋恬,通過層的反向傳播不受其參數(shù)縮放的影響。實(shí)際上编丘,對(duì)于標(biāo)量$a$与学,$$BN(Wu) = BN((aW)u)$$因此$\frac {\partial BN((aW)u)} {\partial u}= \frac {\partial BN(Wu)} {\partial u}$,因此標(biāo)量不影響層的雅可比行列式嘉抓,從而不影響梯度傳播索守。此外,$\frac {\partial BN((aW)u)} {\partial (aW)}=\frac {1} {a} \cdot \frac {\partial BN(Wu)} {\partial W}$抑片,因此更大的權(quán)重會(huì)導(dǎo)致更小的梯度卵佛,并且批標(biāo)準(zhǔn)化會(huì)穩(wěn)定參數(shù)的增長。
We further conjecture that Batch Normalization may lead the layer Jacobians to have singular values close to 1, which is known to be beneficial for training (Saxe et al., 2013). Consider two consecutive layers with normalized inputs, and the transformation between these normalized vectors: $\hat z = F(\hat x)$. If we assume that $\hat x$ and $\hat z$ are Gaussian and uncorrelated, and that $F(\hat x)\approx J \hat x$ is a linear transformation for the given model parameters, then both $\hat x$ and $\hat z$ have unit covariances, and $I=Cov[\hat z] =J Cov[\hat x] J^T = JJ^T$. Thus, $J$ is orthogonal, which preserves the gradient magnitudes during backpropagation. Although the above assumptions are not true in reality, we expect Batch Normalization to help make gradient propagation better behaved. This remains an area of further study.
我們進(jìn)一步推測(cè)敞斋,批標(biāo)準(zhǔn)化可能會(huì)導(dǎo)致雅可比行列式的奇異值接近于1截汪,這被認(rèn)為對(duì)訓(xùn)練是有利的(Saxe et al., 2013)≈采樱考慮具有標(biāo)準(zhǔn)化輸入的兩個(gè)連續(xù)的層衙解,并且變換位于這些標(biāo)準(zhǔn)化向量之間:$\hat z = F(\hat x)$。如果我們假設(shè)$\hat x$和$\hat z$是高斯分布且不相關(guān)的焰枢,那么$F(\hat x)\approx J \hat x$是對(duì)給定模型參數(shù)的一個(gè)線性變換丢郊,$\hat x$和$\hat z$有單位方差盔沫,并且$I=Cov[\hat z] =J Cov[\hat x] J^T = JJ^T$。因此枫匾,$J$是正交的架诞,其保留了反向傳播中的梯度大小。盡管上述假設(shè)在現(xiàn)實(shí)中不是真實(shí)的干茉,但我們希望批標(biāo)準(zhǔn)化有助于梯度傳播更好的執(zhí)行谴忧。這有待于進(jìn)一步研究。
4. Experiments
4.1. Activations over time
To verify the effects of internal covariate shift on training, and the ability of Batch Normalization to combat it, we considered the problem of predicting the digit class on the MNIST dataset (LeCun et al., 1998a). We used a very simple network, with a 28x28 binary image as input, and 3 fully-connected hidden layers with 100 activations each. Each hidden layer computes $y = g(Wu+b)$ with sigmoid nonlinearity, and the weights $W$ initialized to small random Gaussian values. The last hidden layer is followed by a fully-connected layer with 10 activations (one per class) and cross-entropy loss. We trained the network for 50000 steps, with 60 examples per mini-batch. We added Batch Normalization to each hidden layer of the network, as in Sec.3.1. We were interested in the comparison between the baseline and batch-normalized networks, rather than achieving the state of the art performance on MNIST (which the described architecture does not).
4. 實(shí)驗(yàn)
4.1. 隨時(shí)間激活
為了驗(yàn)證內(nèi)部協(xié)變量轉(zhuǎn)移對(duì)訓(xùn)練的影響角虫,以及批標(biāo)準(zhǔn)化對(duì)抗它的能力沾谓,我們考慮了在MNIST數(shù)據(jù)集上預(yù)測(cè)數(shù)字類別的問題(LeCun et al., 1998a)。我們使用非常簡單的網(wǎng)絡(luò)戳鹅,28x28的二值圖像作為輸入均驶,以及三個(gè)全連接層,每層100個(gè)激活枫虏。每一個(gè)隱藏層用sigmoid非線性計(jì)算$y = g(Wu+b)$妇穴,權(quán)重$W$初始化為小的隨機(jī)高斯值。最后的隱藏層之后是具有10個(gè)激活(每類1個(gè))和交叉熵?fù)p失的全連接層隶债。我們訓(xùn)練網(wǎng)絡(luò)50000次迭代腾它,每份小批量數(shù)據(jù)中有60個(gè)樣本。如第3.1節(jié)所述死讹,我們?cè)诰W(wǎng)絡(luò)的每一個(gè)隱藏層后添加批標(biāo)準(zhǔn)化瞒滴。我們對(duì)基準(zhǔn)線和批標(biāo)準(zhǔn)化網(wǎng)絡(luò)之間的比較感興趣,而不是實(shí)現(xiàn)在MNIST上的最佳性能(所描述的架構(gòu)沒有)赞警。
Figure 1(a) shows the fraction of correct predictions by the two networks on held-out test data, as training progresses. The batch-normalized network enjoys the higher test accuracy. To investigate why, we studied inputs to the sigmoid, in the original network $N$ and batch-normalized network $N_{BN}^{tr}$ (Alg. 2) over the course of training. In Fig. 1(b,c) we show, for one typical activation from the last hidden layer of each network, how its distribution evolves. The distributions in the original network change significantly over time, both in their mean and the variance, which complicates the training of the subsequent layers. In contrast, the distributions in the batch-normalized network are much more stable as training progresses, which aids the training.
Figure 1. (a) The test accuracy of the MNIST network trained with and without Batch Normalization, vs. the number of training steps. Batch Normalization helps the network train faster and achieve higher accuracy. (b, c) The evolution of input distributions to a typical sigmoid, over the course of training, shown as {15, 50, 85}th percentiles. Batch Normalization makes the distribution more stable and reduces the internal covariate shift.
圖1(a)顯示了隨著訓(xùn)練進(jìn)行妓忍,兩個(gè)網(wǎng)絡(luò)在提供的測(cè)試數(shù)據(jù)上正確預(yù)測(cè)的分?jǐn)?shù)。批標(biāo)準(zhǔn)化網(wǎng)絡(luò)具有更高的測(cè)試準(zhǔn)確率愧旦。為了調(diào)查原因世剖,我們?cè)谟?xùn)練過程中研究了原始網(wǎng)絡(luò)$N$和批標(biāo)準(zhǔn)化網(wǎng)絡(luò)$N_{BN}^{tr}$(Alg. 2)中的sigmoid輸入。在圖1(b忘瓦,c)中搁廓,我們顯示,對(duì)于來自每個(gè)網(wǎng)絡(luò)的最后一個(gè)隱藏層的一個(gè)典型的激活耕皮,其分布如何演變境蜕。原始網(wǎng)絡(luò)中的分布隨著時(shí)間的推移而發(fā)生顯著變化,無論是平均值還是方差凌停,都會(huì)使后面的層的訓(xùn)練復(fù)雜化粱年。相比之下,隨著訓(xùn)練的進(jìn)行罚拟,批標(biāo)準(zhǔn)化網(wǎng)絡(luò)中的分布更加穩(wěn)定台诗,這有助于訓(xùn)練完箩。
圖1。(a)使用批標(biāo)準(zhǔn)化和不使用批標(biāo)準(zhǔn)化訓(xùn)練的網(wǎng)絡(luò)在MNIST上的測(cè)試準(zhǔn)確率拉队,以及訓(xùn)練的迭代次數(shù)弊知。批標(biāo)準(zhǔn)化有助于網(wǎng)絡(luò)訓(xùn)練的更快,取得更高的準(zhǔn)確率粱快。(b秩彤,c)典型的sigmoid在訓(xùn)練過程中輸入分布的演變,顯示為15%事哭,50%漫雷,85%。批標(biāo)準(zhǔn)化使分布更穩(wěn)定并降低了內(nèi)部協(xié)變量轉(zhuǎn)移鳍咱。
4.2. ImageNet classification
We applied Batch Normalization to a new variant of the Inception network (Szegedy et al., 2014), trained on the ImageNet classification task (Russakovsky et al., 2014). The network has a large number of convolutional and pooling layers, with a softmax layer to predict the image class, out of 1000 possibilities. Convolutional layers use ReLU as the nonlinearity. The main difference to the network described in (Szegedy et al., 2014) is that the 5x5 convolutional layers are replaced by two consecutive layers of 3x3 convolutions with up to 128 filters. The network contains $13.6\cdot10^6$ parameters, and, other than the top softmax layer, has no fully-connected layers. We refer to this model as Inception in the rest of the text. The training was performed on a large-scale, distributed architecture (Dean et al., 2012), using 5 concurrent steps on each of 10 model replicas, using asynchronous SGD with momentum (Sutskever et al.,2013), with the mini-batch size of 32. All networks are evaluated as training progresses by computing the validation accuracy @1, i.e. the probability of predicting the correct label out of 1000 possibilities, on a held-out set, using a single crop per image.
4.2. ImageNet分類
我們將批標(biāo)準(zhǔn)化化應(yīng)用于在ImageNet分類任務(wù)(Russakovsky等降盹,2014)上訓(xùn)練的Inception網(wǎng)絡(luò)的新變種(Szegedy等,2014)谤辜。網(wǎng)絡(luò)具有大量的卷積和池化層蓄坏,和一個(gè)softmax層用來在1000個(gè)可能之中預(yù)測(cè)圖像的類別。卷積層使用ReLU作為非線性每辟。與(Szegedy等人剑辫,2014年)中描述的網(wǎng)絡(luò)的主要區(qū)別是5×5卷積層被兩個(gè)連續(xù)的3x3卷積層替換干旧,最多可以有128個(gè)濾波器渠欺。該網(wǎng)絡(luò)包含$13.6 \cdot 10^6$個(gè)參數(shù),除了頂部的softmax層之外椎眯,沒有全連接層挠将。在其余的文本中我們將這個(gè)模型稱為Inception。訓(xùn)練在大型分布式架構(gòu)(Dean et al编整。舔稀,2012)上進(jìn)行,10個(gè)模型副本中的每一個(gè)都使用了5個(gè)并行步驟掌测,使用異步帶動(dòng)量的SGD(Sutskever等内贮,2013),小批量數(shù)據(jù)大小為32汞斧。隨著訓(xùn)練進(jìn)行夜郁,所有網(wǎng)絡(luò)都通過計(jì)算驗(yàn)證準(zhǔn)確率@1來評(píng)估,即每幅圖像使用單個(gè)裁剪圖像粘勒,在1000個(gè)可能性中預(yù)測(cè)正確標(biāo)簽的概率竞端。
In our experiments, we evaluated several modifications of Inception with Batch Normalization. In all cases, Batch Normalization was applied to the input of each nonlinearity, in a convolutional way, as described in section 3.2, while keeping the rest of the architecture constant.
在我們的實(shí)驗(yàn)中,我們?cè)u(píng)估了幾個(gè)帶有批標(biāo)準(zhǔn)化的Inception修改版本庙睡。在所有情況下事富,如第3.2節(jié)所述技俐,批標(biāo)準(zhǔn)化以卷積方式應(yīng)用于每個(gè)非線性的輸入,同時(shí)保持架構(gòu)的其余部分不變统台。
4.2.1. ACCELERATING BN NETWORKS
Simply adding Batch Normalization to a network does not take full advantage of our method. To do so, we applied the following modifications:
4.2.1. 加速BN網(wǎng)絡(luò)
將批標(biāo)準(zhǔn)化簡單添加到網(wǎng)絡(luò)中不能充分利用我們方法的優(yōu)勢(shì)雕擂。為此,我們進(jìn)行了以下修改:
Increase learning rate. In a batch-normalized model, we have been able to achieve a training speedup from higher learning rates, with no ill side effects (Sec. 3.3).
提高學(xué)習(xí)率贱勃。在批標(biāo)準(zhǔn)化模型中捂刺,我們已經(jīng)能夠從高學(xué)習(xí)率中實(shí)現(xiàn)訓(xùn)練加速,沒有不良的副作用(第3.3節(jié))募寨。
Remove Dropout. We have found that removing Dropout from BN-Inception allows the network to achieve higher validation accuracy. We conjecture that Batch Normalization provides similar regularization benefits as Dropout, since the activations observed for a training example are affected by the random selection of examples in the same mini-batch.
刪除丟棄族展。我們發(fā)現(xiàn)從BN-Inception中刪除丟棄可以使網(wǎng)絡(luò)實(shí)現(xiàn)更高的驗(yàn)證準(zhǔn)確率。我們推測(cè)拔鹰,批標(biāo)準(zhǔn)化提供了類似丟棄的正則化收益仪缸,因?yàn)閷?duì)于訓(xùn)練樣本觀察到的激活受到了同一小批量數(shù)據(jù)中樣本隨機(jī)選擇的影響。
Shuffle training examples more thoroughly. We enabled within-shard shuffling of the training data, which prevents the same examples from always appearing in a mini-batch together. This led to about 1% improvement in the validation accuracy, which is consistent with the view of Batch Normalization as a regularizer: the randomization inherent in our method should be most beneficial when it affects an example differently each time it is seen.
更徹底地?cái)噥y訓(xùn)練樣本列肢。我們啟用了分布內(nèi)部攪亂訓(xùn)練數(shù)據(jù)恰画,這樣可以防止同一個(gè)例子一起出現(xiàn)在小批量數(shù)據(jù)中。這導(dǎo)致驗(yàn)證準(zhǔn)確率提高了約1%瓷马,這與批標(biāo)準(zhǔn)化作為正則化項(xiàng)的觀點(diǎn)是一致的:它每次被看到時(shí)都會(huì)影響一個(gè)樣本拴还,在我們的方法中內(nèi)在的隨機(jī)化應(yīng)該是最有益的。
Reduce the L2 weight regularization. While in Inception an L2 loss on the model parameters controls overfitting, in modified BN-Inception the weight of this loss is reduced by a factor of 5. We find that this improves the accuracy on the held-out validation data.
減少L2全中正則化欧聘。雖然在Inception中模型參數(shù)的L2損失會(huì)控制過擬合片林,但在修改的BN-Inception中,損失的權(quán)重減少了5倍怀骤。我們發(fā)現(xiàn)這提高了在提供的驗(yàn)證數(shù)據(jù)上的準(zhǔn)確性费封。
Accelerate the learning rate decay. In training Inception, learning rate was decayed exponentially. Because our network trains faster than Inception, we lower the learning rate 6 times faster.
加速學(xué)習(xí)率衰減。在訓(xùn)練Inception時(shí)蒋伦,學(xué)習(xí)率呈指數(shù)衰減弓摘。因?yàn)槲覀兊木W(wǎng)絡(luò)訓(xùn)練速度比Inception更快,所以我們將學(xué)習(xí)速度降低加快6倍痕届。
Remove Local Response Normalization While Inception and other networks (Srivastava et al., 2014) benefit from it, we found that with Batch Normalization it is not necessary.
刪除局部響應(yīng)歸一化韧献。雖然Inception和其它網(wǎng)絡(luò)(Srivastava等人,2014)從中受益研叫,但是我們發(fā)現(xiàn)使用批標(biāo)準(zhǔn)化它是不必要的锤窑。
Reduce the photometric distortions. Because batch-normalized networks train faster and observe each training example fewer times, we let the trainer focus on more “real” images by distorting them less.
減少光照扭曲。因?yàn)榕鷺?biāo)準(zhǔn)化網(wǎng)絡(luò)訓(xùn)練更快蓝撇,并且觀察每個(gè)訓(xùn)練樣本更少的次數(shù)果复,所以通過更少地扭曲它們,我們讓訓(xùn)練器關(guān)注更多的“真實(shí)”圖像渤昌。
4.2.2. SINGLE-NETWORK CLASSIFICATION
We evaluated the following networks, all trained on the LSVRC2012 training data, and tested on the validation data:
4.2.2. 單網(wǎng)絡(luò)分類
我們?cè)u(píng)估了下面的網(wǎng)絡(luò)虽抄,所有的網(wǎng)絡(luò)都在LSVRC2012訓(xùn)練數(shù)據(jù)上訓(xùn)練走搁,并在驗(yàn)證數(shù)據(jù)上測(cè)試:
Inception: the network described at the beginning of Section 4.2, trained with the initial learning rate of 0.0015.
Inception:在4.2小節(jié)開頭描述的網(wǎng)絡(luò),以0.0015的初始學(xué)習(xí)率進(jìn)行訓(xùn)練迈窟。
BN-Baseline: Same as Inception with Batch Normalization before each nonlinearity.
BN-Baseline:每個(gè)非線性之前加上批標(biāo)準(zhǔn)化私植,其它的與Inception一樣。
BN-x5: Inception with Batch Normalization and the modifications in Sec. 4.2.1. The initial learning rate was increased by a factor of 5, to 0.0075. The same learning rate increase with original Inception caused the model parameters to reach machine infinity.
BN-x5:帶有批標(biāo)準(zhǔn)化的Inception车酣,修改在4.2.1小節(jié)中曲稼。初始學(xué)習(xí)率增加5倍到了0.0075。原始Inception增加同樣的學(xué)習(xí)率會(huì)使模型參數(shù)達(dá)到機(jī)器無限大湖员。
BN-x30: Like BN-x5, but with the initial learning rate 0.045 (30 times that of Inception).
BN-x30:類似于BN-x5贫悄,但初始學(xué)習(xí)率為0.045(Inception學(xué)習(xí)率的30倍)。
BN-x5-Sigmoid: Like BN-x5, but with sigmoid nonlinearity $g(t)=\frac{1}{1+\exp(-x)}$ instead of ReLU. We also attempted to train the original Inception with sigmoid, but the model remained at the accuracy equivalent to chance.
BN-x5-Sigmoid:類似于BN-x5娘摔,但使用sigmoud非線性$g(t)=\frac{1}{1+\exp(-x)}$來代替ReLU窄坦。我們也嘗試訓(xùn)練帶有sigmoid的原始Inception,但模型保持在相當(dāng)于機(jī)會(huì)的準(zhǔn)確率凳寺。
In Figure 2, we show the validation accuracy of the networks, as a function of the number of training steps. Inception reached the accuracy of 72.2% after $31 \cdot 10^6$ training steps. The Figure 3 shows, for each network, the number of training steps required to reach the same 72.2% accuracy, as well as the maximum validation accuracy reached by the network and the number of steps to reach it.
Figure 2. Single crop validation accuracy of Inception and its batch-normalized variants, vs. the number of training steps.
Figure 3. For Inception and the batch-normalized variants, the number of training steps required to reach the maximum accuracy of Inception (72.2%), and the maximum accuracy achieved by the network.
在圖2中鸭津,我們顯示了網(wǎng)絡(luò)的驗(yàn)證集準(zhǔn)確率,作為訓(xùn)練步驟次數(shù)的函數(shù)肠缨。Inception網(wǎng)絡(luò)在$31 \cdot 10^6$次訓(xùn)練步驟后達(dá)到了72.2%的準(zhǔn)確率逆趋。圖3顯示,對(duì)于每個(gè)網(wǎng)絡(luò)晒奕,達(dá)到同樣的72.2%準(zhǔn)確率需要的訓(xùn)練步驟數(shù)量闻书,以及網(wǎng)絡(luò)達(dá)到的最大驗(yàn)證集準(zhǔn)確率和達(dá)到該準(zhǔn)確率的訓(xùn)練步驟數(shù)量。
圖2吴汪。Inception和它的批標(biāo)準(zhǔn)化變種在單個(gè)裁剪圖像上的驗(yàn)證準(zhǔn)確率以及訓(xùn)練步驟的數(shù)量惠窄。
圖3蒸眠。對(duì)于Inception和它的批標(biāo)準(zhǔn)化變種漾橙,達(dá)到Inception最大準(zhǔn)確率(72.2%)所需要的訓(xùn)練步驟數(shù)量,以及網(wǎng)絡(luò)取得的最大準(zhǔn)確率楞卡。
By only using Batch Normalization (BN-Baseline), we match the accuracy of Inception in less than half the number of training steps. By applying the modifications in Sec. 4.2.1, we significantly increase the training speed of the network. BN-x5 needs 14 times fewer steps than Inception to reach the 72.2% accuracy. Interestingly, increasing the learning rate further (BN-x30) causes the model to train somewhat slower initially, but allows it to reach a higher final accuracy. This phenomenon is counterintuitive and should be investigated further. BN-x30 reaches 74.8% after $6 \cdot 10^6$ steps, i.e. 5 times fewer steps than required by Inception to reach 72.2%.
通過僅使用批標(biāo)準(zhǔn)化(BN-Baseline)霜运,我們?cè)诓坏絀nception一半的訓(xùn)練步驟數(shù)量內(nèi)將準(zhǔn)確度與其相匹配。通過應(yīng)用4.2.1小節(jié)中的修改蒋腮,我們顯著提高了網(wǎng)絡(luò)的訓(xùn)練速度淘捡。BN-x5需要比Inception少14倍的步驟就達(dá)到了72.2%的準(zhǔn)確率。有趣的是池摧,進(jìn)一步提高學(xué)習(xí)率(BN-x30)使得該模型最初訓(xùn)練有點(diǎn)慢焦除,但可以使其達(dá)到更高的最終準(zhǔn)確率。這種現(xiàn)象是違反直覺的作彤,應(yīng)進(jìn)一步調(diào)查膘魄。在$6 \cdot 10^6$步驟之后乌逐,BN-x30達(dá)到74.8%的準(zhǔn)確率,即比Inception達(dá)到72.2%的準(zhǔn)確率所需的步驟減少了5倍创葡。
We also verified that the reduction in internal covariate shift allows deep networks with Batch Normalization to be trained when sigmoid is used as the nonlinearity, despite the well-known difficulty of training such networks. Indeed, BN-x5-Sigmoid achieves the accuracy of 69.8%. Without Batch Normalization, Inception with sigmoid never achieves better than 1/1000 accuracy.
我們也證實(shí)了盡管訓(xùn)練這樣的網(wǎng)絡(luò)是眾所周知的困難浙踢,但是當(dāng)使用sigmoid作為非線性時(shí),內(nèi)部協(xié)變量轉(zhuǎn)移的減少允許具有批標(biāo)準(zhǔn)化的深層網(wǎng)絡(luò)被訓(xùn)練。的確,BN-x5-Sigmoid取得了69.8%的準(zhǔn)確率達(dá)经窖。沒有批標(biāo)準(zhǔn)化竭鞍,使用sigmoid的Inception從未達(dá)到比1/1000準(zhǔn)確率更好的結(jié)果。
4.2.3. ENSEMBLE CLASSIFICATION
The current reported best results on the ImageNet Large Scale Visual Recognition Competition are reached by the Deep Image ensemble of traditional models (Wu et al., 2015) and the ensemble model of (He et al., 2015). The latter reports the error of 4.94%, as evaluated by the ILSVRC test server. Here we report a test error of 4.82% on test server. This improves upon the previous best result, and exceeds the estimated accuracy of human raters according to (Russakovsky et al., 2014).
4.2.3. 組合分類
目前在ImageNet大型視覺識(shí)別競(jìng)賽中報(bào)道的最佳結(jié)果是傳統(tǒng)模型(Wu et al讶坯。,2015)的Deep Image組合和(He等,2015)的組合模型闻伶。后者報(bào)告了ILSVRC測(cè)試服務(wù)器評(píng)估的4.94%
的top-5錯(cuò)誤率。這里我們?cè)跍y(cè)試服務(wù)器上報(bào)告4.82%
的測(cè)試錯(cuò)誤率够话。這提高了以前的最佳結(jié)果蓝翰,并且根據(jù)(Russakovsky等,2014)這超過了人類評(píng)估者的評(píng)估準(zhǔn)確率女嘲。
For our ensemble, we used 6 networks. Each was based on BN-x30, modified via some of the following: increased initial weights in the convolutional layers; using Dropout (with the Dropout probability of 5% or 10%, vs. 40% for the original Inception); and using non-convolutional Batch Normalization with last hidden layers of the model. Each network achieved its maximum accuracy after about $6 \cdot 10^6$ training steps. The ensemble prediction was based on the arithmetic average of class probabilities predicted by the constituent networks. The details of ensemble and multi-crop inference are similar to (Szegedy et al., 2014).
對(duì)于我們的組合畜份,我們使用了6個(gè)網(wǎng)絡(luò)。每個(gè)都是基于BN-x30的欣尼,進(jìn)行了以下一些修改:增加卷積層中的初始重量爆雹;使用Dropout(丟棄概率為5%或10%,而原始Inception為40%)愕鼓;模型最后的隱藏層使用非卷積批標(biāo)準(zhǔn)化钙态。每個(gè)網(wǎng)絡(luò)在大約$6 \cdot 10^6$個(gè)訓(xùn)練步驟之后實(shí)現(xiàn)了最大的準(zhǔn)確率。組合預(yù)測(cè)是基于組成網(wǎng)絡(luò)的預(yù)測(cè)類概率的算術(shù)平均菇晃。組合和多裁剪圖像推斷的細(xì)節(jié)與(Szegedy et al册倒,2014)類似。
We demonstrate in Fig. 4 that batch normalization allows us to set new state-of-the-art on the ImageNet classification challenge benchmarks.
Figure 4. Batch-Normalized Inception comparison with previous state of the art on the provided validation set comprising 50000 images. Ensemble results are test server evaluation results on the test set. The BN-Inception ensemble has reached 4.9% top-5 error on the 50000 images of the validation set. All other reported results are on the validation set.
我們?cè)趫D4中證實(shí)了批標(biāo)準(zhǔn)化使我們能夠在ImageNet分類挑戰(zhàn)基準(zhǔn)上設(shè)置新的最佳結(jié)果磺送。
圖4驻子。批標(biāo)準(zhǔn)化Inception與以前的最佳結(jié)果在提供的包含5萬張圖像的驗(yàn)證集上的比較。組合結(jié)果是在測(cè)試集上由測(cè)試服務(wù)器評(píng)估的結(jié)果估灿。BN-Inception組合在驗(yàn)證集的5萬張圖像上取得了4.9% top-5
的錯(cuò)誤率崇呵。所有報(bào)道的其它結(jié)果是在驗(yàn)證集上。
5. Conclusion
We have presented a novel mechanism for dramatically accelerating the training of deep networks. It is based on the premise that covariate shift, which is known to complicate the training of machine learning systems, also applies to sub-networks and layers, and removing it from internal activations of the network may aid in training. Our proposed method draws its power from normalizing activations, and from incorporating this normalization in the network architecture itself. This ensures that the normalization is appropriately handled by any optimization method that is being used to train the network. To enable stochastic optimization methods commonly used in deep network training, we perform the normalization for each mini-batch, and backpropagate the gradients through the normalization parameters. Batch Normalization adds only two extra paramters per activation, and in doing so preserves the representation ability of the network. We presented an algorithm for constructing, training, and performing inference with batch-normalized networks. The resulting networks can be trained with saturating nonlinearities, are more tolerant to increased training rates, and often do not require Dropout for regularization.
5. 結(jié)論
我們提出了一個(gè)新的機(jī)制馅袁,大大加快了深度網(wǎng)絡(luò)的訓(xùn)練域慷。它是基于前提協(xié)變量轉(zhuǎn)移的,已知其會(huì)使機(jī)器學(xué)習(xí)系統(tǒng)的訓(xùn)練復(fù)雜化,也適用于子網(wǎng)絡(luò)和層犹褒,并且從網(wǎng)絡(luò)的內(nèi)部激活中去除它可能有助于訓(xùn)練兄纺。我們提出的方法從其標(biāo)準(zhǔn)化激活中獲取其功能,并將這種標(biāo)準(zhǔn)化合并到網(wǎng)絡(luò)架構(gòu)本身化漆。這確保了標(biāo)準(zhǔn)化可以被用來訓(xùn)練網(wǎng)絡(luò)的任何優(yōu)化方法進(jìn)行恰當(dāng)?shù)奶幚砉来唷榱俗屔疃染W(wǎng)絡(luò)訓(xùn)練中常用的隨機(jī)優(yōu)化方法可用,我們對(duì)每個(gè)小批量數(shù)據(jù)執(zhí)行標(biāo)準(zhǔn)化座云,并通過標(biāo)準(zhǔn)化參數(shù)來反向傳播梯度疙赠。批標(biāo)準(zhǔn)化每個(gè)激活只增加了兩個(gè)額外的參數(shù),這樣做可以保持網(wǎng)絡(luò)的表示能力朦拖。我們提出了一個(gè)用于構(gòu)建圃阳,訓(xùn)練和執(zhí)行推斷的批標(biāo)準(zhǔn)化網(wǎng)絡(luò)算法。所得到的網(wǎng)絡(luò)可以用飽和非線性進(jìn)行訓(xùn)練璧帝,能更容忍增加的訓(xùn)練率捍岳,并且通常不需要丟棄來進(jìn)行正則化。
Merely adding Batch Normalization to a state-of-the-art image classification model yields a substantial speedup in training. By further increasing the learning rates, removing Dropout, and applying other modifications afforded by Batch Normalization, we reach the previous state of the art with only a small fraction of training steps —— and then beat the state of the art in single-network image classification. Furthermore, by combining multiple models trained with Batch Normalization, we perform better than the best known system on ImageNet, by a significant margin.
僅僅將批標(biāo)準(zhǔn)化添加到了最新的圖像分類模型中便在訓(xùn)練中取得了實(shí)質(zhì)的加速睬隶。通過進(jìn)一步提高學(xué)習(xí)率锣夹,刪除丟棄和應(yīng)用批標(biāo)準(zhǔn)化所提供的其它修改,我們只用了少部分的訓(xùn)練步驟就達(dá)到了以前的技術(shù)水平——然后在單網(wǎng)絡(luò)圖像分類中擊敗了最先進(jìn)的技術(shù)苏潜。此外银萍,通過組合多個(gè)使用批標(biāo)準(zhǔn)化訓(xùn)練的模型,我們?cè)贗mageNet上的表現(xiàn)顯著優(yōu)于最好的已知系統(tǒng)恤左。
Our method bears similarity to the standardization layer of (Gu?lc?ehre & Bengio, 2013), though the two address different goals. Batch Normalization seeks a stable distribution of activation values throughout training, and normalizes the inputs of a nonlinearity since that is where matching the moments is more likely to stabilize the distribution. On the contrary, the standardization layer is applied to the output of the nonlinearity, which results in sparser activations. We have not observed the nonlinearity inputs to be sparse, neither with nor without Batch Normalization. Other notable differences of Batch Normalization include the learned scale and shift that allow the BN transform to represent identity, handling of convolutional layers, and deterministic inference that does not depend on the mini-batch.
我們的方法與(Gül?ehre&Bengio贴唇,2013)的標(biāo)準(zhǔn)化層相似,盡管這兩個(gè)方法解決的目標(biāo)不同飞袋。批標(biāo)準(zhǔn)化尋求在整個(gè)訓(xùn)練過程中激活值的穩(wěn)定分布戳气,并且對(duì)非線性的輸入進(jìn)行歸一化,因?yàn)檫@時(shí)更有可能穩(wěn)定分布巧鸭。相反瓶您,標(biāo)準(zhǔn)化層被應(yīng)用于非線性的輸出,這導(dǎo)致了更稀疏的激活蹄皱。我們沒有觀察到非線性輸入是稀疏的览闰,無論是有批標(biāo)準(zhǔn)化還是沒有批標(biāo)準(zhǔn)化。批標(biāo)準(zhǔn)化的其它顯著差異包括學(xué)習(xí)到的縮放和轉(zhuǎn)移允許BN變換表示恒等巷折,卷積層處理以及不依賴于小批量數(shù)據(jù)的確定性推斷。
In this work, we have not explored the full range of possibilities that Batch Normalization potentially enables. Our future work includes applications of our method to Recurrent Neural Networks (Pascanu et al., 2013), where the internal covariate shift and the vanishing or exploding gradients may be especially severe, and which would allow us to more thoroughly test the hypothesis that normalization improves gradient propagation (Sec. 3.3). More study is needed of the regularization properties of Batch Normalization, which we believe to be responsible for the improvements we have observed when Dropout is removed from BN-Inception. We plan to investigate whether Batch Normalization can help with domain adaptation, in its traditional sense —— i.e. whether the normalization performed by the network would allow it to more easily generalize to new data distributions, perhaps with just a recomputation of the population means and variances (Alg. 2). Finally, we believe that further theoretical analysis of the algorithm would allow still more improvements and applications.
在這項(xiàng)工作中崖咨,我們沒有探索批標(biāo)準(zhǔn)化可能實(shí)現(xiàn)的全部可能性锻拘。我們的未來工作包括將我們的方法應(yīng)用于循環(huán)神經(jīng)網(wǎng)絡(luò)(Pascanu et al.,2013),其中內(nèi)部協(xié)變量轉(zhuǎn)移和梯度消失或爆炸可能特別嚴(yán)重署拟,這將使我們能夠更徹底地測(cè)試假設(shè)標(biāo)準(zhǔn)化改善了梯度傳播(第3.3節(jié))婉宰。需要對(duì)批標(biāo)準(zhǔn)化的正則化屬性進(jìn)行更多的研究,我們認(rèn)為這是BN-Inception中刪除丟棄時(shí)我們觀察到的改善的原因推穷。我們計(jì)劃調(diào)查批標(biāo)準(zhǔn)化是否有助于傳統(tǒng)意義上的域自適應(yīng)——即網(wǎng)絡(luò)執(zhí)行標(biāo)準(zhǔn)化是否能夠更容易泛化到新的數(shù)據(jù)分布心包,也許僅僅是對(duì)總體均值和方差的重新計(jì)算(Alg.2)。最后馒铃,我們認(rèn)為蟹腾,該算法的進(jìn)一步理論分析將允許更多的改進(jìn)和應(yīng)用。
Acknowledgments
We thank Vincent Vanhoucke and Jay Yagnik for help and discussions, and the reviewers for insightful comments.
致謝
我們感謝Vincent Vanhoucke和Jay Yagnik的幫助和討論区宇,以及審稿人的深刻評(píng)論娃殖。
References
Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249–256, May 2010.
Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished).
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-4435.
Gu ?lc ?ehre, C ? aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
Hyva ?rinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411–430, May 2000.
Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998a.
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.
Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1–8. IEEE Computer Society, Jun 23-28 2008. doi: 10.1109/CVPR.2008.4587821.
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Omnipress, 2010.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 1310–1318, 2013.
Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924–932, 2012.
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227–244, October 2000.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139–1147. JMLR.org, 2013.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
Wiesler, Simon, Richard, Alexander, Schlu ?ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180–184, Florence, Italy, May 2014.
Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.