Dive into Deep Learning

# Basics

## Standard notations- Variable: $X$ (uppercase and no bold)- Matrix: $\mathbf{X}$ (upper-case and bold)- Vetor: $\mathbf{x}$ (lower-case and bold)- Element/Scalar: $x$ (lower-case and no bold)

## Basic Steps for Deep Learning1. Define the model structure2. Initialize the model's parameters3. Loop:? ? - Calculate current loss(forward propagation)? ? - Calculate current gradient(backward propagation)? ? - Update parameters(gradient descent)

## Backpropagation

Here are some notations we will need later. We use $w^l_{jk}$ to denote the weight for the connection from the $k^{th}$ neuron in the $(l - 1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer. And we use $z^{l}_j$ to represent the input of the $j^{th}$ neuron in the $l^{th}$ layer, $a^l_j$ to represent the activation output in the j^{th} neuron in the $l^{th}$ layer. Similarly, $b^l_j$ represents the bias of the $j^{th}$ neuron in the $l^{th}$ layer.

Why use this cumbersome notation?Maybe it is better to use $j$ to refer to the inpurt neuron, and $k$ to the output neuron. Why we use vice versa? The reason is the activation output of the $j^{th}$ neuron in the $l^{th}$ layer can be expressed like,$$a^l_j = \sigma(\sum_k w^l_{jk}a^{l-1}_k + b^l_j)$$This expression can be rewritten into a matrix from as followings,$$\mathbf{a}^l = \sigma(\mathbf{W}^l \mathbf{a}^{l-1} + \mathbf^l)$$where, $\mathbf{a}^{l}$, $\mathbf{a}^{l-1}$ and $\mathbf忱叭^l$ are vectores, $\mathbf{W}^l$ is aweight matirxfor the $j^{th}$ layer, and its $j^{th}$ row and $k^{th}$ column is $w^l_{jk}$. The elements in $j^{th}$ row of $\mathbf{W}^l$ are reprent the weights of neurons in $(l-1)^{th}$ layer connecting to the $j^{th}$ neuron in $l^{th}$ layer.

Then, we define the loss function $C$, here we use the following notation(mean square error, MSE) as a example,$$C = \frac{1}{2}\frac{1}{m}\sum_{i}^{m}\| \mathbf{y}^{(i)} - \mathbf{a}^{L}(\mathbf{x}^{(i)}) \|^2$$where, $L$ denotes the number of layers in the networks, $\mathbf{a}^L$ denotes the final output of the network. And the loss of a single training example is $C_{\mathbf{x}^{(i)}} = \frac{1}{2}\|\mathbf{y}^{(i) - \mathbf{a}^L}\|^2$.

Note:Backpropagation actually compute the partial derivatives $\frac{\partial C_{x^{(i)}}}{\partial w}$ and $\frac{\partial C_{x^{(i)}}}{\partial b}$ for single trainning example. Then, we calculate $\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$ by averageing over training samples (this step is for GD or mini-bath GD). Here we suppose the training example $\mathbf{x}$ has been fixed. And in order to simplify notation, we drop the $\mathbf{x}$ subscript, writing the loss $C_\mathbf{x}^{(i)}$ as $C$.

So, for each single training sample $\mathbf{x}$, the lose maybe written as,$$C = \frac{1}{2}\| \mathbf{y} - \mathbf{a}^L \| = \frac{1}{2}\sum_j (y_j - a^L_j)^2$$Here, we define $\delta^l_j$ as$$\delta^l_j = \frac{\partial C}{\partial z^l_j}$$

$\delta^l_j$ shows that the input of $j^{th}$ neuron in the $l^{th}$ layer influences the extent of the network loss change (Details can be obtained from [here](http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation)).

理解：$\delta^l_j$表達(dá)了在第$l$層網(wǎng)絡(luò)的第$j$個(gè)神經(jīng)元的輸入值的變化對(duì)最終的loss function的影響程度媒鼓。

And we have,$$z^l_j = \sum_k w_{jk}^l a_k^{l-1}$$$$a_j^l = \sigma(z_j^l)$$Then,$$\delta_j^L = \frac{\partial C}{\partial z^L_j} = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a_k^L}{\partial z_j^L} = \frac{\partial C}{\partial a^L_k} \sigma^{'}(z^L_j)$$Moreover,$$\delta^l_j = \frac{\partial C}{\partial z_j^l} = \sum_k \frac{\partial C}{\partial z^{l + 1}_k} \frac{\partial z_k^{l+1}}{\partial z_j^l} = \sum_k \delta^{l+1}_k \frac{\partial z_k^{l+1}}{\partial z_j^l}$$Because$$z_k^{l+1} = \sum_i w_{ki}^{l+1}a_i^l + b^l_i = \sum_i w^{l+1}_{ki}\sigma(z^{l}_i) + b^l_i$$Differentiating, we obtain$$\frac{\partial z_k^{l+1}}{\partial z_j^l} = w^{l+1}_{kj}\sigma^{'}(z^l_j) ~~~~~~(i = j)$$Then, we get$$\delta_j^l = \sum_k \delta_k^{l+1} w^{l+1}_{kj} \sigma^{'}(z^l_j)$$

理解：$w^{l+1}_{kj}$表示位于$(l+1)^{th}$層的$k^{th}$神經(jīng)元連接到$l^{th}$層$j^{th}$神經(jīng)元的權(quán)值勋又，該公式表明，將$(l+1)^{th}$層的所有神經(jīng)元的梯度變化分別乘以其與$l^{th}$層$k^{th}$神經(jīng)元的權(quán)值并相加豺谈。

Our goal is to update $w^l_{jk}$ and $b^l_j$, and we need to calculate the partial derivative,$$\frac{\partial C}{\partial w_{jk}^{l}} = \sum_i \frac{\partial C}{\partial z^l_{i}} \frac{\partial z^l_i}{w^l_{jk}} = \frac{\partial C}{\partial z^l_{j}} \frac{\partial z^l_{j}}{\partial w^l_{jk}} = \delta^{l}_j a^{l-1}_k$$$$\frac{\partial C}{\partial b^l_j} = \sum_i \frac{\partial C}{z^l_i} \frac{\partial z^l_i}{b^l_j} = \delta_j$$So far, we have four key formulas of backpropagation,$$\begin{aligned}& \delta_j^L = \frac{\partial C}{\partial a^L_k} \sigma^{'}(z^L_j) & ~(1) \\& \delta_j^l = \sum_k \delta_k^{l+1} w^{l+1}_{kj} \sigma^{'}(z^l_j) & ~(2) \\& \frac{\partial C}{\partial w_{jk}^{l}} = \delta^{l}_j a^{l-1}_k &~(3)\\& \frac{\partial C}{\partial b^l_j} = \delta_j^l &~(4) \\\end{aligned}$$

### Deduce BP with VectorizationHere we use the concept of differential:- Monadic calculus: $\mathrmklwi2s4f = f^{'}(x)\mathrmr77ilhcx$- **Multivariable calculus**:? ? - Scalar to vector

? ? $$

? ? \mathrm5y7uftlf = \sum_i \frac{\partial f}{\partial x_i} = {\frac{\partial f}{\partial \mathbf{x}}^T}\mathrmmj02zfp\mathbf{x}

? ? $$

- Scalar to matrix

? ? ? ? base on trace of a matrix,

? ? ? ? $$

? ? ? ? \sum_i \sum_j a_{ij}b_{ij} = \mathrm{Tr}(A^TB)

? ? ? ? $$

? ? ? ? \mathrm{Tr}(AB) = \mathrm{Tr}(BA)

? ? ? ? $$

? ? ? ? we can have,

? ? ? ? $$

? ? ? ? \mathrm5btuyqaf = \sum_i \sum_j \frac{\partial y}{x_{ij}}\mathrmtxb5cmwx_{ij} = \sum_i \sum_j [\frac{\partial f}{\partial \mathbf{X}}]_{ij} [\mathrmtshp7g4\mathbf{X}]_{ij} = \mathrm{Tr}[{({\frac{\partial f}{\partial \mathbf{X}})}^T} \mathrmdlhxbe7\mathbf{X}]

? ? ? ? $$

? ? ? ? so,

? ? ? ? $$

? ? ? ? \mathrmwxa7bwgf = \mathrm{Tr}({\frac{\partial f}{\partial \mathbf{X}}}^T)\mathrmsfxppcz\mathbf{X}

? ? ? ? $$

We already have,

$$\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf胆剧^l$$$$\mathbf{a}^l = \sigma(\mathbf{z}^l)$$And,$$\frac{\partial J}{\partial \mathbf{W}^l} = \frac{\partial J}{\partial \mathbf{a}^{l}} \frac{\partial \mathbf{a}^l}{\partial \mathbf{z}^l} \frac{\partial \mathbf{z}}{\partial \mathbf{W}^l} = \frac{\partial J}{\partial \mathbf{z}^l} \frac{\partial \mathbf{z}^l} {\partial \mathbf{W}^l}$$$$\mathrmquuzdjx\mathbf{z}^l = \mathrmyg5i2k9 \mathbf{W}^l \mathbf{a}^{l-1} = \mathbf{a}^{l-1} \mathrmpufg07v \mathbf{W}^{l}$$so,$$\frac{\partial \mathbf{z}^l} {\partial \mathbf{W}^l} = {\mathbf{a}^{l-1}}^T$$then, calulate $\frac{\partial J}{\partial \mathbf{z}^l}$$$\mathrms79czrb J = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{a}^l})}^T \mathrm2urv7br \mathbf{a}^l] = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{a}^l})}^T? \sigma^{'}(\mathbf{z}^l) \odot \mathrmebucgji \mathbf{z}^l]$$$$\frac{\partial J}{\partial \mathbf{z}^l} = \frac{\partial J}{\partial \mathbf{a}^l} \odot \sigma^{'}(\mathbf{z}^l)$$and,$$\mathrmnkae5o9J = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{z}^{l+1}})}^T \mathrmbu2rrbi\mathbf{z}^{l+1}] = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{z}^{l+1}})}^T \mathrmf09qfxp\mathbf{W}^{l+1}\mathbf{a}^l] = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{z}^{l+1}})}^T \mathbf{W}^{l+1}\mathrmmr5r2u9? \mathbf{a}^l]$$$$\frac{\partial J}{\partial \mathbf{a}^l} = \mathbf{W}^{l+1} \frac{\partial J}{\partial \mathbf{z}^{l+1}}$$Until now we have,$$\frac{\partial J}{\partial \mathbf{W}^l} = \frac{\partial J}{\partial \mathbf{z}^l} {\mathbf{a}^{l-1}}^T$$$$\frac{\partial J}{\partial \mathbf{z}^L} =? (\mathbf{W}^{l+1} \frac{\partial J}{\partial \mathbf{z}^{l+1}}) \odot \sigma^{'}(\mathbf{z}^l)$$

We note,$$\delta^l = \frac{\partial J}{\partial \mathbf{z}^{l}}$$And we can rewrite these formulas into matrix-based form, as$$\begin{aligned}& \delta^L = \nabla_{\mathbf{a}^L} C \odot \sigma^{'}(\mathbf{z}^L) & ~(1) \\& \delta^l = ({(\mathbf{W}^{l+1})}^T \delta^{l+1}) \odot \sigma^{'}(\mathbf{z}^l) & ~(2) \\& \nabla_{\mathbf{W}^l}C = \delta^{l} {(\mathbf{a}^{l - 1})}^T & ~(3) \\& \nabla_{\mathbf^l}C = \delta^l & ~(4) \\\end{aligned}$$

***Reference:***- [知乎：矩陣求導(dǎo)術(shù)（上）](https://zhuanlan.zhihu.com/p/24709748)- [Neural Networks and Deep Learning: How the backpropagation algorithm works](http://neuralnetworksanddeeplearning.com/chap2.html)- [The Matrix Cookbook](https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf)- [Caltech: EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 5](http://www.systems.caltech.edu/dsp/ee150_acospc/lectures/EE_150_Lecture_5_Slides.pdf)

## Regularization**Key idea** is to add another term to the loss, which penalizes large weights.### $L_1$ regularization

\lambda \sum_{i=1}^{n} | w_i | = \lambda {\|\mathbf{w}\|}_1

By using $L_1$ regularization, $\mathbf{w}$ will be sparse.### $L_2$ regularization$L2$ regularization are used much more often during training neural network, it will make weights uniform,

\lambda \sum_{i=1}^{n} w_i^2 = \lambda {\|\mathbf{w}\|}_2^2

In neural network, the loss function with regularization is written as,

$$J(\mathbf{w}^1, b^1, \mathbf{w}^2, b^2, ...,? \mathbf{w}^L, b^L) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\sum_{l=1}^{L}\|\mathbf{w}^l\|^2_2$$where, $\mathbf{w}^l$ is weights matrix and $b$ is a bias vector.

When we do backpropagation to update weights(here assume we use SGD), the gradient of $\mathbf{w}^L$ is,$$\frac{\partial J}{\partial \mathbf{w}^L} = \frac{\partial L(\hat{y}^{(i)}, y^{(i)})}{\partial \mathbf{w}^L} + \lambda \mathbf{w}^L$$we note $\frac{\partial L(\hat{y}^{(i)}, y^{(i)})}{\partial \mathbf{w}^L}$ as $\mathrmb0pt0ij\mathbf{w}^L$$$\mathbf{w}^L := \mathbf{w}^L - \alpha \mathrmb7exu05 \mathbf{w}^L - \alpha \lambda \mathbf{w}^L$$Here, the $\lambda$ is called **weight decay**, no matter what value of $\mathbf{w}^L$ is, this notation is intent to decay the weights(make weights' absolute value small).

### DropoutCore concept:1. Dropout randomly knocks out units in the network, so it's as if on every iteration, we are working with a smaller neural network and so using a smaller neural network seems like it should has a regularization effect.2. Make the neuron can not rely on any one feature, so it makes to spread out weights.

**理解：** dropout在每一次迭代都會(huì)拋棄部分輸入數(shù)據(jù)（使某些輸入為0）羊壹，不使權(quán)值集中于某個(gè)或者部分輸入特征上蓖宦，而是使權(quán)值參數(shù)更加均勻的分布齐婴，可以理解為**shrink weights**，因此于$L_2$正則化類似稠茂。

Tips of using Dropout:1. Dropout is for preventing over-fitting. It the model is not over-fitting, it's better not to use dropout.

**理解：** Dropout是用來解決over-fitting的柠偶，如果模型沒有over-fitting，不必非要使用睬关。2. Because of dropout, the loss function $J$ can not be defined explicitly. So it's hard to check whether loss decrease rightly. It's a good choice to close dropout and check the loss decreases right to ensure that your code has no bug, and then open dropout to work.### Other Regularization Methods1. Data augmentation? ? - Flipping? ? - Random rotation? ? - Clip? ? - Distortion2. Early stoppingCheck dev set error and early stop training. $\mathbf{w}$ is small at initialization and it will increase along with iteration. Early stop will get a mid-size rate $\mathbf{w}$, so it's similar to $L_2$ regularization.

## PreprocessingMost dateset maybe have different size of image, a common preprocession is:1. Scale image into same size or scale one side(width or height, often the short one) into the same size2. Do data augmentation: flipping, random rotation3. Crop a square from each image randomly4. mean subtract### Per-pixel mean subtractSubtract input image with per-pixel mean. The whole training set is $(N, C, H, W)$, the per-pixel mean is calculated by for each $C$ computing the average of all the same position pixel over all image, and then will get `mean matrix` which size is $(C, H, W)$.```python# X size is (N, C, H, W)mean = np.mean(X, axis=0)mean.shape>>> (C, H, W)``` `Caffe` use per-pixel mean subtract in its [tutorial]().

**注：** per-pixel mean處理時(shí)诱担，每個(gè)通道是獨(dú)立處理的，因?yàn)椴煌ǖ赖南袼夭痪哂衅椒€(wěn)性（圖像中不同部分的統(tǒng)計(jì)特性是相同的）电爹，并對(duì)同一位置的像素計(jì)算所有樣本的平均值蔫仙。### Per-channel mean subtractSubtract the mean of per channel calculated over all images. The training set size is $(N, C, H, W)$, the mean is calculated each channel over all images, and get the `mean vector` size of $(C, )$.```python# X size is (N, C, H, W)mean = np.mean(X, axis=(0, 2, 3))mean.shape>>> (C,)```

Whether **per-pixel mean subtract** or **per-channel mean subtract**, they all serves to "center" the data, it means to make the mean of the dataset is around zero, which will help train the networks(make gradient healthy). And as far as I knowm, **per-channel mean subtract** is better and common choice for preprocessing.***References:***- [Github: KaimingHe/deep-residual-networks: preprocessing? #5](https://github.com/KaimingHe/deep-residual-networks/issues/5)- [caffe: Brewing ImageNet](http://caffe.berkeleyvision.org/gathered/examples/imagenet.html)- [Google Groups: Subtract mean image/pixel](https://groups.google.com/forum/#!topic/digits-users/FfeFp0MHQfQ)- [StackExchange: Why do we normalize images by subtracting the dataset's image mean and not the current image mean in deep learning?](https://stats.stackexchange.com/questions/211436/why-do-we-normalize-images-by-subtracting-the-datasets-image-mean-and-not-the-c)- [MathWorks: What is per-pixel mean?](https://cn.mathworks.com/matlabcentral/answers/292415-what-is-per-pixel-mean)

## Batch Normalization

Assume $\mathbf{X}$ is 4d input $(N, C, H, W)$, the output of batch normalization layer is$$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta$$where $x$ is a mini-batch of 3d input $(N, H, W)$. The $\mathrm{E}[x]$ and $\mathrm{Var}[x]$ are calculate pre-dimension over the mini-batches and $\gamma$ and $\beta$ are learnable parameter vectors of size $C$(the input size).

理解：按照$C$的維度，把其他維度的值拉成一個(gè)向量計(jì)算均值和方差丐箩，之后進(jìn)行歸一化：即對(duì)每個(gè)Channel的所有mini-batch樣本所有值計(jì)算均值和方差并歸一化摇邦。

A toy example:```pythonfrom torch import nnfrom torch.autograd import Variableimport numpy as npx = np.array([? ? ? ? ? ? [[[1,1], [1,1]], [[1,1], [1,1]]],? ? ? ? ? ? [[[2,2],[2,2]], [[2,2], [2,2]]]? ? ? ? ? ? ], dtype=np.float32)x = Variable(torch.from_numpy(x))# No affine parameters.bn = nn.BatchNorm2d(2, affine=False)output = bn(x)>>> Variable containing:(0 ,0 ,.,.) = -1.0000 -1.0000 -1.0000 -1.0000(0 ,1 ,.,.) = -1.0000 -1.0000 -1.0000 -1.0000(1 ,0 ,.,.) =? 1.0000? 1.0000? 1.0000? 1.0000(1 ,1 ,.,.) =? 1.0000? 1.0000? 1.0000? 1.0000[torch.FloatTensor of size 2x2x2x2]```***Reference:***- [PyTorch: BathNorm2d](http://pytorch.org/docs/master/nn.html#batchnorm2d)- [pytorch: 利用batch normalization對(duì)Variable進(jìn)行normalize/instance normalize](http://blog.csdn.net/u014722627/article/details/68947016)

## Weight Initialization

Input featres $\mathbf{x} \sim \mathcal{N}(\mu, \sigma^2)$，the output $\mathbf{a} = \sum_{i=1}^{n}w_ix_i$屎勘，其方差為$$\mathrm{Var}(a) = \mathrm{Var}(\sum_{i=1}^{n}w_ix_i) = \sum_{i=1}^{n}\mathrm{Var}(w_ix_i)$$$$= \sum_{i=1}^{n}[\mathrm{E}(w_i)]^2\mathrm{Var}(x_i) + [\mathrm{E}(x_i)]^2 \mathrm{Var}(w_i) + \mathrm{Var}(w_i)\mathrm{Var}(x_i)$$$$= \sum_{i=1}^{n} \mathrm{Var}(w_i) \mathrm{Var}(x_i)$$$$= n\mathrm{Var}(w) \mathrm{Var}(x)$$Here, we assumed zero mean inputs and weights, so $\mathrm{E}[x_i] = 0, \mathrm{E}[w_i] = 0$, and $w_i, x_i$ are independent each other, $x_i (i = 1,2,..,n)$ are independent identically distributed and $w_i (i = 1,2,..,n)$ are alse independent identically distributed.

If we want output $a$ to have the same variance as all of its input $x$, the variance of $w$ needs to be $\frac{1}{n}$, $\mathrm{Var}(x) = \frac{1}{n}$, it means $w \sim \mathcal{N}(0, \frac{1}{n})$.

理解：我們假設(shè)了輸入特征和權(quán)重的均值都是0施籍，$\mathrm{E}[x_i] = 0$，$\mathrm{E}(w_i) = 0$概漱，并且$w_i, x_i$之間都是相互獨(dú)立的丑慎，且$x_i$獨(dú)立同分布，$w_i$獨(dú)立同分布。因此竿裂，如果想要$a$與$x$的方差相同（網(wǎng)絡(luò)輸入與輸出的分布不發(fā)生改變）玉吁，我們需要讓$\mathrm{Var}(w) = \frac{1}{n}$，即$w \sim \mathcal{N}(0, \frac{1}{n})$腻异，又因?yàn)?\mathrm{Var}(nx) = n^2\mathrm{Var}(x)$诈茧，所以有`w = np.random.randn(n) / sqrt(n)`.另外，在深度學(xué)習(xí)代碼實(shí)現(xiàn)中捂掰，通常采用下面所示的方法對(duì)參數(shù)初始化

```python# Calculate standard deviation.stdv = 1 / math.sqrt(n)# Numpyw = np.random.uniform(-stdv, stdv)```即在以均值0為中心敢会，一個(gè)標(biāo)準(zhǔn)差的范圍內(nèi)進(jìn)行隨機(jī)采樣，這樣使權(quán)值$w$更為接近0这嚣。***Reference:***- [cs231n: Weight Initialization](http://cs231n.github.io/neural-networks-2/#init)- [Wiki: Variance](https://en.wikipedia.org/wiki/Variance)- [知乎: 為什么神經(jīng)網(wǎng)絡(luò)在考慮梯度下降的時(shí)候鸥昏，網(wǎng)絡(luò)參數(shù)的初始值不能設(shè)定為全0，而是要采用隨機(jī)初始化思想姐帚？](https://www.zhihu.com/question/36068411)

## Optimization Methods

Loss function is defined as $J(\mathbf{W}, \mathbf{X})$, $X \in \mathbb{R}^{m \times n}$

- $\eta$: learning rate

### Batch Gradient Descent(BGD)BGD calculate the sum gradients of all samples and get the mean,

w_i := w_i - \eta \frac{1}{m}\sum_{k=1}^{m}\nabla_{w_i}J(\mathbf{w}, \mathbf{x})^{(k)}

**Advantages**:- Simpleness**Disdvantages**:- Large amounts of computation- Memory may not enough to put all samples- Difficult to update weights onlineWhen the training set is very large, BGD will takes lots of time.### Stochastic Gradient DescentSGD get one sample and calculate the gradient to update weights,

w_i := w_i - \eta \nabla J(\mathbf{w}, \mathbf{x})^{(k)}

A drawback of SGD is that direction maybe not always of the miniumum, because you only calculate one sample's gradient time.

### Mini-batch Gradient DescentThese method calculate the gradient of a mini-batch samples and the get the mean to update weights,

w_i := w_i - \eta \frac{1}吏垮\sum_{k=j}^{j+b}\nabla_{w_i}J(\mathbf{w}, \mathbf{x})^{(k)}

```pythonn_batches = m / batch_sizefor i in n_batches? ? # use matrix to calculate loss of mini-batch? ? output = xxxx? ? loss = loss_function(output, target)? ? # update weights? ? w := w - lr * gradient```Mini-batch GD is much faster than Batch GD. But the loss of GD will go down all the time(assume that lear rate is suitable), but the loss of SGD or mini-batch GD will be noisy, it means that sometime the loss is decrease and sometime it will increase.

**Summary**

In one epoch(a single pass through training set), BGD update the weights onec; SGD update the weights $m$ times; mini-batch GD update the weights $\frac{m}{\mathrm{batch~size}}$ times.

**注：** epoch的意思是訓(xùn)練時(shí)，遍歷了整個(gè)訓(xùn)練集一次**How to choose mini-batch size?**- If small training set($m \le 2000$): use batch gradient descent- Typical mini-batch size: 64 ~ 512, which is a power of 2 (because of the way computer memory is laid out and accessed, it will make computation run faster)

The following methods are optimized based on **Gradient Descent**, we ues $g$ to notate gradient $\nabla J(\mathbf{w}, \mathbf{x})$(This gradient can be the mean of all samples, a samples or a mean of a batch of samples). **Note:** In deep learning we often use SGD, but you should know that then SGD here often represents mini-batch gradient descent.### Exponentially weighted averages(指數(shù)加權(quán)平均)- $v_t$: exponentially weighted average(a moving average) when time is $t$- $\theta_t$: current value

v_t = \beta v_{t-1} + (1-\beta) \theta_t

It can be rewritten as,

v_t = (1-\beta)\theta_t + (1-\beta) \beta \theta_{t-1} + (1-\beta)\beta^2 \theta_{t-2} + ...

$v_t$ is an approximate average over $\frac{1}{1-\beta}$ previous data. And $v_0$ is 0.

**理解：** $v_t$是前$\frac{1}{1-\beta}$個(gè)數(shù)據(jù)的平均值的近似罐旗。該方法是一種加窗型的平均值計(jì)算方法膳汪，$\beta$決定了窗口的大小。（該平均值可以用來預(yù)測(cè)下一時(shí)刻的值為多少）> 本質(zhì)就是以指數(shù)式遞減加權(quán)的移動(dòng)平均九秀。各數(shù)值的加權(quán)而隨時(shí)間而指數(shù)式遞減遗嗽，越近期的數(shù)據(jù)加權(quán)越重，但較舊的數(shù)據(jù)也給予一定的加權(quán)鼓蜒。### MomentumBasic idea is to computate an exponentially weighted average of the gradients, and use this average gradient to update weights.

- $\mathrm{dw}$: current gradient- $\eta$: learning rate

v_{dw} := \beta v_{dw} + (1-\beta) \mathrm{dw}

w := w - \eta v_{dw}

Some version of momentum is written as (like PyTorch),

v_{dw} := \beta v_{dw} + \mathrm{dw}

It mean that $v$ is divided by $1-\beta$.The most common value for $\beta$ is $0.9$(a average of last 10 gradients) for both two version momentum. The difference is that the second version's $v$ is larger than the first one, which will only have a influence on the **learning rate**.**理解：** Momentum通過計(jì)算當(dāng)前時(shí)刻的梯度的平均值（指數(shù)加權(quán)平均）來作為更新參數(shù)的梯度痹换，即借用了當(dāng)前信息與歷史信息來修正梯度從而得到更好的優(yōu)化方向。***References:***- [deeplearing.ai: Momentum](https://mooc.study.163.com/learn/2001281003?tid=2001391036#/learn/content?type=detail&id=2001702123&cid=2001694311)### Nesterov Momentum

Init $v_{dw}=0$

Then in each iteration $t$

Compute $\mathrm{dw}$ and $\mathrm{db}$ on current mini-batch, then$$v := \mu v_{t-1} + g$$$$w_{i_{\text{next}}} = w - \eta v$$$$v := \mu v_{t-1} + g_{w_{i_{\text{next}}}}$$$$w_i := w_i - \eta v$$

***Reference:***- [知乎專欄：深度學(xué)習(xí)最全優(yōu)化方法總結(jié)比較（SGD都弹，Adagrad娇豫，Adadelta，Adam畅厢，Adamax冯痢，Nadam）](https://zhuanlan.zhihu.com/p/22252270)- [卷積神經(jīng)網(wǎng)絡(luò)中的優(yōu)化算法比較](http://shuokay.com/2016/06/11/optimization/) (注：該博客寫的有些錯(cuò)誤，主要了解其講解的思想)- [知乎：在神經(jīng)網(wǎng)絡(luò)中weight decay起到的做用是什么框杜？momentum呢浦楣？normalization呢？](https://www.zhihu.com/question/24529483)### RMSprop (Root Mean Square)

Init $s_{dw}=0$

Then in iteration $t$

Compute $\mathrm{dw}$ and $\mathrm{db}$ on current mini-batch, then$$s_{dw} := \beta s_{dw} + (1 - \beta) {\mathrm{dw}}^2$$$$w := w - \eta \frac{\mathrm{dw}}{\sqrt{s_{dw}}}$$Here ${\mathrm{dw}}^2$ is square of $\mathrm{dw}$

**理解：** 直觀理解霸琴，通過$s$的作用椒振，讓梯度大的參數(shù)除以一個(gè)大的值從而讓參數(shù)更新的幅度減小，而讓梯度小的參數(shù)除以一個(gè)小的值從而讓參數(shù)更新的幅度變大梧乘。In order to avoid $\sqrt{s_{dw}}$ is zero or near to zero, in parctice, we often add a small valur $\epsilon$ ($10^{-8}$) to denominator $\frac{\mathrm{dw}}{\sqrt{s_{dw}}+\epsilon}$ to avoid getting `inf` or a very large value. ### Adam (Adaptive Moment Estimation)A combination of Momentum and RMSprop

Init $v_{dw}=0$, $s_{dw}=0$

Then in iteration $t$

Compute $\mathrm{dw}$ and $\mathrm{db}$ on current mini-batch, then $$v_{dw} := \beta_1 v_{dw} + (1-\beta_1) \mathrm{dw}$$$$s_{dw} := \beta_2 s_{dw} + (1-\beta_2) \mathrm{dw}$$Do bias correction,$$v_{dw}^{correct} = \frac{v_{dw}}{(1 - \beta_1^t)}$$$$s_{dw}^{correct} = \frac{s_{dw}}{(1 - \beta_2^t)}$$Update weights,$$w := w - \eta \frac{v_{dw}^{correct}}{\sqrt{s_{dw}^{correct}} + \epsilon}$$

A common value for $\beta_1$ is 0.9, and $\beta_2$ is 0.999### Learning Rate DecayDuring training, with the increase of epoch, decrease value of learning rate. Because when in earlier epoch network can accept a relatively large learning rate which can accelarate training, but with the loss decrease we are getting closer to the optimal solution, using a smaller learning rate can help us get the solution in a tighter region around the minimum.

## 1D, 2D, 3D Convlutions- 1D convolution:? ? - Input: a vector $[C_{in}, L_{in}]$? ? - Kernel: a vector $[k,]$? ? - Output(one kernel): a vector $[L_{out},]$- 2D convolution:? ? - Input: a image $[1, H, W]$ or $[C_{in}, H, W]$? ? - Kernel: $[C_{in}, k, k]$? ? - Output(one kernel): a feature map $[H_{out}, W_{out}]$- 3D convolution:? ? - Input: a video or CT $[C_{in}, D, H, W]$? ? - Kernel: $[C_{in}, k, k, k]$? ? - Output(one kernel): $[D_{out}, H_{out}, W_{out}]$Notice that the dimensions of the output after convolution make the name of what kind convolution it is.

**注：** 幾維的卷積是由一個(gè)卷積核卷積之后的輸出結(jié)果的維度決定的澎迎。***References:***- [網(wǎng)易-deeplearning.ai: Convolution over volumes](https://mooc.study.163.com/learn/deeplearning_ai-2001281004?tid=2001392030#/learn/content?type=detail&id=2001728687&cid=2001725124)## Loss Function### Classification#### Cross Entropy

H(Y, \hat{Y}) = E_Y [\frac{1}{\log \hat{Y}}] = E_Y [-\log \hat{Y}]

**Basic knowledge:**- **Entropy(Shannon Entropy):** Shannon defined the entropy $H$ of a discrete random variable $X$ with possible values ${x_1, x_2, ..., x_n}$ and probability mass function $P(X)$ as:

? ? $$

? ? H(X) = E[I(X)] = E[-\ln(P(X))]

? ? $$

Here $E$ is the *expected value operator*, and $I$ is the *information content* of $X$.

It can be explicitly be written as,

? ? $$

? ? H(X) = \sum_{i=1}^{n}P(x_i)I(x_i) = -\sum_{i=1}^{n}P(x_i)\log_b P(x_i)

? ? $$

? ? where $b$ is the base of the logarithm used. Common values of $b$ are 2, Euler's number $e$, and 10. In machine learning and deep learning, people often use $e$.

- **KL divergence** from $\hat{Y}$ to $Y$ is the difference between cross entropy and entropy

\mathrm{KL}(Y\|\hat{Y}) = \sum_{i}y_i\log\frac{1}{\hat{y_i}} - \sum_{i}y_i\log\frac{1}{y_i} = \sum_{i}y_i\log\frac{y_i}{\hat{y_i}}

**注：** 熵的本質(zhì)是香農(nóng)信息量的期望庐杨，信息量就是上面公式中的$I(X)$。- 信息量：對(duì)信息的度量夹供，隨機(jī)變量所代表的事件發(fā)生所帶來的信息的大小灵份。出現(xiàn)概率小的事件信息量多，而事件發(fā)生的概率越大哮洽，則信息量越小填渠，即信息量的大小與事件發(fā)生的概率大小成反比。- 熵：度量了隨機(jī)變量$X$平均的信息量鸟辅。- 交叉熵：使用估計(jì)出的分布q去逼近真實(shí)分布p所需要的信息- KL散度：交叉熵與熵的差值***References:***- [A Friendly Introduction to Cross-Entropy Loss](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/)- [知乎：如何通俗的解釋交叉熵與相對(duì)熵?](https://www.zhihu.com/question/41252833)

* * *

# Awesome Papers### [Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv-2014](https://arxiv.org/abs/1409.1556)這篇論文提出了**VGG Net**氛什。核心思想是使用小的kernel(3 * 3)來實(shí)現(xiàn)深度比較深的網(wǎng)絡(luò)。使用小的kernel的原因在于匪凉，在加深網(wǎng)路的同時(shí)控制網(wǎng)絡(luò)的參數(shù)不會(huì)過多枪眉。

其中，該網(wǎng)絡(luò)在訓(xùn)練時(shí)使用的fully connected layer再层，而在測(cè)試時(shí)為了適應(yīng)不同大小的圖片贸铜，將全連接層的參數(shù)轉(zhuǎn)化為了對(duì)應(yīng)參數(shù)大小的卷積核從而實(shí)現(xiàn)了fully convolution layer。### [Deep Residual Learning for Image Recognition, CVPR-2016](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)論文中提出了deep network出現(xiàn)的**degradation**問題聂受，深度網(wǎng)絡(luò)退化問題蒿秦，即：深度網(wǎng)絡(luò)在訓(xùn)練是會(huì)出現(xiàn)其training loss比前層網(wǎng)絡(luò)的loss要大，并將這種網(wǎng)絡(luò)稱為**plain networks**蛋济。論文中認(rèn)為該問題的出現(xiàn)不是因?yàn)樘荻认?vanishing gradients), 因?yàn)檫@些plain networks中使用**BN**從而保證了在前向傳播中信號(hào)是有**non-zero variance**的棍鳖，并且他們通過實(shí)驗(yàn)證明方向傳播中的由于**BN**的作用，梯度是正常的瘫俊。作者認(rèn)為出現(xiàn)這種**plain networks**的問題在于**The deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error**, 而且單純地增加迭代次數(shù)無法解決這個(gè)問題鹊杖。我的理解：這些病態(tài)網(wǎng)絡(luò)的參數(shù)空間存在很多類似于**馬鞍面**的這種情況，導(dǎo)致梯度值變化不大從而影響了最優(yōu)化扛芽。

因此，該論文提出了**Resudual Learning**:

\mathcal{F}(\mathbf{x}) = \mathcal{H}(\mathbf{x}) - \mathbf{x}

where, $\mathcal{H}(\mathbf{x})$ is output of a few stacked layers, $\mathbf{x}$ denotes the input of the first layer of these layers. It makes the network to learn the **residual** between the input and output. And this **reidual learning** is realized by **skip connection**.

**注：** 參差網(wǎng)絡(luò)學(xué)習(xí)的是輸出與輸入之間的參差积瞒，就是說：輸出等于在輸入的$\mathbf{x}$的基礎(chǔ)上在加上$\mathcal{F}(\mathbf{x})$川尖。而之前的方法是學(xué)習(xí)從輸入到輸出的mapping: $\mathbf{x} \to \mathrm{ouput}$, 并沒有參差的學(xué)習(xí)。 ### [Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields](https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation)CMU的多人姿態(tài)估計(jì)論文茫孔，效果很不錯(cuò)叮喳。核心思想為：heatmap + paf，其中heatmap為多人的關(guān)鍵點(diǎn)預(yù)測(cè)缰贝，paf為骨骼bone的方向預(yù)測(cè)馍悟。通過在圖像上計(jì)算每個(gè)bone的方向（使用單位向量表示）來構(gòu)建對(duì)骨骼點(diǎn)之間的相互關(guān)系的表示。網(wǎng)絡(luò)的label為每個(gè)骨骼點(diǎn)的heatmap和每個(gè)bone的paf（數(shù)量為bone的兩倍剩晴，分別描述x方向和y方向）锣咒。其數(shù)據(jù)預(yù)處理部分中侵状，使用了matlab將含有多人的同一張圖片的annotation生成為多個(gè)sample，即分為`self_joints`和`others_joints`毅整，另外其數(shù)據(jù)增強(qiáng)是內(nèi)嵌在caffe代碼中趣兄，依次對(duì)圖片做了`scale`, `rotate`, `crop and pad`和`flip`的操作。

最后編輯于：2018.09.28 22:51:16

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末悼嫉，一起剝皮案震驚了整個(gè)濱河市艇潭，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌戏蔑，老刑警劉巖蹋凝，帶你破解...
沈念sama閱讀 216,692評(píng)論 6贊 501
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異总棵，居然都是意外死亡仙粱，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,482評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門彻舰，熙熙樓的掌柜王于貴愁眉苦臉地迎上來伐割，“玉大人，你說我怎么就攤上這事刃唤「粜模” “怎么了？”我有些...
開封第一講書人閱讀 162,995評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵尚胞，是天一觀的道長硬霍。經(jīng)常有香客問我，道長笼裳，這世上最難降的妖魔是什么唯卖？我笑而不...
開封第一講書人閱讀 58,223評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮躬柬，結(jié)果婚禮上拜轨，老公的妹妹穿的比我還像新娘。我一直安慰自己允青，他們只是感情好橄碾，可當(dāng)我...
茶點(diǎn)故事閱讀 67,245評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著颠锉，像睡著了一般法牲。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上琼掠，一...
開封第一講書人閱讀 51,208評(píng)論 1贊 299
城市分裂傳說
那天拒垃，我揣著相機(jī)與錄音，去河邊找鬼瓷蛙。笑死悼瓮，一個(gè)胖子當(dāng)著我的面吹牛戈毒，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播谤牡，決...
沈念sama閱讀 40,091評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼副硅，長吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了翅萤？” 一聲冷哼從身側(cè)響起恐疲，我...
開封第一講書人閱讀 38,929評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎套么，沒想到半個(gè)月后培己，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,346評(píng)論 1贊 311
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡胚泌，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,570評(píng)論 2贊 333
?白月光啟示錄
正文我和宋清朗相戀三年省咨，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片玷室。...
茶點(diǎn)故事閱讀 39,739評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡零蓉，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出穷缤，到底是詐尸還是另有隱情敌蜂，我是刑警寧澤，帶...
沈念sama閱讀 35,437評(píng)論 5贊 344
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布津肛，位于F島的核電站章喉，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏身坐。R本人自食惡果不足惜秸脱，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,037評(píng)論 3贊 326
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望部蛇。院中可真熱鬧摊唇，春花似錦、人聲如沸搪花。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,677評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽撮竿。三九已至，卻和暖如春笔呀，著一層夾襖步出監(jiān)牢的瞬間幢踏，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,833評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工许师，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留房蝉，地道東北人僚匆。一個(gè)月前我還...
沈念sama閱讀 47,760評(píng)論 2贊 369
代替公主和親
正文我出身青樓，卻偏偏與公主長得像搭幻，于是被迫代替她去往敵國和親咧擂。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,647評(píng)論 2贊 354

Dive into Deep Learning

推薦閱讀更多精彩內(nèi)容