CS231n (winter 2016) : Assignment2

前言：

以斯坦福cs231n課程的python編程任務(wù)為主線迅箩，展開對該課程主要內(nèi)容的理解和部分?jǐn)?shù)學(xué)推導(dǎo)。建議PC端閱讀院仿，該課程的學(xué)習(xí)資料和代碼如下：
視頻和PPT
筆記
 assignment2初始代碼

Part 1：深層全連接神經(jīng)網(wǎng)絡(luò)（python編程任務(wù)）

我們在Assignment1中完成了簡單的2-layer全連接神經(jīng)網(wǎng)絡(luò)葱绒，但是我們之前的編程不夠模塊化凄诞，所有的計算部分（損失函數(shù)的計算、梯度的計算等等）都放在了一個函數(shù)塊里由桌，使得沒有靈活性为黎，即我們無法隨意更改網(wǎng)絡(luò)的結(jié)構(gòu)。這里行您，我們使用更加模塊化的編程方式铭乾，每個模塊之間相互獨立，運行的時候可以相互調(diào)用娃循，使得我們的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)十分靈活片橡。就像這樣：

python 
def layer_forward(x, w): 
""" Receive inputs x and weights w """ 
# Do some computations ... 
z = # ... some intermediate value 
# Do some more computations ... 
out = # the output  
cache = (x, w, z, out) # Values we need to compute gradients  

return out, cache 

The backward pass will receive upstream derivatives and the cache object, 
and will return gradients with respect to the inputs and weights, like this:

python 
def layer_backward(dout, cache): 
""" 
Receive derivative of loss with respect to outputs and cache, 
and compute derivative with respect to inputs. 
""" 
# Unpack cache values 
x, w, z, out = cache 

# Use values in cache to compute derivatives 
dx = # Derivative of loss with respect to x 
dw = # Derivative of loss with respect to w  

return dx, dw

此外，我們會將前面學(xué)過的參數(shù)更新策略全部集成到模塊中淮野，這樣我們可以探索不同的參數(shù)更新策略的性能表現(xiàn)捧书；我們也會將Batch Normalization和Dropout應(yīng)用到模塊中吹泡，來更高效地優(yōu)化深度網(wǎng)絡(luò)。

由于這部分的編程任務(wù)較為繁重经瓷，我們把任務(wù)拆分下來爆哑，一步一步地完成：

1. 2-layer全連接神經(jīng)網(wǎng)絡(luò)

這部分我們需要完成以下編程任務(wù)（此外，需要看懂solver.py）：
--> fc_net.py里的TwoLayerNet類
--> layers.py里的前四個函數(shù)
--> optim.py

具體代碼如下：
---> fc_net.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

from layer_utils import *

class TwoLayerNet(object):   
    """    
    A two-layer fully-connected neural network with ReLU nonlinearity and    
    softmax loss that uses a modular layer design. We assume an input dimension    
    of D, a hidden dimension of H, and perform classification over C classes.    

    The architecure should be affine - relu - affine - softmax.    

    Note that this class does not implement gradient descent; instead, it    
    will interact with a separate Solver object that is responsible for running    
    optimization.    

    The learnable parameters of the model are stored in the dictionary    
    self.params that maps parameter names to numpy arrays.   
    """
    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,           
                              weight_scale=1e-3, reg=0.0):    
        """    
        Initialize a new network.   
        Inputs:    
        - input_dim: An integer giving the size of the input    
        - hidden_dim: An integer giving the size of the hidden layer    
        - num_classes: An integer giving the number of classes to classify    
        - dropout: Scalar between 0 and 1 giving dropout strength.    
        - weight_scale: Scalar giving the standard deviation for random 
                        initialization of the weights.    
        - reg: Scalar giving L2 regularization strength.    
        """    
        self.params = {}    
        self.reg = reg   
        self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)     
        self.params['b1'] = np.zeros((1, hidden_dim))    
        self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)  
        self.params['b2'] = np.zeros((1, num_classes))

    def loss(self, X, y=None):    
        """   
        Compute loss and gradient for a minibatch of data.    
        Inputs:    
        - X: Array of input data of shape (N, d_1, ..., d_k)    
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].  
        Returns:   
        If y is None, then run a test-time forward pass of the model and return:    
        - scores: Array of shape (N, C) giving classification scores, where              
                  scores[i, c] is the classification score for X[i] and class c. 
        If y is not None, then run a training-time forward and backward pass and    
        return a tuple of:    
        - loss: Scalar value giving the loss   
        - grads: Dictionary with the same keys as self.params, mapping parameter             
                 names to gradients of the loss with respect to those parameters.    
        """
        scores = None
        N = X.shape[0]
        # Unpack variables from the params dictionary
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        h1, cache1 = affine_relu_forward(X, W1, b1)
        out, cache2 = affine_forward(h1, W2, b2)
        scores = out              # (N,C)
        # If y is None then we are in test mode so just return scores
        if y is None:   
            return scores

        loss, grads = 0, {}
        data_loss, dscores = softmax_loss(scores, y)
        reg_loss = 0.5 * self.reg * np.sum(W1*W1) + 0.5 * self.reg * np.sum(W2*W2)
        loss = data_loss + reg_loss

       # Backward pass: compute gradients
       dh1, dW2, db2 = affine_backward(dscores, cache2)
       dX, dW1, db1 = affine_relu_backward(dh1, cache1)
       # Add the regularization gradient contribution
       dW2 += self.reg * W2
       dW1 += self.reg * W1
       grads['W1'] = dW1
       grads['b1'] = db1
       grads['W2'] = dW2
       grads['b2'] = db2

       return loss, grads

---> layers.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

#import numpy as np

def affine_forward(x, w, b):   
    """    
    Computes the forward pass for an affine (fully-connected) layer. 
    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N   
    examples, where each example x[i] has shape (d_1, ..., d_k). We will    
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and    
    then transform it to an output vector of dimension M.    
    Inputs:    
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)    
    - w: A numpy array of weights, of shape (D, M)    
    - b: A numpy array of biases, of shape (M,)   
    Returns a tuple of:    
    - out: output, of shape (N, M)    
    - cache: (x, w, b)   
    """
    out = None
    # Reshape x into rows
    N = x.shape[0]
    x_row = x.reshape(N, -1)         # (N,D)
    out = np.dot(x_row, w) + b       # (N,M)
    cache = (x, w, b)
    
    return out, cache

def affine_backward(dout, cache):   
    """    
    Computes the backward pass for an affine layer.    
    Inputs:    
    - dout: Upstream derivative, of shape (N, M)    
    - cache: Tuple of: 
    - x: Input data, of shape (N, d_1, ... d_k)    
    - w: Weights, of shape (D, M)    
    Returns a tuple of:   
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)    
    - dw: Gradient with respect to w, of shape (D, M) 
    - db: Gradient with respect to b, of shape (M,)    
    """    
    x, w, b = cache    
    dx, dw, db = None, None, None   
    dx = np.dot(dout, w.T)                       # (N,D)    
    dx = np.reshape(dx, x.shape)                 # (N,d1,...,d_k)   
    x_row = x.reshape(x.shape[0], -1)            # (N,D)    
    dw = np.dot(x_row.T, dout)                   # (D,M)    
    db = np.sum(dout, axis=0, keepdims=True)     # (1,M)    

    return dx, dw, db

def relu_forward(x):   
    """    
    Computes the forward pass for a layer of rectified linear units (ReLUs).    
    Input:    
    - x: Inputs, of any shape    
    Returns a tuple of:    
    - out: Output, of the same shape as x    
    - cache: x    
    """   
    out = None    
    out = ReLU(x)    
    cache = x    

    return out, cache

def relu_backward(dout, cache):   
    """  
    Computes the backward pass for a layer of rectified linear units (ReLUs).   
    Input:    
    - dout: Upstream derivatives, of any shape    
    - cache: Input x, of same shape as dout    
    Returns:    
    - dx: Gradient with respect to x    
    """    
    dx, x = None, cache    
    dx = dout    
    dx[x <= 0] = 0    

    return dx

def svm_loss(x, y):   
    """    
    Computes the loss and gradient using for multiclass SVM classification.    
    Inputs:    
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class         
         for the ith input.    
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and         
         0 <= y[i] < C   
    Returns a tuple of:    
    - loss: Scalar giving the loss   
    - dx: Gradient of the loss with respect to x    
    """    
    N = x.shape[0]   
    correct_class_scores = x[np.arange(N), y]    
    margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)    
    margins[np.arange(N), y] = 0   
    loss = np.sum(margins) / N   
    num_pos = np.sum(margins > 0, axis=1)    
    dx = np.zeros_like(x)   
    dx[margins > 0] = 1    
    dx[np.arange(N), y] -= num_pos    
    dx /= N    

    return loss, dx

def softmax_loss(x, y):    
    """    
    Computes the loss and gradient for softmax classification.    Inputs:    
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class         
    for the ith input.    
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and         
         0 <= y[i] < C   
    Returns a tuple of:    
    - loss: Scalar giving the loss    
    - dx: Gradient of the loss with respect to x   
    """    
    probs = np.exp(x - np.max(x, axis=1, keepdims=True))    
    probs /= np.sum(probs, axis=1, keepdims=True)    
    N = x.shape[0]   
    loss = -np.sum(np.log(probs[np.arange(N), y])) / N    
    dx = probs.copy()    
    dx[np.arange(N), y] -= 1    
    dx /= N    

    return loss, dx

def ReLU(x):    
    """ReLU non-linearity."""    
    return np.maximum(0, x)

---> optim.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

import numpy as np

def sgd(w, dw, config=None):    
    """    
    Performs vanilla stochastic gradient descent.    
    config format:    
    - learning_rate: Scalar learning rate.    
    """    
   if config is None: config = {}    
   config.setdefault('learning_rate', 1e-2)    
   w -= config['learning_rate'] * dw    

   return w, config

def sgd_momentum(w, dw, config=None):    
    """    
    Performs stochastic gradient descent with momentum.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - momentum: Scalar between 0 and 1 giving the momentum value.                
    Setting momentum = 0 reduces to sgd.    
    - velocity: A numpy array of the same shape as w and dw used to store a moving    
    average of the gradients.   
    """   
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-2)   
    config.setdefault('momentum', 0.9)    
    v = config.get('velocity', np.zeros_like(w))    
    next_w = None    
    v = config['momentum'] * v - config['learning_rate'] * dw    
    next_w = w + v    
    config['velocity'] = v    

    return next_w, config

def rmsprop(x, dx, config=None):    
    """    
    Uses the RMSProp update rule, which uses a moving average of squared gradient    
    values to set adaptive per-parameter learning rates.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared                  
    gradient cache.    
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - cache: Moving average of second moments of gradients.   
    """    
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-2)  
    config.setdefault('decay_rate', 0.99)    
    config.setdefault('epsilon', 1e-8)    
    config.setdefault('cache', np.zeros_like(x))    
    next_x = None    
    cache = config['cache']    
    decay_rate = config['decay_rate']    
    learning_rate = config['learning_rate']    
    epsilon = config['epsilon']    
    cache = decay_rate * cache + (1 - decay_rate) * (dx**2)    
    x += - learning_rate * dx / (np.sqrt(cache) + epsilon)  
    config['cache'] = cache    
    next_x = x    

    return next_x, config

def adam(x, dx, config=None):    
    """    
    Uses the Adam update rule, which incorporates moving averages of both the  
    gradient and its square and a bias correction term.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - beta1: Decay rate for moving average of first moment of gradient.    
    - beta2: Decay rate for moving average of second moment of gradient.   
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - m: Moving average of gradient.    
    - v: Moving average of squared gradient.    
    - t: Iteration number.   
    """    
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-3)    
    config.setdefault('beta1', 0.9)    
    config.setdefault('beta2', 0.999)    
    config.setdefault('epsilon', 1e-8)    
    config.setdefault('m', np.zeros_like(x))    
    config.setdefault('v', np.zeros_like(x))    
    config.setdefault('t', 0)   
    next_x = None    
    m = config['m']    
    v = config['v']    
    beta1 = config['beta1']    
    beta2 = config['beta2']    
    learning_rate = config['learning_rate']    
    epsilon = config['epsilon']   
    t = config['t']    
    t += 1    
    m = beta1 * m + (1 - beta1) * dx    
    v = beta2 * v + (1 - beta2) * (dx**2)    
    m_bias = m / (1 - beta1**t)    
    v_bias = v / (1 - beta2**t)    
    x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)    
    next_x = x    
    config['m'] = m    
    config['v'] = v    
    config['t'] = t    

    return next_x, config

編程完成后舆吮，我們可以用FullyConnectedNets.ipynb里的代碼來check我們的代碼是否有誤揭朝。check完之后，我們可以在CIFAR-10上跑一遍色冀，和Assignment1里的2-layer神經(jīng)網(wǎng)絡(luò)比較一下潭袱，結(jié)果應(yīng)該是差不多的。

這里锋恬，我貼一下在CIFAR-10上運行的代碼和結(jié)果圖：
---> two_layer_fc_net_start.py

__coauthor__ = 'Deeplayer'
# 6.22.2016

import matplotlib.pyplot as plt
from fc_net import *
from data_utils import get_CIFAR10_data
from solver import Solver

data = get_CIFAR10_data()
model = TwoLayerNet(reg=0.9)
solver = Solver(model, data,                
                lr_decay=0.95,                
                print_every=100, num_epochs=40, batch_size=400, 
                update_rule='sgd_momentum',                
                optim_config={'learning_rate': 5e-4, 'momentum': 0.5})

solver.train()                 

plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()


best_model = model
y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()
print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()
# Validation set accuracy:  about 52.9%
# Test set accuracy:  about 54.7%


# Visualize the weights of the best network
from vis_utils import visualize_grid

def show_net_weights(net):    
    W1 = net.params['W1']    
    W1 = W1.reshape(3, 32, 32, -1).transpose(3, 1, 2, 0)    
    plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))   
    plt.gca().axis('off')    

plt.show()show_net_weights(best_model)

Figure_1.png

2. Multilayer全連接網(wǎng)絡(luò) + Batch Normalization

這部分我們需要完成以下編程任務(wù):
--> fc_net.py 里的 FullyConnectedNet類
--> layers.py 里的 batchnorm_forward 和 batchnorm_backward函數(shù)

具體代碼如下：
---> fc_net.py

__coauthor__ = 'Deeplayer'
# 6.22.2016

from layer_utils import *

class FullyConnectedNet(object):    
    """    
    A fully-connected neural network with an arbitrary number of hidden layers,    
    ReLU nonlinearities, and a softmax loss function. This will also implement    
    dropout and batch normalization as options. For a network with L layers,    
    the architecture will be    
    {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax    
    where batch normalization and dropout are optional, and the {...} block is    
    repeated L - 1 times.   
    Similar to the TwoLayerNet above, learnable parameters are stored in the    
    self.params dictionary and will be learned using the Solver class. 
    def __init__(self, hidden_dims, input_dim=3*32*32,  
                 num_classes=10,              
                 dropout=0, use_batchnorm=False, reg=0.0,    
                 weight_scale=1e-2, dtype=np.float32, seed=None):    
    """
    def __init__(self, hidden_dims, input_dim=3*32*32, 
                 num_classes=10,           
                 dropout=0, use_batchnorm=False, reg=0.0,      
                 weight_scale=1e-2, dtype=np.float32, seed=None):

        self.use_batchnorm = use_batchnorm
        self.use_dropout = dropout > 0
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        layers_dims = [input_dim] + hidden_dims + [num_classes]
        for i in xrange(self.num_layers):    
            self.params['W' + str(i+1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i+1])    
            self.params['b' + str(i+1)] = np.zeros((1, layers_dims[i+1]))    
            if self.use_batchnorm and i < len(hidden_dims): 
                self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1]))        
                self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:    
            self.dropout_param = {'mode': 'train', 'p': dropout}    
            if seed is not None:        
                self.dropout_param['seed'] = seed
        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.use_batchnorm:    
            self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.iteritems():    
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):    
        """    
        Compute loss and gradient for the fully-connected net.    
        Input / output: Same as TwoLayerNet above.    
        """    
        X = X.astype(self.dtype)    
        mode = 'test' if y is None else 'train'    
        # Set train/test mode for batchnorm params and dropout param since they    
        # behave differently during training and testing.    
        if self.dropout_param is not None: 
            self.dropout_param['mode'] = mode    
        if self.use_batchnorm:        
        for bn_param in self.bn_params:            
            bn_param['mode'] = mode    
        scores = None    
        h, cache1, cache2, cache3, bn, out = {}, {}, {}, {}, {}, {}    
        out[0] = X

        # Forward pass: compute loss
        for i in xrange(self.num_layers-1):    
            # Unpack variables from the params dictionary    
            W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
            if self.use_batchnorm:        
                gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]        
                h[i], cache1[i] = affine_forward(out[i], W, b)        
                bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])        
                out[i+1], cache3[i] = relu_forward(bn[i])    
            else:        
                out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)

        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
        scores, cache = affine_forward(out[self.num_layers-1], W, b)

        # If test mode return early
        if mode == 'test':   
            return scores

        loss, reg_loss, grads = 0.0, 0.0, {}
        data_loss, dscores = softmax_loss(scores, y)
        for i in xrange(self.num_layers):    
            reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        dout, dbn, dh = {}, {}, {}
        t = self.num_layers-1
        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
        for i in xrange(t):    
            if self.use_batchnorm:        
                dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i]) 
                dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])       
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])    
            else:        
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])

        # Add the regularization gradient contribution
        for i in xrange(self.num_layers):    
            grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]

        return loss, grads

在給出 batchnorm_forward 和 batchnorm_backward函數(shù)代碼之前屯换，先給出Batch Normalization的算法和反向求導(dǎo)公式：

Batch Normalization, algorithm1.png

Backpropagate the gradient of loss ? .png

---> layers.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

import numpy as np

def batchnorm_forward(x, gamma, beta, bn_param):
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)
    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':    
        sample_mean = np.mean(x, axis=0, keepdims=True)       # [1,D]    
        sample_var = np.var(x, axis=0, keepdims=True)         # [1,D] 
        x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps)    # [N,D]    
        out = gamma * x_normalized + beta    
        cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)    
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean    
        running_var = momentum * running_var + (1 - momentum) * sample_var
    elif mode == 'test':    
        x_normalized = (x - running_mean) / np.sqrt(running_var + eps)    
        out = gamma * x_normalized + beta
    else:    
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

def batchnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
    N, D = x.shape
    dx_normalized = dout * gamma       # [N,D]
    x_mu = x - sample_mean             # [N,D]
    sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]
    dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
    dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \                                
                                   2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
    dx1 = dx_normalized * sample_std_inv
    dx2 = 2.0/N * dsample_var * x_mu
    dx = dx1 + dx2 + 1.0/N * dsample_mean
    dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
    dbeta = np.sum(dout, axis=0, keepdims=True)

    return dx, dgamma, dbeta

完成編程后，我們可以用Batch Normalization.ipynb來check我們的code是否有誤与学。下面我會給出在使用Batch Normalization的情況下彤悔，6-layer神經(jīng)網(wǎng)絡(luò)在CIFAR-10上的performance∷魇兀可以預(yù)見晕窑，6-layer神經(jīng)網(wǎng)絡(luò)的performance應(yīng)該不會比2-layer神經(jīng)網(wǎng)絡(luò)的performance好多少的（因為會存在我在Assignment1最后提到的問題1）。

在這之前卵佛，我們先來看看Batch Normalization對梯度消失現(xiàn)象的緩解能力怎樣杨赤，同時給出在不同weight_scales下的情況。我們分別以sigmoid和ReLU作為為激活函數(shù)的6-layer神經(jīng)網(wǎng)絡(luò)為例截汪，測試一下：

---> batchnorm_and_weight_scales.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

from fc_net import *
from solver import *
import matplotlib.pyplot as plt
from data_utils import get_CIFAR10_data

# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()

hidden_dims = [100, 100, 100, 100, 100]
num_train = 5000
small_data = {  
       'X_train': data['X_train'][:num_train],  
       'y_train': data['y_train'][:num_train],  
       'X_val': data['X_val'],  
       'y_val': data['y_val'],
}
bn_solvers = {}
solvers = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):    
    print 'Running weight scale %d / %d' % (i + 1, len(weight_scales)) 
    bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)    
    model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)    

    bn_solver = Solver(bn_model, small_data,        
                       num_epochs=10, batch_size=100,           
                       update_rule='adam',                  
                       optim_config={'learning_rate': 1e-3, },                  
                       verbose=False, print_every=1000)    
    bn_solver.train()    
    bn_solvers[weight_scale] = bn_solver    

    solver = Solver(model, small_data,                  
                    num_epochs=10, batch_size=100,      
                    update_rule='adam',                 
                    optim_config={'learning_rate': 1e-3, },  
                    verbose=False, print_every=1000)    
    solver.train()    
    solvers[weight_scale] = solver

# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []

for ws in weight_scales: 
    best_train_accs.append(max(solvers[ws].train_acc_history))
    bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))  

    best_val_accs.append(max(solvers[ws].val_acc_history))  
    bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))  

    final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))  
    bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))

plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')

plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend(loc='upper left')

plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend(loc='upper left')

plt.gcf().set_size_inches(10, 15)
plt.show()

Activation Function: Sigmoid.png

Activation Function: ReLU.png

從上圖可以看出：

1)望拖、Batch Normalization解決了困擾學(xué)術(shù)界十幾年的sigmoid的過飽和問題（梯度消失問題），bravo挫鸽！可能你覺得上面的結(jié)果不夠直接说敏，那么我貼一下每層的權(quán)重梯度值：

Left: without Batch Normalization --- Right: with Batch Normalization

2)、即使沒有梯度消失現(xiàn)象丢郊，sigmoid還是沒有ReLU好盔沫。
3)、如果weight_scales選得好的話枫匾，當(dāng)激活函數(shù)為ReLU時架诞，Batch Normalization對識別率的提升并不多。

現(xiàn)在干茉，我給一下6-layer神經(jīng)網(wǎng)絡(luò)在CIFAR-10上的識別結(jié)果（激活函數(shù)為ReLU）：
· Validation set accuracy: 0.554
· Test set accuracy: 0.54

3. Dropout

這部分我們需要完成以下編程任務(wù):
--> 修改fc_net.py谴忧，將dropout加進(jìn)去
vlayers.py 里的 dropout_forward 和 dropout_backward函數(shù)

Dropout是我們在實際（深度）神經(jīng)網(wǎng)絡(luò)訓(xùn)練中，用得非常多的一種正則化手段，可以很好地抑制過擬合沾谓。即：在訓(xùn)練過程中委造，我們對每個神經(jīng)元，都以概率p保持它的激活狀態(tài)均驶。下面給出3-layer神經(jīng)網(wǎng)絡(luò)的dropout示意圖:

CS231n Convolutional Neural Networks for Visual Recognition.png

具體代碼如下：

對于fc_net.py我們只要修改下其中的loss函數(shù)：

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

    def loss(self, X, y=None):    
        """    
        Compute loss and gradient for the fully-connected net.    
        Input / output: Same as TwoLayerNet above.    
        """    
        X = X.astype(self.dtype)    
        mode = 'test' if y is None else 'train'    
        # Set train/test mode for batchnorm params and dropout param since they    
        # behave differently during training and testing.    
        if self.dropout_param is not None: 
            self.dropout_param['mode'] = mode    
        if self.use_batchnorm:        
        for bn_param in self.bn_params:            
            bn_param['mode'] = mode    
        scores = None    
        h, cache1, cache2, cache3, cache4, bn, out = {}, {}, {}, {}, {}, {}, {}    
        out[0] = X

        # Forward pass: compute loss
        for i in xrange(self.num_layers-1):    
            # Unpack variables from the params dictionary    
            W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
            if self.use_batchnorm:        
                gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]        
                h[i], cache1[i] = affine_forward(out[i], W, b)        
                bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])        
                out[i+1], cache3[i] = relu_forward(bn[i])
                if self.use_dropout:    
                    out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param) 
            else:        
                out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
                if self.use_dropout:    
                    out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
        scores, cache = affine_forward(out[self.num_layers-1], W, b)

        # If test mode return early
        if mode == 'test':   
            return scores

        loss, reg_loss, grads = 0.0, 0.0, {}
        data_loss, dscores = softmax_loss(scores, y)
        for i in xrange(self.num_layers):    
            reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        dout, dbn, dh, ddrop = {}, {}, {}, {}
        t = self.num_layers-1
        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
        for i in xrange(t):    
            if self.use_batchnorm:
                if self.use_dropout:    
                    ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])    
                    dout[t-i] = ddrop[t-1-i]     
                dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i]) 
                dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])       
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])    
            else:
                if self.use_dropout:    
                    ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])    
                    dout[t-i] = ddrop[t-1-i]
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])

        # Add the regularization gradient contribution
        for i in xrange(self.num_layers):    
            grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]

        return loss, grads

---> layers.py 里的 dropout_forward 和 dropout_backward函數(shù)

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

def dropout_forward(x, dropout_param):
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:  
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None
    if mode == 'train':    
        mask = (np.random.rand(*x.shape) < p) / p    
        out = x * mask
    elif mode == 'test':    
        out = x

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    dropout_param, mask = cache
    mode = dropout_param['mode']
    dx = None

    if mode == 'train':    
        dx = dout * mask
    elif mode == 'test':    
        dx = dout

    return dx

完成編程后昏兆，我們可以用Dropout.ipynb里的代碼來check你的code是否有誤。我們可以用Dropout.ipynb里最后一部分的代碼來比較下使用和不使用dropout的區(qū)別：

Dropout vs Overfitting.png

Part 2：卷積神經(jīng)網(wǎng)絡(luò)（Convolutional Neural Networks, CNNs）

現(xiàn)在我們開始理解本課程的核心內(nèi)容 —— 卷積神經(jīng)網(wǎng)絡(luò)妇穴，對于視覺識別任務(wù)爬虱，CNNs無疑是最出彩的。和我們前面講過的全連接神經(jīng)網(wǎng)絡(luò)相比腾它，CNNs的優(yōu)越之處在哪呢跑筝？我覺得可以列出以下幾點：

1)、它的權(quán)值共享以及局部（感受野瞒滴，receptive field）連接的特點曲梗，使之更加類似生物神經(jīng)網(wǎng)絡(luò)，視覺皮層的神經(jīng)元就是局部接受信息的（即這些神經(jīng)元只響應(yīng)某些特定感受野的刺激）逛腿；
2)、在我們的圖像比較大的情況下（如 96x96仅颇、224x224单默、384x384、512x512等）忘瓦，全連接神經(jīng)網(wǎng)絡(luò)將需要訓(xùn)練超大量的參數(shù)（權(quán)重和偏置）搁廓，這不僅會使得計算變得非常耗時，還會導(dǎo)致更加嚴(yán)重的過擬合現(xiàn)象耕皮。而CNNs的權(quán)值共享和局部連接的特點境蜕，使得需要訓(xùn)練的參數(shù)銳減（指數(shù)級的）；
3)凌停、CNNs具有強(qiáng)大的特征提取能力（從邊緣到局部再到整體）粱年，而全連接神經(jīng)網(wǎng)絡(luò)基本沒有特征提取的能力。

下面我們來具體討論CNNs的結(jié)構(gòu)特點罚拟，討論之前台诗，先給一張圖，方便感受下CNNs的大致結(jié)構(gòu)：

CS231n Convolutional Neural Networks for Visual Recognition.png

1. 卷積層（Convolutional Layer）

卷積層赐俗，也可以稱之為特征提取層拉队，是CNNs最重要的部分。卷積層需要訓(xùn)練的參數(shù)是一系列的過濾器（我更喜歡卷積核這個詞）阻逮，這些過濾器的大小一致粱快，通常都是正方形。假設(shè)我們有n個過濾器，每個過濾器的大小為kxk（k通常取3或5）事哭，那么這一層我們需要訓(xùn)練的參數(shù)就有nxkxk+n/c個（這里的c表示通道數(shù)漫雷，如果是灰度圖像c=1，如果是彩色圖像c=3）慷蠕。權(quán)值共享告訴我們珊拼，一個過濾器只能提取一種特征，即當(dāng)過濾器在圖像上卷積（滑動）的過程中流炕，只提取了該圖像全局范圍內(nèi)的同一個特征澎现。所以，n個過濾器可以提取圖像的n個不同特征每辟。這里貼張卷積過程的動圖剑辫，這里的過濾器個數(shù)是6，但事實上是2種（因為有三個通道嘛）渠欺，所以提取了兩種特征：

CS231n Convolutional Neural Networks for Visual Recognition.gif

動圖中妹蔽，你會發(fā)現(xiàn)圖像外面多了一圈0，而且過濾器移動的步長（stride）為2挠将。補(bǔ)零這個操作胳岂，我們稱之為zero-padding。我們記補(bǔ)零的圈數(shù)為p舔稀，過濾器移動步長為s乳丰，那么計算輸出卷積特征（convolved feature，或者叫activation map）邊長的公式為： L=(input_dim-k+2p)/s+1内贮，輸出特征的維數(shù)則為LxLxn/c产园。zero-padding這個操作產(chǎn)生的原因是為了保證過濾器的滑動能從頭到尾剛剛好，即保證上面的公式能夠整除夜郁。上面的p什燕，s和n是需要我們提前設(shè)定好的三個超參數(shù)。對于步長s的設(shè)定竞端，s設(shè)定得越小屎即，提取的信息就越豐富，但計算量會相對大一點事富；s設(shè)定得越大剑勾，計算量會相對小一點，但是提取的信息就少一些赵颅。s的通常選擇是1虽另。

---> PS: 卷積為什么work?
自然圖像有其固有特性，也就是說饺谬，圖像的一部分的統(tǒng)計特性與其他部分是一樣的捂刺。這也意味著我們在這一部分學(xué)習(xí)的特征也能用在另一部分上谣拣，所以對于這個圖像上的所有位置，我們都能使用同樣的學(xué)習(xí)特征族展。（摘自UFLDL）

2. 池化層（Pooling Layer）

卷積層的下一層是池化層森缠，但要注意，卷積層的輸出會經(jīng)過激活函數(shù)（如ReLU）激活后仪缸，進(jìn)入池化層贵涵。池化層的作用是將卷積層輸出的維數(shù)進(jìn)一步降低，以此來減少參數(shù)的數(shù)量和計算量恰画。具體來講宾茂，是將卷積層得到的結(jié)果無重合的分成幾個子區(qū)域，然后選擇每一子區(qū)域的最大值拴还，或者平均值跨晴，或者2范數(shù)，我們以取最大值的max pooling為例（相對而言片林，max pooling的效果更好端盆，所以我們通常采用max pooling），給出一個diagram：

CS231n Convolutional Neural Networks for Visual Recognition.png

通常费封，池化層的采樣窗口大小為2x2焕妙。

有些人認(rèn)為池化層并不是必要的，如Striving for Simplicity: The All Convolutional Net弓摘。此外焚鹊，有人發(fā)現(xiàn)去除池化層對于生成式模型（generative models）很重要，例如variational autoencoders(VAEs)衣盾，generative adversarial networks(GANs)寺旺∫ィ可能在以后的模型結(jié)構(gòu)中势决，池化層會逐漸減少或者消失。

3. 全連接層（Fully-connected layer）

現(xiàn)在的很多CNNs模型蓝撇，在最后幾層（一般是1~3層）會采用全連接的方式去學(xué)習(xí)更多的信息果复。注意，全連接層的最后一層就是輸出層渤昌；除了最后一層虽抄，其它的全連接層都包含激活函數(shù)。

4. 卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)（CNNs Architectures）

CNNs的通常結(jié)構(gòu)独柑，可以表述如下：

INPUT --> [[CONV --> RELU]*N --> POOL?]*M --> [FC --> RELU]*K --> FC(OUTPUT)

其中迈窟，"?"是代表池化層是可選的，可有可無忌栅；N（一般0_{3）车酣，K（一般0}2）和M（M>=0）是具體層數(shù)。

注意，我們傾向于選擇多層小size的卷積層湖员，而不是一個大size的卷積層贫悄。
現(xiàn)在，我們以3個3x3的卷積層和1個7x7的卷積層為例娘摔，加以對比說明窄坦。從下圖可以看出，這兩種方法最終得到的activation map大小是一致的凳寺，但3個3x3的卷積層明顯更好：
1)猖吴、3層的非線性組合要比1層線性組合提取出的特征具備更高的表達(dá)能力；
2)雕薪、3層小size的卷積層的參數(shù)數(shù)量要少纲岭，3x3x3<7x7；
3)怜瞒、同樣的父泳，為了便于反向傳播時的梯度計算，我們需要保留很多中間梯度吴汪，3層小size的卷積層需要保留的中間梯度更少惠窄。

3_3x3 VS 1_7x7.png

下面我給出一個最簡單的CNNs結(jié)構(gòu)的diagram（input+1conv+1pool+2fc）:

A simple CNNs architecture.png

這里我們列舉幾種常見類型的卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)：

· INPUT --> FC/OUT      這其實就是個線性分類器
· INPUT --> CONV --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> POOL]*2 --> FC --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> CONV --> RELU --> POOL]*3 --> [FC --> RELU]*2 --> FC/OUT

---> PS:
1、對于輸入層（圖像層）漾橙，我們一般會把圖像大小resize成邊長為2的次方的正方形杆融。比如CIFAR-10是32x32x3，STL-10是64x64x3霜运，而ImageNet是224x224x3或者512x512x3脾歇。

2、實際工程中淘捡，我們得預(yù)估一下內(nèi)存藕各，然后根據(jù)內(nèi)存的情況去設(shè)定合理的值。例如輸入是224x224x3得圖片焦除，過濾器大小為3x3激况，共64個，zero-padding為1膘魄，這樣每張圖片需要72MB的內(nèi)存（這里的72MB囊括了圖片以及對應(yīng)的參數(shù)乌逐、梯度和激活值在內(nèi)的，所需要的內(nèi)存空間）创葡，但是在GPU上運行的話浙踢，內(nèi)存可能不夠（相比于CPU，GPU的內(nèi)存要小得多）灿渴，所以需要調(diào)整下參數(shù)洛波，比如過濾器大小改為7x7呐芥，stride改為2（ZF net），或者過濾器大小改為11x11奋岁，stride改為4（AlexNet）思瘟。

3、構(gòu)建一個實際可用的深度卷積神經(jīng)網(wǎng)絡(luò)最大的瓶頸是GPU的內(nèi)(顯)存∥帕妫現(xiàn)在很多GPU只有3/4/6GB的內(nèi)存滨攻，單卡最大的也就12G（NVIDIA），所以我們應(yīng)該在設(shè)計卷積神經(jīng)網(wǎng)的時候蓝翰，多加考慮內(nèi)存主要消耗在哪里：

大量的激活值和中間梯度值光绕；
參數(shù)，反向傳播時的梯度以及使用momentum畜份，Adagrad诞帐，or RMSProp時的緩存都會占用儲存，所以估計參數(shù)占用的內(nèi)存時爆雹，一般至少要乘以3倍停蕉；
數(shù)據(jù)的batch以及其他的類似信息或者來源信息等也會消耗一部分內(nèi)存。

下面列出一些著名的卷積神經(jīng)網(wǎng)絡(luò)：
· LeNet钙态，這是最早成功應(yīng)用的卷積神經(jīng)網(wǎng)絡(luò)慧起，Yann LeCun在論文LeNet中提出。
· AlexNet册倒，2012 ILSVRC競賽遠(yuǎn)超第2名的卷積神經(jīng)網(wǎng)絡(luò)蚓挤，掀起了深度學(xué)習(xí)的浪潮。
· ZF Net驻子，2013 ILSVRC競賽冠軍灿意，調(diào)整了Alexnet的結(jié)構(gòu)參數(shù), 擴(kuò)增了中間卷積層。
· GoogLeNet崇呵，2014 ILSVRC競賽冠軍缤剧，極大地減少了參數(shù)數(shù)量（由 60M到4M）。
· VGGNet演熟，2014 ILSVRC鞭执，證明了CNNs的深度對于最后的效果有至關(guān)重要的作用司顿。
· ResNet芒粹，2015 ILSVRC競賽冠軍，截止2016年5月10大溜，這是最先進(jìn)的模型化漆。最近Kaiming He等人，提出了改進(jìn)版Identity Mappings in Deep Residual Networks钦奋。

From Kaiming He's ICML16 tutorial

Part 3：Python編程任務(wù)（3-layer CNNs）

這部分我們需要完成以下編程任務(wù)：
1)座云、layers.py里的以下函數(shù)：
---> conv_forward_naive
---> conv_backward_naive
---> max_pool_forward_naive
---> max_pool_backward_naive

在給出卷積層的代碼前疙赠，我們先理解下卷積層的前向和后向傳播時，具體是如何計算的朦拖。為了理解方便圃阳，我們假設(shè)某一個batch里的第一張圖片為x[0, :, :, :]，有RGB三個通道璧帝，每個通道大小為7x7捍岳，padding為1，stride為2睬隶，那么x[0, :, :, :]的大小為1x3x9x9锣夹；此外，我們假設(shè)有3個過濾器苏潜，每個大小為3x3银萍，用w表示所有過濾器中的權(quán)重（如第一個濾波器的第一個通道為w[0, 0, :, :]）；偏置b的大小為1x3恤左；activation maps用out來表示贴唇，大小為3x4x4（如第一個map為out[0, :, :]）。

以剛才的假設(shè)為例飞袋，給出前向傳播和后向傳播的具體計算過程（反向傳播的那張圖片分辨率較高滤蝠，請在新的標(biāo)簽頁打開圖片并放大，或者下載后觀看）：

Forward.png

Backward.jpg

具體代碼如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def conv_forward_naive(x, w, b, conv_param):
    stride, pad = conv_param['stride'], conv_param['pad']
    N, C, H, W = x.shape
    F, C, HH, WW = w.shape
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
    H_new = 1 + (H + 2 * pad - HH) / stride
    W_new = 1 + (W + 2 * pad - WW) / stride
    s = stride
    out = np.zeros((N, F, H_new, W_new))

    for i in xrange(N):       # ith image    
        for f in xrange(F):   # fth filter        
            for j in xrange(H_new):            
                for k in xrange(W_new):                
                    out[i, f, j, k] = np.sum(x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] * w[f]) + b[f]

    cache = (x, w, b, conv_param)

    return out, cache


def conv_backward_naive(dout, cache):
    x, w, b, conv_param = cache
    pad = conv_param['pad']
    stride = conv_param['stride']
    F, C, HH, WW = w.shape
    N, C, H, W = x.shape
    H_new = 1 + (H + 2 * pad - HH) / stride
    W_new = 1 + (W + 2 * pad - WW) / stride

    dx = np.zeros_like(x)
    dw = np.zeros_like(w)
    db = np.zeros_like(b)

    s = stride
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
    dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')

    for i in xrange(N):       # ith image    
        for f in xrange(F):   # fth filter        
            for j in xrange(H_new):            
                for k in xrange(W_new):                
                    window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]
                    db[f] += dout[i, f, j, k]                
                    dw[f] += window * dout[i, f, j, k]                
                    dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]

    # Unpad
    dx = dx_padded[:, :, pad:pad+H, pad:pad+W]

    return dx, dw, db

完成編程后授嘀，可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤物咳。

下面給出池化層（最大值池化）的代碼：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def max_pool_forward_naive(x, pool_param):
    HH, WW = pool_param['pool_height'], pool_param['pool_width']
    s = pool_param['stride']
    N, C, H, W = x.shape
    H_new = 1 + (H - HH) / s
    W_new = 1 + (W - WW) / s
    out = np.zeros((N, C, H_new, W_new))
    for i in xrange(N):    
        for j in xrange(C):        
            for k in xrange(H_new):            
                for l in xrange(W_new):                
                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s] 
                    out[i, j, k, l] = np.max(window)

    cache = (x, pool_param)

    return out, cache


def max_pool_backward_naive(dout, cache):
    x, pool_param = cache
    HH, WW = pool_param['pool_height'], pool_param['pool_width']
    s = pool_param['stride']
    N, C, H, W = x.shape
    H_new = 1 + (H - HH) / s
    W_new = 1 + (W - WW) / s
    dx = np.zeros_like(x)
    for i in xrange(N):    
        for j in xrange(C):        
            for k in xrange(H_new):            
                for l in xrange(W_new):                
                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]                
                    m = np.max(window)               
                    dx[i, j, k*s:HH+k*s, l*s:WW+l*s] = (window == m) * dout[i, j, k, l]

    return dx

同樣，可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤蹄皱。

上面的編程中览闰，我們使用了多層for循環(huán)，這會使得運行速度過慢巷折。為了加快運行速度压鉴，Assignment2里提供了fast_layers.py，但需要借助Cython來生成C擴(kuò)展锻拘，加快運行速度油吭。這里，我給出naive版和fast版在運行速度上的對比署拟，從下圖可以看出婉宰，運行速度得到了極大的提升：

Naive vs Fast.png

2)、cnn.py推穷，具體代碼如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

from layer_utils import *

class ThreeLayerConvNet(object):    
    """    
    A three-layer convolutional network with the following architecture:       
       conv - relu - 2x2 max pool - affine - relu - affine - softmax
    """

    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,             
                 hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
                 dtype=np.float32):
        self.params = {}
        self.reg = reg
        self.dtype = dtype

        # Initialize weights and biases
        C, H, W = input_dim
        self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)
        self.params['b1'] = np.zeros((1, num_filters))
        self.params['W2'] = weight_scale * np.random.randn(num_filters*H*W/4, hidden_dim)
        self.params['b2'] = np.zeros((1, hidden_dim))
        self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params['b3'] = np.zeros((1, num_classes))

        for k, v in self.params.iteritems():    
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        W3, b3 = self.params['W3'], self.params['b3']

        # pass conv_param to the forward pass for the convolutional layer
        filter_size = W1.shape[2]
        conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}

        # pass pool_param to the forward pass for the max-pooling layer
        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

        # compute the forward pass
        a1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
        a2, cache2 = affine_relu_forward(a1, W2, b2)
        scores, cache3 = affine_forward(a2, W3, b3)

        if y is None:    
            return scores

        # compute the backward pass
        data_loss, dscores = softmax_loss(scores, y)
        da2, dW3, db3 = affine_backward(dscores, cache3)
        da1, dW2, db2 = affine_relu_backward(da2, cache2)
        dX, dW1, db1 = conv_relu_pool_backward(da1, cache1)

        # Add regularization
        dW1 += self.reg * W1
        dW2 += self.reg * W2
        dW3 += self.reg * W3
        reg_loss = 0.5 * self.reg * sum(np.sum(W * W) for W in [W1, W2, W3])

        loss = data_loss + reg_loss
        grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}

        return loss, grads

完成編程后心包，可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤。

3)馒铃、layers.py里的spatial_batchnorm_forward和spatial_batchnorm_backward函數(shù)蟹腾。在給出代碼前痕惋，我放張圖，方便大家理解CNNs里的Batch Normalization是怎么計算卷積層的均值mean和標(biāo)準(zhǔn)差std的：

ConvNet Batch Normalization.png

具體代碼如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def spatial_batchnorm_forward(x, gamma, beta, bn_param):
    N, C, H, W = x.shape
    x_new = x.transpose(0, 2, 3, 1).reshape(N*H*W, C)
    out, cache = batchnorm_forward(x_new, gamma, beta, bn_param)
    out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return out, cache


def spatial_batchnorm_backward(dout, cache):
    N, C, H, W = dout.shape
    dout_new = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C)
    dx, dgamma, dbeta = batchnorm_backward(dout_new, cache)
    dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return dx, dgamma, dbeta

完成編程后娃殖，可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤值戳。

以上面完成的ThreeLayerConvNet為例，比較下使用和不使用Batch Normalization對收斂速度的影響炉爆。從下圖中的結(jié)果可以看出述寡，使用Batch Normalization明顯加快了收斂，使得訓(xùn)練速度大幅提升（因為需要的epoch更少）：

with BN --vs-- without BN.png

---> PS:
1叶洞、數(shù)據(jù)擴(kuò)增（Data Augmentation）
當(dāng)數(shù)據(jù)集較小的情況下鲫凶，這一操作還是十分有效的，可以一定程度提高識別率衩辟。具體的擴(kuò)增方法如下：
1)螟炫、水平翻轉(zhuǎn)（Horizontal flips）

Horizontal flips.png

2)、隨機(jī)剪裁（Random crops/scales）

Random crops/scales.png

3)艺晴、色彩抖動（Color jitter）

Randomly jitter contrast.png

4)昼钻、發(fā)揮想象力（Get creative）
比如：平移、旋轉(zhuǎn)封寞、拉伸然评、切變、光學(xué)畸變等等狈究。

下面我給出一個CNN模型碗淌，測試其在CIFAR-10上的表現(xiàn)（進(jìn)行簡單的水平翻轉(zhuǎn)來擴(kuò)增數(shù)據(jù)），training set: 49000x2, validation set: 1000, test set: 10000抖锥。CNN層數(shù)結(jié)構(gòu)如下：

           [[conv - relu]x3 - pool]x3 - affine - relu - affine - softmax

訓(xùn)練結(jié)果如下：
· Validation set accuracy: 0.904
· Test set accuracy: 0.892

Training loss & Accuracy

CONV layer 1: filters

Part 4：可視化卷積神經(jīng)網(wǎng)絡(luò)

可視化手段可以直觀地揭開CNNs的神秘面紗亿眠，幫助我們更好地理解CNNs究竟學(xué)到了什么，下面我們討論下具體的可視化技術(shù)：

1. 可視化權(quán)重和激活值

以AlexNet為例磅废，給出每層部分權(quán)重和激活值的可視化如下：

CONV layer 1: filters(left) and activations(right)

CONV layer 2: filters(left) and activations(right)

CONV layer 3: activations

CONV layer 4: activations

CONV layer 5: activations

Fully-connected layer 1 & 2

Output layer

2. 檢索能最大限度激活神經(jīng)元的圖片

我們可以將大量圖片輸入網(wǎng)絡(luò)纳像，追蹤那些可以最大限度激活神經(jīng)元的圖片，然后我們可以可視化這些圖片拯勉，以此來理解神經(jīng)元在它的感受野里究竟在尋找什么竟趾，以便能夠正確地分類圖片？下圖是AlexNet的第五個pooling層（光頭躺槍 O__O "…）：

AlexNet: pooling layer 5

3. 利用t-SNE和CNNs的特征向量來可視化圖片

CNNs可以表示為對輸入圖像進(jìn)行逐層轉(zhuǎn)化宫峦，最終形成一個可以用線性分類器進(jìn)行分類的representation岔帽，這個最終形成的representation就是CNN codes（例如AlexNet里輸入分類器之前的那個4096維向量），即特征向量斗遏。

t-SNE作為對高維數(shù)據(jù)降維并可視化的最好的方法之一山卦，其可視化結(jié)果有非常棒的視覺效果鞋邑。我們可以將CNN codes輸入t-SNE诵次，得到每一張圖片（對應(yīng)一個特征向量）對應(yīng)的二維向量账蓉，然后可以可視化出如下結(jié)果（靠的越近的圖片，在CNNs眼里越相似）：

t-SNE visualization of CNN codes

4. 局部遮擋圖片

為了判斷CNNs是否是依靠圖片中正確的目標(biāo)進(jìn)行進(jìn)行分類（而不是靠蒙的）逾一，我們可以對圖片進(jìn)行局部遮擋铸本，來測試CNNs。從下圖可以看出遵堵，CNNs確實是依靠正確的目標(biāo)進(jìn)行分類的：

Occluding parts of the image

Part 5：遷移學(xué)習(xí)（Transfer Learning）

實際中箱玷，我們很少從頭開始訓(xùn)練一個CNNs，因為通常我們沒有足夠的數(shù)據(jù)陌宿。我們常采取的做法是：使用已經(jīng)在大數(shù)據(jù)集（例如ImageNet）上訓(xùn)練好的CNNs作為我們的初始模型或者一個固定的特征提取器锡足，然后用在新的數(shù)據(jù)集上。上張圖以便說明：

CS231n Convolutional Neural Networks for Visual Recognition.png

當(dāng)新數(shù)據(jù)集和預(yù)訓(xùn)練時的數(shù)據(jù)集不相似的情況下（如醫(yī)學(xué)圖像）壳坪，上圖的策略需要稍稍調(diào)整下：若新數(shù)據(jù)集較小舶得，我們需要訓(xùn)練除線性分類器之外更前面的幾層；若新數(shù)據(jù)集較大爽蝴，我們需要微調(diào)所有層沐批。

---> CS231n: Assignment 1
---> CS231n: Assignment 3

最后編輯于：2020.07.15 20:29:46

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

禁止轉(zhuǎn)載，如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者蝎亚。

人面猴
序言：七十年代末九孩，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子发框，更是在濱河造成了極大的恐慌躺彬，老刑警劉巖，帶你破解...
沈念sama閱讀 206,968評論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件梅惯，死亡現(xiàn)場離奇詭異顾患，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)个唧，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,601評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門江解，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人徙歼，你說我怎么就攤上這事犁河。” “怎么了魄梯？”我有些...
開封第一講書人閱讀 153,220評論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵桨螺，是天一觀的道長。經(jīng)常有香客問我酿秸，道長灭翔，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,416評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任辣苏，我火速辦了婚禮肝箱，結(jié)果婚禮上哄褒，老公的妹妹穿的比我還像新娘。我一直安慰自己煌张，他們只是感情好呐赡，可當(dāng)我...
茶點故事閱讀 64,425評論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著骏融，像睡著了一般链嘀。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上档玻，一...
開封第一講書人閱讀 49,144評論 1贊 285
城市分裂傳說
那天怀泊，我揣著相機(jī)與錄音，去河邊找鬼误趴。笑死包个，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的冤留。我是一名探鬼主播碧囊，決...
沈念sama閱讀 38,432評論 3贊 401
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼纤怒！你這毒婦竟也來了糯而？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,088評論 0贊 261
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤泊窘，失蹤者是張志新（化名）和其女友劉穎熄驼，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體烘豹，經(jīng)...
沈念sama閱讀 43,586評論 1贊 300
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡瓜贾，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,028評論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了携悯。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片祭芦。...
茶點故事閱讀 38,137評論 1贊 334
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖憔鬼，靈堂內(nèi)的尸體忽然破棺而出龟劲，到底是詐尸還是另有隱情，我是刑警寧澤轴或，帶...
沈念sama閱讀 33,783評論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布昌跌，位于F島的核電站，受9級特大地震影響照雁，放射性物質(zhì)發(fā)生泄漏蚕愤。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 39,343評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望萍诱。院中可真熱鬧悬嗓，春花似錦、人聲如沸砂沛。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,333評論 0贊 19
一樁弒父案曙求，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽碍庵。三九已至，卻和暖如春悟狱，著一層夾襖步出監(jiān)牢的瞬間静浴，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,559評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工挤渐，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留苹享，地道東北人。一個月前我還...
沈念sama閱讀 45,595評論 2贊 355
代替公主和親
正文我出身青樓浴麻，卻偏偏與公主長得像得问，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子软免，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 42,901評論 2贊 345

CS231n (winter 2016) : Assignment2

前言：

Part 1：深層全連接神經(jīng)網(wǎng)絡(luò)（python編程任務(wù)）

1. 2-layer全連接神經(jīng)網(wǎng)絡(luò)

2. Multilayer全連接網(wǎng)絡(luò) + Batch Normalization

3. Dropout

Part 2：卷積神經(jīng)網(wǎng)絡(luò)（Convolutional Neural Networks, CNNs）

1. 卷積層（Convolutional Layer）

2. 池化層（Pooling Layer）

3. 全連接層（Fully-connected layer）

4. 卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)（CNNs Architectures）

Part 3：Python編程任務(wù)（3-layer CNNs）

Part 4： 可視化卷積神經(jīng)網(wǎng)絡(luò)

1. 可視化權(quán)重和激活值

2. 檢索能最大限度激活神經(jīng)元的圖片

3. 利用t-SNE和CNNs的特征向量來可視化圖片

4. 局部遮擋圖片

Part 5： 遷移學(xué)習(xí)（Transfer Learning）

推薦閱讀更多精彩內(nèi)容

Part 4：可視化卷積神經(jīng)網(wǎng)絡(luò)

Part 5：遷移學(xué)習(xí)（Transfer Learning）