前言:
以斯坦福cs231n課程的python編程任務(wù)為主線迅箩,展開對該課程主要內(nèi)容的理解和部分?jǐn)?shù)學(xué)推導(dǎo)。建議PC端閱讀院仿,該課程的學(xué)習(xí)資料和代碼如下:
視頻和PPT
筆記
assignment2初始代碼
Part 1:深層全連接神經(jīng)網(wǎng)絡(luò)(python編程任務(wù))
我們在Assignment1中完成了簡單的2-layer全連接神經(jīng)網(wǎng)絡(luò)葱绒,但是我們之前的編程不夠模塊化凄诞,所有的計算部分(損失函數(shù)的計算、梯度的計算等等)都放在了一個函數(shù)塊里由桌,使得沒有靈活性为黎,即我們無法隨意更改網(wǎng)絡(luò)的結(jié)構(gòu)。這里行您,我們使用更加模塊化的編程方式铭乾,每個模塊之間相互獨立,運行的時候可以相互調(diào)用娃循,使得我們的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)十分靈活片橡。就像這樣:
python
def layer_forward(x, w):
""" Receive inputs x and weights w """
# Do some computations ...
z = # ... some intermediate value
# Do some more computations ...
out = # the output
cache = (x, w, z, out) # Values we need to compute gradients
return out, cache
The backward pass will receive upstream derivatives and the cache object,
and will return gradients with respect to the inputs and weights, like this:
python
def layer_backward(dout, cache):
"""
Receive derivative of loss with respect to outputs and cache,
and compute derivative with respect to inputs.
"""
# Unpack cache values
x, w, z, out = cache
# Use values in cache to compute derivatives
dx = # Derivative of loss with respect to x
dw = # Derivative of loss with respect to w
return dx, dw
此外,我們會將前面學(xué)過的參數(shù)更新策略全部集成到模塊中淮野,這樣我們可以探索不同的參數(shù)更新策略的性能表現(xiàn)捧书;我們也會將Batch Normalization和Dropout應(yīng)用到模塊中吹泡,來更高效地優(yōu)化深度網(wǎng)絡(luò)。
由于這部分的編程任務(wù)較為繁重经瓷,我們把任務(wù)拆分下來爆哑,一步一步地完成:
1. 2-layer全連接神經(jīng)網(wǎng)絡(luò)
這部分我們需要完成以下編程任務(wù)(此外,需要看懂solver.py):
--> fc_net.py里的TwoLayerNet類
--> layers.py里的前四個函數(shù)
--> optim.py
具體代碼如下:
---> fc_net.py
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
from layer_utils import *
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network with ReLU nonlinearity and
softmax loss that uses a modular layer design. We assume an input dimension
of D, a hidden dimension of H, and perform classification over C classes.
The architecure should be affine - relu - affine - softmax.
Note that this class does not implement gradient descent; instead, it
will interact with a separate Solver object that is responsible for running
optimization.
The learnable parameters of the model are stored in the dictionary
self.params that maps parameter names to numpy arrays.
"""
def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
weight_scale=1e-3, reg=0.0):
"""
Initialize a new network.
Inputs:
- input_dim: An integer giving the size of the input
- hidden_dim: An integer giving the size of the hidden layer
- num_classes: An integer giving the number of classes to classify
- dropout: Scalar between 0 and 1 giving dropout strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- reg: Scalar giving L2 regularization strength.
"""
self.params = {}
self.reg = reg
self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)
self.params['b1'] = np.zeros((1, hidden_dim))
self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)
self.params['b2'] = np.zeros((1, num_classes))
def loss(self, X, y=None):
"""
Compute loss and gradient for a minibatch of data.
Inputs:
- X: Array of input data of shape (N, d_1, ..., d_k)
- y: Array of labels, of shape (N,). y[i] gives the label for X[i].
Returns:
If y is None, then run a test-time forward pass of the model and return:
- scores: Array of shape (N, C) giving classification scores, where
scores[i, c] is the classification score for X[i] and class c.
If y is not None, then run a training-time forward and backward pass and
return a tuple of:
- loss: Scalar value giving the loss
- grads: Dictionary with the same keys as self.params, mapping parameter
names to gradients of the loss with respect to those parameters.
"""
scores = None
N = X.shape[0]
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
h1, cache1 = affine_relu_forward(X, W1, b1)
out, cache2 = affine_forward(h1, W2, b2)
scores = out # (N,C)
# If y is None then we are in test mode so just return scores
if y is None:
return scores
loss, grads = 0, {}
data_loss, dscores = softmax_loss(scores, y)
reg_loss = 0.5 * self.reg * np.sum(W1*W1) + 0.5 * self.reg * np.sum(W2*W2)
loss = data_loss + reg_loss
# Backward pass: compute gradients
dh1, dW2, db2 = affine_backward(dscores, cache2)
dX, dW1, db1 = affine_relu_backward(dh1, cache1)
# Add the regularization gradient contribution
dW2 += self.reg * W2
dW1 += self.reg * W1
grads['W1'] = dW1
grads['b1'] = db1
grads['W2'] = dW2
grads['b2'] = db2
return loss, grads
---> layers.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
#import numpy as np
def affine_forward(x, w, b):
"""
Computes the forward pass for an affine (fully-connected) layer.
The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
examples, where each example x[i] has shape (d_1, ..., d_k). We will
reshape each input into a vector of dimension D = d_1 * ... * d_k, and
then transform it to an output vector of dimension M.
Inputs:
- x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
- w: A numpy array of weights, of shape (D, M)
- b: A numpy array of biases, of shape (M,)
Returns a tuple of:
- out: output, of shape (N, M)
- cache: (x, w, b)
"""
out = None
# Reshape x into rows
N = x.shape[0]
x_row = x.reshape(N, -1) # (N,D)
out = np.dot(x_row, w) + b # (N,M)
cache = (x, w, b)
return out, cache
def affine_backward(dout, cache):
"""
Computes the backward pass for an affine layer.
Inputs:
- dout: Upstream derivative, of shape (N, M)
- cache: Tuple of:
- x: Input data, of shape (N, d_1, ... d_k)
- w: Weights, of shape (D, M)
Returns a tuple of:
- dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
- dw: Gradient with respect to w, of shape (D, M)
- db: Gradient with respect to b, of shape (M,)
"""
x, w, b = cache
dx, dw, db = None, None, None
dx = np.dot(dout, w.T) # (N,D)
dx = np.reshape(dx, x.shape) # (N,d1,...,d_k)
x_row = x.reshape(x.shape[0], -1) # (N,D)
dw = np.dot(x_row.T, dout) # (D,M)
db = np.sum(dout, axis=0, keepdims=True) # (1,M)
return dx, dw, db
def relu_forward(x):
"""
Computes the forward pass for a layer of rectified linear units (ReLUs).
Input:
- x: Inputs, of any shape
Returns a tuple of:
- out: Output, of the same shape as x
- cache: x
"""
out = None
out = ReLU(x)
cache = x
return out, cache
def relu_backward(dout, cache):
"""
Computes the backward pass for a layer of rectified linear units (ReLUs).
Input:
- dout: Upstream derivatives, of any shape
- cache: Input x, of same shape as dout
Returns:
- dx: Gradient with respect to x
"""
dx, x = None, cache
dx = dout
dx[x <= 0] = 0
return dx
def svm_loss(x, y):
"""
Computes the loss and gradient using for multiclass SVM classification.
Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
N = x.shape[0]
correct_class_scores = x[np.arange(N), y]
margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
margins[np.arange(N), y] = 0
loss = np.sum(margins) / N
num_pos = np.sum(margins > 0, axis=1)
dx = np.zeros_like(x)
dx[margins > 0] = 1
dx[np.arange(N), y] -= num_pos
dx /= N
return loss, dx
def softmax_loss(x, y):
"""
Computes the loss and gradient for softmax classification. Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
probs = np.exp(x - np.max(x, axis=1, keepdims=True))
probs /= np.sum(probs, axis=1, keepdims=True)
N = x.shape[0]
loss = -np.sum(np.log(probs[np.arange(N), y])) / N
dx = probs.copy()
dx[np.arange(N), y] -= 1
dx /= N
return loss, dx
def ReLU(x):
"""ReLU non-linearity."""
return np.maximum(0, x)
---> optim.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
import numpy as np
def sgd(w, dw, config=None):
"""
Performs vanilla stochastic gradient descent.
config format:
- learning_rate: Scalar learning rate.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
w -= config['learning_rate'] * dw
return w, config
def sgd_momentum(w, dw, config=None):
"""
Performs stochastic gradient descent with momentum.
config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value.
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a moving
average of the gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w))
next_w = None
v = config['momentum'] * v - config['learning_rate'] * dw
next_w = w + v
config['velocity'] = v
return next_w, config
def rmsprop(x, dx, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared gradient
values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(x))
next_x = None
cache = config['cache']
decay_rate = config['decay_rate']
learning_rate = config['learning_rate']
epsilon = config['epsilon']
cache = decay_rate * cache + (1 - decay_rate) * (dx**2)
x += - learning_rate * dx / (np.sqrt(cache) + epsilon)
config['cache'] = cache
next_x = x
return next_x, config
def adam(x, dx, config=None):
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(x))
config.setdefault('v', np.zeros_like(x))
config.setdefault('t', 0)
next_x = None
m = config['m']
v = config['v']
beta1 = config['beta1']
beta2 = config['beta2']
learning_rate = config['learning_rate']
epsilon = config['epsilon']
t = config['t']
t += 1
m = beta1 * m + (1 - beta1) * dx
v = beta2 * v + (1 - beta2) * (dx**2)
m_bias = m / (1 - beta1**t)
v_bias = v / (1 - beta2**t)
x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)
next_x = x
config['m'] = m
config['v'] = v
config['t'] = t
return next_x, config
編程完成后舆吮,我們可以用FullyConnectedNets.ipynb里的代碼來check我們的代碼是否有誤揭朝。check完之后,我們可以在CIFAR-10上跑一遍色冀,和Assignment1里的2-layer神經(jīng)網(wǎng)絡(luò)比較一下潭袱,結(jié)果應(yīng)該是差不多的。
這里锋恬,我貼一下在CIFAR-10上運行的代碼和結(jié)果圖:
---> two_layer_fc_net_start.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
import matplotlib.pyplot as plt
from fc_net import *
from data_utils import get_CIFAR10_data
from solver import Solver
data = get_CIFAR10_data()
model = TwoLayerNet(reg=0.9)
solver = Solver(model, data,
lr_decay=0.95,
print_every=100, num_epochs=40, batch_size=400,
update_rule='sgd_momentum',
optim_config={'learning_rate': 5e-4, 'momentum': 0.5})
solver.train()
plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')
plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()
best_model = model
y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()
print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()
# Validation set accuracy: about 52.9%
# Test set accuracy: about 54.7%
# Visualize the weights of the best network
from vis_utils import visualize_grid
def show_net_weights(net):
W1 = net.params['W1']
W1 = W1.reshape(3, 32, 32, -1).transpose(3, 1, 2, 0)
plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))
plt.gca().axis('off')
plt.show()show_net_weights(best_model)
2. Multilayer全連接網(wǎng)絡(luò) + Batch Normalization
這部分我們需要完成以下編程任務(wù):
--> fc_net.py 里的 FullyConnectedNet類
--> layers.py 里的 batchnorm_forward 和 batchnorm_backward函數(shù)
具體代碼如下:
---> fc_net.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
from layer_utils import *
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch normalization as options. For a network with L layers,
the architecture will be
{affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax
where batch normalization and dropout are optional, and the {...} block is
repeated L - 1 times.
Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
def __init__(self, hidden_dims, input_dim=3*32*32,
num_classes=10,
dropout=0, use_batchnorm=False, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
"""
def __init__(self, hidden_dims, input_dim=3*32*32,
num_classes=10,
dropout=0, use_batchnorm=False, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
self.use_batchnorm = use_batchnorm
self.use_dropout = dropout > 0
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
layers_dims = [input_dim] + hidden_dims + [num_classes]
for i in xrange(self.num_layers):
self.params['W' + str(i+1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i+1])
self.params['b' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
if self.use_batchnorm and i < len(hidden_dims):
self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1]))
self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
# With batch normalization we need to keep track of running means and
# variances, so we need to pass a special bn_param object to each batch
# normalization layer. You should pass self.bn_params[0] to the forward pass
# of the first batch normalization layer, self.bn_params[1] to the forward
# pass of the second batch normalization layer, etc.
self.bn_params = []
if self.use_batchnorm:
self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]
# Cast all parameters to the correct datatype
for k, v in self.params.iteritems():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.dropout_param is not None:
self.dropout_param['mode'] = mode
if self.use_batchnorm:
for bn_param in self.bn_params:
bn_param['mode'] = mode
scores = None
h, cache1, cache2, cache3, bn, out = {}, {}, {}, {}, {}, {}
out[0] = X
# Forward pass: compute loss
for i in xrange(self.num_layers-1):
# Unpack variables from the params dictionary
W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
if self.use_batchnorm:
gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]
h[i], cache1[i] = affine_forward(out[i], W, b)
bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])
out[i+1], cache3[i] = relu_forward(bn[i])
else:
out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
scores, cache = affine_forward(out[self.num_layers-1], W, b)
# If test mode return early
if mode == 'test':
return scores
loss, reg_loss, grads = 0.0, 0.0, {}
data_loss, dscores = softmax_loss(scores, y)
for i in xrange(self.num_layers):
reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
loss = data_loss + reg_loss
# Backward pass: compute gradients
dout, dbn, dh = {}, {}, {}
t = self.num_layers-1
dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
for i in xrange(t):
if self.use_batchnorm:
dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i])
dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])
else:
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])
# Add the regularization gradient contribution
for i in xrange(self.num_layers):
grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]
return loss, grads
在給出 batchnorm_forward 和 batchnorm_backward函數(shù)代碼之前屯换,先給出Batch Normalization的算法和反向求導(dǎo)公式:
---> layers.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
import numpy as np
def batchnorm_forward(x, gamma, beta, bn_param):
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
sample_mean = np.mean(x, axis=0, keepdims=True) # [1,D]
sample_var = np.var(x, axis=0, keepdims=True) # [1,D]
x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps) # [N,D]
out = gamma * x_normalized + beta
cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
elif mode == 'test':
x_normalized = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * x_normalized + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# Store the updated running means back into bn_param
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
def batchnorm_backward(dout, cache):
dx, dgamma, dbeta = None, None, None
x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
N, D = x.shape
dx_normalized = dout * gamma # [N,D]
x_mu = x - sample_mean # [N,D]
sample_std_inv = 1.0 / np.sqrt(sample_var + eps) # [1,D]
dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \
2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
dx1 = dx_normalized * sample_std_inv
dx2 = 2.0/N * dsample_var * x_mu
dx = dx1 + dx2 + 1.0/N * dsample_mean
dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
dbeta = np.sum(dout, axis=0, keepdims=True)
return dx, dgamma, dbeta
完成編程后,我們可以用Batch Normalization.ipynb來check我們的code是否有誤与学。下面我會給出在使用Batch Normalization的情況下彤悔,6-layer神經(jīng)網(wǎng)絡(luò)在CIFAR-10上的performance∷魇兀可以預(yù)見晕窑,6-layer神經(jīng)網(wǎng)絡(luò)的performance應(yīng)該不會比2-layer神經(jīng)網(wǎng)絡(luò)的performance好多少的(因為會存在我在Assignment1最后提到的問題1)。
在這之前卵佛,我們先來看看Batch Normalization對梯度消失現(xiàn)象的緩解能力怎樣杨赤,同時給出在不同weight_scales下的情況。我們分別以sigmoid和ReLU作為為激活函數(shù)的6-layer神經(jīng)網(wǎng)絡(luò)為例截汪,測試一下:
---> batchnorm_and_weight_scales.py
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
from fc_net import *
from solver import *
import matplotlib.pyplot as plt
from data_utils import get_CIFAR10_data
# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()
hidden_dims = [100, 100, 100, 100, 100]
num_train = 5000
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}
bn_solvers = {}
solvers = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):
print 'Running weight scale %d / %d' % (i + 1, len(weight_scales))
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)
bn_solver = Solver(bn_model, small_data,
num_epochs=10, batch_size=100,
update_rule='adam',
optim_config={'learning_rate': 1e-3, },
verbose=False, print_every=1000)
bn_solver.train()
bn_solvers[weight_scale] = bn_solver
solver = Solver(model, small_data,
num_epochs=10, batch_size=100,
update_rule='adam',
optim_config={'learning_rate': 1e-3, },
verbose=False, print_every=1000)
solver.train()
solvers[weight_scale] = solver
# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []
for ws in weight_scales:
best_train_accs.append(max(solvers[ws].train_acc_history))
bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))
best_val_accs.append(max(solvers[ws].val_acc_history))
bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))
final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))
bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))
plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')
plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend(loc='upper left')
plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend(loc='upper left')
plt.gcf().set_size_inches(10, 15)
plt.show()
從上圖可以看出:
1)望拖、Batch Normalization解決了困擾學(xué)術(shù)界十幾年的sigmoid的過飽和問題(梯度消失問題),bravo挫鸽!可能你覺得上面的結(jié)果不夠直接说敏,那么我貼一下每層的權(quán)重梯度值:
2)、即使沒有梯度消失現(xiàn)象丢郊,sigmoid還是沒有ReLU好盔沫。
3)、如果weight_scales選得好的話枫匾,當(dāng)激活函數(shù)為ReLU時架诞,Batch Normalization對識別率的提升并不多。
現(xiàn)在干茉,我給一下6-layer神經(jīng)網(wǎng)絡(luò)在CIFAR-10上的識別結(jié)果(激活函數(shù)為ReLU):
· Validation set accuracy: 0.554
· Test set accuracy: 0.54
3. Dropout
這部分我們需要完成以下編程任務(wù):
--> 修改fc_net.py谴忧,將dropout加進(jìn)去
vlayers.py 里的 dropout_forward 和 dropout_backward函數(shù)
Dropout是我們在實際(深度)神經(jīng)網(wǎng)絡(luò)訓(xùn)練中,用得非常多的一種正則化手段,可以很好地抑制過擬合沾谓。即:在訓(xùn)練過程中委造,我們對每個神經(jīng)元,都以概率p保持它的激活狀態(tài)均驶。下面給出3-layer神經(jīng)網(wǎng)絡(luò)的dropout示意圖:
具體代碼如下:
對于fc_net.py我們只要修改下其中的loss函數(shù):
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.dropout_param is not None:
self.dropout_param['mode'] = mode
if self.use_batchnorm:
for bn_param in self.bn_params:
bn_param['mode'] = mode
scores = None
h, cache1, cache2, cache3, cache4, bn, out = {}, {}, {}, {}, {}, {}, {}
out[0] = X
# Forward pass: compute loss
for i in xrange(self.num_layers-1):
# Unpack variables from the params dictionary
W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
if self.use_batchnorm:
gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]
h[i], cache1[i] = affine_forward(out[i], W, b)
bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])
out[i+1], cache3[i] = relu_forward(bn[i])
if self.use_dropout:
out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
else:
out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
if self.use_dropout:
out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
scores, cache = affine_forward(out[self.num_layers-1], W, b)
# If test mode return early
if mode == 'test':
return scores
loss, reg_loss, grads = 0.0, 0.0, {}
data_loss, dscores = softmax_loss(scores, y)
for i in xrange(self.num_layers):
reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
loss = data_loss + reg_loss
# Backward pass: compute gradients
dout, dbn, dh, ddrop = {}, {}, {}, {}
t = self.num_layers-1
dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
for i in xrange(t):
if self.use_batchnorm:
if self.use_dropout:
ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])
dout[t-i] = ddrop[t-1-i]
dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i])
dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])
else:
if self.use_dropout:
ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])
dout[t-i] = ddrop[t-1-i]
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])
# Add the regularization gradient contribution
for i in xrange(self.num_layers):
grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]
return loss, grads
---> layers.py 里的 dropout_forward 和 dropout_backward函數(shù)
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
def dropout_forward(x, dropout_param):
p, mode = dropout_param['p'], dropout_param['mode']
if 'seed' in dropout_param:
np.random.seed(dropout_param['seed'])
mask = None
out = None
if mode == 'train':
mask = (np.random.rand(*x.shape) < p) / p
out = x * mask
elif mode == 'test':
out = x
cache = (dropout_param, mask)
out = out.astype(x.dtype, copy=False)
return out, cache
def dropout_backward(dout, cache):
dropout_param, mask = cache
mode = dropout_param['mode']
dx = None
if mode == 'train':
dx = dout * mask
elif mode == 'test':
dx = dout
return dx
完成編程后昏兆,我們可以用Dropout.ipynb里的代碼來check你的code是否有誤。我們可以用Dropout.ipynb里最后一部分的代碼來比較下使用和不使用dropout的區(qū)別:
Part 2:卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Networks, CNNs)
現(xiàn)在我們開始理解本課程的核心內(nèi)容 —— 卷積神經(jīng)網(wǎng)絡(luò)妇穴,對于視覺識別任務(wù)爬虱,CNNs無疑是最出彩的。和我們前面講過的全連接神經(jīng)網(wǎng)絡(luò)相比腾它,CNNs的優(yōu)越之處在哪呢跑筝?我覺得可以列出以下幾點:
1)、它的權(quán)值共享以及局部(感受野瞒滴,receptive field)連接的特點曲梗,使之更加類似生物神經(jīng)網(wǎng)絡(luò),視覺皮層的神經(jīng)元就是局部接受信息的(即這些神經(jīng)元只響應(yīng)某些特定感受野的刺激)逛腿;
2)、在我們的圖像比較大的情況下(如 96x96仅颇、224x224单默、384x384、512x512等)忘瓦,全連接神經(jīng)網(wǎng)絡(luò)將需要訓(xùn)練超大量的參數(shù)(權(quán)重和偏置)搁廓,這不僅會使得計算變得非常耗時,還會導(dǎo)致更加嚴(yán)重的過擬合現(xiàn)象耕皮。而CNNs的權(quán)值共享和局部連接的特點境蜕,使得需要訓(xùn)練的參數(shù)銳減(指數(shù)級的);
3)凌停、CNNs具有強(qiáng)大的特征提取能力(從邊緣到局部再到整體)粱年,而全連接神經(jīng)網(wǎng)絡(luò)基本沒有特征提取的能力。
下面我們來具體討論CNNs的結(jié)構(gòu)特點罚拟,討論之前台诗,先給一張圖,方便感受下CNNs的大致結(jié)構(gòu):
1. 卷積層(Convolutional Layer)
卷積層赐俗,也可以稱之為特征提取層拉队,是CNNs最重要的部分。卷積層需要訓(xùn)練的參數(shù)是一系列的過濾器(我更喜歡卷積核這個詞)阻逮,這些過濾器的大小一致粱快,通常都是正方形。假設(shè)我們有n個過濾器,每個過濾器的大小為kxk(k通常取3或5)事哭,那么這一層我們需要訓(xùn)練的參數(shù)就有nxkxk+n/c個(這里的c表示通道數(shù)漫雷,如果是灰度圖像c=1,如果是彩色圖像c=3)慷蠕。權(quán)值共享告訴我們珊拼,一個過濾器只能提取一種特征,即當(dāng)過濾器在圖像上卷積(滑動)的過程中流炕,只提取了該圖像全局范圍內(nèi)的同一個特征澎现。所以,n個過濾器可以提取圖像的n個不同特征每辟。這里貼張卷積過程的動圖剑辫,這里的過濾器個數(shù)是6,但事實上是2種(因為有三個通道嘛)渠欺,所以提取了兩種特征:
動圖中妹蔽,你會發(fā)現(xiàn)圖像外面多了一圈0,而且過濾器移動的步長(stride)為2挠将。補(bǔ)零這個操作胳岂,我們稱之為zero-padding。我們記補(bǔ)零的圈數(shù)為p舔稀,過濾器移動步長為s乳丰,那么計算輸出卷積特征(convolved feature,或者叫activation map)邊長的公式為: L=(input_dim-k+2p)/s+1内贮,輸出特征的維數(shù)則為LxLxn/c产园。zero-padding這個操作產(chǎn)生的原因是為了保證過濾器的滑動能從頭到尾剛剛好,即保證上面的公式能夠整除夜郁。上面的p什燕,s和n是需要我們提前設(shè)定好的三個超參數(shù)。對于步長s的設(shè)定竞端,s設(shè)定得越小屎即,提取的信息就越豐富,但計算量會相對大一點事富;s設(shè)定得越大剑勾,計算量會相對小一點,但是提取的信息就少一些赵颅。s的通常選擇是1虽另。
---> PS: 卷積為什么work?
自然圖像有其固有特性,也就是說饺谬,圖像的一部分的統(tǒng)計特性與其他部分是一樣的捂刺。這也意味著我們在這一部分學(xué)習(xí)的特征也能用在另一部分上谣拣,所以對于這個圖像上的所有位置,我們都能使用同樣的學(xué)習(xí)特征族展。(摘自UFLDL)
2. 池化層(Pooling Layer)
卷積層的下一層是池化層森缠,但要注意,卷積層的輸出會經(jīng)過激活函數(shù)(如ReLU)激活后仪缸,進(jìn)入池化層贵涵。池化層的作用是將卷積層輸出的維數(shù)進(jìn)一步降低,以此來減少參數(shù)的數(shù)量和計算量恰画。具體來講宾茂,是將卷積層得到的結(jié)果無重合的分成幾個子區(qū)域,然后選擇每一子區(qū)域的最大值拴还,或者平均值跨晴,或者2范數(shù),我們以取最大值的max pooling為例(相對而言片林,max pooling的效果更好端盆,所以我們通常采用max pooling),給出一個diagram:
通常费封,池化層的采樣窗口大小為2x2焕妙。
有些人認(rèn)為池化層并不是必要的,如Striving for Simplicity: The All Convolutional Net弓摘。此外焚鹊,有人發(fā)現(xiàn)去除池化層對于生成式模型(generative models)很重要,例如variational autoencoders(VAEs)衣盾,generative adversarial networks(GANs)寺旺∫ィ可能在以后的模型結(jié)構(gòu)中势决,池化層會逐漸減少或者消失。
3. 全連接層(Fully-connected layer)
現(xiàn)在的很多CNNs模型蓝撇,在最后幾層(一般是1~3層)會采用全連接的方式去學(xué)習(xí)更多的信息果复。注意,全連接層的最后一層就是輸出層渤昌;除了最后一層虽抄,其它的全連接層都包含激活函數(shù)。
4. 卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)(CNNs Architectures)
CNNs的通常結(jié)構(gòu)独柑,可以表述如下:
INPUT --> [[CONV --> RELU]*N --> POOL?]*M --> [FC --> RELU]*K --> FC(OUTPUT)
其中迈窟,"?"是代表池化層是可選的,可有可無忌栅;N(一般03)车酣,K(一般02)和M(M>=0)是具體層數(shù)。
注意,我們傾向于選擇多層小size的卷積層湖员,而不是一個大size的卷積層贫悄。
現(xiàn)在,我們以3個3x3的卷積層和1個7x7的卷積層為例娘摔,加以對比說明窄坦。從下圖可以看出,這兩種方法最終得到的activation map大小是一致的凳寺,但3個3x3的卷積層明顯更好:
1)猖吴、3層的非線性組合要比1層線性組合提取出的特征具備更高的表達(dá)能力;
2)雕薪、3層小size的卷積層的參數(shù)數(shù)量要少纲岭,3x3x3<7x7;
3)怜瞒、同樣的父泳,為了便于反向傳播時的梯度計算,我們需要保留很多中間梯度吴汪,3層小size的卷積層需要保留的中間梯度更少惠窄。
下面我給出一個最簡單的CNNs結(jié)構(gòu)的diagram(input+1conv+1pool+2fc):
這里我們列舉幾種常見類型的卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu):
· INPUT --> FC/OUT 這其實就是個線性分類器
· INPUT --> CONV --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> POOL]*2 --> FC --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> CONV --> RELU --> POOL]*3 --> [FC --> RELU]*2 --> FC/OUT
---> PS:
1、對于輸入層(圖像層)漾橙,我們一般會把圖像大小resize成邊長為2的次方的正方形杆融。比如CIFAR-10是32x32x3,STL-10是64x64x3霜运,而ImageNet是224x224x3或者512x512x3脾歇。
2、實際工程中淘捡,我們得預(yù)估一下內(nèi)存藕各,然后根據(jù)內(nèi)存的情況去設(shè)定合理的值。例如輸入是224x224x3得圖片焦除,過濾器大小為3x3激况,共64個,zero-padding為1膘魄,這樣每張圖片需要72MB的內(nèi)存(這里的72MB囊括了圖片以及對應(yīng)的參數(shù)乌逐、梯度和激活值在內(nèi)的,所需要的內(nèi)存空間)创葡,但是在GPU上運行的話浙踢,內(nèi)存可能不夠(相比于CPU,GPU的內(nèi)存要小得多)灿渴,所以需要調(diào)整下參數(shù)洛波,比如過濾器大小改為7x7呐芥,stride改為2(ZF net),或者過濾器大小改為11x11奋岁,stride改為4(AlexNet)思瘟。
3、構(gòu)建一個實際可用的深度卷積神經(jīng)網(wǎng)絡(luò)最大的瓶頸是GPU的內(nèi)(顯)存∥帕妫現(xiàn)在很多GPU只有3/4/6GB的內(nèi)存滨攻,單卡最大的也就12G(NVIDIA),所以我們應(yīng)該在設(shè)計卷積神經(jīng)網(wǎng)的時候蓝翰,多加考慮內(nèi)存主要消耗在哪里:
- 大量的激活值和中間梯度值光绕;
- 參數(shù),反向傳播時的梯度以及使用momentum畜份,Adagrad诞帐,or RMSProp時的緩存都會占用儲存,所以估計參數(shù)占用的內(nèi)存時爆雹,一般至少要乘以3倍停蕉;
- 數(shù)據(jù)的batch以及其他的類似信息或者來源信息等也會消耗一部分內(nèi)存。
下面列出一些著名的卷積神經(jīng)網(wǎng)絡(luò):
· LeNet钙态,這是最早成功應(yīng)用的卷積神經(jīng)網(wǎng)絡(luò)慧起,Yann LeCun在論文LeNet中提出。
· AlexNet册倒,2012 ILSVRC競賽遠(yuǎn)超第2名的卷積神經(jīng)網(wǎng)絡(luò)蚓挤,掀起了深度學(xué)習(xí)的浪潮。
· ZF Net驻子,2013 ILSVRC競賽冠軍灿意,調(diào)整了Alexnet的結(jié)構(gòu)參數(shù), 擴(kuò)增了中間卷積層。
· GoogLeNet崇呵,2014 ILSVRC競賽冠軍缤剧,極大地減少了參數(shù)數(shù)量(由 60M到4M)。
· VGGNet演熟,2014 ILSVRC鞭执,證明了CNNs的深度對于最后的效果有至關(guān)重要的作用司顿。
· ResNet芒粹,2015 ILSVRC競賽冠軍,截止2016年5月10大溜,這是最先進(jìn)的模型化漆。最近Kaiming He等人,提出了改進(jìn)版Identity Mappings in Deep Residual Networks钦奋。
Part 3:Python編程任務(wù)(3-layer CNNs)
這部分我們需要完成以下編程任務(wù):
1)座云、layers.py里的以下函數(shù):
---> conv_forward_naive
---> conv_backward_naive
---> max_pool_forward_naive
---> max_pool_backward_naive
在給出卷積層的代碼前疙赠,我們先理解下卷積層的前向和后向傳播時,具體是如何計算的朦拖。為了理解方便圃阳,我們假設(shè)某一個batch里的第一張圖片為x[0, :, :, :],有RGB三個通道璧帝,每個通道大小為7x7捍岳,padding為1,stride為2睬隶,那么x[0, :, :, :]的大小為1x3x9x9锣夹;此外,我們假設(shè)有3個過濾器苏潜,每個大小為3x3银萍,用w表示所有過濾器中的權(quán)重(如第一個濾波器的第一個通道為w[0, 0, :, :]);偏置b的大小為1x3恤左;activation maps用out來表示贴唇,大小為3x4x4(如第一個map為out[0, :, :])。
以剛才的假設(shè)為例飞袋,給出前向傳播和后向傳播的具體計算過程(反向傳播的那張圖片分辨率較高滤蝠,請在新的標(biāo)簽頁打開圖片并放大,或者下載后觀看):
具體代碼如下:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
def conv_forward_naive(x, w, b, conv_param):
stride, pad = conv_param['stride'], conv_param['pad']
N, C, H, W = x.shape
F, C, HH, WW = w.shape
x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
H_new = 1 + (H + 2 * pad - HH) / stride
W_new = 1 + (W + 2 * pad - WW) / stride
s = stride
out = np.zeros((N, F, H_new, W_new))
for i in xrange(N): # ith image
for f in xrange(F): # fth filter
for j in xrange(H_new):
for k in xrange(W_new):
out[i, f, j, k] = np.sum(x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] * w[f]) + b[f]
cache = (x, w, b, conv_param)
return out, cache
def conv_backward_naive(dout, cache):
x, w, b, conv_param = cache
pad = conv_param['pad']
stride = conv_param['stride']
F, C, HH, WW = w.shape
N, C, H, W = x.shape
H_new = 1 + (H + 2 * pad - HH) / stride
W_new = 1 + (W + 2 * pad - WW) / stride
dx = np.zeros_like(x)
dw = np.zeros_like(w)
db = np.zeros_like(b)
s = stride
x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
for i in xrange(N): # ith image
for f in xrange(F): # fth filter
for j in xrange(H_new):
for k in xrange(W_new):
window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]
db[f] += dout[i, f, j, k]
dw[f] += window * dout[i, f, j, k]
dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]
# Unpad
dx = dx_padded[:, :, pad:pad+H, pad:pad+W]
return dx, dw, db
完成編程后授嘀,可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤物咳。
下面給出池化層(最大值池化)的代碼:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
def max_pool_forward_naive(x, pool_param):
HH, WW = pool_param['pool_height'], pool_param['pool_width']
s = pool_param['stride']
N, C, H, W = x.shape
H_new = 1 + (H - HH) / s
W_new = 1 + (W - WW) / s
out = np.zeros((N, C, H_new, W_new))
for i in xrange(N):
for j in xrange(C):
for k in xrange(H_new):
for l in xrange(W_new):
window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]
out[i, j, k, l] = np.max(window)
cache = (x, pool_param)
return out, cache
def max_pool_backward_naive(dout, cache):
x, pool_param = cache
HH, WW = pool_param['pool_height'], pool_param['pool_width']
s = pool_param['stride']
N, C, H, W = x.shape
H_new = 1 + (H - HH) / s
W_new = 1 + (W - WW) / s
dx = np.zeros_like(x)
for i in xrange(N):
for j in xrange(C):
for k in xrange(H_new):
for l in xrange(W_new):
window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]
m = np.max(window)
dx[i, j, k*s:HH+k*s, l*s:WW+l*s] = (window == m) * dout[i, j, k, l]
return dx
同樣,可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤蹄皱。
上面的編程中览闰,我們使用了多層for循環(huán),這會使得運行速度過慢巷折。為了加快運行速度压鉴,Assignment2里提供了fast_layers.py,但需要借助Cython來生成C擴(kuò)展锻拘,加快運行速度油吭。這里,我給出naive版和fast版在運行速度上的對比署拟,從下圖可以看出婉宰,運行速度得到了極大的提升:
2)、cnn.py推穷,具體代碼如下:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
from layer_utils import *
class ThreeLayerConvNet(object):
"""
A three-layer convolutional network with the following architecture:
conv - relu - 2x2 max pool - affine - relu - affine - softmax
"""
def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
dtype=np.float32):
self.params = {}
self.reg = reg
self.dtype = dtype
# Initialize weights and biases
C, H, W = input_dim
self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)
self.params['b1'] = np.zeros((1, num_filters))
self.params['W2'] = weight_scale * np.random.randn(num_filters*H*W/4, hidden_dim)
self.params['b2'] = np.zeros((1, hidden_dim))
self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)
self.params['b3'] = np.zeros((1, num_classes))
for k, v in self.params.iteritems():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
W3, b3 = self.params['W3'], self.params['b3']
# pass conv_param to the forward pass for the convolutional layer
filter_size = W1.shape[2]
conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}
# pass pool_param to the forward pass for the max-pooling layer
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}
# compute the forward pass
a1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
a2, cache2 = affine_relu_forward(a1, W2, b2)
scores, cache3 = affine_forward(a2, W3, b3)
if y is None:
return scores
# compute the backward pass
data_loss, dscores = softmax_loss(scores, y)
da2, dW3, db3 = affine_backward(dscores, cache3)
da1, dW2, db2 = affine_relu_backward(da2, cache2)
dX, dW1, db1 = conv_relu_pool_backward(da1, cache1)
# Add regularization
dW1 += self.reg * W1
dW2 += self.reg * W2
dW3 += self.reg * W3
reg_loss = 0.5 * self.reg * sum(np.sum(W * W) for W in [W1, W2, W3])
loss = data_loss + reg_loss
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}
return loss, grads
完成編程后心包,可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤。
3)馒铃、layers.py里的spatial_batchnorm_forward和spatial_batchnorm_backward函數(shù)蟹腾。在給出代碼前痕惋,我放張圖,方便大家理解CNNs里的Batch Normalization是怎么計算卷積層的均值mean和標(biāo)準(zhǔn)差std的:
具體代碼如下:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
def spatial_batchnorm_forward(x, gamma, beta, bn_param):
N, C, H, W = x.shape
x_new = x.transpose(0, 2, 3, 1).reshape(N*H*W, C)
out, cache = batchnorm_forward(x_new, gamma, beta, bn_param)
out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return out, cache
def spatial_batchnorm_backward(dout, cache):
N, C, H, W = dout.shape
dout_new = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C)
dx, dgamma, dbeta = batchnorm_backward(dout_new, cache)
dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return dx, dgamma, dbeta
完成編程后娃殖,可以用ConvolutionalNetworks.ipynb里的代碼來check編程是否有誤值戳。
以上面完成的ThreeLayerConvNet為例,比較下使用和不使用Batch Normalization對收斂速度的影響炉爆。從下圖中的結(jié)果可以看出述寡,使用Batch Normalization明顯加快了收斂,使得訓(xùn)練速度大幅提升(因為需要的epoch更少):
---> PS:
1叶洞、數(shù)據(jù)擴(kuò)增(Data Augmentation)
當(dāng)數(shù)據(jù)集較小的情況下鲫凶,這一操作還是十分有效的,可以一定程度提高識別率衩辟。具體的擴(kuò)增方法如下:
1)螟炫、水平翻轉(zhuǎn)(Horizontal flips)
2)、隨機(jī)剪裁(Random crops/scales)
3)艺晴、色彩抖動(Color jitter)
4)昼钻、發(fā)揮想象力(Get creative)
比如:平移、旋轉(zhuǎn)封寞、拉伸然评、切變、光學(xué)畸變等等狈究。
下面我給出一個CNN模型碗淌,測試其在CIFAR-10上的表現(xiàn)(進(jìn)行簡單的水平翻轉(zhuǎn)來擴(kuò)增數(shù)據(jù)),training set: 49000x2, validation set: 1000, test set: 10000抖锥。CNN層數(shù)結(jié)構(gòu)如下:
[[conv - relu]x3 - pool]x3 - affine - relu - affine - softmax
訓(xùn)練結(jié)果如下:
· Validation set accuracy: 0.904
· Test set accuracy: 0.892
Part 4: 可視化卷積神經(jīng)網(wǎng)絡(luò)
可視化手段可以直觀地揭開CNNs的神秘面紗亿眠,幫助我們更好地理解CNNs究竟學(xué)到了什么,下面我們討論下具體的可視化技術(shù):
1. 可視化權(quán)重和激活值
以AlexNet為例磅废,給出每層部分權(quán)重和激活值的可視化如下:
2. 檢索能最大限度激活神經(jīng)元的圖片
我們可以將大量圖片輸入網(wǎng)絡(luò)纳像,追蹤那些可以最大限度激活神經(jīng)元的圖片,然后我們可以可視化這些圖片拯勉,以此來理解神經(jīng)元在它的感受野里究竟在尋找什么竟趾,以便能夠正確地分類圖片?下圖是AlexNet的第五個pooling層(光頭躺槍 O__O "…):
3. 利用t-SNE和CNNs的特征向量來可視化圖片
CNNs可以表示為對輸入圖像進(jìn)行逐層轉(zhuǎn)化宫峦,最終形成一個可以用線性分類器進(jìn)行分類的representation岔帽,這個最終形成的representation就是CNN codes(例如AlexNet里輸入分類器之前的那個4096維向量),即特征向量斗遏。
t-SNE作為對高維數(shù)據(jù)降維并可視化的最好的方法之一山卦,其可視化結(jié)果有非常棒的視覺效果鞋邑。我們可以將CNN codes輸入t-SNE诵次,得到每一張圖片(對應(yīng)一個特征向量)對應(yīng)的二維向量账蓉,然后可以可視化出如下結(jié)果(靠的越近的圖片,在CNNs眼里越相似):
4. 局部遮擋圖片
為了判斷CNNs是否是依靠圖片中正確的目標(biāo)進(jìn)行進(jìn)行分類(而不是靠蒙的)逾一,我們可以對圖片進(jìn)行局部遮擋铸本,來測試CNNs。從下圖可以看出遵堵,CNNs確實是依靠正確的目標(biāo)進(jìn)行分類的:
Part 5: 遷移學(xué)習(xí)(Transfer Learning)
實際中箱玷,我們很少從頭開始訓(xùn)練一個CNNs,因為通常我們沒有足夠的數(shù)據(jù)陌宿。我們常采取的做法是:使用已經(jīng)在大數(shù)據(jù)集(例如ImageNet)上訓(xùn)練好的CNNs作為我們的初始模型或者一個固定的特征提取器锡足,然后用在新的數(shù)據(jù)集上。上張圖以便說明:
當(dāng)新數(shù)據(jù)集和預(yù)訓(xùn)練時的數(shù)據(jù)集不相似的情況下(如醫(yī)學(xué)圖像)壳坪,上圖的策略需要稍稍調(diào)整下:若新數(shù)據(jù)集較小舶得,我們需要訓(xùn)練除線性分類器之外更前面的幾層;若新數(shù)據(jù)集較大爽蝴,我們需要微調(diào)所有層沐批。
---> CS231n: Assignment 1
---> CS231n: Assignment 3