assignment2下泼舱,在cnn之前還有fully connected neural nets躁绸,batch norm和dropout的assignment裕循,dropout實現(xiàn)起來還是挺簡單的臣嚣,batch norm的話在上一章我參考了別人的解法(一開始自己想錯了,后來還沒來得及自己重新推導一遍剥哑,之后希望可以補上這個過程)硅则,fcn_net的實現(xiàn)主要依靠forward:存儲cache并逐層計算layer的輸出,backward:根據(jù)cache和dout來計算當前層的參數(shù)的梯度下降增量(dout是上一層傳遞下來的株婴,cache是當前層的怎虫,注意cache在forward過程中存下的是當前l(fā)ayer的輸入量和其他有關(guān)計算的,比如第1層(輸入層)的cache就是輸入x和第一層的weight和bias困介,如果是一些特殊的layer大审,比如dropout層需要保存當前層的mask【 決定哪些輸入feature在此置0, 即np.random.rand(x.shape[0], x.shape[1]) 】)。
接下來是重點...實現(xiàn)一個CNN逻翁,畢竟fully connected nets只是一個testbed饥努。對于實現(xiàn)cnn,我會一步一步記錄自己的理解和做法八回。
1. Conv Layer Naive Forward
卷積層的pad酷愧,輸出的dimension計算如下:
注意kernel是對輸入層的所有層都進行卷積,所以如果只有一個filter那么輸出只有1層缠诅,如果有n個filter溶浴,那么輸出就有n層了。一個filter中用于卷積的kernel數(shù)取決于上一層的輸入channel的個數(shù)管引,每個channel對應(yīng)一個kernel士败,代碼:
def conv_forward_naive(x, w, b, conv_param):
"""
A naive implementation of the forward pass for a convolutional layer.
The input consists of N data points, each with C channels, height H and width
W. We convolve each input with F different filters, where each filter spans
all C channels and has height HH and width HH.
Input:
- x: Input data of shape (N, C, H, W)
- w: Filter weights of shape (F, C, HH, WW)
- b: Biases, of shape (F,)
- conv_param: A dictionary with the following keys:
- 'stride': The number of pixels between adjacent receptive fields in the
horizontal and vertical directions.
- 'pad': The number of pixels that will be used to zero-pad the input.
Returns a tuple of:
- out: Output data, of shape (N, F, H', W') where H' and W' are given by
H' = 1 + (H + 2 * pad - HH) / stride
W' = 1 + (W + 2 * pad - WW) / stride
- cache: (x, w, b, conv_param)
"""
out = None
#############################################################################
# TODO: Implement the convolutional forward pass. #
# Hint: you can use the function np.pad for padding. #
#############################################################################
stride = conv_param['stride']
pad = conv_param['pad']
print 'pad:', pad
# print 'stride:', stride, 'pad:', pad
# print 'x', x.shape, 'w', w.shape
x_pad = np.zeros((x.shape[0], x.shape[1], x.shape[2]+pad*2, x.shape[3]+pad*2))
# pad 0 to image
for i in range(x.shape[0]):
for j in range(x.shape[1]):
x_pad[i,j,:,:]=np.pad(x[i,j,:,:], pad, 'constant', constant_values=0)
H_out = (x.shape[2]+2*pad-w.shape[2])/stride+1
W_out = (x.shape[3]+2*pad-w.shape[3])/stride+1
out = np.zeros((x.shape[0], w.shape[0], H_out, W_out))
# convolution
for i in range(H_out):
for j in range(W_out):
for k in range(w.shape[0]):
out[:, k, i, j] = np.sum(w[k,:,:,:]*x_pad[:, :, i*stride:i*stride+w.shape[2], j*stride:j*stride+w.shape[3]], axis=(1,2,3))+b[k]
#############################################################################
# END OF YOUR CODE #
#############################################################################
cache = (x, w, b, conv_param)
return out, cache
2. Conv Layer Naive Backward
【推薦閱讀】:
http://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/
個人總結(jié)的感性認識:
- dx一定和x的shape相同,dw一定和w的shape相同褥伴,db一定和b的shape相同谅将,dout和out的shape一定相同。只要考慮下它們的意義就可以知道重慢。dL/dx的含義(也就是代碼中的dx)是x的改變會多大程度地改變L饥臂,而x的改變也只能是在自己的shape下。
- 在計算的時候先將x進行pad似踱。我們知道在卷積的過程中隅熙,每一次卷積運算,對應(yīng)于上圖的2×2和2×2矩陣的作用核芽,其實就是一個相乘累加過程囚戚。對于out層的每一個元素,都是由input layer和weight矩陣對應(yīng)的相乘累加得到的轧简。以dx為例驰坊,input矩陣x中的每個元素可能會參與1到幾次不等的運算,與weight相乘將結(jié)果貢獻到out layer中哮独,比如x(0,0)只參與了一次庐橙,那么它的dx(0,0)=weight(0,0)×dout(...)假勿,而x(0,1)參與了兩次,分別貢獻到out的不同位置态鳖,那么dx(0,1)=weight(0,0)×dout(...)+weight(0,1)×dout(...),以此類推恶导。
代碼:
def conv_backward_naive(dout, cache):
"""
A naive implementation of the backward pass for a convolutional layer.
Inputs:
- dout: Upstream derivatives.
- cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive
Returns a tuple of:
- dx: Gradient with respect to x
- dw: Gradient with respect to w
- db: Gradient with respect to b
"""
dx, dw, db = None, None, None
#############################################################################
# TODO: Implement the convolutional backward pass. #
#############################################################################
x = cache[0]
w = cache[1]
b = cache[2]
stride = cache[3]['stride']
pad = cache[3]['pad']
x_pad=np.pad(x, [(0,0),(0,0),(pad,pad),(pad,pad)], 'constant', constant_values=0)
dx_pad = np.zeros((x.shape[0], x.shape[1], x.shape[2]+2*pad, x.shape[3]+2*pad))
dw = np.zeros(w.shape)
db = np.zeros(b.shape)
# k浆竭,即第一維是樣本個數(shù);f惨寿,即第二維是filter個數(shù)邦泄;xi和xj是該層輸出map的長寬。
# 以下代碼根據(jù)上面的分析就很好理解了
for k in range(dout.shape[0]):
for f in range(dout.shape[1]):
for xi in range(dout.shape[2]):
for xj in range(dout.shape[3]):
dx_pad[k,:,xi*stride:xi*stride+w.shape[2],xj*stride:xj*stride+w.shape[3]] += w[f,:]*dout[k,f,xi,xj]
dw[f] += x_pad[k,:,xi*stride:xi*stride+w.shape[2],xj*stride:xj*stride+w.shape[3]]*dout[k,f,xi,xj]
dx = dx_pad[:,:,pad:pad+x.shape[2],pad:pad+x.shape[3]]
db = np.sum(dout, axis=(0,2,3))
#############################################################################
# END OF YOUR CODE #
#############################################################################
return dx, dw, db
3. Max Pooling Naive Forward
def max_pool_forward_naive(x, pool_param):
"""
A naive implementation of the forward pass for a max pooling layer.
Inputs:
- x: Input data, of shape (N, C, H, W)
- pool_param: dictionary with the following keys:
- 'pool_height': The height of each pooling region
- 'pool_width': The width of each pooling region
- 'stride': The distance between adjacent pooling regions
Returns a tuple of:
- out: Output data
- cache: (x, pool_param)
"""
out = None
#############################################################################
# TODO: Implement the max pooling forward pass #
#############################################################################
pool_height = pool_param['pool_height']
pool_width = pool_param['pool_width']
stride = pool_param['stride']
out = np.zeros(( x.shape[0], x.shape[1],
(x.shape[2]-pool_height)//stride+1,
(x.shape[3]-pool_width)//stride+1 ))
for n in range(x.shape[0]):
for c in range(x.shape[1]):
for i in range(out.shape[2]):
for j in range(out.shape[3]):
out[n,c,i,j] = np.max( x[n,c,i*stride:i*stride+pool_height, j*stride:j*stride+pool_width] )
#############################################################################
# END OF YOUR CODE #
#############################################################################
cache = (x, pool_param)
return out, cache
4. Max Pooling Naive Backward
def max_pool_backward_naive(dout, cache):
"""
A naive implementation of the backward pass for a max pooling layer.
Inputs:
- dout: Upstream derivatives
- cache: A tuple of (x, pool_param) as in the forward pass.
Returns:
- dx: Gradient with respect to x
"""
dx = None
#############################################################################
# TODO: Implement the max pooling backward pass #
#############################################################################
x = cache[0]
dx = np.zeros(x.shape)
pool_param = cache[1]
stride = pool_param['stride']
pool_height = pool_param['pool_height']
pool_width = pool_param['pool_width']
for n in range(x.shape[0]):
for c in range(x.shape[1]):
for i in range(dout.shape[2]):
for j in range(dout.shape[3]):
tmp = x[n,c,i*stride:i*stride+pool_height,j*stride:j*stride+pool_width]
binary = tmp==np.max(tmp)
dx[n,c,i*stride:i*stride+pool_height,j*stride:j*stride+pool_width] = binary*dout[n,c,i,j]
#############################################################################
# END OF YOUR CODE #
#############################################################################
return dx
5. 訓練一個(Conv + Relu + Pooling) + (Affine + Relu) + (Affine) + Softmax 三層網(wǎng)絡(luò)
class ThreeLayerConvNet(object):
"""
A three-layer convolutional network with the following architecture:
conv - relu - 2x2 max pool - affine - relu - affine - softmax
The network operates on minibatches of data that have shape (N, C, H, W)
consisting of N images, each with height H and width W and with C input
channels.
"""
def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
dtype=np.float32):
"""
Initialize a new network.
Inputs:
- input_dim: Tuple (C, H, W) giving size of input data
- num_filters: Number of filters to use in the convolutional layer
- filter_size: Size of filters to use in the convolutional layer
- hidden_dim: Number of units to use in the fully-connected hidden layer
- num_classes: Number of scores to produce from the final affine layer.
- weight_scale: Scalar giving standard deviation for random initialization
of weights.
- reg: Scalar giving L2 regularization strength
- dtype: numpy datatype to use for computation.
"""
self.params = {}
self.reg = reg
self.dtype = dtype
############################################################################
# TODO: Initialize weights and biases for the three-layer convolutional #
# network. Weights should be initialized from a Gaussian with standard #
# deviation equal to weight_scale; biases should be initialized to zero. #
# All weights and biases should be stored in the dictionary self.params. #
# Store weights and biases for the convolutional layer using the keys 'W1' #
# and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the #
# hidden affine layer, and keys 'W3' and 'b3' for the weights and biases #
# of the output affine layer. #
############################################################################
self.params['W1'] = np.random.normal(0, weight_scale, (num_filters, input_dim[0], filter_size, filter_size))
self.params['b1'] = np.zeros(num_filters)
W2_rol_size = num_filters * input_dim[1]/2 * input_dim[2]/2
self.params['W2'] = np.random.normal(0, weight_scale, (W2_rol_size, hidden_dim))
self.params['b2'] = np.zeros(hidden_dim)
self.params['W3'] = np.random.normal(0, weight_scale, (hidden_dim, num_classes))
self.params['b3'] = np.zeros(num_classes)
############################################################################
# END OF YOUR CODE #
############################################################################
for k, v in self.params.iteritems():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
"""
Evaluate loss and gradient for the three-layer convolutional network.
Input / output: Same API as TwoLayerNet in fc_net.py.
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
W3, b3 = self.params['W3'], self.params['b3']
# pass conv_param to the forward pass for the convolutional layer
filter_size = W1.shape[2]
conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}
# pass pool_param to the forward pass for the max-pooling layer
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}
scores = None
loss, grads = 0, {}
############################################################################
# TODO: Implement the forward pass for the three-layer convolutional net, #
# computing the class scores for X and storing them in the scores #
# variable. #
############################################################################
out1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
out2, cache2 = affine_relu_forward(out1, W2, b2)
out3, cache3 = affine_forward(out2, W3, b3)
score = out3
loss, dscore = softmax_loss(out3, y)
loss += np.sum(0.5*self.reg*np.sum(W**2) for W in [W1, W2, W3])
############################################################################
# END OF YOUR CODE #
############################################################################
if y is None:
return scores
############################################################################
# TODO: Implement the backward pass for the three-layer convolutional net, #
# storing the loss and gradients in the loss and grads variables. Compute #
# data loss using softmax, and make sure that grads[k] holds the gradients #
# for self.params[k]. Don't forget to add L2 regularization! #
############################################################################
dout3, grads['W3'], grads['b3'] = affine_backward(dscore, cache3)
dout2, grads['W2'], grads['b2'] = affine_relu_backward(dout3, cache2)
dout1, grads['W1'], grads['b1'] = conv_relu_pool_backward(dout2, cache1)
grads['W3'] += self.reg*W3
grads['W2'] += self.reg*W2
grads['W1'] += self.reg*W1
############################################################################
# END OF YOUR CODE #
############################################################################
return loss, grads
6. Spatial Batch Normalization
We already saw that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called "spatial batch normalization."
Normally batch-normalization accepts inputs of shape (N, D) and produces outputs of shape (N, D), where we normalize across the minibatch dimension N. For data coming from convolutional layers, batch normalization needs to accept inputs of shape (N, C, H, W) and produce outputs of shape (N, C, H, W) where the N dimension gives the minibatch size and the (H, W) dimensions give the spatial size of the feature map.
If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different images and different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the C feature channels by computing statistics over both the minibatch dimension N and the spatial dimensions H and W.
CNN和全連接的NN的BN方法不一樣的地方在于裂垦,Conv layer使用的Spatial BN的思想是按照Channel(RGB通道)來進行歸一化顺囊,保證不同圖和同一圖的不同位置在一個通道的分布上是一致的、歸一化的蕉拢。這部分我沒有自己去寫特碳,直接給出github上martinkersner的代碼:
def spatial_batchnorm_forward(x, gamma, beta, bn_param):
"""
Computes the forward pass for spatial batch normalization.
Inputs:
- x: Input data of shape (N, C, H, W)
- gamma: Scale parameter, of shape (C,)
- beta: Shift parameter, of shape (C,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance. momentum=0 means that old information is discarded completely at every time step, while momentum=1 means that new information is never incorporated. The default of momentum=0.9 should work well in most situations.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: Output data, of shape (N, C, H, W)
- cache: Values needed for the backward pass
"""
out, cache = None, None
##############################################################################
# Implement the forward pass for spatial batch normalization.
# HINT: You can implement spatial batch normalization using the vanilla #
# version of batch normalization defined above. Your implementation should #
# be very short; ours is less than five lines.
##############################################################################
N, C, H, W = x.shape
x_reshaped = x.transpose(0,2,3,1).reshape(N*H*W, C)
out_tmp, cache = batchnorm_forward(x_reshaped, gamma, beta, bn_param)
out = out_tmp.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return out, cache
def spatial_batchnorm_backward(dout, cache):
"""
Computes the backward pass for spatial batch normalization.
Inputs:
- dout: Upstream derivatives, of shape (N, C, H, W)
- cache: Values from the forward pass
Returns a tuple of:
- dx: Gradient with respect to inputs, of shape (N, C, H, W)
- dgamma: Gradient with respect to scale parameter, of shape (C,)
- dbeta: Gradient with respect to shift parameter, of shape (C,)
"""
dx, dgamma, dbeta = None, None, None
#############################################################################
# Implement the backward pass for spatial batch normalization.
## HINT: You can implement spatial batch normalization using the vanilla
## version of batch normalization defined above. Your implementation should
## be very short; ours is less than five lines.
##############################################################################
N, C, H, W = dout.shape
dout_reshaped = dout.transpose(0,2,3,1).reshape(N*H*W, C)
dx_tmp, dgamma, dbeta = batchnorm_backward(dout_reshaped, cache)
dx = dx_tmp.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return dx, dgamma, dbeta
【寫在最后的tip】Tips for training:
For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:
- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the course-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
另外對于cnn的梯度反向傳播來說,可以看看更高效的做法晕换,而不是用程序中的for loop:
http://www.cnblogs.com/pinard/p/6494810.html