2.MLP構(gòu)建、前向、反向

%matplotlib inline

Lab1 - Multilayer Perceptrons


In this lab, we are going through 3 examples of MLP, which covers the implementation from scratch and the standard library.

  • Use numpy for feed-forward and gradient computing

  • Use PyTorch tensor for feed-forward and automatic differentiation

  • Use PyTorch built-in layers and optimizers

Before you get started, please install numpy, torch and torchvision in advance.

We suggest you run the following cells and study the internal mechanism of the neural networks. Moreover, it is also highly recommended that you should tune the hyper-parameters to gain better results.

Some insights of dropout and xavier initialization has been adapted from Mu Li's course Dive into Deep Learning.

Dataset and DataLoader

First of all, we utilize the MNIST dataset for example.

For simplicity, we use the premade dataset powered by torchvision, therefore we don't have to worry about data preprocessing : )

Before moving on, please check the basic concepts of Dataset and DataLoader of PyTorch.


import numpy as np

import torch

import torchvision

train_loader = torch.utils.data.DataLoader(

 torchvision.datasets.MNIST('data/', train=True, download=True,

 transform=torchvision.transforms.Compose([

 torchvision.transforms.ToTensor(),

 torchvision.transforms.Normalize(

 (0.1307,), (0.3081,))

 ])),

 batch_size=256, shuffle=True)

test_loader = torch.utils.data.DataLoader(

 torchvision.datasets.MNIST('data/', train=False, download=True,

 transform=torchvision.transforms.Compose([

 torchvision.transforms.ToTensor(),

 torchvision.transforms.Normalize(

 (0.1307,), (0.3081,))

 ])),

 batch_size=256, shuffle=True)

Warm-up: numpy


A fully-connected ReLU network with one hidden layer and no biases, trained to

predict y from x using cross-entropy loss.

This implementation uses numpy to manually compute the forward pass, loss, and

backward pass.

A numpy array is a generic n-dimensional array; it does not know anything about

deep learning or gradients or computational graphs, and is just a way to perform

generic numeric computations.


def softmax(x):

 x -= np.max(x, axis=1, keepdims=True)

 exps = np.exp(x)

 return exps / np.sum(exps, axis=1, keepdims=True)


def cross_entropy(y_pred, y, epsilon=1e-12):

 """

 y_pred is the output from fully connected layer (num_examples x num_classes)

 y is labels (num_examples x 1)

 Note that y is **not** one-hot encoded vector.

 It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.

 """

 n = y.shape[0]

 p = softmax(y_pred)

 # avoid computing log(0)

 p = np.clip(p, epsilon, 1.)

 # We use multidimensional array indexing to extract

 # softmax probability of the correct label for each sample.

 # Refer to https://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arrays for understanding multidimensional array indexing.

 log_likelihood = -np.log(p[np.arange(n), y])

 loss = np.sum(log_likelihood) / n

 return loss

Calculating gradients manually is prone to error; be careful when doing it yourself.

If you found it difficult, please refer to these sites(link1, link2).


def grad_cross_entropy(y_pred, y):

 """

 y_pred is the output from fully connected layer (num_examples x num_classes)

 y is labels (num_examples x 1)

 Note that y is not one-hot encoded vector.

 It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.

 """

 n = y.shape[0]

 grad = softmax(y_pred)

 grad[np.arange(n), y] -= 1

 grad = grad / n

 return grad


# N is batch size; D_in is input dimension;

# H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 256, 784, 100, 10

# Create random input and output data

x = np.random.randn(N, D_in)

y = np.random.randn(N, D_out)

# Randomly initialize weights

w1 = np.random.randn(D_in, H)

w2 = np.random.randn(H, D_out)

n_epochs = 10

learning_rate = 1e-3

display_freq = 50

for t in range(n_epochs):

 for batch_idx, (x, y) in enumerate(train_loader):

 # Forward pass: compute predicted y

 x = x.view(x.shape[0], -1)

 x, y = x.numpy(), y.numpy()

 h = x.dot(w1)

 h_relu = np.maximum(h, 0)

 y_pred = h_relu.dot(w2)

 # Compute and print loss

 loss = cross_entropy(y_pred, y)

 if batch_idx % display_freq == 0:

 print('epoch = {}\tbatch_idx = {}\tloss = {}'.format(t, batch_idx, loss))

 # Backprop to compute gradients of w1 and w2 with respect to loss

 grad_y_pred = grad_cross_entropy(y_pred, y)

 grad_w2 = h_relu.T.dot(grad_y_pred)

 grad_h_relu = grad_y_pred.dot(w2.T)

 grad_h = grad_h_relu.copy()

 grad_h[h < 0] = 0

 grad_w1 = x.T.dot(grad_h)

 # Update weights

 w1 -= learning_rate * grad_w1

 w2 -= learning_rate * grad_w2

PyTorch: Tensors and autograd


A fully-connected ReLU network with one hidden layer and no biases, trained to

predict y from x by minimizing cross-entropy loss.

This implementation computes the forward pass using operations on PyTorch

Tensors, and uses PyTorch autograd to compute gradients.

A PyTorch Tensor represents a node in a computational graph. If x is a

Tensor that has x.requires_grad=True then x.grad is another Tensor

holding the gradient of x with respect to some scalar value.

Activation Function


def activation(x, method='relu'):

 assert method in ['relu', 'sigmoid', 'tanh'], "Invalid activation function!"

 if method is 'relu':

 return torch.max(x, torch.zeros_like(x))

 elif method is 'sigmoid':

 return 1\. / (1\. + torch.exp(-x.float()))

 else:

 pos = torch.exp(x.float())

 neg = torch.exp(-x.float())

 return (pos - neg) / (pos + neg)

Dropout

Robustness through Perturbations

Let's think briefly about what we expect from a good statistical model. Obviously we want it to do well on unseen test data. One way we can accomplish this is by asking for what amounts to a 'simple' model. Simplicity can come in the form of a small number of dimensions, which is what we did when discussing fitting a function with monomial basis functions. Simplicity can also come in the form of a small norm for the basis funtions. This is what led to weight decay and \ell_2 regularization. Yet a third way to impose some notion of simplicity is that the function should be robust under modest changes in the input. For instance, when we classify images, we would expect that alterations of a few pixels are mostly harmless.

In fact, this notion was formalized by Bishop in 1995, when he proved that Training with Input Noise is Equivalent to Tikhonov Regularization. That is, he connected the notion of having a smooth (and thus simple) function with one that is resilient to perturbations in the input. Fast forward to 2014. Given the complexity of deep networks with many layers, enforcing smoothness just on the input misses out on what is happening in subsequent layers. The ingenious idea of Srivastava et al., 2014 was to apply Bishop's idea to the internal layers of the network, too, namely to inject noise into the computational path of the network while it's training.

A key challenge in this context is how to add noise without introducing undue bias. In terms of inputs \mathbf{x}, this is relatively easy to accomplish: simply add some noise \epsilon \sim \mathcal{N}(0,\sigma^2) to it and use this data during training via \mathbf{x}' = \mathbf{x} + \epsilon. A key property is that in expectation \mathbf{E}[\mathbf{x}'] = \mathbf{x}. For intermediate layers, though, this might not be quite so desirable since the scale of the noise might not be appropriate. The alternative is to perturb coordinates as follows:

$$

\begin{aligned}

h' =

\begin{cases}

0 & \text{ with probability } p \

\frac{h}{1-p} & \text{ otherwise}

\end{cases}

\end{aligned}

$$

By design, the expectation remains unchanged, i.e. \mathbf{E}[h'] = h. This idea is at the heart of dropout where intermediate activations h are replaced by a random variable h' with matching expectation. The name 'dropout' arises from the notion that some neurons 'drop out' of the computation for the purpose of computing the final result. During training we replace intermediate activations with random variables


def dropout(X, drop_prob=0.3):

 assert 0 <= drop_prob <= 1

 # In this case, all elements are dropped out

 if drop_prob == 1:

 return torch.zeros_like(X)

 mask = torch.rand(*X.size()) > drop_prob

 # keep intermediate results unbiased

 return mask.type_as(X) * X / (1.0-drop_prob)

Model with a Droput Layer


def net(x, method='relu'):

 x = x.view(x.shape[0], -1)

 hidden = activation(x.mm(w1), method=method)

 hidden = dropout(hidden)

 return hidden.mm(w2)

Loss Function


loss_func = torch.nn.CrossEntropyLoss()

Training


# N is batch size; D_in is input dimension;

# H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 256, 784, 100, 10

# train_iter, test_iter = housing_data(batch_size)

dtype = torch.float

device = torch.device("cpu")

# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random Tensors for weights.

# Setting requires_grad=True indicates that we want to compute gradients with

# respect to these Tensors during the backward pass.

w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)

w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

# Hyper-parameters

learning_rate = 1e-3

n_epochs = 10

display_freq = 50

for t in range(n_epochs):

 for batch_idx, (x, y) in enumerate(train_loader):

 # Forward pass: compute predicted y using operations on Tensors; these

 # are exactly the same operations we used to compute the forward pass using

 # Tensors, but we do not need to keep references to intermediate values since

 # we are not implementing the backward pass by hand.

 y_pred = net(x, method='relu')

 # Compute and print loss using operations on Tensors.

 # Now loss is a Tensor of shape (1,)

 # loss.item() gets the a scalar value held in the loss.

 loss = loss_func(y_pred, y)

 if batch_idx % display_freq == 0:

 print('epoch = {}\tbatch_idx = {}\tloss = {}'.format(t, batch_idx, loss.item()))

 # Use autograd to compute the backward pass. This call will compute the

 # gradient of loss with respect to all Tensors with requires_grad=True.

 # After this call w1.grad and w2.grad will be Tensors holding the gradient

 # of the loss with respect to w1 and w2 respectively.

 loss.backward()

 # Manually update weights using gradient descent. Wrap in torch.no_grad()

 # because weights have requires_grad=True, but we don't need to track this

 # in autograd.

 # An alternative way is to operate on weight.data and weight.grad.data.

 # Recall that tensor.data gives a tensor that shares the storage with

 # tensor, but doesn't track history.

 # You can also use torch.optim.SGD to achieve this.

 with torch.no_grad():

 w1 -= learning_rate * w1.grad

 w2 -= learning_rate * w2.grad

 # Manually zero the gradients after updating weights

 w1.grad.zero_()

 w2.grad.zero_()

PyTorch: Standard APIs


A fully-connected ReLU network with one hidden layer, trained to predict y from x

by minimizing cross-entropy loss.

This implementation uses the nn package from PyTorch to build the network.

PyTorch autograd makes it easy to define computational graphs and take gradients,

but raw autograd can be a bit too low-level for defining complex neural networks;

this is where the nn package can help. The nn package defines a set of Modules,

which you can think of as a neural network layer that has produces output from

input and may have some trainable weights.

NOTICE:

In this section, we use built-in optimizer SGD with another hyper-parameter, i.e. momentum.

Model using nn package


# N is batch size; D_in is input dimension;

# H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 256, 784, 100, 10

# Use the nn package to define our model as a sequence of layers. nn.Sequential

# is a Module which contains other Modules, and applies them in sequence to

# produce its output. Each Linear Module computes output from input using a

# linear function, and holds internal Tensors for its weight and bias.

model = torch.nn.Sequential(

 torch.nn.Linear(D_in, H),

 torch.nn.ReLU(),

 torch.nn.Dropout(0.3),

 torch.nn.Linear(H, D_out),

)

Default Initialization

In the previous sections, e.g. in “Concise Implementation of Linear Regression”, we used net.initialize(init.Normal(sigma=0.01)) as a way to pick normally distributed random numbers as initial values for the weights. If the initialization method is not specified, such as net.initialize(), MXNet will use the default random initialization method: each element of the weight parameter is randomly sampled with an uniform distribution U[-0.07, 0.07] and the bias parameters are all set to 0. Both choices tend to work quite well in practice for moderate problem sizes.

Xavier Initialization

Let's look at the scale distribution of the activations of the hidden units h_{i} for some layer. They are given by

h_{i} = \sum_{j=1}^{n_\mathrm{in}} W_{ij} x_j

The weights W_{ij} are all drawn independently from the same distribution. Let's furthermore assume that this distribution has zero mean and variance \sigma^2 (this doesn't mean that the distribution has to be Gaussian, just that mean and variance need to exist). We don't really have much control over the inputs into the layer x_j but let's proceed with the somewhat unrealistic assumption that they also have zero mean and variance \gamma^2 and that they're independent of \mathbf{W}. In this case we can compute mean and variance of h_i as follows:

$$

\begin{aligned}

\mathbf{E}[h_i] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W_{ij} x_j] = 0 \

\mathbf{E}[h_i^2] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij} x^2_j] \

& = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij}] \mathbf{E}[x^2_j] \

& = n_\mathrm{in} \sigma^2 \gamma^2

\end{aligned}

$$

One way to keep the variance fixed is to set n_\mathrm{in} \sigma^2 = 1. Now consider backpropagation. There we face a similar problem, albeit with gradients being propagated from the top layers. That is, instead of \mathbf{W} \mathbf{w} we need to deal with \mathbf{W}^\top \mathbf{g}, where \mathbf{g} is the incoming gradient from the layer above. Using the same reasoning as for forward propagation we see that the gradients' variance can blow up unless n_\mathrm{out} \sigma^2 = 1. This leaves us in a dilemma: we cannot possibly satisfy both conditions simultaneously. Instead, we simply try to satisfy

$$

\begin{aligned}

\frac{1}{2} (n_\mathrm{in} + n_\mathrm{out}) \sigma^2 = 1 \text{ or equivalently }

\sigma = \sqrt{\frac{2}{n_\mathrm{in} + n_\mathrm{out}}}

\end{aligned}

$$

This is the reasoning underlying the eponymous Xavier initialization, proposed by Xavier Glorot and Yoshua Bengio in 2010. It works well enough in practice. For Gaussian random variables the Xavier initialization picks a normal distribution with zero mean and variance \sigma^2 = 2/(n_\mathrm{in} + n_\mathrm{out}).

For uniformly distributed random variables U[-a, a] note that their variance is given by a^2/3. Plugging a^2/3 into the condition on \sigma^2 yields that we should initialize uniformly with

U\left[-\sqrt{6/(n_\mathrm{in} + n_\mathrm{out})}, \sqrt{6/(n_\mathrm{in} + n_\mathrm{out})}\right].


torch.nn.init.xavier_normal_(model[0].weight)

torch.nn.init.xavier_normal_(model[-1].weight)


# The nn package also contains definitions of popular loss functions

loss_fn = torch.nn.CrossEntropyLoss()

# Hyper-parameters

learning_rate = 1e-3

momentum = 0.9

n_epochs = 10

display_freq = 50

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

for t in range(n_epochs):

 for batch_idx, (x, y) in enumerate(train_loader):

 optimizer.zero_grad()

 # Forward pass: compute predicted y by passing x to the model. Module objects

 # override the __call__ operator so you can call them like functions. When

 # doing so you pass a Tensor of input data to the Module and it produces

 # a Tensor of output data.

 y_pred = model(x.view(x.shape[0], -1))

 # Compute and print loss. We pass Tensors containing the predicted and true

 # values of y, and the loss function returns a Tensor containing the

 # loss.

 loss = loss_fn(y_pred, y)

 if batch_idx % display_freq == 0:

 print('epoch = {}\tbatch_idx = {}\tloss = {}'.format(t, batch_idx, loss.item()))

 # Backward pass: compute gradient of the loss with respect to all the learnable

 # parameters of the model. Internally, the parameters of each Module are stored

 # in Tensors with requires_grad=True, so this call will compute gradients for

 # all learnable parameters in the model.

 loss.backward()

 optimizer.step()

運(yùn)行截圖

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末衣厘,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子压恒,更是在濱河造成了極大的恐慌影暴,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,372評(píng)論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件探赫,死亡現(xiàn)場離奇詭異型宙,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)期吓,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門早歇,熙熙樓的掌柜王于貴愁眉苦臉地迎上來倾芝,“玉大人,你說我怎么就攤上這事箭跳〕苛恚” “怎么了?”我有些...
    開封第一講書人閱讀 162,415評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵谱姓,是天一觀的道長借尿。 經(jīng)常有香客問我,道長屉来,這世上最難降的妖魔是什么路翻? 我笑而不...
    開封第一講書人閱讀 58,157評(píng)論 1 292
  • 正文 為了忘掉前任,我火速辦了婚禮茄靠,結(jié)果婚禮上茂契,老公的妹妹穿的比我還像新娘。我一直安慰自己慨绳,他們只是感情好掉冶,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,171評(píng)論 6 388
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著脐雪,像睡著了一般厌小。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上战秋,一...
    開封第一講書人閱讀 51,125評(píng)論 1 297
  • 那天璧亚,我揣著相機(jī)與錄音,去河邊找鬼脂信。 笑死癣蟋,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的吉嚣。 我是一名探鬼主播梢薪,決...
    沈念sama閱讀 40,028評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢(mèng)啊……” “哼尝哆!你這毒婦竟也來了秉撇?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,887評(píng)論 0 274
  • 序言:老撾萬榮一對(duì)情侶失蹤秋泄,失蹤者是張志新(化名)和其女友劉穎琐馆,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體恒序,經(jīng)...
    沈念sama閱讀 45,310評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡瘦麸,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,533評(píng)論 2 332
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了歧胁。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片滋饲。...
    茶點(diǎn)故事閱讀 39,690評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡厉碟,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出屠缭,到底是詐尸還是另有隱情箍鼓,我是刑警寧澤,帶...
    沈念sama閱讀 35,411評(píng)論 5 343
  • 正文 年R本政府宣布呵曹,位于F島的核電站款咖,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏奄喂。R本人自食惡果不足惜铐殃,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,004評(píng)論 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望跨新。 院中可真熱鬧富腊,春花似錦、人聲如沸域帐。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽俯树。三九已至,卻和暖如春贰盗,著一層夾襖步出監(jiān)牢的瞬間许饿,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評(píng)論 1 268
  • 我被黑心中介騙來泰國打工舵盈, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留陋率,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 47,693評(píng)論 2 368
  • 正文 我出身青樓秽晚,卻偏偏與公主長得像瓦糟,于是被迫代替她去往敵國和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子赴蝇,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,577評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容

  • 中國幾千年來的文化一直都在教育我們:人之初句伶,性本善劲蜻。對(duì)此我們深信不疑。 原本身無一物的嬰兒來到世間考余,帶來的就是喜悅...
    臻汐閱讀 405評(píng)論 2 1
  • 我喜歡媽媽先嬉,我愛媽媽 我喜歡鴨鴨,還喜歡錢 還喜歡運(yùn)動(dòng)楚堤,喜歡書包 我愛媽媽 現(xiàn)在最愛最愛的就是媽媽媽 小氣鬼疫蔓,媽媽...
    xiumingou閱讀 191評(píng)論 1 0
  • 途中 夜車星繁,他鄉(xiāng)游子拗小,拾掇行囊重罪,喃喃低唱。 “親愛的旅客您好哀九,你乘坐的本次列車將開往……” 廣播里女播音員...
    不二森林閱讀 339評(píng)論 0 0