線性回歸

主要內(nèi)容包括：

線性回歸的基本要素
線性回歸模型從零開始的實現(xiàn)
線性回歸模型使用pytorch的簡潔實現(xiàn)

線性回歸的基本要素

模型

為了簡單起見，這里我們假設(shè)價格只取決于房屋狀況的兩個因素勘天，即面積（平方米）和房齡（年）。接下來我們希望探索價格與這兩個因素的具體關(guān)系。線性回歸假設(shè)輸出與各個輸入之間是線性關(guān)系:

$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b$

數(shù)據(jù)集

我們通常收集一系列的真實數(shù)據(jù)，例如多棟房屋的真實售出價格和它們對應(yīng)的面積和房齡饿肺。我們希望在這個數(shù)據(jù)上面尋找模型參數(shù)來使模型的預(yù)測價格與真實價格的誤差最小。在機(jī)器學(xué)習(xí)術(shù)語里幢痘，該數(shù)據(jù)集被稱為訓(xùn)練數(shù)據(jù)集（training data set）或訓(xùn)練集（training set）唬格，一棟房屋被稱為一個樣本（sample），其真實售出價格叫作標(biāo)簽（label）颜说，用來預(yù)測標(biāo)簽的兩個因素叫作特征（feature）购岗。特征用來表征樣本的特點(diǎn)。

損失函數(shù)

在模型訓(xùn)練中门粪，我們需要衡量價格預(yù)測值與真實值之間的誤差喊积。通常我們會選取一個非負(fù)數(shù)作為誤差，且數(shù)值越小表示誤差越小玄妈。一個常用的選擇是平方函數(shù)乾吻。它在評估索引為 $i$ 的樣本誤差的表達(dá)式為

$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2,$

$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$

優(yōu)化函數(shù) - 隨機(jī)梯度下降

當(dāng)模型和損失函數(shù)形式較為簡單時髓梅，上面的誤差最小化問題的解可以直接用公式表達(dá)出來。這類解叫作解析解（analytical solution）绎签。本節(jié)使用的線性回歸和平方誤差剛好屬于這個范疇枯饿。然而，大多數(shù)深度學(xué)習(xí)模型并沒有解析解诡必，只能通過優(yōu)化算法有限次迭代模型參數(shù)來盡可能降低損失函數(shù)的值奢方。這類解叫作數(shù)值解（numerical solution）。

在求數(shù)值解的優(yōu)化算法中爸舒，小批量隨機(jī)梯度下降（mini-batch stochastic gradient descent）在深度學(xué)習(xí)中被廣泛使用蟋字。它的算法很簡單：先選取一組模型參數(shù)的初始值，如隨機(jī)選扰っ恪鹊奖；接下來對參數(shù)進(jìn)行多次迭代，使每次迭代都可能降低損失函數(shù)的值涂炎。在每次迭代中忠聚，先隨機(jī)均勻采樣一個由固定數(shù)目訓(xùn)練數(shù)據(jù)樣本所組成的小批量（mini-batch） $\mathcal{B}$ ，然后求小批量中數(shù)據(jù)樣本的平均損失有關(guān)模型參數(shù)的導(dǎo)數(shù)（梯度）璧尸，最后用此結(jié)果與預(yù)先設(shè)定的一個正數(shù)的乘積作為模型參數(shù)在本次迭代的減小量咒林。

$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)$

學(xué)習(xí)率: $\eta$ 代表在每次優(yōu)化中，能夠?qū)W習(xí)的步長的大小
批量大小: $\mathcal{B}$ 是小批量計算中的批量大小batch size

總結(jié)一下爷光，優(yōu)化函數(shù)的有以下兩個步驟：

(i)初始化模型參數(shù)垫竞，一般來說使用隨機(jī)初始化；
(ii)我們在數(shù)據(jù)上迭代多次蛀序，通過在負(fù)梯度方向移動參數(shù)來更新每個參數(shù)欢瞪。

矢量計算

在模型訓(xùn)練或預(yù)測時，我們常常會同時處理多個數(shù)據(jù)樣本并用到矢量計算徐裸。在介紹線性回歸的矢量計算表達(dá)式之前遣鼓，讓我們先考慮對兩個向量相加的兩種方法。

向量相加的一種方法是重贺，將這兩個向量按元素逐一做標(biāo)量加法骑祟。
向量相加的另一種方法是，將這兩個向量直接做矢量加法气笙。

import torch
import time

# init variable a, b as 1000 dimension vector
n = 1000
a = torch.ones(n)
b = torch.ones(n)

# define a timer class to record time
class Timer(object):
    """Record multiple running times."""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        # start the timer
        self.start_time = time.time()

    def stop(self):
        # stop the timer and record time into a list
        self.times.append(time.time() - self.start_time)
        return self.times[-1]

    def avg(self):
        # calculate the average and return
        return sum(self.times)/len(self.times)

    def sum(self):
        # return the sum of recorded time
        return sum(self.times)

#現(xiàn)在我們可以來測試了次企。首先將兩個向量使用for循環(huán)按元素逐一做標(biāo)量加法。
timer = Timer()
c = torch.zeros(n)
for i in range(n):
    c[i] = a[i] + b[i]
'%.5f sec' % timer.stop()

'0.00991 sec'

# 另外是使用torch來將兩個向量直接做矢量加法：
timer.start()
d = a + b
'%.5f sec' % timer.stop()

'0.00020 sec'

結(jié)果很明顯,后者比前者運(yùn)算速度更快潜圃。因此缸棵，我們應(yīng)該盡可能采用矢量計算，以提升計算效率谭期。

線性回歸模型從零開始的實現(xiàn)

# import packages and modules
%matplotlib inline
import torch
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import random

print(torch.__version__)

1.3.0

生成數(shù)據(jù)集

使用線性模型來生成數(shù)據(jù)集堵第，生成一個1000個樣本的數(shù)據(jù)集吧凉，下面是用來生成數(shù)據(jù)的線性關(guān)系：

$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b$

使用圖像來展示生成的數(shù)據(jù)

# set input feature number 
num_inputs = 2
# set example number
num_examples = 1000

# set true weight and bias in order to generate corresponded label
true_w = [2, -3.4]
true_b = 4.2

features = torch.randn(num_examples, num_inputs,
                      dtype=torch.float32)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),
                       dtype=torch.float32)
                       
plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);

<img src="

image.png

讀取數(shù)據(jù)集

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # random read 10 samples
    for i in range(0, num_examples, batch_size):
        j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch
        yield  features.index_select(0, j), labels.index_select(0, j)
        
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break

tensor([[ 0.6350,  0.9055],
        [ 1.2403,  0.0921],
        [ 0.6025, -0.2302],
        [ 0.6575, -0.9278],
        [-1.0142, -0.4754],
        [ 1.1268, -0.1300],
        [-0.0864,  0.5667],
        [-0.8504, -0.0015],
        [ 0.6423, -0.0941],
        [-0.1091, -0.6242]]) 
 tensor([2.3810, 6.3654, 6.1801, 8.6633, 3.7889, 6.9128, 2.1082, 2.5203, 5.7929,
        6.1083])

初始化模型參數(shù)

w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)
b = torch.zeros(1, dtype=torch.float32)

w.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)

tensor([0.], requires_grad=True)

定義模型

定義用來訓(xùn)練參數(shù)的訓(xùn)練模型：

$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b$

def linreg(X, w, b):
    return torch.mm(X, w) + b

定義損失函數(shù)

我們使用的是均方誤差損失函數(shù)：
$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2,$

def squared_loss(y_hat, y): 
    return (y_hat - y.view(y_hat.size())) ** 2 / 2

定義優(yōu)化函數(shù)

在這里優(yōu)化函數(shù)使用的是小批量隨機(jī)梯度下降：

$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)$

def sgd(params, lr, batch_size): 
    for param in params:
        param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track

訓(xùn)練

當(dāng)數(shù)據(jù)集、模型踏志、損失函數(shù)和優(yōu)化函數(shù)定義完了之后就可來準(zhǔn)備進(jìn)行模型的訓(xùn)練了阀捅。

def linreg(X, w, b):
    return torch.mm(X, w) + b

# super parameters init
lr = 0.03
num_epochs = 5

net = linreg
loss = squared_loss

# training
for epoch in range(num_epochs):  # training repeats num_epochs times
    # in each epoch, all the samples in dataset will be used once
    
    # X is the feature and y is the label of a batch sample
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y).sum()  
        # calculate the gradient of batch sample loss 
        l.backward()  
        # using small batch random gradient descent to iter model parameters
        sgd([w, b], lr, batch_size)  
        # reset parameter gradient
        w.grad.data.zero_()
        b.grad.data.zero_()
    train_l = loss(net(features, w, b), labels)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))

epoch 1, loss 0.050744
epoch 2, loss 0.000227
epoch 3, loss 0.000053
epoch 4, loss 0.000052
epoch 5, loss 0.000052

w, true_w, b, true_b

(tensor([[ 2.0005],
         [-3.4000]], requires_grad=True),
 [2, -3.4],
 tensor([4.1997], requires_grad=True),
 4.2)

線性回歸模型使用pytorch的簡潔實現(xiàn)

import torch
from torch import nn
import numpy as np
torch.manual_seed(1)

print(torch.__version__)
torch.set_default_tensor_type('torch.FloatTensor')

1.3.0

生成數(shù)據(jù)集

在這里生成數(shù)據(jù)集跟從零開始的實現(xiàn)中是完全一樣的。

num_inputs = 2
num_examples = 1000

true_w = [2, -3.4]
true_b = 4.2

features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

讀取數(shù)據(jù)集

import torch.utils.data as Data

batch_size = 10

# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)

# put dataset into DataLoader
data_iter = Data.DataLoader(
    dataset=dataset,            # torch TensorDataset format
    batch_size=batch_size,      # mini batch size
    shuffle=True,               # whether shuffle the data or not
    num_workers=2,              # read data in multithreading
)

for X, y in data_iter:
    print(X, '\n', y)
    break

tensor([[-0.0258,  0.4510],
        [ 0.4923, -0.1081],
        [-1.1668,  0.0468],
        [ 0.3817, -1.0940],
        [ 0.7259,  0.2551],
        [ 1.4847,  0.9639],
        [ 0.1183,  1.7620],
        [-1.3907,  0.2543],
        [ 1.5845,  0.6674],
        [-0.0429,  0.7687]]) 
 tensor([ 2.6156,  5.5594,  1.7087,  8.6942,  4.7836,  3.8924, -1.5552,  0.5531,
         5.1028,  1.5223])

定義模型

class LinearNet(nn.Module):
    def __init__(self, n_feature):
        super(LinearNet, self).__init__()      # call father function to init 
        self.linear = nn.Linear(n_feature, 1)  # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`

    def forward(self, x):
        y = self.linear(x)
        return y
    
net = LinearNet(num_inputs)
print(net)

LinearNet(
  (linear): Linear(in_features=2, out_features=1, bias=True)
)

# ways to init a multilayer network
# method one
net = nn.Sequential(
    nn.Linear(num_inputs, 1)
    # other layers can be added here
    )

# method two
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......

# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
          ('linear', nn.Linear(num_inputs, 1))
          # ......
        ]))

print(net)
print(net[0])

Sequential(
  (linear): Linear(in_features=2, out_features=1, bias=True)
)
Linear(in_features=2, out_features=1, bias=True)

初始化模型參數(shù)

from torch.nn import init

init.normal_(net[0].weight, mean=0.0, std=0.01)
init.constant_(net[0].bias, val=0.0)  # or you can use `net[0].bias.data.fill_(0)` to modify it directly

Parameter containing:
tensor([0.], requires_grad=True)

for param in net.parameters():
    print(param)

Parameter containing:
tensor([[-0.0142, -0.0161]], requires_grad=True)
Parameter containing:
tensor([0.], requires_grad=True)

定義損失函數(shù)

loss = nn.MSELoss()    # nn built-in squared loss function
                       # function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`

定義優(yōu)化函數(shù)

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.03)   # built-in random gradient descent function
print(optimizer)  # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`

SGD (
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0
)

訓(xùn)練

num_epochs = 3
for epoch in range(1, num_epochs + 1):
    for X, y in data_iter:
        output = net(X)
        l = loss(output, y.view(-1, 1))
        optimizer.zero_grad() # reset gradient, equal to net.zero_grad()
        l.backward()
        optimizer.step()
    print('epoch %d, loss: %f' % (epoch, l.item()))

epoch 1, loss: 0.000238
epoch 2, loss: 0.000185
epoch 3, loss: 0.000105

# result comparision
dense = net[0]
print(true_w, dense.weight.data)
print(true_b, dense.bias.data)

[2, -3.4] tensor([[ 2.0010, -3.3994]])
4.2 tensor([4.2005])

兩種實現(xiàn)方式的比較

從零開始的實現(xiàn)（推薦用來學(xué)習(xí)）

能夠更好的理解模型和神經(jīng)網(wǎng)絡(luò)底層的原理

使用pytorch的簡潔實現(xiàn)

能夠更加快速地完成模型的設(shè)計與實現(xiàn)

softmax和分類模型

內(nèi)容包含：

softmax回歸的基本概念
如何獲取Fashion-MNIST數(shù)據(jù)集和讀取數(shù)據(jù)
softmax回歸模型的從零開始實現(xiàn)狰贯，實現(xiàn)一個對Fashion-MNIST訓(xùn)練集中的圖像數(shù)據(jù)進(jìn)行分類的模型
使用pytorch重新實現(xiàn)softmax回歸模型

softmax的基本概念

分類問題
一個簡單的圖像分類問題也搓，輸入圖像的高和寬均為2像素，色彩為灰度涵紊。
圖像中的4像素分別記為 $x_1, x_2, x_3, x_4$ 。
假設(shè)真實標(biāo)簽為狗幔摸、貓或者雞摸柄，這些標(biāo)簽對應(yīng)的離散值為 $y_1, y_2, y_3$ 。
我們通常使用離散的數(shù)值來表示類別既忆，例如 $y_1=1, y_2=2, y_3=3$ 驱负。
權(quán)重矢量
$\begin{aligned} o_1 &= x_1 w_{11} + x_3 w_{21} + x_3 w_{31} + x_4 w_{41} + b_1 \end{aligned}$

$\begin{aligned} o_2 &= x_1 w_{12} + x_2 w_{22} + x_3 w_{32} + x_4 w_{42} + b_2 \end{aligned}$

$\begin{aligned} o_3 &= x_1 w_{13} + x_2 w_{23} + x_3 w_{33} + x_4 w_{43} + b_3 \end{aligned}$

神經(jīng)網(wǎng)絡(luò)圖
下圖用神經(jīng)網(wǎng)絡(luò)圖描繪了上面的計算。softmax回歸同線性回歸一樣患雇，也是一個單層神經(jīng)網(wǎng)絡(luò)跃脊。由于每個輸出 $o_1, o_2, o_3$ 的計算都要依賴于所有的輸入 $x_1, x_2, x_3, x_4$ ，softmax回歸的輸出層也是一個全連接層苛吱。

Image Name

$\begin{aligned}softmax回歸是一個單層神經(jīng)網(wǎng)絡(luò)\end{aligned}$

既然分類問題需要得到離散的預(yù)測輸出酪术，一個簡單的辦法是將輸出值 $o_i$ 當(dāng)作預(yù)測類別是 $i$ 的置信度，并將值最大的輸出所對應(yīng)的類作為預(yù)測輸出翠储，即輸出 $\underset{i}{\arg\max} o_i$ 绘雁。例如，如果 $o_1,o_2,o_3$ 分別為 $0.1,10,0.1$ 援所，由于 $o_2$ 最大庐舟，那么預(yù)測類別為2，其代表貓住拭。

輸出問題
直接使用輸出層的輸出有兩個問題：
1. 一方面挪略，由于輸出層的輸出值的范圍不確定，我們難以直觀上判斷這些值的意義滔岳。例如杠娱，剛才舉的例子中的輸出值10表示“很置信”圖像類別為貓，因為該輸出值是其他兩類的輸出值的100倍澈蟆。但如果 $o_1=o_3=10^3$ 墨辛，那么輸出值10卻又表示圖像類別為貓的概率很低。
2. 另一方面趴俘，由于真實標(biāo)簽是離散值睹簇，這些離散值與不確定范圍的輸出值之間的誤差難以衡量奏赘。

softmax運(yùn)算符（softmax operator）解決了以上兩個問題。它通過下式將輸出值變換成值為正且和為1的概率分布：

$\hat{y}_1, \hat{y}_2, \hat{y}_3 = \text{softmax}(o_1, o_2, o_3)$

其中

$\hat{y}1 = \frac{ \exp(o_1)}{\sum_{i=1}^3 \exp(o_i)},\quad \hat{y}2 = \frac{ \exp(o_2)}{\sum_{i=1}^3 \exp(o_i)},\quad \hat{y}3 = \frac{ \exp(o_3)}{\sum_{i=1}^3 \exp(o_i)}.$

容易看出 $\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$ 且 $0 \leq \hat{y}_1, \hat{y}_2, \hat{y}_3 \leq 1$ 太惠，因此 $\hat{y}_1, \hat{y}_2, \hat{y}_3$ 是一個合法的概率分布磨淌。這時候，如果 $\hat{y}_2=0.8$ 凿渊，不管 $\hat{y}_1$ 和 $\hat{y}_3$ 的值是多少梁只，我們都知道圖像類別為貓的概率是80%。此外埃脏，我們注意到

$\underset{i}{\arg\max} o_i = \underset{i}{\arg\max} \hat{y}_i$

因此softmax運(yùn)算不改變預(yù)測類別輸出搪锣。

計算效率
- 單樣本矢量計算表達(dá)式
  為了提高計算效率，我們可以將單樣本分類通過矢量計算來表達(dá)彩掐。在上面的圖像分類問題中构舟，假設(shè)softmax回歸的權(quán)重和偏差參數(shù)分別為

$\boldsymbol{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix},\quad \boldsymbol = \begin{bmatrix} b_1 & b_2 & b_3 \end{bmatrix},$

設(shè)高和寬分別為2個像素的圖像樣本 $i$ 的特征為

$\boldsymbol{x}^{(i)} = \begin{bmatrix}x_1^{(i)} & x_2^{(i)} & x_3^{(i)} & x_4^{(i)}\end{bmatrix},$

輸出層的輸出為

$\boldsymbol{o}^{(i)} = \begin{bmatrix}o_1^{(i)} & o_2^{(i)} & o_3^{(i)}\end{bmatrix},$

預(yù)測為狗堵幽、貓或雞的概率分布為

$\boldsymbol{\hat{y}}^{(i)} = \begin{bmatrix}\hat{y}_1^{(i)} & \hat{y}_2^{(i)} & \hat{y}_3^{(i)}\end{bmatrix}.$

softmax回歸對樣本 $i$ 分類的矢量計算表達(dá)式為

$\begin{aligned} \boldsymbol{o}^{(i)} &= \boldsymbol{x}^{(i)} \boldsymbol{W} + \boldsymbol狗超,\\ \boldsymbol{\hat{y}}^{(i)} &= \text{softmax}(\boldsymbol{o}^{(i)}). \end{aligned}$

小批量矢量計算表達(dá)式
為了進(jìn)一步提升計算效率，我們通常對小批量數(shù)據(jù)做矢量計算朴下。廣義上講努咐，給定一個小批量樣本，其批量大小為 $n$ 殴胧，輸入個數(shù)（特征數(shù)）為 $d$ 渗稍，輸出個數(shù)（類別數(shù)）為 $q$ 。設(shè)批量特征為 $\boldsymbol{X} \in \mathbb{R}^{n \times d}$ 溃肪。假設(shè)softmax回歸的權(quán)重和偏差參數(shù)分別為 $\boldsymbol{W} \in \mathbb{R}^{d \times q}$ 和 $\boldsymbol免胃 \in \mathbb{R}^{1 \times q}$ 。softmax回歸的矢量計算表達(dá)式為

$\begin{aligned} \boldsymbol{O} &= \boldsymbol{X} \boldsymbol{W} + \boldsymbol惫撰,\\ \boldsymbol{\hat{Y}} &= \text{softmax}(\boldsymbol{O}), \end{aligned}$

其中的加法運(yùn)算使用了廣播機(jī)制羔沙， $\boldsymbol{O}, \boldsymbol{\hat{Y}} \in \mathbb{R}^{n \times q}$ 且這兩個矩陣的第 $i$ 行分別為樣本 $i$ 的輸出 $\boldsymbol{o}^{(i)}$ 和概率分布 $\boldsymbol{\hat{y}}^{(i)}$ 。

交叉熵?fù)p失函數(shù)

對于樣本 $i$ 厨钻，我們構(gòu)造向量 $\boldsymbol{y}^{(i)}\in \mathbb{R}^{q}$ 扼雏，使其第 $y^{(i)}$ （樣本 $i$ 類別的離散數(shù)值）個元素為1，其余為0夯膀。這樣我們的訓(xùn)練目標(biāo)可以設(shè)為使預(yù)測概率分布 $\boldsymbol{\hat y}^{(i)}$ 盡可能接近真實的標(biāo)簽概率分布 $\boldsymbol{y}^{(i)}$ 诗充。

平方損失估計

$\begin{aligned}Loss = |\boldsymbol{\hat y}^{(i)}-\boldsymbol{y}^{(i)}|^2/2\end{aligned}$

然而，想要預(yù)測分類結(jié)果正確诱建，我們其實并不需要預(yù)測概率完全等于標(biāo)簽概率蝴蜓。例如，在圖像分類的例子里，如果 $y^{(i)}=3$ 茎匠，那么我們只需要 $\hat{y}^{(i)}_3$ 比其他兩個預(yù)測值 $\hat{y}^{(i)}_1$ 和 $\hat{y}^{(i)}_2$ 大就行了格仲。即使 $\hat{y}^{(i)}_3$ 值為0.6，不管其他兩個預(yù)測值為多少诵冒，類別預(yù)測均正確凯肋。而平方損失則過于嚴(yán)格，例如 $\hat y^{(i)}_1=\hat y^{(i)}_2=0.2$ 比 $\hat y^{(i)}_1=0, \hat y^{(i)}_2=0.4$ 的損失要小很多汽馋，雖然兩者都有同樣正確的分類預(yù)測結(jié)果侮东。

改善上述問題的一個方法是使用更適合衡量兩個概率分布差異的測量函數(shù)。其中豹芯，交叉熵（cross entropy）是一個常用的衡量方法：

$H\left(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}\right ) = -\sum_{j=1}^q y_j^{(i)} \log \hat y_j^{(i)},$

其中帶下標(biāo)的 $y_j^{(i)}$ 是向量 $\boldsymbol y^{(i)}$ 中非0即1的元素悄雅，需要注意將它與樣本 $i$ 類別的離散數(shù)值，即不帶下標(biāo)的 $y^{(i)}$ 區(qū)分告组。在上式中煤伟，我們知道向量 $\boldsymbol y^{(i)}$ 中只有第 $y^{(i)}$ 個元素 $y^{(i)}{y^{(i)}}$ 為1，其余全為0木缝，于是 $H(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}) = -\log \hat y_{y^{(i)}}^{(i)}$ 。也就是說围辙，交叉熵只關(guān)心對正確類別的預(yù)測概率我碟，因為只要其值足夠大，就可以確保分類結(jié)果正確姚建。當(dāng)然矫俺，遇到一個樣本有多個標(biāo)簽時，例如圖像里含有不止一個物體時掸冤，我們并不能做這一步簡化厘托。但即便對于這種情況，交叉熵同樣只關(guān)心對圖像中出現(xiàn)的物體類別的預(yù)測概率稿湿。

假設(shè)訓(xùn)練數(shù)據(jù)集的樣本數(shù)為 $n$ 铅匹，交叉熵?fù)p失函數(shù)定義為
$\ell(\boldsymbol{\Theta}) = \frac{1}{n} \sum_{i=1}^n H\left(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}\right ),$

其中 $\boldsymbol{\Theta}$ 代表模型參數(shù)。同樣地饺藤，如果每個樣本只有一個標(biāo)簽包斑，那么交叉熵?fù)p失可以簡寫成 $\ell(\boldsymbol{\Theta}) = -(1/n) \sum_{i=1}^n \log \hat y_{y^{(i)}}^{(i)}$ 。從另一個角度來看涕俗，我們知道最小化 $\ell(\boldsymbol{\Theta})$ 等價于最大化 $\exp(-n\ell(\boldsymbol{\Theta}))=\prod_{i=1}^n \hat y_{y^{(i)}}^{(i)}$ 罗丰，即最小化交叉熵?fù)p失函數(shù)等價于最大化訓(xùn)練數(shù)據(jù)集所有標(biāo)簽類別的聯(lián)合預(yù)測概率。

模型訓(xùn)練和預(yù)測

在訓(xùn)練好softmax回歸模型后再姑，給定任一樣本特征萌抵，就可以預(yù)測每個輸出類別的概率。通常，我們把預(yù)測概率最大的類別作為輸出類別绍填。如果它與真實類別（標(biāo)簽）一致霎桅，說明這次預(yù)測是正確的。在3.6節(jié)的實驗中沐兰，我們將使用準(zhǔn)確率（accuracy）來評價模型的表現(xiàn)哆档。它等于正確預(yù)測數(shù)量與總預(yù)測數(shù)量之比。

獲取Fashion-MNIST訓(xùn)練集和讀取數(shù)據(jù)

在介紹softmax回歸的實現(xiàn)前我們先引入一個多類圖像分類數(shù)據(jù)集住闯。它將在后面的章節(jié)中被多次使用瓜浸，以方便我們觀察比較算法之間在模型精度和計算效率上的區(qū)別。圖像分類數(shù)據(jù)集中最常用的是手寫數(shù)字識別數(shù)據(jù)集MNIST[1]比原。但大部分模型在MNIST上的分類精度都超過了95%插佛。為了更直觀地觀察算法之間的差異，我們將使用一個圖像內(nèi)容更加復(fù)雜的數(shù)據(jù)集Fashion-MNIST[2]量窘。

我這里我們會使用torchvision包雇寇，它是服務(wù)于PyTorch深度學(xué)習(xí)框架的，主要用來構(gòu)建計算機(jī)視覺模型蚌铜。torchvision主要由以下幾部分構(gòu)成：

torchvision.datasets: 一些加載數(shù)據(jù)的函數(shù)及常用的數(shù)據(jù)集接口锨侯；
torchvision.models: 包含常用的模型結(jié)構(gòu)（含預(yù)訓(xùn)練模型），例如AlexNet冬殃、VGG囚痴、ResNet等；
torchvision.transforms: 常用的圖片變換审葬，例如裁剪深滚、旋轉(zhuǎn)等；
torchvision.utils: 其他的一些有用的方法涣觉。

# import needed package
%matplotlib inline
from IPython import display
import matplotlib.pyplot as plt

import torch
import torchvision
import torchvision.transforms as transforms
import time

import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

print(torch.__version__)
print(torchvision.__version__)

1.3.0
0.4.1a0+d94043a

get dataset

mnist_train = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=True, download=True, transform=transforms.ToTensor())
mnist_test = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=False, download=True, transform=transforms.ToTensor())

class torchvision.datasets.FashionMNIST(root, train=True, transform=None, target_transform=None, download=False)

root（string）– 數(shù)據(jù)集的根目錄痴荐，其中存放processed/training.pt和processed/test.pt文件。
train（bool, 可選）– 如果設(shè)置為True官册，從training.pt創(chuàng)建數(shù)據(jù)集生兆，否則從test.pt創(chuàng)建。
download（bool, 可選）– 如果設(shè)置為True攀隔，從互聯(lián)網(wǎng)下載數(shù)據(jù)并放到root文件夾下皂贩。如果root目錄下已經(jīng)存在數(shù)據(jù)，不會再次下載昆汹。
transform（可被調(diào)用 , 可選）– 一種函數(shù)或變換明刷，輸入PIL圖片，返回變換之后的數(shù)據(jù)满粗。如：transforms.RandomCrop辈末。
target_transform（可被調(diào)用 , 可選）– 一種函數(shù)或變換，輸入目標(biāo)，進(jìn)行變換挤聘。

# show result 
print(type(mnist_train))
print(len(mnist_train), len(mnist_test))

<class 'torchvision.datasets.mnist.FashionMNIST'>
60000 10000

# 我們可以通過下標(biāo)來訪問任意一個樣本
feature, label = mnist_train[0]
print(feature.shape, label)  # Channel x Height x Width

torch.Size([1, 28, 28]) 9

如果不做變換輸入的數(shù)據(jù)是圖像轰枝，我們可以看一下圖片的類型參數(shù)：

mnist_PIL = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=True, download=True)
PIL_feature, label = mnist_PIL[0]
print(PIL_feature)

<PIL.Image.Image image mode=L size=28x28 at 0x7F54A41612E8>

# 本函數(shù)已保存在d2lzh包中方便以后使用
def get_fashion_mnist_labels(labels):
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    return [text_labels[int(i)] for i in labels]

def show_fashion_mnist(images, labels):
    d2l.use_svg_display()
    # 這里的_表示我們忽略（不使用）的變量
    _, figs = plt.subplots(1, len(images), figsize=(12, 12))
    for f, img, lbl in zip(figs, images, labels):
        f.imshow(img.view((28, 28)).numpy())
        f.set_title(lbl)
        f.axes.get_xaxis().set_visible(False)
        f.axes.get_yaxis().set_visible(False)
    plt.show()

X, y = [], []
for i in range(10):
    X.append(mnist_train[i][0]) # 將第i個feature加到X中
    y.append(mnist_train[i][1]) # 將第i個label加到y(tǒng)中
show_fashion_mnist(X, get_fashion_mnist_labels(y))

[圖片上傳失敗...(image-1e2bb5-1581676935142)]

# 讀取數(shù)據(jù)
batch_size = 256
num_workers = 4
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)

start = time.time()
for X, y in train_iter:
    continue
print('%.2f sec' % (time.time() - start))

4.95 sec

softmax從零開始的實現(xiàn)

import torch
import torchvision
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

print(torch.__version__)
print(torchvision.__version__)

1.3.0
0.4.1a0+d94043a

獲取訓(xùn)練集數(shù)據(jù)和測試集數(shù)據(jù)

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')

模型參數(shù)初始化

num_inputs = 784
print(28*28)
num_outputs = 10

W = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_outputs)), dtype=torch.float)
b = torch.zeros(num_outputs, dtype=torch.float)

W.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)

對多維Tensor按維度操作

X = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(X.sum(dim=0, keepdim=True))  # dim為0，按照相同的列求和组去，并在結(jié)果中保留列特征
print(X.sum(dim=1, keepdim=True))  # dim為1鞍陨，按照相同的行求和，并在結(jié)果中保留行特征
print(X.sum(dim=0, keepdim=False)) # dim為0从隆，按照相同的列求和诚撵，不在結(jié)果中保留列特征
print(X.sum(dim=1, keepdim=False)) # dim為1，按照相同的行求和键闺，不在結(jié)果中保留行特征

tensor([[5, 7, 9]])
tensor([[ 6],
        [15]])
tensor([5, 7, 9])
tensor([ 6, 15])

定義softmax操作

$\hat{y}_j = \frac{ \exp(o_j)}{\sum_{i=1}^3 \exp(o_i)}$

def softmax(X):
    X_exp = X.exp()
    partition = X_exp.sum(dim=1, keepdim=True)
    # print("X size is ", X_exp.size())
    # print("partition size is ", partition, partition.size())
    return X_exp / partition  # 這里應(yīng)用了廣播機(jī)制

X = torch.rand((2, 5))
X_prob = softmax(X)
print(X_prob, '\n', X_prob.sum(dim=1))

tensor([[0.2253, 0.1823, 0.1943, 0.2275, 0.1706],
        [0.1588, 0.2409, 0.2310, 0.1670, 0.2024]]) 
 tensor([1.0000, 1.0000])

softmax回歸模型

$\begin{aligned} \boldsymbol{o}^{(i)} &= \boldsymbol{x}^{(i)} \boldsymbol{W} + \boldsymbol寿烟,\\ \boldsymbol{\hat{y}}^{(i)} &= \text{softmax}(\boldsymbol{o}^{(i)}). \end{aligned}$

def net(X):
    return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)

定義損失函數(shù)

$H\left(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}\right ) = -\sum_{j=1}^q y_j^{(i)} \log \hat y_j^{(i)},$

$\ell(\boldsymbol{\Theta}) = \frac{1}{n} \sum_{i=1}^n H\left(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}\right ),$

$\ell(\boldsymbol{\Theta}) = -(1/n) \sum_{i=1}^n \log \hat y_{y^{(i)}}^{(i)}$

y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y = torch.LongTensor([0, 2])
y_hat.gather(1, y.view(-1, 1))

tensor([[0.1000],
        [0.5000]])

def cross_entropy(y_hat, y):
    return - torch.log(y_hat.gather(1, y.view(-1, 1)))

定義準(zhǔn)確率

我們模型訓(xùn)練完了進(jìn)行模型預(yù)測的時候，會用到我們這里定義的準(zhǔn)確率辛燥。

def accuracy(y_hat, y):
    return (y_hat.argmax(dim=1) == y).float().mean().item()

print(accuracy(y_hat, y))

0.5

# 本函數(shù)已保存在d2lzh_pytorch包中方便以后使用筛武。該函數(shù)將被逐步改進(jìn)：它的完整實現(xiàn)將在“圖像增廣”一節(jié)中描述
def evaluate_accuracy(data_iter, net):
    acc_sum, n = 0.0, 0
    for X, y in data_iter:
        acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
        n += y.shape[0]
    return acc_sum / n

print(evaluate_accuracy(test_iter, net))

0.1445

訓(xùn)練模型

num_epochs, lr = 5, 0.1

# 本函數(shù)已保存在d2lzh_pytorch包中方便以后使用
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
              params=None, lr=None, optimizer=None):
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
        for X, y in train_iter:
            y_hat = net(X)
            l = loss(y_hat, y).sum()
            
            # 梯度清零
            if optimizer is not None:
                optimizer.zero_grad()
            elif params is not None and params[0].grad is not None:
                for param in params:
                    param.grad.data.zero_()
            
            l.backward()
            if optimizer is None:
                d2l.sgd(params, lr, batch_size)
            else:
                optimizer.step() 
            
            
            train_l_sum += l.item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
            n += y.shape[0]
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
              % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))

train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)

epoch 1, loss 0.7851, train acc 0.750, test acc 0.791
epoch 2, loss 0.5704, train acc 0.814, test acc 0.810
epoch 3, loss 0.5258, train acc 0.825, test acc 0.819
epoch 4, loss 0.5014, train acc 0.832, test acc 0.824
epoch 5, loss 0.4865, train acc 0.836, test acc 0.827

模型預(yù)測

現(xiàn)在我們的模型訓(xùn)練完了，可以進(jìn)行一下預(yù)測挎塌，我們的這個模型訓(xùn)練的到底準(zhǔn)確不準(zhǔn)確徘六。
現(xiàn)在就可以演示如何對圖像進(jìn)行分類了。給定一系列圖像（第三行圖像輸出）榴都，我們比較一下它們的真實標(biāo)簽（第一行文本輸出）和模型預(yù)測結(jié)果（第二行文本輸出）硕噩。

X, y = iter(test_iter).next()

true_labels = d2l.get_fashion_mnist_labels(y.numpy())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(dim=1).numpy())
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]

d2l.show_fashion_mnist(X[0:9], titles[0:9])

[圖片上傳失敗...(image-e0094a-1581676935142)]

softmax的簡潔實現(xiàn)

# 加載各種包或者模塊
import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

print(torch.__version__)

1.3.0

初始化參數(shù)和獲取數(shù)據(jù)

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')

定義網(wǎng)絡(luò)模型

num_inputs = 784
num_outputs = 10

class LinearNet(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(LinearNet, self).__init__()
        self.linear = nn.Linear(num_inputs, num_outputs)
    def forward(self, x): # x 的形狀: (batch, 1, 28, 28)
        y = self.linear(x.view(x.shape[0], -1))
        return y
    
# net = LinearNet(num_inputs, num_outputs)

class FlattenLayer(nn.Module):
    def __init__(self):
        super(FlattenLayer, self).__init__()
    def forward(self, x): # x 的形狀: (batch, *, *, ...)
        return x.view(x.shape[0], -1)

from collections import OrderedDict
net = nn.Sequential(
        # FlattenLayer(),
        # LinearNet(num_inputs, num_outputs) 
        OrderedDict([
           ('flatten', FlattenLayer()),
           ('linear', nn.Linear(num_inputs, num_outputs))]) # 或者寫成我們自己定義的 LinearNet(num_inputs, num_outputs) 也可以
        )

初始化模型參數(shù)

init.normal_(net.linear.weight, mean=0, std=0.01)
init.constant_(net.linear.bias, val=0)

Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)

定義損失函數(shù)

loss = nn.CrossEntropyLoss() # 下面是他的函數(shù)原型
# class torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')

定義優(yōu)化函數(shù)

optimizer = torch.optim.SGD(net.parameters(), lr=0.1) # 下面是函數(shù)原型
# class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)

訓(xùn)練

num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)

epoch 1, loss 0.0031, train acc 0.751, test acc 0.795
epoch 2, loss 0.0022, train acc 0.813, test acc 0.809
epoch 3, loss 0.0021, train acc 0.825, test acc 0.806
epoch 4, loss 0.0020, train acc 0.833, test acc 0.813
epoch 5, loss 0.0019, train acc 0.837, test acc 0.822

多層感知機(jī)

多層感知機(jī)的基本知識
使用多層感知機(jī)圖像分類的從零開始的實現(xiàn)
使用pytorch的簡潔實現(xiàn)

多層感知機(jī)的基本知識

深度學(xué)習(xí)主要關(guān)注多層模型。在這里缭贡，我們將以多層感知機(jī)（multilayer perceptron，MLP）為例辉懒，介紹多層神經(jīng)網(wǎng)絡(luò)的概念阳惹。

隱藏層

下圖展示了一個多層感知機(jī)的神經(jīng)網(wǎng)絡(luò)圖，它含有一個隱藏層眶俩，該層中有5個隱藏單元莹汤。

Image Name

表達(dá)公式

具體來說，給定一個小批量樣本 $\boldsymbol{X} \in \mathbb{R}^{n \times d}$ 颠印，其批量大小為 $n$ 纲岭，輸入個數(shù)為 $d$ 。假設(shè)多層感知機(jī)只有一個隱藏層线罕，其中隱藏單元個數(shù)為 $h$ 止潮。記隱藏層的輸出（也稱為隱藏層變量或隱藏變量）為 $\boldsymbol{H}$ ，有 $\boldsymbol{H} \in \mathbb{R}^{n \times h}$ 钞楼。因為隱藏層和輸出層均是全連接層喇闸，可以設(shè)隱藏層的權(quán)重參數(shù)和偏差參數(shù)分別為 $\boldsymbol{W}_h \in \mathbb{R}^{d \times h}$ 和 $\boldsymbol_h \in \mathbb{R}^{1 \times h}$ ，輸出層的權(quán)重和偏差參數(shù)分別為 $\boldsymbol{W}_o \in \mathbb{R}^{h \times q}$ 和 $\boldsymbol燃乍_o \in \mathbb{R}^{1 \times q}$ 唆樊。

我們先來看一種含單隱藏層的多層感知機(jī)的設(shè)計。其輸出 $\boldsymbol{O} \in \mathbb{R}^{n \times q}$ 的計算為

$\begin{aligned} \boldsymbol{H} &= \boldsymbol{X} \boldsymbol{W}_h + \boldsymbol刻蟹_h,\\ \boldsymbol{O} &= \boldsymbol{H} \boldsymbol{W}_o + \boldsymbol逗旁_o, \end{aligned}$

也就是將隱藏層的輸出直接作為輸出層的輸入。如果將以上兩個式子聯(lián)立起來舆瘪，可以得到

$\boldsymbol{O} = (\boldsymbol{X} \boldsymbol{W}_h + \boldsymbol片效_h)\boldsymbol{W}_o + \boldsymbol_o = \boldsymbol{X} \boldsymbol{W}_h\boldsymbol{W}_o + \boldsymbol介陶_h \boldsymbol{W}_o + \boldsymbol堤舒_o.$

從聯(lián)立后的式子可以看出，雖然神經(jīng)網(wǎng)絡(luò)引入了隱藏層哺呜，卻依然等價于一個單層神經(jīng)網(wǎng)絡(luò)：其中輸出層權(quán)重參數(shù)為 $\boldsymbol{W}_h\boldsymbol{W}_o$ 舌缤，偏差參數(shù)為 $\boldsymbol_h \boldsymbol{W}_o + \boldsymbol某残_o$ 国撵。不難發(fā)現(xiàn)，即便再添加更多的隱藏層玻墅，以上設(shè)計依然只能與僅含輸出層的單層神經(jīng)網(wǎng)絡(luò)等價介牙。

激活函數(shù)

上述問題的根源在于全連接層只是對數(shù)據(jù)做仿射變換（affine transformation），而多個仿射變換的疊加仍然是一個仿射變換澳厢。解決問題的一個方法是引入非線性變換环础，例如對隱藏變量使用按元素運(yùn)算的非線性函數(shù)進(jìn)行變換，然后再作為下一個全連接層的輸入剩拢。這個非線性函數(shù)被稱為激活函數(shù)（activation function）线得。

下面我們介紹幾個常用的激活函數(shù)：

ReLU函數(shù)

ReLU（rectified linear unit）函數(shù)提供了一個很簡單的非線性變換。給定元素 $x$ 徐伐，該函數(shù)定義為

$\text{ReLU}(x) = \max(x, 0).$

可以看出贯钩，ReLU函數(shù)只保留正數(shù)元素，并將負(fù)數(shù)元素清零办素。為了直觀地觀察這一非線性變換角雷，我們先定義一個繪圖函數(shù)xyplot。

%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)

1.3.0

def xyplot(x_vals, y_vals, name):
    # d2l.set_figsize(figsize=(5, 2.5))
    plt.plot(x_vals.detach().numpy(), y_vals.detach().numpy())
    plt.xlabel('x')
    plt.ylabel(name + '(x)')

x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = x.relu()
xyplot(x, y, 'relu')

img

y.sum().backward()
xyplot(x, x.grad, 'grad of relu')

img

Sigmoid函數(shù)

sigmoid函數(shù)可以將元素的值變換到0和1之間：

$\text{sigmoid}(x) = \frac{1}{1 + \exp(-x)}.$

y = x.sigmoid()
xyplot(x, y, 'sigmoid')

img

依據(jù)鏈?zhǔn)椒▌t性穿，sigmoid函數(shù)的導(dǎo)數(shù)

$\text{sigmoid}'(x) = \text{sigmoid}(x)\left(1-\text{sigmoid}(x)\right).$

下面繪制了sigmoid函數(shù)的導(dǎo)數(shù)勺三。當(dāng)輸入為0時，sigmoid函數(shù)的導(dǎo)數(shù)達(dá)到最大值0.25季二；當(dāng)輸入越偏離0時檩咱，sigmoid函數(shù)的導(dǎo)數(shù)越接近0揭措。

x.grad.zero_()
y.sum().backward()
xyplot(x, x.grad, 'grad of sigmoid')

img

tanh函數(shù)

tanh（雙曲正切）函數(shù)可以將元素的值變換到-1和1之間：

$\text{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.$

我們接著繪制tanh函數(shù)。當(dāng)輸入接近0時刻蚯，tanh函數(shù)接近線性變換绊含。雖然該函數(shù)的形狀和sigmoid函數(shù)的形狀很像，但tanh函數(shù)在坐標(biāo)系的原點(diǎn)上對稱炊汹。

y = x.tanh()
xyplot(x, y, 'tanh')

img

依據(jù)鏈?zhǔn)椒▌t躬充，tanh函數(shù)的導(dǎo)數(shù)

$\text{tanh}'(x) = 1 - \text{tanh}^2(x).$

下面繪制了tanh函數(shù)的導(dǎo)數(shù)。當(dāng)輸入為0時讨便，tanh函數(shù)的導(dǎo)數(shù)達(dá)到最大值1充甚；當(dāng)輸入越偏離0時，tanh函數(shù)的導(dǎo)數(shù)越接近0霸褒。

x.grad.zero_()
y.sum().backward()
xyplot(x, x.grad, 'grad of tanh')

img

關(guān)于激活函數(shù)的選擇

ReLu函數(shù)是一個通用的激活函數(shù)伴找，目前在大多數(shù)情況下使用。但是废菱，ReLU函數(shù)只能在隱藏層中使用技矮。

用于分類器時，sigmoid函數(shù)及其組合通常效果更好殊轴。由于梯度消失問題低斋，有時要避免使用sigmoid和tanh函數(shù)被丧。

在神經(jīng)網(wǎng)絡(luò)層數(shù)較多的時候，最好使用ReLu函數(shù)赞别，ReLu函數(shù)比較簡單計算量少改橘，而sigmoid和tanh函數(shù)計算量大很多谍夭。

在選擇激活函數(shù)的時候可以先選用ReLu函數(shù)如果效果不理想可以嘗試其他激活函數(shù)箩帚。

多層感知機(jī)

多層感知機(jī)就是含有至少一個隱藏層的由全連接層組成的神經(jīng)網(wǎng)絡(luò)璧帝，且每個隱藏層的輸出通過激活函數(shù)進(jìn)行變換。多層感知機(jī)的層數(shù)和各隱藏層中隱藏單元個數(shù)都是超參數(shù)芋哭。以單隱藏層為例并沿用本節(jié)之前定義的符號塑悼，多層感知機(jī)按以下方式計算輸出：

$\begin{aligned} \boldsymbol{H} &= \phi(\boldsymbol{X} \boldsymbol{W}_h + \boldsymbol_h),\\ \boldsymbol{O} &= \boldsymbol{H} \boldsymbol{W}_o + \boldsymbol楷掉_o, \end{aligned}$

其中 $\phi$ 表示激活函數(shù)。

多層感知機(jī)從零開始的實現(xiàn)

import torch
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)

1.3.0

獲取訓(xùn)練集

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065')

定義模型參數(shù)

num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_hiddens)), dtype=torch.float)
b1 = torch.zeros(num_hiddens, dtype=torch.float)
W2 = torch.tensor(np.random.normal(0, 0.01, (num_hiddens, num_outputs)), dtype=torch.float)
b2 = torch.zeros(num_outputs, dtype=torch.float)

params = [W1, b1, W2, b2]
for param in params:
    param.requires_grad_(requires_grad=True)

定義激活函數(shù)

def relu(X):
    return torch.max(input=X, other=torch.tensor(0.0))

定義網(wǎng)絡(luò)

def net(X):
    X = X.view((-1, num_inputs))
    H = relu(torch.matmul(X, W1) + b1)
    return torch.matmul(H, W2) + b2

定義損失函數(shù)

loss = torch.nn.CrossEntropyLoss()

訓(xùn)練

num_epochs, lr = 5, 100.0
# def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
#               params=None, lr=None, optimizer=None):
#     for epoch in range(num_epochs):
#         train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
#         for X, y in train_iter:
#             y_hat = net(X)
#             l = loss(y_hat, y).sum()
#             
#             # 梯度清零
#             if optimizer is not None:
#                 optimizer.zero_grad()
#             elif params is not None and params[0].grad is not None:
#                 for param in params:
#                     param.grad.data.zero_()
#            
#             l.backward()
#             if optimizer is None:
#                 d2l.sgd(params, lr, batch_size)
#             else:
#                 optimizer.step()  # “softmax回歸的簡潔實現(xiàn)”一節(jié)將用到
#             
#             
#             train_l_sum += l.item()
#             train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
#             n += y.shape[0]
#         test_acc = evaluate_accuracy(test_iter, net)
#         print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
#               % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))

d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

epoch 1, loss 0.0030, train acc 0.712, test acc 0.806
epoch 2, loss 0.0019, train acc 0.821, test acc 0.806
epoch 3, loss 0.0017, train acc 0.847, test acc 0.825
epoch 4, loss 0.0015, train acc 0.856, test acc 0.834
epoch 5, loss 0.0015, train acc 0.863, test acc 0.847

多層感知機(jī)pytorch實現(xiàn)

import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

print(torch.__version__)

1.3.0

初始化模型和各個參數(shù)

num_inputs, num_outputs, num_hiddens = 784, 10, 256
    
net = nn.Sequential(
        d2l.FlattenLayer(),
        nn.Linear(num_inputs, num_hiddens),
        nn.ReLU(),
        nn.Linear(num_hiddens, num_outputs), 
        )
    
for params in net.parameters():
    init.normal_(params, mean=0, std=0.01)

訓(xùn)練

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065')
loss = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(net.parameters(), lr=0.5)

num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)

epoch 1, loss 0.0031, train acc 0.701, test acc 0.774
epoch 2, loss 0.0019, train acc 0.821, test acc 0.806
epoch 3, loss 0.0017, train acc 0.841, test acc 0.805
epoch 4, loss 0.0015, train acc 0.855, test acc 0.834
epoch 5, loss 0.0014, train acc 0.866, test acc 0.840

第一天-線性回歸,Softmax與分類模型,多層感知機(jī)

線性回歸

線性回歸的基本要素

模型

數(shù)據(jù)集

損失函數(shù)

優(yōu)化函數(shù) - 隨機(jī)梯度下降

矢量計算

線性回歸模型從零開始的實現(xiàn)

生成數(shù)據(jù)集

使用圖像來展示生成的數(shù)據(jù)

讀取數(shù)據(jù)集

初始化模型參數(shù)

定義模型

定義損失函數(shù)

定義優(yōu)化函數(shù)

訓(xùn)練

線性回歸模型使用pytorch的簡潔實現(xiàn)

生成數(shù)據(jù)集

讀取數(shù)據(jù)集

定義模型

初始化模型參數(shù)

定義損失函數(shù)

定義優(yōu)化函數(shù)

訓(xùn)練

兩種實現(xiàn)方式的比較

softmax和分類模型

softmax的基本概念

交叉熵?fù)p失函數(shù)

模型訓(xùn)練和預(yù)測

獲取Fashion-MNIST訓(xùn)練集和讀取數(shù)據(jù)

get dataset

softmax從零開始的實現(xiàn)

獲取訓(xùn)練集數(shù)據(jù)和測試集數(shù)據(jù)

模型參數(shù)初始化

對多維Tensor按維度操作

定義softmax操作

softmax回歸模型

定義損失函數(shù)

定義準(zhǔn)確率

訓(xùn)練模型

模型預(yù)測

softmax的簡潔實現(xiàn)

初始化參數(shù)和獲取數(shù)據(jù)

定義網(wǎng)絡(luò)模型

初始化模型參數(shù)

定義損失函數(shù)

定義優(yōu)化函數(shù)

訓(xùn)練

多層感知機(jī)

多層感知機(jī)的基本知識

隱藏層

表達(dá)公式

激活函數(shù)

ReLU函數(shù)

Sigmoid函數(shù)

tanh函數(shù)

關(guān)于激活函數(shù)的選擇

多層感知機(jī)

多層感知機(jī)從零開始的實現(xiàn)

獲取訓(xùn)練集

定義模型參數(shù)

定義激活函數(shù)

定義網(wǎng)絡(luò)

定義損失函數(shù)

訓(xùn)練

多層感知機(jī)pytorch實現(xiàn)

初始化模型和各個參數(shù)

訓(xùn)練