介紹PyTorch的簡(jiǎn)單示例
翻譯自Github地址 jcjohnson/pytorch-examples
增加了個(gè)人的理解和注釋?zhuān)热绶聪騻鞑ギ?huà)了一張圖來(lái)更清晰的表明反向傳播的過(guò)程苏揣。
將英文翻譯成了中文更方便閱讀障涯。
特別喜歡這個(gè)倉(cāng)庫(kù)庭再,從numpy介紹到了tensor裙秋,然后從手動(dòng)實(shí)現(xiàn)反向傳播介紹到了如何利用Pytorch提供的自動(dòng)微分來(lái)進(jìn)行反向傳播,從自己動(dòng)手實(shí)現(xiàn)模型梁呈,損失函數(shù)婚度,權(quán)重更新到如何利用已有的包自定義模型,調(diào)用損失函數(shù)和優(yōu)化器官卡。
整個(gè)倉(cāng)庫(kù)看完應(yīng)該對(duì)pytorch的原理掌握的比較透徹了蝗茁。
Simple examples to introduce PyTorch
該倉(cāng)庫(kù)通過(guò)獨(dú)立的示例介紹了PyTorch的基本概念。
PyTorch的核心是提供兩個(gè)主要功能:
- n維Tensor寻咒,類(lèi)似于numpy哮翘,但可以在GPU上運(yùn)行。
- 自動(dòng)微分毛秘,用于構(gòu)建和訓(xùn)練神經(jīng)網(wǎng)絡(luò)
我們將使用完全連接的ReLU網(wǎng)絡(luò)作為我們的運(yùn)行示例饭寺。該網(wǎng)絡(luò)將具有單個(gè)隱藏層,并且將通過(guò)最小化網(wǎng)絡(luò)輸出與真實(shí)輸出之間的歐幾里德距離來(lái)進(jìn)行梯度下降訓(xùn)練叫挟,以適應(yīng)隨機(jī)數(shù)據(jù)艰匙。
注意:這些示例已針對(duì)PyTorch 0.4進(jìn)行了更新,對(duì)核心PyTorch API進(jìn)行了幾項(xiàng)重大更改抹恳。最值得注意的是员凝,在0.4之前,必須將Tensor包裹在Variable對(duì)象中才能使用autograd》芟祝現(xiàn)在绊序,此功能已直接添加到張量中,并且不建議使用變量秽荞。
目錄
- Warm-up: numpy
- PyTorch: Tensors
- PyTorch: Autograd
- PyTorch: Defining new autograd functions
- TensorFlow: Static Graphs
- PyTorch: nn
- PyTorch: optim
- PyTorch: Custom nn Modules
- PyTorch: Control Flow and Weight Sharing
1. Warm-up: numpy
在介紹PyTorch之前,我們將首先使用numpy實(shí)現(xiàn)網(wǎng)絡(luò)抚官。
Numpy提供一個(gè)n維數(shù)組array扬跋,以及許多用于操作這些array的函數(shù)。Numpy是科學(xué)計(jì)算的通用框架;它對(duì)計(jì)算圖形凌节、深度學(xué)習(xí)或梯度一無(wú)所知钦听。但是洒试,我們可以使用numpy操作通過(guò)網(wǎng)絡(luò)手動(dòng)實(shí)現(xiàn)向前和向后傳遞 forward and backward passes,從而輕松地使用numpy使兩層網(wǎng)絡(luò)適合隨機(jī)數(shù)據(jù):
import numpy as np
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x using Euclidean error.
一個(gè)全連接網(wǎng)絡(luò)模型朴上,激活函數(shù)是ReLU垒棋,具有一個(gè)隱藏層且沒(méi)有偏差,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y痪宰。
This implementation uses numpy to manually compute the forward pass, loss, and
backward pass.
該程序?qū)崿F(xiàn)了使用numpy手動(dòng)計(jì)算前向傳播叼架,損失和后向傳播。
A numpy array is a generic n-dimensional array; it does not know anything about
deep learning or gradients or computational graphs, and is just a way to perform
generic numeric computations.
numpy數(shù)組是通用的n維數(shù)組衣撬;它對(duì)深度學(xué)習(xí)乖订,梯度或計(jì)算圖一無(wú)所知,只是執(zhí)行通用數(shù)值計(jì)算的一種方法具练。
"""
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in) # 輸入 (64,1000)
y = np.random.randn(N, D_out) # 輸出 (64,10)
# Randomly initialize weights
w1 = np.random.randn(D_in, H) # 輸入層-隱藏層 權(quán)重 (1000,100)
w2 = np.random.randn(H, D_out) # 隱藏層-輸出層 權(quán)重 (100,10)
learning_rate = 1e-6 # 學(xué)習(xí)率
for t in range(500):
# Forward pass: compute predicted y 前向傳播:計(jì)算預(yù)測(cè)的y
h = x.dot(w1) # 點(diǎn)乘 得到隱藏層 (64,100)
h_relu = np.maximum(h, 0) # 計(jì)算relu激活函數(shù)
# np.maximum(X, Y, out=None) X和Y相比取最大值
# np.max(a, axis=None, out=None, keepdims=False) 求序列的最值, axis:默認(rèn)為列向(也即 axis=0)乍构,axis = 1 時(shí)為行方向的最值
y_pred = h_relu.dot(w2) # 點(diǎn)乘 得到輸出層 (64,10)
# Compute and print loss
loss = np.square(y_pred - y).sum() # .sum()所有元素的總和
print(t, loss) # 目的就是使Loss越來(lái)越小
# Backprop to compute gradients of w1 and w2 with respect to loss
# 反向傳播的過(guò)程(難點(diǎn)),詳見(jiàn)下圖推導(dǎo)過(guò)程
grad_y_pred = 2.0 * (y_pred - y) # (64,10)
grad_w2 = h_relu.T.dot(grad_y_pred) # (64,100)^T dot (64,10) = (100,10)
grad_h_relu = grad_y_pred.dot(w2.T) # (64,100)
grad_h = grad_h_relu.copy() # 深拷貝 (64,100)
grad_h[h < 0] = 0 # Relu反向傳播處理過(guò)程
grad_w1 = x.T.dot(grad_h) # (1000,100)
# Update weights 更新權(quán)重
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
這是一個(gè)針對(duì)前面提到的反向傳播的推導(dǎo)圖
2. PyTorch: Tensors
Numpy是一個(gè)很棒的框架扛点,但是它不能利用GPU來(lái)加速其數(shù)值計(jì)算哥遮。對(duì)于現(xiàn)代深度神經(jīng)網(wǎng)絡(luò),GPU通沉昃浚可提供50倍或更高的加速比眠饮,因此不幸的是,僅憑numpy不足以實(shí)現(xiàn)現(xiàn)代深度學(xué)習(xí)畔乙。
在這里君仆,我們介紹最基本的PyTorch概念:Tensor張量。PyTorch的張量在概念上與numpy數(shù)組相同:Tensor是n維數(shù)組牲距,而PyTorch提供了許多在這些張量上運(yùn)行的函數(shù)返咱。您可能希望使用numpy執(zhí)行的任何計(jì)算也可以使用PyTorch的Tensors完成;您應(yīng)該將它們視為科學(xué)計(jì)算的通用工具牍鞠。
但是咖摹,與numpy不同,PyTorch張量可以利用GPU加速其數(shù)字計(jì)算难述。要在GPU上運(yùn)行PyTorch Tensor萤晴,請(qǐng)?jiān)跇?gòu)造Tensor時(shí)使用device
參數(shù)將Tensor放置在GPU上。
在這里胁后,我們使用PyTorch張量使兩層網(wǎng)絡(luò)適合隨機(jī)數(shù)據(jù)店读。像上面的numpy示例一樣,我們使用PyTorch張量上的操作來(lái)手動(dòng)實(shí)現(xiàn)網(wǎng)絡(luò)的正向和反向傳遞:
import torch
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型攀芯,激活函數(shù)是ReLU屯断,具有一個(gè)隱藏層且沒(méi)有偏差,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y。
This implementation uses PyTorch tensors to manually compute the forward pass,
loss, and backward pass.
該程序?qū)崿F(xiàn)使用PyTorch張量手動(dòng)計(jì)算前向傳播殖演,損失和后向傳播氧秘。
A PyTorch Tensor is basically the same as a numpy array: it does not know
anything about deep learning or computational graphs or gradients, and is just
a generic n-dimensional array to be used for arbitrary numeric computation.
PyTorch張量基本上與numpy數(shù)組相同:它對(duì)深度學(xué)習(xí),計(jì)算圖或梯度一無(wú)所知趴久,只是用于任意數(shù)值計(jì)算的通用n維數(shù)組丸相。
The biggest difference between a numpy array and a PyTorch Tensor is that
a PyTorch Tensor can run on either CPU or GPU. To run operations on the GPU,
just pass a different value to the `device` argument when constructing the
Tensor.
numpy數(shù)組和PyTorch張量之間的最大區(qū)別是PyTorch張量可以在CPU或GPU上運(yùn)行。要在GPU上運(yùn)行操作彼棍,只需在構(gòu)造Tensor時(shí)將不同的值傳遞給device參數(shù)即可灭忠。
"""
device = torch.device('cpu') # CPU環(huán)境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU環(huán)境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device) # 輸入 (64,1000)
y = torch.randn(N, D_out, device=device) # 輸出 (64,10)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device) # 輸入層-隱藏層 權(quán)重 (1000,100)
w2 = torch.randn(H, D_out, device=device) # 隱藏層-輸出層 權(quán)重 (100,10)
learning_rate = 1e-6 # 學(xué)習(xí)率
for t in range(500):
# Forward pass: compute predicted y 前向傳播:計(jì)算預(yù)測(cè)的y
h = x.mm(w1) # 點(diǎn)乘 得到隱藏層 (64,100)
# torch.mm()矩陣相乘
# torch.mul() 矩陣位相乘
h_relu = h.clamp(min=0) # 計(jì)算relu激活函數(shù)
# torch.clamp(input, min, max, out=None) → Tensor 將輸入input張量每個(gè)元素的夾緊到區(qū)間 [min,max][min,max],并返回結(jié)果到一個(gè)新張量滥酥。
y_pred = h_relu.mm(w2) # 點(diǎn)乘 得到輸出層 (64,10)
# Compute and print loss; loss is a scalar標(biāo)量, and is stored in a PyTorch Tensor
# of shape (); we can get its value as a Python number with loss.item().
loss = (y_pred - y).pow(2).sum() # .sum()所有元素的總和 torch.Size([])
print(t, loss.item()) # pytorch中的.item()用于將一個(gè)零維張量轉(zhuǎn)換成浮點(diǎn)數(shù)
# Backprop to compute gradients of w1 and w2 with respect to loss
# 反向傳播的過(guò)程(難點(diǎn))更舞,具體過(guò)程同上,沒(méi)有變化
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
# torch.clone()和torch.copy()應(yīng)該沒(méi)什么區(qū)別
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent 更新權(quán)重
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
3. PyTorch: Autograd
在以上示例中坎吻,我們必須手動(dòng)實(shí)現(xiàn)神經(jīng)網(wǎng)絡(luò)的前向和后向傳遞缆蝉。對(duì)于小型的兩層網(wǎng)絡(luò)而言,手動(dòng)實(shí)施反向傳遞并不重要瘦真,但對(duì)于大型的復(fù)雜網(wǎng)絡(luò)而言刊头,可以很快變得非常麻煩。
幸運(yùn)的是诸尽,我們可以使用自動(dòng)微分來(lái)自動(dòng)計(jì)算神經(jīng)網(wǎng)絡(luò)中的反向通過(guò)原杂。PyTorch中的autograd包完全提供了此函數(shù)。使用autograd時(shí)您机,網(wǎng)絡(luò)的正向傳遞將定義一個(gè)computational graph計(jì)算圖穿肄;圖中的節(jié)點(diǎn)為張量,邊為從輸入張量產(chǎn)生輸出張量的函數(shù)际看。然后通過(guò)該圖進(jìn)行反向傳播咸产,可以輕松計(jì)算梯度。
這聽(tīng)起來(lái)很復(fù)雜仲闽,在實(shí)踐中非常簡(jiǎn)單脑溢。如果我們要針對(duì)某個(gè)張量計(jì)算梯度,那么在構(gòu)造該張量時(shí)赖欣,我們需要設(shè)置require_grad=True
屑彻。該Tensor上的任何PyTorch操作都將導(dǎo)致構(gòu)建計(jì)算圖,從而使我們以后可以通過(guò)該圖執(zhí)行反向傳播顶吮。如果x
是張量為require_grad=True
的張量社牲,那么在反向傳播之后,x.grad
將是另一個(gè)張量悴了,它保持x
相對(duì)于某個(gè)標(biāo)量值的梯度膳沽。
有時(shí)您可能希望在對(duì)require_grad=True
的張量執(zhí)行某些操作時(shí)汗菜,阻止PyTorch構(gòu)建計(jì)算圖。例如挑社,在訓(xùn)練神經(jīng)網(wǎng)絡(luò)時(shí),我們通常不想在權(quán)重更新步驟中向后傳播巡揍。在這種情況下痛阻,我們可以使用torch.no_grad()
上下文管理器來(lái)防止構(gòu)建計(jì)算圖。
在這里腮敌,我們使用PyTorch張量和autograd來(lái)實(shí)現(xiàn)我們的兩層網(wǎng)絡(luò)≮宓保現(xiàn)在我們不再需要手動(dòng)通過(guò)網(wǎng)絡(luò)實(shí)現(xiàn)反向傳遞:
import torch
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型,激活函數(shù)是ReLU糜工,具有一個(gè)隱藏層且沒(méi)有偏差弊添,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y。
This implementation computes the forward pass using operations on PyTorch
Tensors, and uses PyTorch autograd to compute gradients.
該程序?qū)崿F(xiàn)使用PyTorch張量上的運(yùn)算來(lái)計(jì)算前向傳播捌木,并使用PyTorch autograd來(lái)計(jì)算梯度油坝。
When we create a PyTorch Tensor with requires_grad=True, then operations
involving that Tensor will not just compute values; they will also build up
a computational graph in the background, allowing us to easily backpropagate
through the graph to compute gradients of some downstream (scalar) loss with
respect to a Tensor. Concretely if x is a Tensor with x.requires_grad == True
then after backpropagation x.grad will be another Tensor holding the gradient
of x with respect to some scalar value.
當(dāng)我們使用require_grad = True創(chuàng)建一個(gè)PyTorch Tensor時(shí),涉及該Tensor的操作將不僅僅計(jì)算值刨裆;
他們還將在后臺(tái)建立一個(gè)計(jì)算圖澈圈,使我們能夠輕松地在該圖中反向傳播,以計(jì)算相對(duì)于張量的某些下游(標(biāo)量)
損耗的梯度帆啃。具體來(lái)說(shuō)瞬女,如果x是具有x.requires_grad == True的張量,那么在反向傳播之后x.grad將是
另一個(gè)Tensor努潘,它保持x相對(duì)于某個(gè)標(biāo)量值的梯度诽偷。
"""
device = torch.device('cpu') # CPU環(huán)境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU環(huán)境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device) # 輸入 (64,1000)
y = torch.randn(N, D_out, device=device) # 輸出 (64,10)
# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
# 設(shè)置require_grad = True意味著我們要在反向傳播期間為這些張量計(jì)算梯度。
w1 = torch.randn(D_in, H, device=device, requires_grad=True) # 輸入層-隱藏層 權(quán)重 (1000,100)
w2 = torch.randn(H, D_out, device=device, requires_grad=True) # 隱藏層-輸出層 權(quán)重 (100,10)
learning_rate = 1e-6 # 學(xué)習(xí)率
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don't need to keep references to intermediate values.
# 前向傳播:使用張量上的運(yùn)算來(lái)計(jì)算預(yù)測(cè)的y疯坤。由于w1和w2具有require_grad = True报慕,
# 涉及這些張量的操作將使PyTorch構(gòu)建計(jì)算圖,從而允許自動(dòng)計(jì)算梯度贴膘。
# 由于我們不再手動(dòng)實(shí)現(xiàn)反向傳遞卖子,因此不需要保留對(duì)中間值的引用。
y_pred = x.mm(w1).clamp(min=0).mm(w2) # (64,10)
# Compute and print loss. Loss is a Tensor of shape (), and loss.item()
# is a Python number giving its value.
loss = (y_pred - y).pow(2).sum()
print(t, loss.item()) # 損失函數(shù)
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
# 使用autograd計(jì)算反向傳遞刑峡。該調(diào)用將計(jì)算所有帶有require_grad = True的張量的損失梯度洋闽。
# 在此調(diào)用之后,w1.grad和w2.grad將成為張量突梦,分別保持損失相對(duì)于w1和w2的梯度诫舅。
loss.backward()
# 也就是說(shuō),調(diào)用loss.backward()實(shí)際上只是產(chǎn)生了所有需要計(jì)算梯度的標(biāo)量宫患,記為w1.grad和w2.grad
# Update weights using gradient descent. For this step we just want to mutate
# the values of w1 and w2 in-place; we don't want to build up a computational
# graph for the update steps, so we use the torch.no_grad() context manager
# to prevent PyTorch from building a computational graph for the updates
# 使用梯度下降更新權(quán)重刊懈。對(duì)于這一步,我們只想就地改變w1和w2的值;我們不想為更新步驟建立計(jì)算圖虚汛,
# 因此我們使用torch.no_grad()上下文管理器來(lái)防止PyTorch為更新建立計(jì)算圖
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after running the backward pass
# 向后傳播后手動(dòng)將梯度歸零
w1.grad.zero_()
w2.grad.zero_()
4. PyTorch: Defining new autograd functions
每個(gè)原始的autograd運(yùn)算符實(shí)際上都是在Tensor上運(yùn)行的兩個(gè)函數(shù)匾浪。
forward函數(shù)從輸入張量計(jì)算輸出張量。
backward函數(shù)接收輸出張量相對(duì)于某個(gè)標(biāo)量值的梯度卷哩,并計(jì)算輸入張量相對(duì)于相同標(biāo)量值的梯度蛋辈。
在PyTorch中,我們可以通過(guò)定義torch.autograd.Function
的子類(lèi)并實(shí)現(xiàn)forward
和backward
函數(shù)來(lái)輕松定義自己的autograd運(yùn)算符将谊。然后冷溶,我們可以通過(guò)構(gòu)造實(shí)例并像調(diào)用函數(shù)一樣調(diào)用新的autograd運(yùn)算符,并傳遞包含輸入數(shù)據(jù)的張量尊浓。
在此示例中逞频,我們定義了自己的自定義autograd函數(shù)來(lái)執(zhí)行ReLU非線(xiàn)性,并使用它來(lái)實(shí)現(xiàn)我們的兩層網(wǎng)絡(luò):
import torch
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型栋齿,激活函數(shù)是ReLU苗胀,具有一個(gè)隱藏層且沒(méi)有偏差,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y褒颈。
This implementation computes the forward pass using operations on PyTorch
Tensors, and uses PyTorch autograd to compute gradients.
該代碼實(shí)現(xiàn)使用PyTorch張量上的運(yùn)算來(lái)計(jì)算前向傳播柒巫,并使用PyTorch autograd來(lái)計(jì)算梯度。
In this implementation we implement our own custom autograd function to perform
the ReLU function.
在該代碼中谷丸,我們實(shí)現(xiàn)了自己的自定義autograd函數(shù)來(lái)執(zhí)行ReLU函數(shù)堡掏。
"""
# 自定義類(lèi)并繼承 torch.autograd.Function
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
我們可以通過(guò)繼承torch.autograd.Function并實(shí)現(xiàn)在Tensor上
運(yùn)行的前向傳播和后向傳播來(lái)實(shí)現(xiàn)自己的自定義autograd函數(shù)。
"""
@staticmethod
def forward(ctx, x): # 傳入的x是Tensor,ctx是context object
"""
In the forward pass we receive a context object and a Tensor containing the
input; we must return a Tensor containing the output, and we can use the
context object to cache objects for use in the backward pass.
在前向傳遞中刨疼,我們收到一個(gè)上下文對(duì)象和一個(gè)包含輸入的張量泉唁。
我們必須返回一個(gè)包含輸出的Tensor,并且我們可以使用上下文對(duì)象來(lái)緩存對(duì)象以用于向后傳遞揩慕。
"""
ctx.save_for_backward(x) # 將輸入保存起來(lái)亭畜,在backward時(shí)使用
return x.clamp(min=0) # 返回relu處理后的輸出,返回的是Tensor
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive the context object and a Tensor containing
the gradient of the loss with respect to the output produced during the
forward pass. We can retrieve cached data from the context object, and must
compute and return the gradient of the loss with respect to the input to the
forward function.
在后向傳播中迎卤,我們接收上下文對(duì)象和張量拴鸵,其中包含相對(duì)于前向傳播期間產(chǎn)生的輸出的損耗梯度。
我們可以從上下文對(duì)象中檢索緩存的數(shù)據(jù)蜗搔,并且必須計(jì)算損失的梯度并將其相對(duì)于輸入返回到前向函數(shù)劲藐。
"""
x, = ctx.saved_tensors # 得到forward保存的Tensor
grad_x = grad_output.clone() # 深拷貝
grad_x[x < 0] = 0 # 計(jì)算Relu的微分的方式
# 類(lèi)比 grad_output相當(dāng)于grad_h_relu,x相當(dāng)于h樟凄,grad_x相當(dāng)于grad_h
# grad_h_relu = grad_y_pred.mm(w2.t())
# grad_h = grad_h_relu.clone()
# grad_h[h < 0] = 0
return grad_x
device = torch.device('cpu') # CPU環(huán)境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU環(huán)境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and output
x = torch.randn(N, D_in, device=device) # 輸入 (64,1000)
y = torch.randn(N, D_out, device=device) # 輸出 (64,10)
# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
# 輸入層-隱藏層 權(quán)重 (1000,100)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)
# 隱藏層-輸出層 權(quán)重 (100,10)
learning_rate = 1e-6 # 學(xué)習(xí)率
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors; we call our
# custom ReLU implementation using the MyReLU.apply function
# 前向傳播:使用張量上的運(yùn)算來(lái)計(jì)算預(yù)測(cè)的y聘芜;
# 我們使用MyReLU.apply函數(shù)調(diào)用自定義ReLU實(shí)現(xiàn)
y_pred = MyReLU.apply(x.mm(w1)).mm(w2) # (64,10) 利用自定義的MyReLU類(lèi)
# 更新前代碼 y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.item()) # 損失函數(shù)
# Use autograd to compute the backward pass.
loss.backward() # 反向傳播
with torch.no_grad():
# Update weights using gradient descent 更新權(quán)重
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after running the backward pass
# 向后傳播后手動(dòng)將梯度歸零
w1.grad.zero_()
w2.grad.zero_()
5. TensorFlow: Static Graphs
PyTorch autograd看起來(lái)很像TensorFlow:在兩個(gè)框架中我們都定義了一個(gè)計(jì)算圖,并使用自動(dòng)微分來(lái)計(jì)算梯度缝龄。兩者之間的最大區(qū)別是TensorFlow的計(jì)算圖是靜態(tài)的汰现,而PyTorch使用動(dòng)態(tài)計(jì)算圖挂谍。
在TensorFlow中,我們一次定義了計(jì)算圖瞎饲,然后一遍又一遍地執(zhí)行相同的圖口叙,可能將不同的輸入數(shù)據(jù)提供給該圖。在PyTorch中嗅战,每個(gè)前向傳播都定義一個(gè)新的計(jì)算圖庐扫。
靜態(tài)圖很不錯(cuò),因?yàn)槟梢?strong>預(yù)先優(yōu)化圖仗哨。例如,框架可能決定融合某些圖操作以提高效率铅辞,或者想出一種在多個(gè)GPU或許多機(jī)器之間分布圖形的策略厌漂。如果您要一遍又一遍地重用同一張圖,那么隨著一遍一遍地重復(fù)運(yùn)行同一張圖斟珊,可以分?jǐn)傔@種潛在的昂貴的前期優(yōu)化苇倡。
靜態(tài)圖和動(dòng)態(tài)圖不同的一個(gè)方面是控制流。對(duì)于某些模型囤踩,我們可能希望對(duì)每個(gè)數(shù)據(jù)點(diǎn)執(zhí)行不同的計(jì)算旨椒。例如,對(duì)于每個(gè)數(shù)據(jù)點(diǎn)堵漱,循環(huán)網(wǎng)絡(luò)可能會(huì)展開(kāi)不同數(shù)量的時(shí)間步長(zhǎng)综慎;此展開(kāi)可以實(shí)現(xiàn)為循環(huán)。對(duì)于靜態(tài)圖勤庐,循環(huán)構(gòu)造必須是圖的一部分示惊;因此,TensorFlow提供了tf.scan
之類(lèi)的運(yùn)算符來(lái)將循環(huán)嵌入圖形中愉镰。使用動(dòng)態(tài)圖米罚,情況更簡(jiǎn)單:由于我們?yōu)槊總€(gè)示例動(dòng)態(tài)生成圖,因此可以使用常規(guī)命令流控制來(lái)執(zhí)行針對(duì)每個(gè)輸入而不同的計(jì)算丈探。
與上面的PyTorch autograd示例形成對(duì)比录择,這里我們使用TensorFlow來(lái)擬合一個(gè)簡(jiǎn)單的兩層網(wǎng)絡(luò):
import tensorflow as tf
import numpy as np
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型,激活函數(shù)是ReLU碗降,具有一個(gè)隱藏層且沒(méi)有偏差隘竭,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y。
This implementation uses basic TensorFlow operations to set up a computational
graph, then executes the graph many times to actually train the network.
此實(shí)現(xiàn)使用基本的TensorFlow操作來(lái)設(shè)置計(jì)算圖遗锣,然后多次執(zhí)行該圖以實(shí)際訓(xùn)練網(wǎng)絡(luò)货裹。
One of the main differences between TensorFlow and PyTorch is that TensorFlow
uses static computational graphs while PyTorch uses dynamic computational
graphs.
TensorFlow和PyTorch之間的主要區(qū)別之一是TensorFlow使用靜態(tài)計(jì)算圖,而PyTorch使用動(dòng)態(tài)計(jì)算圖精偿。
In TensorFlow we first set up the computational graph, then execute the same
graph many times.
在TensorFlow中弧圆,我們首先設(shè)置計(jì)算圖赋兵,然后多次執(zhí)行同一圖。
"""
# First we set up the computational graph:
# 首先搔预,我們建立計(jì)算圖:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in)) # 輸入 (64,1000)
y = tf.placeholder(tf.float32, shape=(None, D_out)) # 輸出 (64,10)
# tf.placeholder() 占位符霹期,主要為真實(shí)輸入數(shù)據(jù)和輸出標(biāo)簽的輸入,只會(huì)分配必要的內(nèi)存拯田,
# 等建立session历造,在會(huì)話(huà)中,運(yùn)行模型的時(shí)候通過(guò)feed_dict()函數(shù)向占位符喂入數(shù)據(jù)船庇。
# 對(duì)比pytorch的語(yǔ)法規(guī)則
# x = torch.randn(N, D_in, device=device)
# y = torch.randn(N, D_out, device=device)
# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H))) # 輸入層-隱藏層 權(quán)重 (1000,100)
w2 = tf.Variable(tf.random_normal((H, D_out))) # 隱藏層-輸出層 權(quán)重 (100,10)
# tf.Variable()主要用于定義weights bias等可訓(xùn)練會(huì)改變的變量吭产,必須指定初始值。
# tf.constant()創(chuàng)建一個(gè)常量鸭轮。
# 對(duì)比pytorch的語(yǔ)法規(guī)則
# w1 = torch.randn(D_in, H, device=device, requires_grad=True)
# w2 = torch.randn(H, D_out, device=device, requires_grad=True)
# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# 前向傳播:使用TensorFlow張量上的運(yùn)算來(lái)計(jì)算預(yù)測(cè)的y臣淤。
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
# 請(qǐng)注意,此代碼實(shí)際上并不執(zhí)行任何數(shù)字運(yùn)算窃爷。它只是建立了我們稍后將執(zhí)行的計(jì)算圖邑蒋。
h = tf.matmul(x, w1) # 點(diǎn)乘 得到隱藏層 (64,100)
h_relu = tf.maximum(h, tf.zeros(1)) # 計(jì)算relu激活函數(shù)
y_pred = tf.matmul(h_relu, w2) # 點(diǎn)乘 得到輸出層 (64,10)
# 對(duì)比pytorch的語(yǔ)法規(guī)則
# h = x.mm(w1)
# h_relu = h.clamp(min=0)
# y_pred = h_relu.mm(w2)
# Compute loss using operations on TensorFlow Tensors 計(jì)算損失
loss = tf.reduce_sum((y - y_pred) ** 2.0)
# 對(duì)比pytorch的語(yǔ)法規(guī)則
# loss = (y_pred - y).pow(2).sum()
# Compute gradient of the loss with respect to w1 and w2. 計(jì)算梯度
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
# 對(duì)比pytorch的語(yǔ)法規(guī)則
# loss.backward()
# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
# 使用梯度下降更新權(quán)重。要實(shí)際更新權(quán)重按厘,我們需要在執(zhí)行圖形時(shí)評(píng)估new_w1和new_w2医吊。
# 請(qǐng)注意,在TensorFlow中逮京,權(quán)重值的更新操作是計(jì)算圖的一部分;在PyTorch中卿堂,這發(fā)生在計(jì)算圖之外。
learning_rate = 1e-6 # 學(xué)習(xí)率
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
# 對(duì)比pytorch的語(yǔ)法規(guī)則
# with torch.no_grad():
# w1 -= learning_rate * w1.grad
# w2 -= learning_rate * w2.grad
# w1.grad.zero_()
# w2.grad.zero_()
# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
# 現(xiàn)在我們已經(jīng)建立了計(jì)算圖造虏,因此我們進(jìn)入一個(gè)TensorFlow會(huì)話(huà)來(lái)實(shí)際執(zhí)行該圖御吞。
with tf.Session() as sess: # 執(zhí)行Session()
# Run the graph once to initialize the Variables w1 and w2.
# 運(yùn)行一次圖以初始化變量w1和w2。
sess.run(tf.global_variables_initializer())
# Create numpy arrays holding the actual data for the inputs x and targets y
# 創(chuàng)建numpy數(shù)組來(lái)保存輸入x和目標(biāo)y的實(shí)際數(shù)據(jù)
x_value = np.random.randn(N, D_in)
y_value = np.random.randn(N, D_out)
for _ in range(500):
# Execute the graph many times. Each time it executes we want to bind
# x_value to x and y_value to y, specified with the feed_dict argument.
# Each time we execute the graph we want to compute the values for loss,
# new_w1, and new_w2; the values of these Tensors are returned as numpy
# arrays.
# 執(zhí)行多次圖漓藕。每次執(zhí)行時(shí)陶珠,我們都希望將x_value綁定到x上,并將y_value綁定到y(tǒng)上
# (由feed_dict參數(shù)指定)享钞。每次執(zhí)行該圖時(shí)揍诽,我們都要計(jì)算損失值new_w1和new_w2;
# 這些張量的值以numpy數(shù)組形式返回。
loss_value, _, _ = sess.run([loss, new_w1, new_w2],
feed_dict={x: x_value, y: y_value})
print(loss_value)
6. PyTorch: nn
計(jì)算圖和autograd是定義復(fù)雜運(yùn)算符并自動(dòng)采用導(dǎo)數(shù)的非常強(qiáng)大的范例栗竖。但是對(duì)于大型神經(jīng)網(wǎng)絡(luò)暑脆,原始的autograd可能會(huì)有點(diǎn)太低級(jí)了。
在構(gòu)建神經(jīng)網(wǎng)絡(luò)時(shí)狐肢,我們經(jīng)程砺穑考慮將計(jì)算分為幾層,其中一些具有可學(xué)習(xí)的參數(shù)份名,這些參數(shù)將在學(xué)習(xí)過(guò)程中進(jìn)行優(yōu)化碟联。
在TensorFlow中妓美,像Keras,TensorFlow-Slim和TFLearn這樣的軟件包在原始計(jì)算圖上提供了更高級(jí)別的抽象鲤孵,這些抽象對(duì)構(gòu)建神經(jīng)網(wǎng)絡(luò)很有用壶栋。
在PyTorch中,nn
包可達(dá)到相同的目的普监。nn
包定義了一組模塊贵试,這些模塊大致等效于神經(jīng)網(wǎng)絡(luò)層。模塊接收輸入張量并計(jì)算輸出張量凯正,但也可以保持內(nèi)部狀態(tài)毙玻,例如包含可學(xué)習(xí)參數(shù)的張量。nn
包還定義了一組有用的損失函數(shù)廊散,這些函數(shù)通常在訓(xùn)練神經(jīng)網(wǎng)絡(luò)時(shí)使用淆珊。
在此示例中,我們使用nn
包來(lái)實(shí)現(xiàn)我們的兩層網(wǎng)絡(luò):
import torch
"""
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型奸汇,激活函數(shù)是ReLU,具有一個(gè)隱藏層且沒(méi)有偏差往声,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y擂找。
This implementation uses the nn package from PyTorch to build the network.
PyTorch autograd makes it easy to define computational graphs and take gradients,
but raw autograd can be a bit too low-level for defining complex neural networks;
this is where the nn package can help. The nn package defines a set of Modules,
which you can think of as a neural network layer that has produces output from
input and may have some trainable weights or other state.
這個(gè)實(shí)現(xiàn)使用來(lái)自PyTorch的nn包來(lái)構(gòu)建網(wǎng)絡(luò)。PyTorch autograd使得定義計(jì)算圖和獲取梯度變得容易浩销,
但是原始的autograd對(duì)于定義復(fù)雜的神經(jīng)網(wǎng)絡(luò)來(lái)說(shuō)可能太低級(jí)了贯涎。這是nn軟件包可以提供幫助的地方。
nn包定義了一組模塊慢洋,您可以將其視為神經(jīng)網(wǎng)絡(luò)層塘雳,該神經(jīng)網(wǎng)絡(luò)層從輸入產(chǎn)生輸出并且可能具有一些
可訓(xùn)練的權(quán)重或其他狀態(tài)。
"""
device = torch.device('cpu') # CPU環(huán)境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU環(huán)境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device) # 輸入 (64,1000)
y = torch.randn(N, D_out, device=device) # 輸出 (64,10)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
# 使用nn包將我們的模型定義為一系列圖層普筹。
# nn.Sequential是一個(gè)包含其他模塊的模塊败明,應(yīng)用這些內(nèi)置模塊可以直接得到其輸出。
# 每個(gè)線(xiàn)性模塊都使用線(xiàn)性函數(shù)來(lái)計(jì)算輸入的輸出太防,并保留內(nèi)部張量用于其權(quán)重和偏差妻顶。
# 構(gòu)建模型后,我們使用.to()方法將其移動(dòng)到所需的設(shè)備蜒车。
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H), # 隱藏層 (64,100)
torch.nn.ReLU(), # 計(jì)算relu激活函數(shù) (64,100)
torch.nn.Linear(H, D_out), # 輸出層 (64,10)
).to(device)
# 相當(dāng)于取代了這三行代碼
# h = x.mm(w1)
# h_relu = h.clamp(min=0)
# y_pred = h_relu.mm(w2)
# 寫(xiě)在for循環(huán)外面讳嘱,在for循環(huán)里面只需要調(diào)用模型就可以了
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
# nn包還包含流行的損失函數(shù)的定義;在這種情況下酿愧,我們將使用均方誤差(MSE)作為損失函數(shù)沥潭。
# 設(shè)置reduction ='sum'意味著我們正在計(jì)算平方誤差的* sum *而不是均值;
# 這是為了與上面的示例(我們手動(dòng)計(jì)算損失)保持一致嬉挡,
# 但是在實(shí)踐中钝鸽,更常見(jiàn)的是通過(guò)設(shè)置reducer ='elementwise_mean'將均方誤差用作損失汇恤。
loss_fn = torch.nn.MSELoss(reduction='sum')
# 相當(dāng)于取代了 loss = (y_pred - y).pow(2).sum()
# 寫(xiě)在for循環(huán)外面,在for循環(huán)里面直接調(diào)用損失函數(shù)就可以了
learning_rate = 1e-4 # 學(xué)習(xí)率
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
# 前向傳播:通過(guò)將x傳遞給模型來(lái)計(jì)算預(yù)測(cè)的y寞埠。模塊對(duì)象會(huì)覆蓋__call__運(yùn)算符屁置,
# 因此您可以像調(diào)用函數(shù)一樣調(diào)用它們。這樣做時(shí)仁连,您將輸入數(shù)據(jù)的張量傳遞給模塊蓝角,
# 它會(huì)產(chǎn)生輸出數(shù)據(jù)的張量。
y_pred = model(x) # 直接調(diào)用
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the loss.
# 計(jì)算和打印損失饭冬。我們傳遞包含y的預(yù)測(cè)值和真實(shí)值的Tensor使鹅,損失函數(shù)返回包含損失的Tensor。
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Zero the gradients before running the backward pass. 梯度歸零
model.zero_grad()
# 替換原來(lái)的兩行代碼
# w1.grad.zero_()
# w2.grad.zero_()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
# 反向傳播:相對(duì)于模型的所有可學(xué)習(xí)參數(shù)計(jì)算損耗的梯度昌抠。
# 在內(nèi)部患朱,每個(gè)模塊的參數(shù)都存儲(chǔ)在Tensors中,
# 其中require_grad = True炊苫,因此此調(diào)用將計(jì)算模型中所有可學(xué)習(xí)參數(shù)的梯度裁厅。
loss.backward()
# 在執(zhí)行l(wèi)oss.backward()之前要先清空梯度
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its data and gradients like we did before.
# 使用梯度下降更新權(quán)重。每個(gè)參數(shù)都是一個(gè)Tensor侨艾,因此我們可以像以前一樣訪(fǎng)問(wèn)其數(shù)據(jù)和漸變执虹。
with torch.no_grad():
for param in model.parameters(): # 輸出模型的參數(shù)
param.data -= learning_rate * param.grad
7. PyTorch: optim
到目前為止,我們已通過(guò)手動(dòng)更改持有可學(xué)習(xí)參數(shù)的張量來(lái)更新模型的權(quán)重唠梨。對(duì)于像隨機(jī)梯度下降這樣的簡(jiǎn)單優(yōu)化算法來(lái)說(shuō)袋励,這并不是一個(gè)沉重的負(fù)擔(dān),但是在實(shí)踐中当叭,我們經(jīng)常使用更復(fù)雜的優(yōu)化器(如AdaGrad茬故,RMSProp,Adam等)來(lái)訓(xùn)練神經(jīng)網(wǎng)絡(luò)蚁鳖。
PyTorch中的optim
軟件包抽象了優(yōu)化算法的思想磺芭,并提供了常用優(yōu)化算法的實(shí)現(xiàn)。
在此示例中醉箕,我們將像以前一樣使用nn
包定義模型徘跪,但是我們將使用optim
包提供的Adam算法優(yōu)化模型:
import torch
"""
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型,激活函數(shù)是ReLU琅攘,具有一個(gè)隱藏層且沒(méi)有偏差垮庐,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y。
This implementation uses the nn package from PyTorch to build the network.
該實(shí)現(xiàn)使用來(lái)自PyTorch的nn軟件包來(lái)構(gòu)建網(wǎng)絡(luò)坞琴。
Rather than manually updating the weights of the model as we have been doing,
we use the optim package to define an Optimizer that will update the weights
for us. The optim package defines many optimization algorithms that are commonly
used for deep learning, including SGD+momentum, RMSProp, Adam, etc.
與其像我們一直在手動(dòng)更新模型的權(quán)重哨查,不如使用optim包定義一個(gè)優(yōu)化器,該優(yōu)化器將為我們更新權(quán)重剧辐。
optim軟件包定義了許多深度學(xué)習(xí)常用的優(yōu)化算法寒亥,包括SGD + momentum邮府,RMSProp,Adam等溉奕。
"""
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in) # 輸入 (64,1000)
y = torch.randn(N, D_out) # 輸出 (64,10)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
) # 定義網(wǎng)絡(luò)模型
loss_fn = torch.nn.MSELoss(reduction='sum') # 定義損失函數(shù)
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
# 使用optim包定義一個(gè)優(yōu)化器褂傀,該優(yōu)化器將為我們更新模型的權(quán)重。在這里加勤,我們將使用Adam仙辟;
# optim程序包包含許多其他優(yōu)化算法。Adam構(gòu)造函數(shù)的第一個(gè)參數(shù)告訴優(yōu)化器應(yīng)該更新哪個(gè)張量鳄梅。
learning_rate = 1e-4 # 學(xué)習(xí)率
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 優(yōu)化器 torch.optim叠国,第一個(gè)參數(shù)是需要更新的系數(shù),第二個(gè)參數(shù)是學(xué)習(xí)率
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x) # 調(diào)用網(wǎng)絡(luò)模型進(jìn)行前向傳播
# Compute and print loss.
loss = loss_fn(y_pred, y) # 計(jì)算損失
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the Tensors it will update (which are the learnable weights
# of the model)
optimizer.zero_grad() # 梯度歸零
# Backward pass: compute gradient of the loss with respect to model parameters
loss.backward() # 后向傳播
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step() # 調(diào)用優(yōu)化器更新參數(shù)
總結(jié)
- 調(diào)用模型前向傳播 y_pred = model(x)
- 調(diào)用損失函數(shù) loss = loss_fn(y_pred, y)
- 梯度歸零 optimizer.zero_grad()
- 后向傳播 loss.backward()
- 調(diào)用優(yōu)化器更新參數(shù) optimizer.step()
8. PyTorch: Custom nn Modules
有時(shí)戴尸,您將需要指定比一系列現(xiàn)有模塊更復(fù)雜的模型粟焊。對(duì)于這些情況,您可以通過(guò)子類(lèi)nn.Module
并定義一個(gè)forwad
輸入來(lái)定義自己的模塊孙蒙,該前向接收輸入張量并使用其他模塊或在張量上的其他自動(dòng)轉(zhuǎn)換操作產(chǎn)生輸出張量项棠。
在此示例中,我們將兩層網(wǎng)絡(luò)實(shí)現(xiàn)為自定義的Module
子類(lèi):
import torch
"""
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
一個(gè)全連接網(wǎng)絡(luò)模型挎峦,激活函數(shù)是ReLU沾乘,具有一個(gè)隱藏層且沒(méi)有偏差,經(jīng)過(guò)訓(xùn)練可以使用歐幾里得誤差根據(jù)x來(lái)預(yù)測(cè)y浑测。
This implementation defines the model as a custom Module subclass. Whenever you
want a model more complex than a simple sequence of existing Modules you will
need to define your model this way.
此實(shí)現(xiàn)將模型定義為自定義Module子類(lèi)。每當(dāng)您想要一個(gè)比現(xiàn)有模塊的簡(jiǎn)單序列更復(fù)雜的模型時(shí)歪玲,
都需要以這種方式定義模型迁央。
"""
# 自定義類(lèi),繼承torch.nn.Module
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out): # 構(gòu)造函數(shù)滥崩,self是固定的岖圈,然后傳入的系數(shù)
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
在構(gòu)造函數(shù)中,我們實(shí)例化兩個(gè)nn.Linear模塊并將其分配為成員變量钙皮。
"""
super(TwoLayerNet, self).__init__() # 固定寫(xiě)法
self.linear1 = torch.nn.Linear(D_in, H) # 線(xiàn)性層 1
self.linear2 = torch.nn.Linear(H, D_out) # 線(xiàn)性層 2
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary (differentiable) operations on Tensors.
在前向函數(shù)中蜂科,我們接受輸入數(shù)據(jù)的張量,并且必須返回輸出數(shù)據(jù)的張量短条。
我們可以使用構(gòu)造函數(shù)中定義的模塊以及張量上的任意(可微分)操作导匣。
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in) # 輸入 (64,1000)
y = torch.randn(N, D_out) # 輸出 (64,10)
# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out) # 實(shí)例化模型,對(duì)應(yīng)__init__里面的參數(shù)輸入
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum') # 損失函數(shù)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4) # 優(yōu)化器
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # 調(diào)用模型被啼,對(duì)應(yīng)的是forward里面的參數(shù)
# Compute and print loss
loss = loss_fn(y_pred, y) # 調(diào)用損失函數(shù)
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad() # 梯度歸零
loss.backward() # 后向傳播
optimizer.step() # 更新參數(shù)
自定義類(lèi)繼承nn.Module
也是最常用的操作了高蜂。
一定要搞清楚傳入的參數(shù)而芥,什么時(shí)候是實(shí)例化模型(對(duì)應(yīng)init)啄枕,什么時(shí)候是調(diào)用模型(對(duì)應(yīng)forward)
9. PyTorch: Control Flow and Weight Sharing
作為動(dòng)態(tài)圖和權(quán)重共享的示例缓待,我們實(shí)現(xiàn)了一個(gè)非常奇怪的模型:一個(gè)完全連接的ReLU網(wǎng)絡(luò)蚓耽,該網(wǎng)絡(luò)在每個(gè)前向傳播中選擇1到4之間的隨機(jī)數(shù),并使用那么多隱藏層旋炒,多次重復(fù)使用相同的權(quán)重計(jì)算最里面的隱藏層步悠。
對(duì)于此模型,可以使用常規(guī)的Python流控制來(lái)實(shí)現(xiàn)循環(huán)瘫镇,并且可以通過(guò)在定義前向傳播時(shí)簡(jiǎn)單地多次重復(fù)使用同一模塊來(lái)實(shí)現(xiàn)最內(nèi)層之間的權(quán)重共享鼎兽。
我們可以輕松地將此模型實(shí)現(xiàn)為Module子類(lèi):
import random
import torch
"""
To showcase the power of PyTorch dynamic graphs, we will implement a very strange
model: a fully-connected ReLU network that on each forward pass randomly chooses
a number between 1 and 4 and has that many hidden layers, reusing the same
weights multiple times to compute the innermost hidden layers.
為了展示PyTorch動(dòng)態(tài)圖的強(qiáng)大功能,我們將實(shí)現(xiàn)一個(gè)非常奇怪的模型:一個(gè)完全連接的ReLU網(wǎng)絡(luò)汇四,
該網(wǎng)絡(luò)在每個(gè)前向傳遞上隨機(jī)選擇一個(gè)1到4之間的數(shù)字接奈,并且具有許多隱藏層,多次重復(fù)使用相同
的權(quán)重計(jì)算最里面的隱藏層通孽。
"""
# 自定義神經(jīng)網(wǎng)咯
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out): # 初始化
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__() # 固定用法
self.input_linear = torch.nn.Linear(D_in, H) # 輸入層
self.middle_linear = torch.nn.Linear(H, H) # 中間層
self.output_linear = torch.nn.Linear(H, D_out) # 輸出層
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
對(duì)于模型的前向傳播序宦,我們隨機(jī)選擇0、1背苦、2或3互捌,然后多次重復(fù)使用middle_linear模塊
來(lái)計(jì)算隱藏層表示。
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
由于每個(gè)前向傳播都會(huì)構(gòu)建一個(gè)動(dòng)態(tài)計(jì)算圖行剂,因此在定義模型的前向傳播時(shí)秕噪,我們可以
使用諸如循環(huán)或條件語(yǔ)句之類(lèi)的常規(guī)Python控制流運(yùn)算符。
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
在這里厚宰,我們還看到腌巾,在定義計(jì)算圖時(shí),多次重用同一模塊是絕對(duì)安全的铲觉。
這是對(duì)Lua Torch的一項(xiàng)重大改進(jìn)澈蝙,Lua Torch的每個(gè)模塊只能使用一次。
"""
h_relu = self.input_linear(x).clamp(min=0) # 輸入層
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0) # 隨機(jī)調(diào)用中間層
y_pred = self.output_linear(h_relu) # 輸出層
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in) # 輸入 (64,1000)
y = torch.randn(N, D_out) # 輸出 (64,10)
# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out) # 實(shí)例化模型
# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum') # 損失函數(shù)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9) # 優(yōu)化器
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # 調(diào)用模型
# Compute and print loss
loss = criterion(y_pred, y) # 計(jì)算損失
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad() # 梯度歸零
loss.backward() # 后向傳播
optimizer.step() # 更新權(quán)重