本文是DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ的學(xué)習(xí)筆記
這里繼續(xù)進(jìn)行第二部分A GENTLE INTRODUCTION TO TORCH.AUTOGRAD:
torch.autograd
is PyTorch’s automatic differentiation engine that powers neural network training. In this section, you will get a conceptual understanding of how autograd helps a neural network train. (自動(dòng)求導(dǎo) --- 神經(jīng)網(wǎng)絡(luò)訓(xùn)練)
1. Background
Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors.
Training a NN happens in two steps:
- Forward Propagation: In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.
- Backward Propagation: In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent.
訓(xùn)練神經(jīng)網(wǎng)絡(luò)的兩步:前向傳播&反向傳播肚逸,反向傳播即是誤差error(loss)對(duì)weights和biases求偏導(dǎo)陈醒,不斷更新weights和biases,反復(fù)迭代的過程。
2. Usage in PyTorch
Let’s take a look at a single training step. For this example, we load a pretrained resnet18 model from torchvision. We create a random data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding label initialized to some random values. Label in pretrained models has shape (1,1000).
加載torchvision中的預(yù)訓(xùn)練模型resnet18房轿。隨機(jī)生成的數(shù)據(jù)是一個(gè)3個(gè)通道64×64的圖像,label定義為1000個(gè)數(shù)值。
import torch
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)
Next, we run the input data through the model through each of its layers to make a prediction. This is the forward pass. We use the model’s prediction and the corresponding label to calculate the error (loss). The next step is to backpropagate this error through the network. Backward propagation is kicked off when we call .backward()
on the error tensor. Autograd then calculates and stores the gradients for each model parameter in the parameter’s .grad
attribute.
prediction = model(data) # forward pass
loss = (prediction - labels).sum()
loss.backward() # backward pass
Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and momentum. Finally, we call .step()
to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in .grad
.
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
optim.step() #gradient descent
3. Differentiation in Autograd
We create two tensors a and b with requires_grad=True
. This signals to autograd
that every operation on them should be tracked. We create another tensor Q from a and b.
import torch
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)
Q = 3*a**3 - b**2
對(duì)于方法.backward()
而言,如果對(duì)應(yīng)變量是一個(gè)標(biāo)量片仿,則無需傳遞參數(shù),如果對(duì)應(yīng)變量是一個(gè)向量尤辱,則必須顯式指定一個(gè)gradient
參數(shù)砂豌,gradient參數(shù)是一個(gè)shape與對(duì)應(yīng)變量相同的Tensor。
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)
(留)4. Optional Reading - Vector Calculus using autograd
5. Computational Graph
Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.
Below is a visual representation of the DAG in our example. In the graph, the arrows are in the direction of the forward pass. The nodes represent the backward functions of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors a and b.
DAGs are dynamic in PyTorch啥刻,相當(dāng)于在訓(xùn)練模型時(shí)奸鸯,每迭代一次都會(huì)重新構(gòu)建一個(gè)新的計(jì)算圖咪笑。
6. Exclusion from the DAG
torch.autograd
tracks operations on all tensors which have their requires_grad
flag set to True
. For tensors that don’t require gradients, setting this attribute to False
excludes it from the gradient computation DAG.
The output tensor of an operation will require gradients even if only a single input tensor has requires_grad=True
.
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)
a = x + y
print(f"Does `a` require gradients? : {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")
In a NN, parameters that don’t compute gradients are usually called frozen parameters. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).
Another common usecase where exclusion from the DAG is important is for finetuning a pretrained network. In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. (模型的微調(diào))
from torch import nn, optim
model = resnet18(weights=ResNet18_Weights.DEFAULT)
# Freeze all the parameters in the network
for param in model.parameters():
param.requires_grad = False
Let’s say we want to finetune the model on a new dataset with 10 labels. In resnet, the classifier is the last linear layer model.fc
. We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.
Now all parameters in the model, except the parameters of model.fc, are frozen. The only parameters that compute gradients are the weights and bias of model.fc
.
預(yù)訓(xùn)練模型最終是1000個(gè)labels可帽,此處替換成10個(gè)labels,即改最后的一個(gè)input為512窗怒,output為1000的線性層映跟,變?yōu)?512, 10)蓄拣,除此層外,其他參數(shù)處于“凍結(jié)”狀態(tài)努隙,不再計(jì)算參數(shù)梯度球恤,即微調(diào)(finetune)。
model.fc = nn.Linear(512, 10)
# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
Notice although we register all the parameters in the optimizer, the only parameters that are computing gradients (and hence updated in gradient descent) are the weights and bias of the classifier.
The same exclusionary functionality is available as a context manager in torch.no_grad()
.
(留)Further readings:
In-place operations & Multithreaded Autograd
Example implementation of reverse-mode autodiff
參考:
- https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
- https://www.youtube.com/watch?v=tIeHLnjs5U8
- https://blog.csdn.net/PolarisRisingWar/article/details/116069338
- https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
- https://juejin.cn/post/6844903934876729351