TensorRT 是 NVIDIA 提出的用于統(tǒng)一模型部署的加速器,可以應用于 NVIDIA 自家設(shè)計的硬件平臺比如 NVIDIA Tesla A100 顯卡帜慢,JETSON Xavier 開發(fā)板等隔披,它的輸入可以是來自各個流行的訓練框架泄鹏,比如 Tensorflow
, Pytorch
等訓練得到的模型結(jié)果嫉沽。
官網(wǎng)定義:
TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks leveraging libraries, development tools and technologies in CUDA-X for artificial intelligence, autonomous machines, high-performance computing, and graphics.
TensorRT包括 推理優(yōu)化(inference optimization) 和 runtime 兩部分呀潭,類似于 MicroSoft 提出的 ONNX Runtime
昔逗,但 ONNX Runtime
一般只能接收 ONNX
格式的模型降传,而TensorRT可以接受包括ONNX
,Pytorch
, Tensorflow
等基本上所有框架的模型
TensorRT在對模型優(yōu)化時主要進行了5個調(diào)整:
1. Layer and tensor fusion
kernel fusion 的主要目的是提高GPU的利用效率,減少 kernel 的數(shù)目勾怒,因為每增加一個算子就會增加一份數(shù)據(jù)讀寫婆排,而數(shù)據(jù)讀寫是相對比較耗時的,同時增加一個算子也會增加一次計算
- 因此將可以融合的模塊比如
conv-bn-relu
三個模塊就可以融合成一個模塊這就可以減少數(shù)據(jù)讀寫和多次計算 - 對于具有同一個輸入笔链,且模塊內(nèi)容相同的模塊段只,但是輸出不一樣的,如上圖左的 3 個 1x1 模塊卡乾,就可以利用并行parallel進行計算翼悴,再輸出到不同的節(jié)點。具體的實現(xiàn)方法后續(xù)跟進
When you have identical kernels which take the same input but just use different weights, you can combine the kernels by making a single kernel wider in a sense that is processes more of these operations in parallel. The output from these horizontally fused kernels will be automatically split up if they feed to different kernels further down the graph。
2. Precision Calibration
校準精度鹦赎,由于這里 inference 過程只需要 forward谍椅,并不需要 backward, 因此就不需要 32位的浮點數(shù)來進行計算,因此可以合理的采用 fp16 或者 int8 來進行 forward, 這樣可以是的模型存儲空間更小古话,更低的內(nèi)存占用和延遲
具體的實現(xiàn)方法后續(xù)跟進雏吭,引用如下:
TensorRT achieved this by using an automated parameter-free calibration step to change the weighs and activation tensors into lower precision using a representative input sample and this is done such that the model minimizes the accuracy loss.
3. Kernel Auto-tuning
對于同一個操作(卷積等)有很多不同的底層實現(xiàn),TensorRT 可以根據(jù)你的參數(shù) 比如 batch-size, filter-size, input data size 等或者部署平臺去選擇最優(yōu)的實現(xiàn)方法陪踩。
4. Dynamic Tensor Memory
dynamic tensor memory ensures that memory is allocated for each tensor only for the duration of its usage. This naturally reduces memory footprint and improves memory reuse.
5. Multi-Stream Execution
Multi-stream execution is essential when you scale the inference to multiple clients. This is achieved by allowing multiple input streams to use the same model in parallel on a single device
代碼:
可以使用 TRTorch
, torch2trt
, 或者TF-TRT
對模型進行轉(zhuǎn)換
TRTorch, torch2trt
pytorch 舉例:
import torch
from torch2trt import torch2trt
from torchvision.models.alexnet import alexnet
# create some regular pytorch model...
model = alexnet(pretrained=True).eval().cuda()
# create example data
x = torch.ones((1, 3, 224, 224)).cuda()
# convert to TensorRT feeding sample data as input
model_trt = torch2trt(model, [x])
y = model(x)
y_trt = model_trt(x)
# check the output against PyTorch
print(torch.max(torch.abs(y - y_trt)))
模型保存和加載:
torch.save(model_trt.state_dict(), 'alexnet_trt.pth')
from torch2trt import TRTModule
model_trt = TRTModule()
model_trt.load_state_dict(torch.load('alexnet_trt.pth'))