現(xiàn)象描述
當(dāng)模型在推理階段使用batch inference時,推理速度并無明顯提升舟茶,相比單幀多次推理收益不大谭期。如筆者在Xavier上測試某模型結(jié)果
batch size | 推理時間ms | 折算耗時 ms/img |
---|---|---|
1 | 11.23 | 11.23 |
2 | 20.39 | 10.20 |
4 | 38.73 | 9.68 |
8 | 74.11 | 9.26 |
32 | 287.30 | 8.98 |
類似情況在網(wǎng)上也很多見,如yolov5作者的測試結(jié)果【1】
按理來說吧凉,多張圖放一個batch喂給模型隧出,模型矩陣運(yùn)算可以并行操作,推理的速度可以有batch size倍的提升阀捅,但實(shí)際觀察到的現(xiàn)象確實(shí)提升不大胀瞪,尤其是在一些算力較弱的設(shè)備上。
原因分析:
在網(wǎng)上搜索一番饲鄙,大概定位原因赏廓,這里參考GitHub tensorrt 的 issues 1046解答【2】:
簡單來說涵紊,問題在于gpu計算性能有瓶頸。如果單張圖的計算量已經(jīng)快占滿計算核心(達(dá)到性能瓶頸)幔摸,再增加batch size也無法多張圖并行計算,尤其是在網(wǎng)絡(luò)中間的一些層channel數(shù)特別大時颤练,瞬時矩陣乘法運(yùn)算量非常大既忆,cuda核用滿了就需要排隊慢慢計算。
Generally, GPU computation is more efficient when the batch size is larger. This is because when you have a lot of ops, you can fully utilize the GPUs and hide some inefficiency or overhead between ops. However, if there are already a lot of ops at BS=1 and even BS=1 is able to fully utilize the GPUs, you may not see any increase in efficiency anymore.
For example, is your input size BSx3x1600x1000? This is a super large image which is expected to fully utilize even the largest GPU we have (like A100), so I don't think increasing BS gives benefit on GPU efficiency.
In terms of N/V/K, in your case the "N" is already 1600x1000 at BS=1, so N=1600x1000 vs N=2x1600x1000 do not make too much difference in turns of GPU efficiency, compared to N=1 vs N=2.
另外一個現(xiàn)象就是gpu性能越高嗦玖,batch inference效果提升越明顯患雇。如筆者在xavier上測試單幀推理時,GPU利用率就接近60%宇挫,所以當(dāng)batch size增加時基本無增益苛吱,而yolov5作者在A100(性能天花板更高)測試時,加速效果更明顯器瘪。其實(shí)當(dāng)batch size非常大時翠储,相當(dāng)于在讓GPU持續(xù)工作直到計算完成,減少了等待時間橡疼,所以性能越高可以并行計算的量也就越大援所,加速越明顯。
可以嘗試的優(yōu)化方向:
遇到上述情況欣除,想要加快推理速度住拭,除了最直接的-換更高性能的設(shè)備,暫時想到如下兩個方向優(yōu)化:
減少計算量:
降低模型輸入尺寸
優(yōu)化網(wǎng)絡(luò)結(jié)構(gòu)(中間計算量非常大的某些層)历帚,思想就是大的矩陣分解計算滔岳;想簡單省事的就看是否有開源的成果,如yolov5升級yolov8之類的
模型導(dǎo)trt挽牢,模型量化(fp16, int8)谱煤、剪枝等
升級trt版本說不定有驚喜,NVIDIA的工程師們可能對某些算子做了優(yōu)化
減少cuda核等待時間:
- 異步模式(多線程等)卓研,就是不讓gpu閑著趴俘,一直去計算
如有其它后續(xù)補(bǔ)充......