環(huán)境
Linux
GPU Tesla K80
步驟
0. DeepBench下載
從官網 https://github.com/baidu-research/DeepBench下載DeepBench包
git方式:
git clone https://github.com/baidu-research/DeepBench
1. 編譯
-
環(huán)境配置
NVIDIA benchmarks需要CUDA cuDNN MPI nccl
前三個可以直接由module導入,這里使用的是CUDA8.0 cuDNN5.1 openmpi1.10.2弯囊,nccl使用自己安裝好的路徑
后面出現的問題多半是這幾個庫的版本問題
export MODULEPATH=/BIGDATA/app/modulefiles_GPU/:/BIGDATA/app/modulefiles
module load CUDA/8.0
module load cudnn/5.1-CUDA8.0
module load openmpi/1.10.2-gcc4.9.2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/HOME/user_name/nccl/path/lib
從DeepBench目錄下進入NVIDIA目錄
cd code/nvidia
-
build
使用官網給出的build方法,build似乎可以不用yhrun延柠,make后要加上ARCH配置
yhrun -n 1 make CUDA_PATH=/BIGDATA/app/CUDA/8.0 CUDNN_PATH=/BIGDATA/app/cuDNN/5.1-CUDA8.0 MPI_PATH=/BIGDATA/app/openmpi/1.10.2-gcc4.9.2 NCCL_PATH=/HOME/user_name/nccl ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70
或者修改Makefile
也可以分開build浪讳,比如conv
make conv
#具體:
yhrun -n 1 make CUDA_PATH=/BIGDATA/app/CUDA/8.0 CUDNN_PATH=/BIGDATA/app/cuDNN/5.1-CUDA8.0 MPI_PATH=/BIGDATA/app/openmpi/1.10.2-gcc4.9.2 NCCL_PATH=/HOME/user_name/nccl ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62 conv
build 成功
mkdir -p bin
/BIGDATA/app/CUDA/8.0/bin/nvcc conv_bench.cu -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /BIGDATA/app/CUDA/8.0/include -I /BIGDATA/app/cuDNN/5.1-CUDA8.0/include/ -L /BIGDATA/app/cuDNN/5.1-CUDA8.0/lib64/ -L /BIGDATA/app/CUDA/8.0/lib64 -lcurand -lcudnn --generate-code arch=compute_30,code=sm_30 --generate-code arch=compute_32,code=sm_32 --generate-code arch=compute_35,code=sm_35 --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_52,code=sm_52 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_62,code=sm_62 -std=c++11
運行前設置好LD_LIBRARY
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/BIGDATA/app/CUDA/8.0:/BIGDATA/app/cuDNN/5.1-CUDA8.0:/BIGDATA/app/PGIcompiler/17.1/linux86-64/2017/mpi/openmpi-1.10.2:/HOME/user_name/nccl
2. 運行測試
-
gemm benchmark
nvidia目錄下
yhrun -n 1 ./bin/gemm_bench
CUDA8.0 cudnn5.1 配置下運行會報錯宝磨,由于CUDA是天河配置好的疯潭,我不會改
terminate called after throwing an instance of 'std::runtime_error'
what(): sgemm failed
1760 16 1760 0 0
halfyhrun: error: gn26: task 0: Aborted (core dumped)
CUDA7.0 cudnn4.0 配置可以正常運行
一部分結果
### CUDA7.0 cudnn4.0 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------
m n k a_t b_t precision time (usec)
1760 16 1760 0 0 float 340 .
...
略
-
conv benchmark
nvidia目錄下
yhrun -n 1 ./bin/conv_bench
CUDA8.0 cudnn6.0 可編譯但無法運行
CUDA7.0 cudnn4.0 無法編譯两蟀,會提示缺很多東西概说,可能是版本過老
CUDA8.0 cudnn5.1 配置運行中途會報錯:運行到第11個算例時出現runtime_error導致運行中止
Illegal algorithm passed to get_fwd_algo_string. Algo: 7
把conv_bench.cu文件中的std::string get_fwd_algo_string()函數中最后一部分的
else {
std::stringstream ss;
ss << "Illegal algorithm passed to get_fwd_algo_string. Algo: " << fwd_algo_ << std::endl;
throw std::runtime_error(ss.str());
}
改成
else {
return "#unknown"
}
重新編譯后再運行碧注,即可越過有問題的段落嚣伐,第11個顯示的是unknown,后面還有好多unknown
### CUDA8.0 cudnn5.1 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
w h c n k f_w f_h pad_w pad_h stride_w stride_h precision fwd_time (usec) bwd_inputs_time (usec) bwd_params_time (usec) total_time (usec) fwd_algo
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
700 161 1 4 32 20 5 0 0 2 2 float 929 1136 1074 3139 IMPLICIT_GEMM
700 161 1 8 32 20 5 0 0 2 2 float 1587 2168 1928 5683 IMPLICIT_GEMM
700 161 1 16 32 20 5 0 0 2 2 float 2813 4337 3508 10658 IMPLICIT_PRECOMP_GEMM
700 161 1 32 32 20 5 0 0 2 2 float 6368 8659 6899 21926 IMPLICIT_GEMM
341 79 32 4 32 10 5 0 0 2 2 float 2174 4076 2506 8756 IMPLICIT_PRECOMP_GEMM
341 79 32 8 32 10 5 0 0 2 2 float 4211 8128 5007 17346 IMPLICIT_PRECOMP_GEMM
341 79 32 16 32 10 5 0 0 2 2 float 8459 16200 9985 34644 IMPLICIT_PRECOMP_GEMM
341 79 32 32 32 10 5 0 0 2 2 float 16903 32380 20188 69471 IMPLICIT_PRECOMP_GEMM
480 48 1 16 16 3 3 1 1 1 1 float 752 1014 1515 3281 IMPLICIT_GEMM
240 24 16 16 32 3 3 1 1 1 1 float 863 1332 1258 3453 IMPLICIT_GEMM
120 12 32 16 64 3 3 1 1 1 1 float 613 652 1005 2270 #unknown
...
略
-
rnn benchmark
nvidia目錄下
yhrun -n 1 ./bin/rnn_bench
CUDA8.0 cudnn5.1 配置下可正常運行
### CUDA8.0 cudnn5.1 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------
type hidden N timesteps precision fwd_time (usec) bwd_time (usec)
vanilla 1760 16 50 float 19590 17450
vanilla 1760 32 50 float 18289 18044
...
lstm 512 16 25 float 3888 5551
lstm 512 32 25 float 3922 5603
...
gru 2816 32 1500 float 2638524 2475404
gru 2816 32 750 float 1319982 1240556
...
略
-
all reduce benchmark
nccl_single_all_reduce
nvidia目錄下
yhrun -n 1 ./bin/nccl_single_all_reduce 2
可以正常運行
NCCL AllReduce
Num Ranks: 2
---------------------------------------------------------------------------
# of floats bytes transferred Time (msec)
---------------------------------------------------------------------------
100000 400000 0.109
3097600 12390400 1.344
...
略
nccl_mpi_all_reduce
nvidia目錄下
yhrun -n 2 -N 2 mpirun -np 2 ./bin/nccl_mpi_all_reduce
可以運行但無結果应闯,我在那個目錄下有報錯提示缺失的文件纤控,不知為什么會這樣報錯
mca: base: component_find: unable to open /BIGDATA/app/openmpi/1.10.2-gcc4.9.2/lib/openmpi/mca_btl_scif: libscif.so.0: cannot open shared object file: No such file or directory (ignored)
3. 使用yhbatch測試
由于測試時間長挂捻,VPN總掉線碉纺,可以使用yhbatch來運行
創(chuàng)建一個test.sh,文件test.sh內容如下:
#! /bin/bash
yhrun -n xx xxx_bench (yhrun語句)
再使用yhbatch命令
yhbatch -n 1 ./test.sh
這樣即可將任務提交上去
任務完成后會有一個slurm_jobid.out文件刻撒,原本輸出到控制臺的語句都可以在這里找到