準(zhǔn)備
批處理
為方便控制集群,寫了腳本cmd2all.sh
#!/bin/bash
if [ $# -lt 3 ]; then
echo "usage: $0 [type cmds hosts]"
echo "for example: ./cmd2all.sh \"cmds\" \"touch t1.txt\" \"gpu1 gpu2\""
echo "for example: ./cmd2all.sh \"path\" \"/home/gbxu/CUDA/" \"gpu1 gpu2\""
exit -1;
fi
type=$1 # "cmds"
cmds_or_path=$2 # "touch test.txt"
#hosts=$3
hosts=(gpu10 gpu11 gpu12 gpu13 gpu14 gpu15 gpu16 gpu17 gpu18)
if [$type == "cmds"]
then
for host in ${hosts[@]}
do
ssh $host nohup $cmds_or_path &
done
fi
if [$type == "path"]
then
for host in ${hosts[@]}
do
nohup scp -r $cmds_or_path $host:~/ &
done
fi
使用virtualenv
如果是python3的環(huán)境,需要virtualenv -p /usr/bin/python3 mxnetGPU
使用virtualenv,創(chuàng)建新的virtualenv,并修改.bashrc,使得在每次進(jìn)入終端時(shí)activate虛擬環(huán)境(方便后期分布式運(yùn)行)
hosts="gpu10 gpu11 gpu12 gpu13 gpu14 gpu15 gpu16 gpu17 gpu18 "
./cmd2all.sh "cmds" "sudo yum -y install epel-release && sudo yum -y install python-pip && sudo pip install virtualenv && virtualenv mxnetGPU" $hosts
./cmd2all.sh "cmds" "echo \"## gbxu MXnet-GPU\" >> .bashrc" $hosts
./cmd2all.sh "cmds" "echo \"source mxnetGPU/bin/activate\" >> .bashrc" $hosts
嘗試在gpu10安裝
Install NVIDIA Driver
本身已有驅(qū)動(dòng)則該操作不必要。
lspci | grep -i nvidia #查看設(shè)備
modinfo nvidia #查看驅(qū)動(dòng)
sudo yum -y remove nvidia-*
sudo sh NVIDIA-Linux-x86_64-390.25.run #安裝驅(qū)動(dòng)
Install CUDA:
see documents:
- offline安裝笑诅,online版本可能出現(xiàn)依賴缺失。
-
所有版本
CUDA是NVIDIA推出的用于自家GPU的并行計(jì)算框架疮鲫,只有當(dāng)要解決的計(jì)算問題是可以大量并行計(jì)算的時(shí)候才能發(fā)揮CUDA的作用吆你。
下載: 見offline安裝
#copy installer && run
# 若安裝錯(cuò)誤需要先卸載
sudo yum -y remove "cuda-*"
sudo rm -rf /usr/local/cuda*
sudo rpm -i cuda-repo-rhel7-9-2-local-9.2.148-1.x86_64.rpm
sudo yum clean all
sudo yum -y install cuda
gpu10利用yum local的安裝出現(xiàn)問題,后來下載cuda_9.2.148_396.37_linux.run
sudo sh cuda_9.2.148_396.37_linux.run
安裝
并且在安裝(or not, just try)時(shí)同意nvidia驅(qū)動(dòng)棚点,并且一路yes和default早处。
or, add /usr/local/cuda-9.2/lib64 to /etc/ld.so.conf and run ldconfig as root
添加CUDA環(huán)境變量
# export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:$LD_LIBRARY_PATH
echo -e "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:\$LD_LIBRARY_PATH" >> .bashrc
# export PATH=$PATH:/usr/local/cuda/bin
echo -e "export PATH=\$PATH:/usr/local/cuda/bin" >> .bashrc
測(cè)試CUDA
nvcc -V
nvidia-smi
cd /home/gbxu/NVIDIA_CUDA-9.2_Samples/1_Utilities/deviceQuery
make
./deviceQuery # 結(jié)果pass則安裝成功
Install cuDNN:
see documents
cuDNN(CUDA Deep Neural Network library):是NVIDIA打造的針對(duì)深度神經(jīng)網(wǎng)絡(luò)的加速庫(kù),是一個(gè)用于深層神經(jīng)網(wǎng)絡(luò)的GPU加速庫(kù)瘫析。如果你要用GPU訓(xùn)練模型砌梆,cuDNN不是必須的,但是一般會(huì)采用這個(gè)加速庫(kù)贬循。
tar -xzvf cudnn-9.2-linux-x64-v7.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h
安裝Prerequisites
see documents
sudo yum -y install build-essential git lapack-devel openblas-devel opencv-devel atlas-devel
complie MXNet
see:documents
git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
make clean_all
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1 USE_PROFILER=1
install MXNet in python
cd python
pip uninstall -y mxnet
pip install -e .
test MXNet in python
python
>>> import mxnet as mx
>>> a = mx.nd.zeros((2,3), mx.gpu())
install python lib
請(qǐng)根據(jù)最后運(yùn)行MXNet任務(wù)時(shí)查缺補(bǔ)漏
pip install numpy requests
預(yù)設(shè)編譯參數(shù)
cd到源代碼主目錄咸包,在makefile文件中預(yù)設(shè)編譯參數(shù),
# vim incubator-mxnet/Makefile
cmpl:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=1
cmplgpu:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=1 USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
之后使用make指令編譯更為便捷杖虾。
make cmplgpu
批量安裝環(huán)境
在gpu11-gpu18批量安裝環(huán)境
先用1.sh
將數(shù)據(jù)傳到nodes烂瘫,
1.sh
hosts=(gpu11 gpu12 gpu13 gpu14 gpu15 gpu16 gpu17 gpu18)
for host in ${hosts[@]}
do
echo run 1.sh at $host
scp -r process_data gbxu@$host:~/
done
再用2.sh
在各nodes運(yùn)行scripts_in_nodes.sh
腳本即可。
2.sh
hosts=(gpu12 gpu13 gpu14 gpu15 gpu16 gpu17 gpu18)
for host in ${hosts[@]}
do
echo run 2.sh at $host
scp process_data/scripts_in_nodes.sh gbxu@$host:~/process_data/
ssh gbxu@$host "cd process_data && nohup ./scripts_in_nodes.sh &"
done
scripts_in_nodes.sh
sudo yum -y remove "cuda-*"
sudo rpm -i cuda-repo-rhel7-9-2-local-9.2.148-1.x86_64.rpm
sudo yum clean all
sudo yum -y install cuda
echo -e "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:\$LD_LIBRARY_PATH" >> ~/.bashrc
echo -e "export PATH=\$PATH:/usr/local/cuda/bin" >> ~/.bashrc
tar -xzvf cudnn-9.2-linux-x64-v7.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo yum -y install build-essential git lapack-devel openblas-devel opencv-devel atlas-devel
pip install numpy requests # 請(qǐng)根據(jù)最后運(yùn)行MXNet任務(wù)時(shí)查缺補(bǔ)漏
編譯奇适、安裝MXNet
之后只需在一臺(tái)host上編譯mxnet即可坟比,余下用MXNet的同步機(jī)制即可。
在gpu10上啟動(dòng)訓(xùn)練
- 需要加庫(kù)文件放到同步的文件夾下:
cd incubator-mxnet/example/image-classification
echo -e "gpu11\ngpu12\ngpu13\ngpu14\ngpu15\ngpu16\ngpu17\ngpu18\n" > hosts
rm -rf mxnet
cp -r ../../python/mxnet .
cp -r ../../lib/libmxnet.so mxnet
- 然后執(zhí)行命令嚷往,該命令會(huì)同步文件夾cluster上啟動(dòng)8個(gè)worker葛账,1個(gè)server
# export DMLC_INTERFACE='ib0'; # ib尚未配置好
python ../../tools/launch.py -n 8 -s 1 --launcher ssh -H hosts --sync-dst-dir /home/gbxu/image-classification_test/ python train_mnist.py --network lenet --kv-store dist_sync --num-epochs 1 --gpus 0
ENJOY
- multiple machines each containing multiple GPUs 的訓(xùn)練見docs
- 其中
dist_sync_device
替代dist_sync
。因?yàn)閏luster為多GPU皮仁,見docs - mxnet-make-install-test.sh
cd incubator-mxnet
make clean_all
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1 USE_PROFILER=1
cd python
pip uninstall -y mxnet
pip install -e .
cd ../example/image-classification
echo -e "gpu11\ngpu12\ngpu13\ngpu14\ngpu15\ngpu16\ngpu17\n" > hosts
rm -rf mxnet # example/image-classification下的
cp -r ../../python/mxnet .
cp -r ../../lib/libmxnet.so mxnet
export DMLC_INTERFACE='ib0';
python ../../tools/launch.py -n 8 -s 1 --launcher ssh -H hosts --sync-dst-dir /home/gbxu/image-classification_test/ python train_mnist.py --network lenet --kv-store dist_sync --num-epochs 1 --gpus 0