死亡Error:OSError: [Errno 12] Cannot allocate memory
調(diào)試背景:使用的是github上https://github.com/arunmallya/packnet這里的代碼室抽。
調(diào)試的時候,出現(xiàn)Error栅隐,如下:
? ? main()
? File "main.py", line 378, in main
? ? manager.prune()
? File "main.py", line 263, in prune
? ? savename='_final', best_accuracy=accuracy)
? File "main.py", line 217, in train
? ? self.do_epoch(epoch_idx, optimizer)
? File "main.py", line 174, in do_epoch
? ? for batch, label in tqdm(self.train_data_loader, desc='Epoch: %d ' % (epoch_idx)):
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1032, in __iter__
? ? for obj in iterable:
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 301, in __iter__
? ? return DataLoaderIter(self)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 158, in __init__
? ? w.start()
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/process.py", line 105, in start
? ? self._popen = self._Popen(self)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
? ? return _default_context.get_context().Process._Popen(process_obj)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
? ? return Popen(process_obj)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
? ? self._launch(process_obj)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
? ? self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
??
遇到這個問題,由于代碼本身的額原因先是考慮到運行電腦的內(nèi)存問題炉旷,于是用
watch -n 2 nvidia-smi
watch -n 2 free -m
??
全程監(jiān)視電腦CPU兽肤、GPU,以及物理內(nèi)存镶奉、交換區(qū)內(nèi)存的變化情況测蹲,發(fā)現(xiàn)并不是內(nèi)存的原因莹捡。找bug未果。
換了一個思路扣甲,從出錯的代碼以及錯誤提示上來看篮赢,是dataloader.py出了問題,于是Google琉挖,關(guān)鍵詞:dataloader OSError: [Errno 12] Cannot allocate memory
果然有很多人也是由于在dataload的時候出錯启泣,找了很多原因:
1、電腦內(nèi)存原因(已排除)
2示辈、電腦系統(tǒng)線程數(shù)量限制:https://blog.csdn.net/m0_37644085/article/details/92795488:修改最大進程數(shù)(嘗試無效)
3寥茫、設置pin_memory=False;(嘗試無效)
4矾麻、修改多線程數(shù)量:設置num_workers纱耻,系統(tǒng)默認的數(shù)量是4,改成1之后险耀,沒有效果弄喘,后面改成0,問題解決KξD⒅尽!程序可以跑了。
特發(fā)此帖紀念急但,認真查了兩天多E烀健!波桩!希望可以幫到大家旱幼。