Author: Zongwei Zhou | 周縱葦
Weibo: @MrGiovanni
Email: zongweiz@asu.edu
References.
官方文檔:multi_gpu_model
以及Google
0. 誤區(qū)
目前Keras是支持了多個GPU同時訓練網(wǎng)絡权均,非常容易,但是靠以下這個代碼是不行的锅锨。
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
當你監(jiān)視GPU的使用情況(nvidia-smi -l 1
)的時候會發(fā)現(xiàn)叽赊,盡管GPU不空閑,實質(zhì)上只有一個GPU在跑必搞,其他的就是閑置的占用狀態(tài)必指,也就是說,如果你的電腦里面有多張顯卡恕洲,無論有沒有上面的代碼塔橡,Keras都會默認的去占用所有能檢測到的GPU梅割。這行代碼在你只需要一個GPU的時候時候用的,也就是可以讓Keras檢測不到電腦里其他的GPU葛家。假設你一共有三張顯卡户辞,每個顯卡都是有自己的標號的(0, 1, 2),為了不影響別人的使用癞谒,你只用其中一個底燎,比如用gpu=1的這張,那么
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
然后再監(jiān)視GPU的使用情況(nvidia-smi -l 1
)扯俱,確實只有一個被占用书蚪,其他都是空閑狀態(tài)。所以這是一個Keras使用多顯卡的誤區(qū)迅栅,它并不能同時利用多個GPU殊校。
1. 目的
為什么要同時用多個GPU來訓練?
單個顯卡內(nèi)存太小 -> batch size無法設的比較大读存,有時甚至batch_size=1都內(nèi)存溢出(OUT OF MEMORY)
從我跑深度網(wǎng)絡的經(jīng)驗來看为流,batch_size設的大一點會比較好,相當于每次反向傳播更新權重让簿,網(wǎng)絡都可以看到更多的樣本敬察,從而不會每次iteration都過擬合到不同的地方去Don't Decay the Learning Rate, Increase the Batch Size。當然尔当,我也看過有論文說也不能設的過大莲祸,原因不明... 反正我也沒有機會試過。我建議的batch_size大概就是64~256的范圍內(nèi)椭迎,都沒什么大問題锐帜。
但是隨著現(xiàn)在網(wǎng)絡的深度越來越深,對于GPU的內(nèi)存要求也越來越大畜号,很多入門的新人最大的問題往往不是代碼缴阎,而是從Github里面抄下來的代碼自己的GPU太渣,實現(xiàn)不了简软,只能降低batch_size蛮拔,最后訓練不出那種效果。
解決方案兩個:一是買一個超級牛逼的GPU痹升,內(nèi)存巨大無比建炫;二是買多個一般般的GPU,一起用疼蛾。
第一個方案不行踱卵,因為目前即便最好的NVIDIA顯卡,內(nèi)存也不過十幾個G了不起了,網(wǎng)絡一深也掛惋砂,并且買一個牛逼顯卡的性價比不高。所以绳锅、學會在Keras下用多個GPU是比較靠譜的選擇西饵。
2. 實現(xiàn)
2.1 設計一個類
cite: parallel_model.py
import tensorflow as tf
import keras.backend as K
import keras.layers as KL
import keras.models as KM
class ParallelModel(KM.Model):
"""Subclasses the standard Keras Model and adds multi-GPU support.
It works by creating a copy of the model on each GPU. Then it slices
the inputs and sends a slice to each copy of the model, and then
merges the outputs together and applies the loss on the combined
outputs.
"""
def __init__(self, keras_model, gpu_count):
"""Class constructor.
keras_model: The Keras model to parallelize
gpu_count: Number of GPUs. Must be > 1
"""
self.inner_model = keras_model
self.gpu_count = gpu_count
merged_outputs = self.make_parallel()
super(ParallelModel, self).__init__(inputs=self.inner_model.inputs,
outputs=merged_outputs)
def __getattribute__(self, attrname):
"""Redirect loading and saving methods to the inner model. That's where
the weights are stored."""
if 'load' in attrname or 'save' in attrname:
return getattr(self.inner_model, attrname)
return super(ParallelModel, self).__getattribute__(attrname)
def summary(self, *args, **kwargs):
"""Override summary() to display summaries of both, the wrapper
and inner models."""
super(ParallelModel, self).summary(*args, **kwargs)
self.inner_model.summary(*args, **kwargs)
def make_parallel(self):
"""Creates a new wrapper model that consists of multiple replicas of
the original model placed on different GPUs.
"""
# Slice inputs. Slice inputs on the CPU to avoid sending a copy
# of the full inputs to all GPUs. Saves on bandwidth and memory.
input_slices = {name: tf.split(x, self.gpu_count)
for name, x in zip(self.inner_model.input_names,
self.inner_model.inputs)}
output_names = self.inner_model.output_names
outputs_all = []
for i in range(len(self.inner_model.outputs)):
outputs_all.append([])
# Run the model call() on each GPU to place the ops there
for i in range(self.gpu_count):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i):
# Run a slice of inputs through this replica
zipped_inputs = zip(self.inner_model.input_names,
self.inner_model.inputs)
inputs = [
KL.Lambda(lambda s: input_slices[name][i],
output_shape=lambda s: (None,) + s[1:])(tensor)
for name, tensor in zipped_inputs]
# Create the model replica and get the outputs
outputs = self.inner_model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later
for l, o in enumerate(outputs):
outputs_all[l].append(o)
# Merge outputs on CPU
with tf.device('/cpu:0'):
merged = []
for outputs, name in zip(outputs_all, output_names):
# If outputs are numbers without dimensions, add a batch dim.
def add_dim(tensor):
"""Add a dimension to tensors that don't have any."""
if K.int_shape(tensor) == ():
return KL.Lambda(lambda t: K.reshape(t, [1, 1]))(tensor)
return tensor
outputs = list(map(add_dim, outputs))
# Concatenate
merged.append(KL.Concatenate(axis=0, name=name)(outputs))
return merged
2.2 調(diào)用非常簡潔
GPU_COUNT = 3 # 同時使用3個GPU
model = keras.applications.densenet.DenseNet201() # 比如使用DenseNet-201
model = ParallelModel(model, GPU_COUNT)
model.compile(optimizer=Adam(lr=1e-5), loss='binary_crossentropy', metrics = ['accuracy'])
model.fit(X_train, y_train,
batch_size=batch_size*GPU_COUNT,
epochs=nb_epoch, verbose=0, shuffle=True,
validation_data=(X_valid, y_valid))
model.save_weights('/path/to/save/model.h5')