在含有GPU的服務(wù)器中间雀,運(yùn)行TensorFlow程序催什。
一般不需要顯式指定使用CPU還是GPU涵亏,TensorFlow能自動檢測。如果檢測到GPU蒲凶,TensorFlow會盡可能地利用找到的第一個GPU來執(zhí)行操作气筋。如果機(jī)器上有超過一個可用的GPU,除第一個外旋圆,其它GPU默認(rèn)是不參與計算的裆悄。為了讓TensorFlow使用這些 GPU,你必須將op明確指派給它們執(zhí)行臂聋。with tf.device('/gpu:%d' % i)
用來指派特定的CPU或GPU光稼。
本文源碼的GitHub地址或南,位于multi_gpu_train
文件夾。
查看當(dāng)前服務(wù)器艾君,滿足TensorFlow的GPU數(shù)采够。
def get_available_gpus():
"""
查看GPU的命令:nvidia-smi
查看被占用的情況:ps aux | grep PID
:return: GPU個數(shù)
"""
local_device_protos = device_lib.list_local_devices()
print "all: %s" % [x.name for x in local_device_protos]
print "gpu: %s" % [x.name for x in local_device_protos if x.device_type == 'GPU']
默認(rèn)的TensorFlow庫,無法顯示冰垄,需要下載TensorFlow的GPU版本蹬癌。
pip install --upgrade tensorflow-gpu==1.2 -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com
查看GPU的庫是否導(dǎo)入
echo $LD_LIBRARY_PATH
查看服務(wù)器的GPU
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.20 Driver Version: 375.20 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:04:00.0 Off | 0 |
| N/A 29C P0 68W / 235W | 0MiB / 11471MiB | 85% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tensorflow-gpu==1.3.0報錯,無法找到libcudnn.so.6虹茶,回退版本1.2即可逝薪。
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory
如果缺少libcudnn庫,則登錄Nvidia的官網(wǎng)下載cuDNN庫蝴罪,網(wǎng)址董济。
最終顯示設(shè)備
all: [u'/cpu:0', u'/gpu:0']
gpu: [u'/gpu:0']
顯示tensorboard
tensorboard --logdir=/tmp/cifar10_train --port=8008
源碼
使用tf.gfile模塊,處理文件夾要门,執(zhí)行核心方法train()虏肾。
def main(argv=None): # pylint: disable=unused-argument
cifar10.maybe_download_and_extract() # 下載數(shù)據(jù)
# 目錄處理的標(biāo)準(zhǔn)流程,使用tf.gfile模塊
if tf.gfile.Exists(FLAGS.train_dir): # 如果存在已有的訓(xùn)練數(shù)據(jù)
tf.gfile.DeleteRecursively(FLAGS.train_dir) # 則遞歸刪除
tf.gfile.MakeDirs(FLAGS.train_dir) # 新建目錄
train() # 核心方法欢搜,訓(xùn)練
if __name__ == '__main__':
tf.app.run()
創(chuàng)建global_step
封豪,訓(xùn)練步數(shù),在訓(xùn)練時炒瘟,自動增加吹埠,名稱是global_step
,shape是[]
疮装,表示常數(shù)藻雌,初始值是0,非訓(xùn)練參數(shù)斩个。
def train():
"""Train CIFAR-10 for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'): # 默認(rèn)使用默認(rèn)CPU0
# 參數(shù): trainable是False,不用訓(xùn)練驯杜,全局步數(shù)就是global_step受啥,默認(rèn)設(shè)置。
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# 每個批次的訓(xùn)練數(shù)鸽心,
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size) # batch_size是128滚局,50000 / 128=390.625
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY) # 每個批次需要衰減的次數(shù)
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True) # 計算學(xué)習(xí)率,lr=Learning Rate
opt = tf.train.GradientDescentOptimizer(lr) # 參數(shù)是學(xué)習(xí)率
使用預(yù)加載隊列顽频,獲取batch_queue
藤肢。
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs() # 獲取圖片資源和標(biāo)簽
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus) # 使用預(yù)加載的隊列
多個GPU執(zhí)行,當(dāng)GPU不足時糯景,有幾個執(zhí)行幾個嘁圈。將全部梯度放置于tower_grads
省骂,reuse_variables()
重用變量,使用summaries獲取scope變量的數(shù)據(jù)最住。
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()): # 變量的名稱
for i in xrange(FLAGS.num_gpus): # 創(chuàng)建GUP的循環(huán)
with tf.device('/gpu:%d' % i): # 指定GPU
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# 含有幾個GPU執(zhí)行幾個钞澳,沒有不執(zhí)行
print('running: %s_%d' % (cifar10.TOWER_NAME, i))
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch) # 獲得損失函數(shù)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables() # 重用變量
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) # 創(chuàng)建存儲信息
# Calculate the gradients for the batch of data on this CIFAR tower.
grads = opt.compute_gradients(loss) # 計算梯度
# Keep track of the gradients across all towers.
tower_grads.append(grads) # 添加梯度,tower_grads是外部變量涨缚,會存儲全部梯度信息
求平均的梯度轧粟,優(yōu)化器opt使用梯度的平均值,存儲輸入進(jìn)入summaries脓魏。
# We must calculate the mean of each gradient. Note that this is the
# synchronization point across all towers.
grads = average_gradients(tower_grads) # 求梯度的平均值
# Add a summary to track the learning rate.
summaries.append(tf.summary.scalar('learning_rate', lr))
# Add histograms for gradients.
for grad, var in grads: # 存儲數(shù)據(jù)
if grad is not None:
summaries.append(tf.summary.histogram(var.op.name + '/gradients', grad))
# Apply the gradients to adjust the shared variables.
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) # 將梯度應(yīng)用于變量
# Add histograms for trainable variables.
for var in tf.trainable_variables(): # 存儲數(shù)據(jù)
summaries.append(tf.summary.histogram(var.op.name, var))
求變量的均值操作
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY, global_step) # 求變量的均值
variables_averages_op = variable_averages.apply(tf.trainable_variables()) # 將變量的均值應(yīng)用于操作
訓(xùn)練操作兰吟,summary操作
# Group all updates to into a single train op.
train_op = tf.group(apply_gradient_op, variables_averages_op) # 訓(xùn)練操作
# Create a saver.
saver = tf.train.Saver(tf.global_variables()) # 創(chuàng)建變量存儲器
# Build the summary operation from the last tower summaries.
summary_op = tf.summary.merge(summaries) # 合并統(tǒng)計數(shù)據(jù)
執(zhí)行初始化操作
# Build an initialization operation to run below.
init = tf.global_variables_initializer()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
summary_writer = tf.summary.FileWriter(FLAGS.train_dir, sess.graph) # 寫入summary的地址
每10次顯示一次,每100次總結(jié)一次茂翔,每1000次保存一次混蔼;
for step in xrange(FLAGS.max_steps):
start_time = time.time()
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
if step % 10 == 0:
num_examples_per_step = FLAGS.batch_size * FLAGS.num_gpus
examples_per_sec = num_examples_per_step / duration
sec_per_batch = duration / FLAGS.num_gpus
format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step, loss_value,
examples_per_sec, sec_per_batch))
if step % 100 == 0:
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
# Save the model checkpoint periodically.
if step % 1000 == 0 or (step + 1) == FLAGS.max_steps:
checkpoint_path = os.path.join(FLAGS.train_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step=step)
核心就是通過多CPU訓(xùn)練梯度,并且求平均梯度和平均均值檩电。TensorBoard的效果
OK, that's all!