1. Abstract
- A new algorithm is proposed in this setting where the communication and coordination of work
among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). elastic force連接了local參數(shù)和PS上全局的參數(shù) - enables the local workers to perform more exploration. The algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. 通過減少local worker和master之間的通信,允許local參數(shù)超前探索蚓再,遠離全局參數(shù)
- 提出了同步的版本和異步的版本
- We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. 收斂性證明就斤,基于RR模式分析物臂,與并行ADMM比較
- We additionally propose the momentum-based version of our algorithm that can be applied in both
synchronous and asynchronous settings. 額外提出了加入動量的版本,能夠用于同步和異步版本
2. Intro
- But practical image recognition systems consist of large-scale convolutional neural networks trained on few GPU cards sitting in a single computer [3, 4]. The main challenge is to devise parallel SGD algorithms to train large-scale deep learning models that yield a significant speedup when run on multiple GPU cards. 本文研究的是單機多GPU卡株汉,挑戰(zhàn)是在多GPU卡上并行SGD
- In this paper we introduce the Elastic Averaging SGD method (EASGD) and its variants. EASGD
is motivated by quadratic penalty method [5], but is re-interpreted as a parallelized extension of the
averaging SGD algorithm [6]. 本文提出了EASGD和其variants爆哑,motivated by平方懲罰方法,但是被重新設(shè)計為average SGD算法的并行版本 - elastic force 鏈接了local parameter和master上的center variable票彪,center variable使用moving average來更新,both in time and in space
- The main contribution of this paper is a new algorithm that provides fast convergent minimization while outperforming DOWNPOUR method [2] and other baseline approaches in practice. 主要貢獻是提供了更快的收斂不狮,超過DOWNPOUR和其他baseline方法
- EASGD減少了master和local workers的通信開銷
3. Problem setting
- This paper focuses on the problem of reducing the parameter communication overhead between the master and local workers. 本文著重的問題是減少master和local worker之間的參數(shù)通信
4. EASGD update rule
- 計算local參數(shù)和全局參數(shù)之間的差距降铸,然后在梯度下降時,加上這個差距摇零,使得local參數(shù)向全局參數(shù)靠攏
- Note that choosing beta=p*alpha? leads to an elastic symmetry in the update rule, i.e. there exists an symmetric force between the update of each local參數(shù)和全局參數(shù).
- Note also that ? alpha=eta*rho??, where the magnitude of rho? represents the amount of exploration we allow in the model. In particular, small rho? allows for more exploration as it allows xi to fluctuate further from the center x. rho代表了本地參數(shù)能夠獨自explore到什么程度推掸,小的rho允許更大的explore,允許本地參數(shù)能夠離全局參數(shù)更遠
- The distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho?) and the master to perform exploitation. EASGD的novelty是遂黍,允許local worker更多的探索
4.1. Asynchronous EASGD
- 上個section是同步的EASGD终佛,這一節(jié)介紹異步的EASGD
- Each worker maintains its own clock ti, which starts from 0 and is incremented by 1 after each stochastic gradient update of xi as shown in Algorithm 1. The master performs an update whenever the local workers finished ?t steps of their gradient updates, where we refer to ?t as the communication period. 每個worker保存自己的clock俊嗽,每次梯度下降后遞增clock雾家,每隔t個clock與master通信一次,更新參數(shù)绍豁,同時獲取最新的全局參數(shù)
- worker等待master發(fā)回參數(shù)芯咧,然后計算elastic difference,接著把elastic difference發(fā)回給master,master更新全局參數(shù)
- The communication period ? controls the frequency of the communication between every local
worker and the master, and thus the trade-off between exploration and exploitation. 通信周期控制更新的頻率
4.2 Momentum EASGD
- It is based on the Nesterov’s momentum scheme [24, 25, 26], where the update of the local worker is replaced by the following update
5. Experiments
- In this section we compare the performance of EASGD and EAMSGD with the parallel method
DOWNPOUR and the sequential method SGD, as well as their averaging and momentum variants. 比較了EASGD敬飒、EAMSGD邪铲、Downspour,還有average和momentum變型 - We perform experiments in a deep learning setting on two benchmark datasets: CIFAR-10 (we refer to it as CIFAR) and ImageNet ILSVRC 2013 (we refer to it as ImageNet). 數(shù)據(jù)集是CIFAR-10和 ImageNet
- We focus on the image classification task with deep convolutional neural networks. 算法是圖像分類无拗,深度卷積神經(jīng)網(wǎng)絡(luò)
6. Conclusion
- In this paper we describe a new algorithm called EASGD and its variants for training deep neural
networks in the stochastic setting when the computations are parallelized over multiple GPUs. 在GPU上并行SGD - We provide the stability analysis of the asynchronous EASGD in the round-robin scheme, and show the theoretical advantage of the method over ADMM.