分组随机梯度下降法：掉队和延迟的平衡

doi:10.3969/j.issn.0253-2778.2020.08.016

中国科学技术大学学报 ›› 2020, Vol. 50 ›› Issue (8): 1156-1161.DOI: 10.3969/j.issn.0253-2778.2020.08.016

分组随机梯度下降法：掉队和延迟的平衡

高翔，陈力

中国科学技术大学电子工程与信息科学系，安徽合肥 230027

收稿日期:2020-07-01 修回日期:2020-07-28 接受日期:2020-07-28 出版日期:2020-08-31 发布日期:2020-07-28
通讯作者: 陈力
作者简介:高翔，男，1995年生，硕士生. 研究方向：分布式机器学习. E-mail: xgao0@mail.ustc.edu.cn

Group stochastic gradient descent: A tradeoff between straggler and staleness

GAO Xiang, CHEN Li

Department of Electronic Engineering and Information Science , University of Science and Technology of China , HeFei 230027, China

Received:2020-07-01 Revised:2020-07-28 Accepted:2020-07-28 Online:2020-08-31 Published:2020-07-28

摘要/Abstract

摘要： 分布式随机梯度下降法被广泛应用于大规模机器学习，同步随机梯度下降法和异步随机梯度下降法

关键词: 随机梯度下降, 分布式机器学习, 掉队者, 延迟

Abstract: Distributed stochastic gradient descent（DSGD)is widely used for large scale distributed machine learning. Two typical implementations of DSGD are synchronous SGD（SSGD)and asynchronous SGD（ASGD). In SSGD, all workers should wait for each other and the training speed will be slowed down to that of the straggler. In ASGD, the stale gradients can result in a poorly trained model. To solve this problem, a new version of distributed SGD method based named group SGD(GSGD)is proposed, which puts workers with similar computation and communication performance in a group and divides them into several groups. The workers in the same group work in a synchronous manner while different groups work in an asynchronous manner. The proposed method can migrate the straggler problem since workers in the same group spend little time waiting for each other. The staleness of the method is small since the number of groups is much smaller than the number of workers. The convergence of the method is proved through theoretical analysis. Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster.

Key words: Stochastic gradient descent, distributed machine learning, straggler, staleness

中图分类号:

TP181

高翔，陈力. 分组随机梯度下降法：掉队和延迟的平衡[J]. 中国科学技术大学学报, 2020, 50(8): 1156-1161.

GAO Xiang, CHEN Li. Group stochastic gradient descent: A tradeoff between straggler and staleness[J]. Journal of University of Science and Technology of China, 2020, 50(8): 1156-1161.

参考文献

［1］
BOTTOU L. Stochastic gradient learning in neural networks[J]. Proceedings of Neuro-Nimes, 1991, 91(8):12.
[2] BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010. Berlin, German: Springer, 2010: 177-186.
[3] RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//International Conference on Machine Learning. Basel, Switzerland: MDPI, 2012:449-456.
[4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2012: 1097-1105.
[5] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(1):30-42.
[6] COLLOBERT R, WESTON J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 160-167.
[7] DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks[J]. Advances in Neural Information Processing Systems, 2012, 2:1223-1231.
[8] XING E P, HO Q, DAI W, et al. Petuum: A new platform for distributed machine learning on big data[J]. IEEE Transactions on Big Data, 2015, 1(2):49-67.
[9] ABADI M, AGARWAL A, BARHAM P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. 2015， arXiv:1603.04467.
[10] LI M, ANDERSEN D G, PARK J W, etal. Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583-598.
[11] ZHANG S, CHOROMANSKA A E, LECUN Y. Deep learning with elastic averaging sgd[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 685-693.
[12] LIAN X, HUANG Y, LI Y, et al. Asynchronous parallel stochastic gradient for nonconvex optimization[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 2737-2745.
[13] CHEN J, PAN X, MONGA R, et al. Revisiting distributed synchronous sgd[J]. arXiv preprint arXiv:1604.00981, 2016.
[14] TANDON R, LEI Q, DIMAKIS A G, et al. Gradient coding: Avoiding stragglers in distributed learning[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 3368-3376.
[15] HARLAP A, CUI H, DAI W, et al. Addressing the straggler problem for iterative convergent parallel ml[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York, NY, USA: ACM, 2016:98-111.
[16] MCMAHAN H B, STREETER M. Delay-tolerant algorithms for asynchronous distributed online learning[J]. Advances in Neural Information Processing Systems, 2014, 4:2915-2923.
[17] CHAN W, LANE I. Distributed asynchronous optimization of convolutional neural networks[J]. College & Research Libraries, 2014, 76(6):756-770.
[18] ZHENG S, MENG Q, WANG T, et al. Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning. New York, NY: ACM, 2017:4120-4129.
[19] HO Q, CIPAR J, CUI H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2013: 1223-1231.
[20] GUPTA S, ZHANG W, WANG F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study[C]//16th International Conference on Data Mining. NEW YORK, NY: IEEE, 2016: 171-180.
[21] ZHANG W, GUPTA S, LIAN X, et al. Staleness-aware async-sgd for distributed deep learning[J]. In International Joint Conference on Artificial Intelligence, 2016:2350-2356.
[22] BASU S, SAXENA V, PANJA R, et al. Balancing stragglers against staleness in distributed deep learning[C]//25th International Conference on High Performance Computing. NEW YORK, NY: IEEE, 2018: 12-21.
[23] BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Siam Review, 2016, 60(2):223-311.
[24] DUTTA S, JOSHI G, GHOSH S, et al. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd[J]. 2018， arXiv:1803.01113.

()
()

分组随机梯度下降法：掉队和延迟的平衡

Group stochastic gradient descent: A tradeoff between straggler and staleness

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics

本文评价