Group stochastic gradient descent: A tradeoff between straggler and staleness

doi:10.3969/j.issn.0253-2778.2020.08.016

Abstract

Abstract: Distributed stochastic gradient descent（DSGD)is widely used for large scale distributed machine learning. Two typical implementations of DSGD are synchronous SGD（SSGD)and asynchronous SGD（ASGD). In SSGD, all workers should wait for each other and the training speed will be slowed down to that of the straggler. In ASGD, the stale gradients can result in a poorly trained model. To solve this problem, a new version of distributed SGD method based named group SGD(GSGD)is proposed, which puts workers with similar computation and communication performance in a group and divides them into several groups. The workers in the same group work in a synchronous manner while different groups work in an asynchronous manner. The proposed method can migrate the straggler problem since workers in the same group spend little time waiting for each other. The staleness of the method is small since the number of groups is much smaller than the number of workers. The convergence of the method is proved through theoretical analysis. Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster.

Key words: Stochastic gradient descent, distributed machine learning, straggler, staleness

CLC Number:

TP181

GAO Xiang, CHEN Li. Group stochastic gradient descent: A tradeoff between straggler and staleness[J]. Journal of University of Science and Technology of China, 2020, 50(8): 1156-1161.

References

［1］
BOTTOU L. Stochastic gradient learning in neural networks[J]. Proceedings of Neuro-Nimes, 1991, 91(8):12.
[2] BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010. Berlin, German: Springer, 2010: 177-186.
[3] RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//International Conference on Machine Learning. Basel, Switzerland: MDPI, 2012:449-456.
[4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2012: 1097-1105.
[5] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(1):30-42.
[6] COLLOBERT R, WESTON J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 160-167.
[7] DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks[J]. Advances in Neural Information Processing Systems, 2012, 2:1223-1231.
[8] XING E P, HO Q, DAI W, et al. Petuum: A new platform for distributed machine learning on big data[J]. IEEE Transactions on Big Data, 2015, 1(2):49-67.
[9] ABADI M, AGARWAL A, BARHAM P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. 2015， arXiv:1603.04467.
[10] LI M, ANDERSEN D G, PARK J W, etal. Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583-598.
[11] ZHANG S, CHOROMANSKA A E, LECUN Y. Deep learning with elastic averaging sgd[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 685-693.
[12] LIAN X, HUANG Y, LI Y, et al. Asynchronous parallel stochastic gradient for nonconvex optimization[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 2737-2745.
[13] CHEN J, PAN X, MONGA R, et al. Revisiting distributed synchronous sgd[J]. arXiv preprint arXiv:1604.00981, 2016.
[14] TANDON R, LEI Q, DIMAKIS A G, et al. Gradient coding: Avoiding stragglers in distributed learning[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 3368-3376.
[15] HARLAP A, CUI H, DAI W, et al. Addressing the straggler problem for iterative convergent parallel ml[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York, NY, USA: ACM, 2016:98-111.
[16] MCMAHAN H B, STREETER M. Delay-tolerant algorithms for asynchronous distributed online learning[J]. Advances in Neural Information Processing Systems, 2014, 4:2915-2923.
[17] CHAN W, LANE I. Distributed asynchronous optimization of convolutional neural networks[J]. College & Research Libraries, 2014, 76(6):756-770.
[18] ZHENG S, MENG Q, WANG T, et al. Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning. New York, NY: ACM, 2017:4120-4129.
[19] HO Q, CIPAR J, CUI H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2013: 1223-1231.
[20] GUPTA S, ZHANG W, WANG F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study[C]//16th International Conference on Data Mining. NEW YORK, NY: IEEE, 2016: 171-180.
[21] ZHANG W, GUPTA S, LIAN X, et al. Staleness-aware async-sgd for distributed deep learning[J]. In International Joint Conference on Artificial Intelligence, 2016:2350-2356.
[22] BASU S, SAXENA V, PANJA R, et al. Balancing stragglers against staleness in distributed deep learning[C]//25th International Conference on High Performance Computing. NEW YORK, NY: IEEE, 2018: 12-21.
[23] BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Siam Review, 2016, 60(2):223-311.
[24] DUTTA S, JOSHI G, GHOSH S, et al. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd[J]. 2018， arXiv:1803.01113.

()
()

[1]	Ma Yulian, Cui Wenquan. An end-to-end multitask method with two targets for high-frequency price movement prediction [J]. Journal of University of Science and Technology of China, 2021, 51(3): 246-258.
[2]	. Focused loss-based for imbalanced data scenarios integrated classification methods for CGAN [J]. Journal of University of Science and Technology of China, 2020, 50(7): 968-976.
[3]	SHI Lukui, GUO Linlin, FANG Zizhe, ZHANG Jun. Parallel ISOMAP algorithm based on Spark [J]. Journal of University of Science and Technology of China, 2019, 49(10): 842-850.
[4]	ZHANG Huimin, YANG Ming, LV Jing. Multifeature hyperspectral image classification based on adaptive kernel joint sparse representation [J]. Journal of University of Science and Technology of China, 2018, 48(4): 298-306.
[5]	RAO Qi, YANG Yan*, TENG Fei. Condition recognition of high-speed train based on multi-view weighted clustering ensemble [J]. Journal of University of Science and Technology of China, 2018, 48(1): 35-41.