Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

doi:10.12142/ZTECOM.202301009

ZTE Communications ›› 2023, Vol. 21 ›› Issue (1): 72-80.DOI: 10.12142/ZTECOM.202301009

收稿日期:2022-04-11 出版日期:2023-03-25 发布日期:2024-03-15

Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

CAI Weibo¹, YANG Shulin¹, SUN Gang¹(), ZHANG Qiming², YU Hongfang¹

^1.University of Electronic Science and Technology of China, Chengdu 611731, China
^2.ZTE Corporation, Shenzhen 518057, China

Received:2022-04-11 Online:2023-03-25 Published:2024-03-15
About author:CAI Weibo is pursuing his master’s degree in communication and information system at University of Electronic Science and Technology of China. His research focuses on distributed machine learning.
YANG Shulin is pursuing his master’s degree in communication and information system at University of Electronic Science and Technology of China. His research focuses on distributed machine learning.
SUN Gang (gangsun@uestc.edu.cn) is a professor of computer science at University of Electronic Science and Technology of China. His research interests include machine learning, cloud computing, high performance computing, parallel and distributed systems, ubiquitous/pervasive computing and intelligence and cyber security.
ZHANG Qiming is a senior system architect of ZTE Corporation. He received his bachelor’s degree from Zhejiang University, China in 1992. His research interests include MEC and heterogeneous computing.
YU Hongfang is a professor of University of Electronic Science and Technology of China. Her research interests include network virtualization, cloud computing and next generation network.
Supported by:
the computing power networks and new communication primitives project(HC?CN?2020120001);the National Natural Science Foundation of China(62102066);Open Research Projects of Zhejiang Lab(2022QA0AB02)

摘要/Abstract

Abstract:

In distributed machine learning (DML) based on the parameter server (PS) architecture, unbalanced communication load distribution of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth. To address this problem, a network-aware adaptive PS load distribution scheme is proposed, which accelerates model synchronization by proactively adjusting the communication load on PSs according to network states. We evaluate the proposed scheme on MXNet, known as a real-world distributed training platform, and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.

Key words: distributed machine learning, network awareness, parameter server, load distribution, heterogeneous network

. [J]. ZTE Communications, 2023, 21(1): 72-80.

CAI Weibo, YANG Shulin, SUN Gang, ZHANG Qiming, YU Hongfang. Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks[J]. ZTE Communications, 2023, 21(1): 72-80.

图/表 8

Figure 1 Illustration of data transmission, where the green block is the additional synchronization delay, and the orange block is the transmission time of each slice

Figure 2 Link throughput measurement

Figure 3 Geometrization of the load distribution problem

Table 1 Key hyperparameters

Parameter	Value
Dataset	Fashion MNIST
Mini-batch	32
Optimizer	Adam
Learning rate	0.001

Figure 4 Training speed of different schemes on different models

Figure 5 SRIT comparison of Aware and Average schemes in dynamic networks

Figure 6 ASRIT of Aware scheme in different network states

Figure 7 ASRIT of Average scheme in different conditions

参考文献 21

1	YU J, TAN M, ZHANG H Y, et al. Hierarchical deep click feature prediction for fine-grained image recognition [J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 44(2): 563–578. DOI: 10.1109/TPAMI.2019.2932058
2	KRIMAN S, BELIAEV S, GINSBURG B, et al. Quartznet: deep automatic speech recognition with 1D time-channel separable convolutions [C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020: 6124–6128. DOI: 10.1109/ICASSP40776.2020.9053889
3	AHMAD F, ABBASI A, LI J J, et al. A deep learning architecture for psychometric natural language processing [J]. ACM transactions on information systems, 2020, 38(1): 1–29. DOI: 10.1145/3365211
4	HONG R, CHANDRA A. DLion: Decentralized distributed deep learning in micro-clouds [C]//Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2021: 227–238. DOI: 10.1145/3431379.3460643
5	TIAN L, YANG M Z, WANG S G. An overview of compute first networking [J]. International journal of web and grid services, 2021, 17(2): 81–97. DOI: 10.1504/ijwgs.2021.114566
6	KRÓL M, MASTORAKIS S, ORAN D, et al. Compute first networking: distributed computing meets ICN [C]//The 6th ACM Conference on Information-Centric Networking. ACM, 2019: 67–77. DOI: 10.1145/3357150.3357395
7	AWAN A A, HAMIDOUCHE K, HASHMI J M, et al. Scaffe: co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters [J]. ACM sigplan notices, 2017, 52(8): 193–205
8	WANG S, LI D, GENG J K, et al. Impact of network topology on the performance of DML: Theoretical analysis and practical factors [C]//IEEE Conference on Computer Communications. IEEE, 2019: 1729–1737. DOI: 10.1109/INFOCOM.2019.8737595
9	LI M, ZHOU L, YANG Z, et al. Parameter server for distributed machine learning [J]. Big learning NIPS workshop, 2013, 6: 2–12
10	LI M, ANDERSEN D G, PARK J W, et al. Scaling distributed machine learning with the parameter server [C]//The 11th USENIX conference on Operating Systems Design and Implementation. ACM, 2014: 583–598. DOI: 10.5555/2685048.2685095
11	LI M, ANDERSEN D G, SMOLA A, et al. Communication efficient distributed machine learning with the parameter server [C]//The 27th International Conference on Neural Information Processing Systems. ACM, 2014: 19–27
12	ZHANG S, CHOROMANSKA A, LECUN Y. Deep learning with elastic averaging SGD [C]//The 28th International Conference on Neural Information Processing Systems. ACM, 2015: 685–693
13	DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks [J]. Advances in neural information processing systems, 2012, 1: 1223–1231
14	CHEN T Q, LI M, LI Y T, et al. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems [EB/OL]. [2022-10-10].
15	ABADI M, AGARWAL A, BARHAM P, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems [EB/OL]. [2022-10-10].
16	XING E P, HO Q, DAI W, et al. Petuum: a new platform for distributed machine learning on big data [J]. IEEE transactions on big data, 2015, 1(2): 49–67. DOI: 10.1109/tbdata.2015.2472014
17	MXNET. Distributed training in MXNet. [EB/OL]. [2022-10-10].
18	CHEN Y R, PENG Y H, BAO Y X, et al. Elastic parameter server load distribution in deep learning clusters [C]//Proceedings of the 11th ACM Symposium on Cloud Computing. ACM, 2020: 507–521. DOI: 10.1145/3419111.3421307
19	MOHAN V, REDDY Y J, KALPANA K. Active and passive network measurements: a survey [J]. International journal of computer science and information technologies, 2011, 2(4): 1372–1385
20	GOEL U, WITTIE M P, CLAFFY K C, et al. Survey of end-to-end mobile network measurement testbeds, tools, and services [J]. IEEE communications surveys & tutorials, 2016, 18(1): 105–123. DOI: 10.1109/COMST.2015.2485979
21	TC(8). Linux tc. [EB/OL]. [2022-10-10].

Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 21

相关文章 0

编辑推荐

Metrics