ZTE Communications ›› 2023, Vol. 21 ›› Issue (1): 72-80.DOI: 10.12142/ZTECOM.202301009

• Research Paper • Previous Articles    

Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

CAI Weibo1, YANG Shulin1, SUN Gang1(), ZHANG Qiming2, YU Hongfang1   

  1. 1.University of Electronic Science and Technology of China, Chengdu 611731, China
    2.ZTE Corporation, Shenzhen 518057, China
  • Received:2022-04-11 Online:2023-03-25 Published:2024-03-15
  • About author:CAI Weibo is pursuing his master’s degree in communication and information system at University of Electronic Science and Technology of China. His research focuses on distributed machine learning.
    YANG Shulin is pursuing his master’s degree in communication and information system at University of Electronic Science and Technology of China. His research focuses on distributed machine learning.
    SUN Gang (gangsun@uestc.edu.cn) is a professor of computer science at University of Electronic Science and Technology of China. His research interests include machine learning, cloud computing, high performance computing, parallel and distributed systems, ubiquitous/pervasive computing and intelligence and cyber security.
    ZHANG Qiming is a senior system architect of ZTE Corporation. He received his bachelor’s degree from Zhejiang University, China in 1992. His research interests include MEC and heterogeneous computing.
    YU Hongfang is a professor of University of Electronic Science and Technology of China. Her research interests include network virtualization, cloud computing and next generation network.
  • Supported by:
    the computing power networks and new communication primitives project(HC?CN?2020120001);the National Natural Science Foundation of China(62102066);Open Research Projects of Zhejiang Lab(2022QA0AB02)

Abstract:

In distributed machine learning (DML) based on the parameter server (PS) architecture, unbalanced communication load distribution of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth. To address this problem, a network-aware adaptive PS load distribution scheme is proposed, which accelerates model synchronization by proactively adjusting the communication load on PSs according to network states. We evaluate the proposed scheme on MXNet, known as a real-world distributed training platform, and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.

Key words: distributed machine learning, network awareness, parameter server, load distribution, heterogeneous network