Hossein ShafieiradAmir ShaniManaf Bin-YahyaSeyed Hossein MortazaviGeng LiXinle DuT.H. SuWei WangJingbin ZhouMajid Ghaderi
Distributed machine learning (DML), such as large language model (LLM) training, has become one of the most critical services in multi-tenant cloud computing. However, communication contention among concurrent DML jobs significantly degrades overall GPU utilization, leading to inefficient training cluster performance. Existing approaches either achieve high performance at the cost of long scheduling runtime or reduce scheduling time at the expense of poor performance. We present Harmonics, a novel two-tier scheduling framework that strikes a balance between scheduling latency and performance. It coordinates decisions between Local Schedulers and a lightweight Global Coordinator to enable scalable and adaptive scheduling. By combining rack-level epoch-based optimization with global coordination, Harmonics alleviates communication contention and improves resource efficiency. We implement and evaluate Harmonics on real distributed ML workloads running on a GPU testbed. Compared to state-of-the-art methods such as fair sharing, optimal scheduling, Crux, and Cassini, Harmonics reduces training time by up to 33% and communication time by up to 48%. Large-scale simulations show that it reduces scheduling time by up to 91× while improving training time by 26% in large-cluster settings.
Angel BeltrePankaj SahaMadhusudhan Govindaraju
Yiqian XiaRui RenHongming CaiAthanasios V. VasilakosZheng Lv