JOURNAL ARTICLE

Harmonics: Scalable Collective Scheduling in Multi-Tenant GPU Clusters

Hossein ShafieiradAmir ShaniManaf Bin-YahyaSeyed Hossein MortazaviGeng LiXinle DuT.H. SuWei WangJingbin ZhouMajid Ghaderi

Year: 2025 Journal:   Proceedings of the ACM on Networking Vol: 3 (CoNEXT4)Pages: 1-20   Publisher: Association for Computing Machinery

Abstract

Distributed machine learning (DML), such as large language model (LLM) training, has become one of the most critical services in multi-tenant cloud computing. However, communication contention among concurrent DML jobs significantly degrades overall GPU utilization, leading to inefficient training cluster performance. Existing approaches either achieve high performance at the cost of long scheduling runtime or reduce scheduling time at the expense of poor performance. We present Harmonics, a novel two-tier scheduling framework that strikes a balance between scheduling latency and performance. It coordinates decisions between Local Schedulers and a lightweight Global Coordinator to enable scalable and adaptive scheduling. By combining rack-level epoch-based optimization with global coordination, Harmonics alleviates communication contention and improves resource efficiency. We implement and evaluate Harmonics on real distributed ML workloads running on a GPU testbed. Compared to state-of-the-art methods such as fair sharing, optimal scheduling, Crux, and Cassini, Harmonics reduces training time by up to 33% and communication time by up to 48%. Large-scale simulations show that it reduces scheduling time by up to 91× while improving training time by 26% in large-cluster settings.

Keywords:
© 2026 ScienceGate Book Chapters — All rights reserved.