Harmonics: Scalable Collective Scheduling in Multi-Tenant GPU Clusters

Hossein Shafieirad; Amir Shani; Manaf Bin-Yahya; Seyed Hossein Mortazavi; Geng Li; Xinle Du; T.H. Su; Wei Wang; Jingbin Zhou; Majid Ghaderi

doi:10.1145/3768985

JOURNAL ARTICLE

Harmonics: Scalable Collective Scheduling in Multi-Tenant GPU Clusters

Hossein Shafieirad Amir Shani Manaf Bin-Yahya Seyed Hossein Mortazavi Geng Li Xinle Du T.H. Su Wei Wang Jingbin Zhou Majid Ghaderi

Year: 2025 Journal: Proceedings of the ACM on Networking Vol: 3 (CoNEXT4)Pages: 1-20 Publisher: Association for Computing Machinery

DOI: 10.1145/3768985

Get Full-Text PDF Get Analytical Report

Abstract

Distributed machine learning (DML), such as large language model (LLM) training, has become one of the most critical services in multi-tenant cloud computing. However, communication contention among concurrent DML jobs significantly degrades overall GPU utilization, leading to inefficient training cluster performance. Existing approaches either achieve high performance at the cost of long scheduling runtime or reduce scheduling time at the expense of poor performance. We present Harmonics, a novel two-tier scheduling framework that strikes a balance between scheduling latency and performance. It coordinates decisions between Local Schedulers and a lightweight Global Coordinator to enable scalable and adaptive scheduling. By combining rack-level epoch-based optimization with global coordination, Harmonics alleviates communication contention and improves resource efficiency. We implement and evaluate Harmonics on real distributed ML workloads running on a GPU testbed. Compared to state-of-the-art methods such as fair sharing, optimal scheduling, Crux, and Cassini, Harmonics reduces training time by up to 33% and communication time by up to 48%. Large-scale simulations show that it reduces scheduling time by up to 91× while improving training time by 26% in large-cluster settings.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Harmonics: Scalable Collective Scheduling in Multi-Tenant GPU Clusters

Abstract

Metrics

Topics

Related Documents

Symphony: Collective Coordination in Multi-Tenant GPU Clusters

Optimize resource scheduling in multi-tenant clusters at scale

Canvas: Scalable Collective Communication Scheduling for Large-Scale GPU Clusters

KubeSphere: An Approach to Multi-Tenant Fair Scheduling for Kubernetes Clusters

Daphne: A Flexible and Hybrid Scheduling Framework in Multi-Tenant Clusters