Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo; Qiang Wang; Shaohuai Shi; Jiaxin Lai; Shuhan Qi; Jiajia Zhang; Xuan Wang

doi:10.1109/iwqos61813.2024.10682877

ScienceGate Book Chapters

JOURNAL ARTICLE

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo Qiang Wang Shaohuai Shi Jiaxin Lai Shuhan Qi Jiajia Zhang Xuan Wang

Year: 2024 Pages: 1-10

DOI: 10.1109/iwqos61813.2024.10682877

Get Full-Text PDF Get Analytical Report

Abstract

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times.Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemptionfree policy, SJF-BSBF reduces the average job completion time by 27-33% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17% in large-scale traces.

Keywords:

Computer science Scheduling (production processes) Deep learning GPU cluster Shared resource Parallel computing Artificial intelligence Distributed computing CUDA Operating system Mathematical optimization

Metrics

Cited By

1.28

FWCI (Field Weighted Citation Impact)

Refs

0.78

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Stochastic Gradient Optimization Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Scheduling and Optimization Algorithms

Physical Sciences → Engineering → Industrial and Manufacturing Engineering

IoT and Edge/Fog Computing

Physical Sciences → Computer Science → Computer Networks and Communications

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-Tenant Deep Learning Acceleration with Competitive GPU Resource Sharing

Optimize resource scheduling in multi-tenant clusters at scale

Online Scheduling of Distributed Machine Learning Jobs for Incentivizing Sharing in Multi-Tenant Systems

On scheduling ring-all-reduce learning jobs in multi-tenant GPU clusters with communication contention

Elastic Deep Learning in Multi-Tenant GPU Clusters