As Deep Learning (DL) continues to drive a variety of applications in edge and cloud data centers, co-locating multiple DL models onto the same GPU become widely deployed to improve resource utilization, and achieve acceleration. For example, a self-driving system hosts multiple tasks simultaneously (e.g., detection, classification, segmentation, etc.) and expects concurrent computing on one single device. However, our analysis demonstrates that, when deploying compound DNN models for multiple tenants on a GPU, certain issues arise: As different models' structure heterogeneities and skewed data distributions, corresponding models cause highly imbalanced computing workloads. However, current GPU scheduling methods lack effective resource allocations. To address these issues, we propose a novel resource allocation method – competitive resource sharing, which is beneficial for parallel model executions, and the proposed concept of "virtual resource" could effectively characterize and guide the practical per-task resource utilization and allocation. Our experiments demonstrate that the DNN computing throughput could be significantly escalated by $2.16 \times \sim 2.80 \times$ in various multitenant scenarios.
Yongbo YuFuxun YuZirui XuDi WangMinjia ZhangAng LiShawn BrayChenchen LiuXiang Chen
Yu, YongboYu, FuxunXu, ZiruiWang, DiZhang, MinjiaLi, AngLiu, ChenChenTian, ZhiChen, Xiang
Yongbo YuFuxun YuZirui XuDi WangMinjia ZhangAng LiChenchen LiuZhi TianXiang Chen