MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. For skewed data input, load balancing is necessary among the MapReduce worker nodes to minimize the overall finishing time, which however can incur massive data movement in a data center network. In this paper, we for the first time examine this problem of data center-network-aware load balancing in the shuffle sub phase in MapReduce. Different from earlier studies that generally assume the network inside a data center has negligible delay and infinite capacity, we consider the traffic and bottlenecks in real data center networks by introducing the constraints on available network bandwidth, and demonstrate that the corresponding problem can be decomposed into two sub problems for network flow and load balancing, respectively. We show effective solutions to both of them, which together yield a complete solution towards near optimal data center-network-aware load balancing. A much simpler yet performance-wise comparable greedy algorithm is also developed for fast implementation in practice. The effectiveness of our solution has been demonstrated on synthetic and real public datasets.
Tom BarbetteMarco ChiesaGerald Q. MaguireDejan Kostić
Zhao‐Rong LaiChe-Wei ChangXue LiuTei‐Wei KuoPi-Cheng Hsiu
Tahir Abbas KhanMuhammad Saeed KhanSagheer AbbasJamshaid Iqbal JanjuaSyed MuhammadMuhammad Asif
Chayanon ChanapalKasidetch ThanyacharoenSucha Supittayapornpong
Shuo WangJiao ZhangTao HuangTian PanJiang LiuYunjie Liu