Google's MapReduce has gained significant popularity as a platform for large scale distributed data processing. Hadoop [1] is an open source implementation of MapReduce [11] framework, originally it was developed to operate over single cluster environment and could not be leveraged for distributed data processing across federated clusters. At multiple federated clusters connected with high speed networks, computing resources are provisioned from any of the clusters from the federation. Placement of map tasks close to its data split is critical for performance of Hadoop. In this work, we add network awareness in Hadoop while scheduling the map tasks over federated clusters. We observe 12% to 15 % reduction of execution time in FIFO and FAIR schedulers of Hadoop for varying workloads.
Jordà PoloClaris CastilloDavid CarreraYolanda BecerraIan WhalleyMałgorzata SteinderJordi TorresEduard Ayguadé
Bing TangQi XieHong S. HeGilles Fedak
Shubbhi TanejaYi ZhouMohammed AlghamdiXiao Qin
Chris X. CaiShayan SaeedIndranil GuptaRoy H. CampbellFranck Le
Radheshyam NanduriNitesh MaheshwariA. ReddyrajaVasudeva Varma