Most public and private cloud providers have experienced failure in one of their services that may affect numerous applications and websites. Thus, in order to understand the causes of different types of failures and remediate the issue, failure analysis is one of the most critical steps. Failure analysis has been developed based on monitoring the most significant metrics of the system in order to study the behavior and frequency changes in the systems. Then, the monitored data will be stored in log files to be utilized for analysis and prediction tasks. In this paper, we primarily focus on analyzing and interpreting the characteristic behavior of finished/failed jobs in association with physically available resources using a publicly available dataset, Google cluster trace. The primary objective of our work is to enhance the understanding of job failure in cloud computing environments. Our results show a clear correlation between failed jobs and requested resources including memory, CPU, and disk space. Based on our results, we find that many techniques can be applied to increase the reliability and availability of cloud applications, such as developing scheduling algorithms, predicting job failure, limiting task resubmission or changing the priority policies.
Mohammad S. JassasQusay H. Mahmoud
Xin ChenCharng‐Da LuKarthik Pattabiraman
Xin ChenCharng‐Da LuKarthik Pattabiraman
Md. RasheduzzamanMd. Amirul IslamTasvirul IslamTahmid HossainRashedur M. Rahman