Failure Analysis and Characterization of Scheduling Jobs in Google Cluster Trace

Mohammad S. Jassas; Qusay H. Mahmoud

doi:10.1109/iecon.2018.8592822

ScienceGate Book Chapters

JOURNAL ARTICLE

Failure Analysis and Characterization of Scheduling Jobs in Google Cluster Trace

Mohammad S. Jassas Qusay H. Mahmoud

Year: 2018

DOI: 10.1109/iecon.2018.8592822

Get Full-Text PDF Get Analytical Report

Abstract

Most public and private cloud providers have experienced failure in one of their services that may affect numerous applications and websites. Thus, in order to understand the causes of different types of failures and remediate the issue, failure analysis is one of the most critical steps. Failure analysis has been developed based on monitoring the most significant metrics of the system in order to study the behavior and frequency changes in the systems. Then, the monitored data will be stored in log files to be utilized for analysis and prediction tasks. In this paper, we primarily focus on analyzing and interpreting the characteristic behavior of finished/failed jobs in association with physically available resources using a publicly available dataset, Google cluster trace. The primary objective of our work is to enhance the understanding of job failure in cloud computing environments. Our results show a clear correlation between failed jobs and requested resources including memory, CPU, and disk space. Based on our results, we find that many techniques can be applied to increase the reliability and availability of cloud applications, such as developing scheduling algorithms, predicting job failure, limiting task resubmission or changing the priority policies.

Keywords:

Computer science Cloud computing Scheduling (production processes) TRACE (psycholinguistics) Limiting Reliability (semiconductor) Cluster (spacecraft) Task (project management) Distributed computing Operating system Engineering

Metrics

Cited By

4.46

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Cloud Computing and Resource Management

Physical Sciences → Computer Science → Information Systems

IoT and Edge/Fog Computing

Physical Sciences → Computer Science → Computer Networks and Communications

Software System Performance and Reliability

Physical Sciences → Computer Science → Computer Networks and Communications

Failure Analysis and Characterization of Scheduling Jobs in Google Cluster Trace

Abstract

Metrics

Citation History

Topics

Related Documents

Failure Characterization and Prediction of Scheduling Jobs in Google Cluster Traces

Google Cloud Trace: Characterization of Terminated Jobs

Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study

Task shape classification and workload characterization of google cluster trace