JOURNAL ARTICLE

Reliability-aware resource management for computational grid/cluster environments

Abstract

The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.

Keywords:
Computer science Failover Distributed computing Grid computing Fault tolerance Grid Replication (statistics) Service (business) Reliability (semiconductor) Resource (disambiguation) Overhead (engineering) Node (physics) Resource management (computing) Database Computer network Operating system

Metrics

15
Cited By
1.59
FWCI (Field Weighted Citation Impact)
15
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Distributed and Parallel Computing Systems
Physical Sciences →  Computer Science →  Computer Networks and Communications
Distributed systems and fault tolerance
Physical Sciences →  Computer Science →  Computer Networks and Communications
Parallel Computing and Optimization Techniques
Physical Sciences →  Computer Science →  Hardware and Architecture

Related Documents

JOURNAL ARTICLE

Resource discovery and management in computational GRID environments

Alan BradleyKevin CurranGerard Parr

Journal:   International Journal of Communication Systems Year: 2005 Vol: 19 (6)Pages: 639-657
JOURNAL ARTICLE

Resource-Aware Distributed Scheduling Strategies for Large-Scale Computational Cluster/Grid Systems

Siva ViswanathanBharadwaj VeeravalliThomas G. Robertazzi

Journal:   IEEE Transactions on Parallel and Distributed Systems Year: 2007 Vol: 18 (10)Pages: 1450-1461
BOOK-CHAPTER

Power Consumption Aware Cluster Resource Management

Simon KiertscherBettina SchnorJörg Zinke

Advances in environmental engineering and green technologies book series Year: 2013 Pages: 20-37
© 2026 ScienceGate Book Chapters — All rights reserved.