Journal Article

An Abstract Interface for System Software on Large-Scale Clusters

Juan Fernández, Eitan Frachtenberg, Fabrizio Petrini and José-Carlos Sancho

in The Computer Journal

Published on behalf of British Computer Society

Volume 49, issue 4, pages 454-469
ISSN: 0010-4620
Published online May 2006 | e-ISSN: 1460-2067 | DOI: http://dx.doi.org/10.1093/comjnl/bxl020
An Abstract Interface for System Software on Large-Scale Clusters

Show Summary Details

Preview

Scalable management of distributed resources is one of the major challenges when building large-scale clusters for high-performance computing. This task includes transparent fault tolerance, efficient deployment of resources and support for all the needs of parallel applications: parallel I/O, deterministic behavior and responsiveness. These challenges may seem daunting with commodity hardware and operating systems, since they were not designed to support a global, single management view of a large-scale system. In this paper we propose and demonstrate an abstract network interface in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, can allow commodity clusters to grow to thousands of nodes, while still retaining the usability and performance of the single-node workstation.

Keywords: Cluster computing; cluster operating system; fault tolerance; network hardware; resource management

Journal Article.  10203 words.  Illustrated.

Subjects: Computer Science

Full text: subscription required

How to subscribe Recommend to my Librarian

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.