Computation Checkpointing & Migration (Hardcover, New)

, ,
Computational clusters have long provided a mechanism for the acceleration of high performance computing (HPC) applications. With today's supercomputers now exceeding the petaflop scale, however, they are also exhibiting an increase in heterogeneity. Thisheterogeneity spans a range of technologies, from multiple operating systems to hardware accelerators and novel architectures. Because of the exceptional acceleration some of these heterogeneous architectures provide, they are being embraced as viable tools for HPC applications. Given the scale of today's supercomputers, it is clear that scientists must consider the use of fault-tolerance in their applications. This is particularly true as computational clusters with hundreds and thousands of processors become ubiquitous in large-scale scientific computing, leading to lower mean-times-to-failure. This forces the systems to effectively deal with the possibility of arbitrary and unexpected node failure. In this book the address the issue of fault-tolerance via checkpointing. They discuss the existing strategies to provide rollback recovery to applications -- both via MPI at the user level and through application-level techniques. Checkpointing itself has been studied extensively in the literature, including the authors' own works. Here they give a general overview of checkpointing and how it's implemented. More importantly, they describe strategies to improve the performance of checkpointing, particularly in the case of distributed systems.

R4,861

Or split into 4x interest-free payments of 25% on orders over R50
Learn more

Discovery Miles48610
Mobicred@R456pm x 12* Mobicred Info
Free Delivery
Delivery AdviceShips in 12 - 17 working days


Toggle WishListAdd to wish list
Review this Item

Product Description

Computational clusters have long provided a mechanism for the acceleration of high performance computing (HPC) applications. With today's supercomputers now exceeding the petaflop scale, however, they are also exhibiting an increase in heterogeneity. Thisheterogeneity spans a range of technologies, from multiple operating systems to hardware accelerators and novel architectures. Because of the exceptional acceleration some of these heterogeneous architectures provide, they are being embraced as viable tools for HPC applications. Given the scale of today's supercomputers, it is clear that scientists must consider the use of fault-tolerance in their applications. This is particularly true as computational clusters with hundreds and thousands of processors become ubiquitous in large-scale scientific computing, leading to lower mean-times-to-failure. This forces the systems to effectively deal with the possibility of arbitrary and unexpected node failure. In this book the address the issue of fault-tolerance via checkpointing. They discuss the existing strategies to provide rollback recovery to applications -- both via MPI at the user level and through application-level techniques. Checkpointing itself has been studied extensively in the literature, including the authors' own works. Here they give a general overview of checkpointing and how it's implemented. More importantly, they describe strategies to improve the performance of checkpointing, particularly in the case of distributed systems.

Customer Reviews

No reviews or ratings yet - be the first to create one!

Product Details

General

Imprint

nova science publishers

Country of origin

United States

Release date

July 2010

Availability

Expected to ship within 12 - 17 working days

First published

2010

Authors

, ,

Dimensions

180 x 260 x 16mm (L x W x T)

Format

Hardcover

Pages

141

Edition

New

ISBN-13

978-1-60741-840-5

Barcode

9781607418405

Categories

LSN

1-60741-840-1



Trending On Loot