摘要 |
A method and apparatus that enable quick recovery from failure or restoration of an application state of one or more nodes, applications, and/or communication links in a distributed computing environment, such as a cluster. Recovery or restoration is facilitated by regularly saving persistent images of the in-memory checkpoint data and/or of distributed shared memory segments using snapshots of the committed checkpoint data. When one or more nodes fail, the snapshots can be read and used to restart the application in the most recently-saved state prior to the failure or rollback the application to an earlier state.
|