摘要 |
A system that provides fault tolerance in a parallel processing system. During operation, the system executes a parallel computing application in parallel across a subset of computing nodes within the parallel processing system. During this process, the system monitors telemetry signals within the parallel processing system. The system analyzes the monitored telemetry signals to determine if the probability that the parallel processing system will fail is increasing. If so, the system increases the frequency at which the parallel computing application is checkpointed, wherein a checkpoint includes the state of the parallel computing application at each computing node within the parallel processing system. |