摘要 |
Embodiments of a system that adjusts a checkpointing frequency in a distributed computing system that executes multiple jobs are described. During operation, the system receives signals associated with the operation of the computing nodes. Then, the system determines risk metrics for the computing nodes using a pattern-recognition technique to identify anomalous signals in the received signals. Next, the system adjusts a checkpointing frequency of a given checkpoint for a given computing node based on a comparison of a risk metric associated with the given computing node and a threshold, thereby implementing holistic fault tolerance, in which prediction and prevention of potential faults occurs across the distributed computing system. |