摘要 |
Mechanisms are provided for preserving data wherein one or more nodes in a distributed computing system experiences an error. In one embodiment, when an error occurs, an error event is identified. Based on this error event, a set of identified execution units is suspended and a set of identified data is collected. All suspended execution units are then released, i.e., allowed to continue execution at the point where the units were suspended. The data collected during suspension is then used to diagnose the cause of the error.
|