摘要 |
A system and method for distributed fault detection. In an exemplary method, unplanned application exits and crashes may be detected at a node local level. Further, application hangs may be detected using at least one of a script and a binary at the node local level. Also, node crashes and operating system crashes may be detected using node to node heart-beating. |