摘要 |
The present invention provides a system and method of detecting a process failure and a network failure in a distributed system. The distributed system includes at least two processes, each executing on a host, operable to transmit messages (i.e., heartbeats) to each other on a plurality of networks in the distributed system. A process in the system is operable to execute a network failure algorithm for detecting failure of a network in the system. The process failure algorithm includes calculating a difference in the period of time to receive a heartbeat on a first network from a process and a period of time to receive a heartbeat on a second network from the process. If the difference exceeds a network failure threshold, the second network is suspected of failing. A process in the system is also operable to execute a process failure algorithm. The process failure algorithm includes detecting receipt of a heartbeat from a process on any one of a plurality of networks in the system within a network failure time limit. If a heartbeat is not received on any of the networks, the process is suspected of failing.
|