发明名称 System and method for detecting process and network failures in a distributed system having multiple independent networks
摘要 The present invention provides a system and method of detecting a process failure and a network failure in a distributed system. The distributed system includes at least two processes, each executing on a host, operable to transmit messages (i.e., heartbeats) to each other on a plurality of networks in the distributed system. A process in the system is operable to execute a network failure algorithm for detecting failure of a network in the system. The process failure algorithm includes calculating a difference in the period of time to receive a heartbeat on a first network from a process and a period of time to receive a heartbeat on a second network from the process. If the difference exceeds a network failure threshold, the second network is suspected of failing. A process in the system is also operable to execute a process failure algorithm. The process failure algorithm includes detecting receipt of a heartbeat from a process on any one of a plurality of networks in the system within a network failure time limit. If a heartbeat is not received on any of the networks, the process is suspected of failing.
申请公布号 US6782489(B2) 申请公布日期 2004.08.24
申请号 US20010833771 申请日期 2001.04.13
申请人 HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. 发明人 FLEMING ROGER A.
分类号 G06F11/00;H04B1/74;H04L12/24;H04L12/26;(IPC1-7):G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址