摘要 |
Techniques for monitoring distributed software health and membership of nodes and software components operating in a compute cluster are disclosed. In one embodiment, each node in the compute cluster operates a watchdog monitoring component in addition to software operating components. The watchdogs are provided with a list of all nodes in a compute cluster that identifies every node's neighboring nodes. Each watchdog checks the health of one of its neighboring node, ensuring that this neighboring node is healthy and is operating successfully. Additionally, each watchdog verifies the cluster membership of its other neighboring nodes to ensure that the cluster is operating an adequate number of operating nodes, and that an adequate number of watchdogs are present in the cluster. If an unhealthy or non-member node is identified, the watchdog may initiate corrective action and attempt to restore the node to a correct operational state.
|