摘要 |
The monitoring by a monitoring node of a process performed by a monitored node is often devised as a tightly coupled interaction, but such coupling may reduce the re-use of monitoring resources and processes and increase the administrative complexity of the monitoring scenario. Instead, fault detection and recovery may be designed as a non-proprietary service, wherein a set of monitored nodes, together performing a set of processes, may register for monitoring by a set of monitoring nodes. In the event of a failure of a process, or of an entire monitored node, the monitoring nodes may collaborate to initiate a restart of the processes on the same or a substitute monitored node (possibly in the state last reported by the respective processes). Additionally, failure of a monitoring node may be detected, and all monitored nodes assigned to the failed monitoring node may be reassigned to a substitute monitoring node. |