发明名称 Fault monitor for restarting failed instances of the fault monitor
摘要 A computer system having a fault-tolerance framework in an extendable computer architecture. The computer system is formed of clusters of nodes where each node includes computer hardware and operating system software for executing jobs that implement the services provided by the computer system. Jobs are distributed across the nodes under control of a hierarchical resource management unit. The resource management unit includes hierarchical monitors that monitor and control the allocation of resources. In the resource management unit, a first monitor, at a first level, monitors and allocates elements below the first level. A second monitor, at a second level, monitors and allocates elements at the first level. The framework is extendable from the hierarchy of the first and second levels to higher levels where monitors at higher levels each monitor lower level elements in a hierarchical tree. If a failure occurs down the hierarchy, a higher level monitor restarts an element at a lower level. If a failure occurs up the hierarchy, a lower level monitor restarts an element at a higher level. Each of the monitors includes termination code that causes an element to terminate if duplicate elements have been restarted for the same job. The termination code in one embodiment includes suicide code whereby an element will self-destruct when the element detects that it is an unnecessary duplicate element.
申请公布号 US6718486(B1) 申请公布日期 2004.04.06
申请号 US20000624747 申请日期 2000.07.24
申请人 LOVEJOY DAVID E. 发明人 ROSELLI DREW SHAFFER;BLASER RICO;LECHNER MIKEL CARL
分类号 G06F11/00;G06F11/14;(IPC1-7):G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址