发明名称 Method for software error recovery using consistent global checkpoints
摘要 Disclosed is a method for error recovery in a multiprocessing computer system of the type in which each of the processes periodically takes checkpoints. In the event of a failure, a process can be rolled back to a prior checkpoint, and execution can continue from the checkpointed state. A monitor process monitors the execution of the processes. Upon the occurrence of a failure, a target set of checkpoints is identified, and the maximum consistent global checkpoint, which includes the target set of checkpoints, is computed. Each of the processes is rolled back to an associated checkpoint in the consistent global checkpoint. Upon a subsequent occurrence of the same failure, a second set of checkpoints is identified, and the minimum consistent global checkpoint, which includes the target set of checkpoints, is computed. Each of the processes is rolled back to an associated checkpoint in the consistent global checkpoint. Upon another occurrence of the same failure, the system is rolled back further to a coordinated checkpoint. Also disclosed are novel methods for calculating the minimum and maximum consistent global checkpoints. In accordance with one embodiment, the minimum and maximum consistent global checkpoints are calculated by a central process. In accordance with another embodiment, the minimum and maximum consistent global checkpoints are calculated in a distributed fashion by each of the individual processes.
申请公布号 US5630047(A) 申请公布日期 1997.05.13
申请号 US19950526737 申请日期 1995.09.12
申请人 LUCENT TECHNOLOGIES INC. 发明人 WANG, YI-MIN
分类号 G06F11/14;(IPC1-7):G06F11/16 主分类号 G06F11/14
代理机构 代理人
主权项
地址