发明名称 DISTRIBUTED, FAULT-TOLERANT AND HIGHLY AVAILABLE COMPUTING SYSTEM
摘要 A method and system for achieving highly available, fault-tolerant execution of components in a distributed computing system, without requiring the writer of these components to explicitly write code (such as entity beans or database transactions) to make component state persistent. It is achieved by converting the intrinsically non-deterministic behavior of the distributed system to a deterministic behavior, thus enabling state recovery to be achieved by advantageously efficient checkpoint-replay techniques. The method comprises: adapting the execution environment for enabling message communication amongst and between the components; automatically associating a deterministic timestamp in conjunction with a message to be communicated from a sender component to a receiver component during program execution, the timestamp representative of estimated time of arrival of the message at a receiver component. At a component, tracking state of that component during program execution, and periodically checkpointing the state in a local storage device. Upon failure of a component, the component state is restored by recovering a recent stored checkpoint and re-executing the events occurring since the last checkpoint. The system is deterministic by repeating the execution of the receiving component by processing the messages in the same order as their associated timestamps.
申请公布号 WO2008133818(A1) 申请公布日期 2008.11.06
申请号 WO2008US04866 申请日期 2008.04.15
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION;DORAI, CHITRA;STROM, ROBERT E. 发明人 DORAI, CHITRA;STROM, ROBERT E.
分类号 G06F15/16 主分类号 G06F15/16
代理机构 代理人
主权项
地址