发明名称 Method and apparatus for providing fault-tolerance in parallel-processing systems
摘要 A system that provides fault tolerance in a parallel processing system. During operation, the system executes a parallel computing application in parallel across a subset of computing nodes within the parallel processing system. During this process, the system monitors telemetry signals within the parallel processing system. The system analyzes the monitored telemetry signals to determine if the probability that the parallel processing system will fail is increasing. If so, the system increases the frequency at which the parallel computing application is checkpointed, wherein a checkpoint includes the state of the parallel computing application at each computing node within the parallel processing system.
申请公布号 US2007220298(A1) 申请公布日期 2007.09.20
申请号 US20060385429 申请日期 2006.03.20
申请人 发明人 GROSS KENNY C.;WOOD ALAN P.
分类号 G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址