发明名称 FAULT RECOVERY ON A MASSIVELY PARALLEL COMPUTER SYSTEM TO HANDLE NODE FAILURES WITHOUT ENDING AN EXECUTING JOB
摘要 A method and apparatus for fault recovery of on a parallel computer system from a soft failure without endingan executing job on a partition of nodes. In preferred embodiments a failed hardware recovery mechanism on a service node uses a heartbeat monitor to determine when a node failure occurs. Where possible, the failed node is reset and re-loaded with software without ending the software job being executed by the partition containing the failed node.
申请公布号 WO2008092952(A2) 申请公布日期 2008.08.07
申请号 WO2008EP51266 申请日期 2008.02.01
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION;IBM UNITED KINGDOM LIMITED;DARRINGTON, DAVID;MCCARTHY, PATRICK, JOSEPH;PETERS, AMANDA;SIDELNIK, ALBERT 发明人 DARRINGTON, DAVID;MCCARTHY, PATRICK, JOSEPH;PETERS, AMANDA;SIDELNIK, ALBERT
分类号 主分类号
代理机构 代理人
主权项
地址