FAULT RECOVERY ON A MASSIVELY PARALLEL COMPUTER SYSTEM TO HANDLE NODE FAILURES WITHOUT ENDING AN EXECUTING JOB
摘要
A method and apparatus for fault recovery of on a parallel computer system from a soft failure without endingan executing job on a partition of nodes. In preferred embodiments a failed hardware recovery mechanism on a service node uses a heartbeat monitor to determine when a node failure occurs. Where possible, the failed node is reset and re-loaded with software without ending the software job being executed by the partition containing the failed node.
申请公布号
WO2008092952(A2)
申请公布日期
2008.08.07
申请号
WO2008EP51266
申请日期
2008.02.01
申请人
INTERNATIONAL BUSINESS MACHINES CORPORATION;IBM UNITED KINGDOM LIMITED;DARRINGTON, DAVID;MCCARTHY, PATRICK, JOSEPH;PETERS, AMANDA;SIDELNIK, ALBERT
发明人
DARRINGTON, DAVID;MCCARTHY, PATRICK, JOSEPH;PETERS, AMANDA;SIDELNIK, ALBERT