发明名称 |
FAULT RECOVERY ON A MASSIVELY PARALLEL COMPUTER SYSTEM TO HANDLE NODE FAILURES WITHOUT ENDING AN EXECUTING JOB |
摘要 |
A method and apparatus for fault recovery of on a parallel computer system from a soft failure without endingan executing job on a partition of nodes. In preferred embodiments a failed hardware recovery mechanism on a service node uses a heartbeat monitor to determine when a node failure occurs. Where possible, the failed node is reset and re-loaded with software without ending the software job being executed by the partition containing the failed node. ® KIPO & WIPO 2009 |
申请公布号 |
KR20090084897(A) |
申请公布日期 |
2009.08.05 |
申请号 |
KR20097010832 |
申请日期 |
2008.02.01 |
申请人 |
INTERNATIONAL BUSINESS MACHINES CORPORATION |
发明人 |
DARRINGTON DAVID;MCCARTHY PATRICK JOSEPH;PETERS AMANDA;SIDELNIK ALBERT |
分类号 |
G06F13/00;G06F1/24;G06F11/16;G06F15/00 |
主分类号 |
G06F13/00 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|