发明名称
摘要 In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.
申请公布号 JP5570095(B2) 申请公布日期 2014.08.13
申请号 JP20070543012 申请日期 2005.04.13
申请人 发明人
分类号 G06F11/20;G06F11/30;G06F15/173;G06F15/177 主分类号 G06F11/20
代理机构 代理人
主权项
地址
您可能感兴趣的专利