发明名称 Methods and apparatus using commutative error detection values for fault isolation in multiple node computers
摘要 The present invention concerns methods and apparatus for performing fault isolation in multiple node computing systems using commutative error detection values-for example, checksums-to identify and to isolate faulty nodes. In the present invention nodes forming the multiple node computing system are networked together and during program execution communicate with one another by transmitting information through the network. When information associated with a reproducible portion of a computer program is injected into the network by a node, a commutative error detection value is calculated and stored in commutative error detection apparatus associated with the node. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values saved in the commutative error detection apparatus associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created; the node fault detection apparatus retrieves them and stores them in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in commutative error detection values indicate that the node may be faulty.
申请公布号 US2006248370(A1) 申请公布日期 2006.11.02
申请号 US20050106069 申请日期 2005.04.14
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 ALMASI GHEORGHE;BLUMRICH MATTHIAS A.;CHEN DONG;COTEUS PAUL;GARA ALAN;GIAMPAPA MARK E.;HEIDELBERGER PHILIP;HOENICKE DIRK I.;SINGH SARABJEET;STEINMACHER-BUROW BURKHARD D.;TAKKEN TODD;VRANAS PAVLOS
分类号 G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址