发明名称 Identifying failure in a tree network of a parallel computer
摘要 Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
申请公布号 US7783933(B2) 申请公布日期 2010.08.24
申请号 US20060531787 申请日期 2006.09.14
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 ARCHER CHARLES J.;PINNOW KURT W.;WALLENFELT BRIAN P.
分类号 G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址