发明名称 Adaptive recovery for parallel reactive power throttling
摘要 Power throttling may be used to conserve power and reduce heat in a parallel computing environment. Compute nodes in the parallel computing environment may be organized into groups based on, for example, whether they execute tasks of the same job or receive power from the same converter. Once one of compute nodes in the group detects that a parameter (i.e., temperature, current, power consumption, etc.) has exceeded a first threshold, power throttling on all the nodes in the group may be activated. However, before deactivating power throttling, a plurality of parameters associated with the group of compute nodes may be monitored to ensure they are all below a second threshold. If so, the power throttling for all of the compute nodes is deactivated.
申请公布号 US8799694(B2) 申请公布日期 2014.08.05
申请号 US201113327100 申请日期 2011.12.15
申请人 International Business Machines Corporation 发明人 Gooding Thomas M.;Knudson Brant L.;Lappi Cory;Poole Ruth J.;Tauferner Andrew T.
分类号 G06F1/32 主分类号 G06F1/32
代理机构 Patterson & Sheridan LLP 代理人 Patterson & Sheridan LLP
主权项 1. A computer program product for managing a parallel computing system that comprises a plurality of compute nodes, wherein the computing system comprises a global clock signal that informs each of the compute nodes of the global time for the computing system, the computer program product comprising: a computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code configured to: monitor a first parameter associated with at least one of the plurality of compute nodes, wherein the first parameter comprises at least one of: a measured temperature, a measured current, and a measured power consumption, and wherein the plurality of compute nodes in the parallel computing system are coupled for data communications;upon determining that the first parameter reaches or exceeds a first threshold for the first parameter, transmit a global interrupt to at least two of the plurality of compute nodes, wherein the global interrupt includes a predefined time, wherein the at least two of the plurality of compute nodes are configured to reduce power consumption upon the global time matching the predefined time of the global interrupt, and wherein the at least two compute nodes form an operational group that is a subset of the plurality of compute nodes;after transmitting the global interrupt, determine whether second parameters respectively associated with each of the at least two compute nodes in the operational group reach or fall below a second threshold for the second parameters, wherein the second parameters comprise at least one of: a measured temperature, a measured current, and a measured power consumption; andafter all the second parameters reach or fall below the second threshold, cancel the global interrupt for the at least two compute nodes in the operational group.
地址 Armonk NY US