发明名称 Distributed computing fault management
摘要 An automated system may be employed to perform detection, analysis and recovery from faults occurring in a distributed computing system. Faults may be recorded in a metadata store for verification and analysis by an automated fault management process. Diagnostic procedures may confirm detected faults. The automated fault management process may perform recovery workflows involving operations such as rebooting faulting devices and excommunicating unrecoverable computing nodes from affected clusters.
申请公布号 US9274902(B1) 申请公布日期 2016.03.01
申请号 US201313961720 申请日期 2013.08.07
申请人 Amazon Technologies, Inc. 发明人 Morley Adam Douglas;Hunter, Jr. Barry Bailey;Lu Yijun;Rath Timothy Andrew;Muniswamy-Reddy Kiran-Kumar;Huang Xianglong;Zheng Jiandan
分类号 G06F11/00;G06F11/20;G06F11/07 主分类号 G06F11/00
代理机构 Baker & Hostetler LLP 代理人 Baker & Hostetler LLP
主权项 1. A distributed database system comprising: a plurality of computing nodes comprising at least a first subset of the plurality of computing nodes, the first subset configured to perform a distributed computing function, one or more of the plurality of computing nodes configured at least to: detect a fault involving the first subset of the plurality of computing nodes;perform one or more diagnostic procedures involving at least a component connected to a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault;perform a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on the performing of the one or more diagnostic procedures; andreconfigure the first subset of the plurality of computing nodes to perform the distributed computing function without the first computing node upon determining that performing the first one or more operations has not resolved the fault.
地址 Reno NV US