发明名称 Fault tolerance for map/reduce computing
摘要 Embodiments of the invention include a method for fault tolerance management of workers nodes during map/reduce computing in a computing cluster. The method includes subdividing a computational problem into a set of sub-problems, mapping a selection of the sub-problems in the set to respective nodes in the cluster, directing processing of the sub-problems in the respective nodes, and collecting results from completion of processing of the sub-problems. During a first early temporal portion of processing the computational problem, failed nodes are detected and the sub-problems currently being processed by the failed nodes are re-processed. Conversely, during a second later temporal portion of processing the computational problem, sub-problems in nodes not yet completely processed are replicated into other nodes, processing of the replicated sub-problems directed, and the results from completion of processing of sub-problems collected. Finally, duplicate results are removed and remaining results reduced into a result set for the problem.
申请公布号 US8381016(B2) 申请公布日期 2013.02.19
申请号 US201213407673 申请日期 2012.02.28
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION;KAMINSKY DAVID L. 发明人 KAMINSKY DAVID L.
分类号 G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址