发明名称 Enhancing throughput and fault-tolerance in a parallel-processing system
摘要 One embodiment of the present invention provides a system that enhances throughput and fault-tolerance in a parallel-processing system. During operation, the system first receives a task. Next, the system partitions N computing nodes into M set-aside nodes and N-M primary computing nodes, wherein M>=1. The system then processes the task in parallel across the N-M primary computing nodes. While doing so, the system proactively monitors the health of each of the N-M primary computing nodes. If the system detects a node in the N-M primary computing nodes to be at risk of failure, the system copies the portion of the task associated with the at-risk node to a subset of the M set-aside nodes. The system then processes the portion of the task in parallel across the subset of the M set-aside nodes while the N-M primary computing nodes continue executing.
申请公布号 US7543180(B2) 申请公布日期 2009.06.02
申请号 US20060371998 申请日期 2006.03.08
申请人 SUN MICROSYSTEMS, INC. 发明人 GROSS KENNY C.;WOOD ALAN PAUL
分类号 G06F11/00 主分类号 G06F11/00
代理机构 代理人
主权项
地址