发明名称 Saving program execution state
摘要 Techniques are described for managing distributed execution of programs. In at least some situations, the techniques include decomposing or otherwise separating the execution of a program into multiple distinct execution jobs that may each be executed on a distinct computing node, such as in a parallel manner with each execution job using a distinct subset of input data for the program. In addition, the techniques may include temporarily terminating and later resuming execution of at least some execution jobs, such as by persistently storing an intermediate state of the partial execution of an execution job, and later retrieving and using the stored intermediate state to resume execution of the execution job from the intermediate state. Furthermore, the techniques may be used in conjunction with a distributed program execution service that executes multiple programs on behalf of multiple customers or other users of the service.
申请公布号 US8935404(B2) 申请公布日期 2015.01.13
申请号 US201313737815 申请日期 2013.01.09
申请人 Amazon Technologies, Inc. 发明人 Sirota Peter;Nowland Ian P.;Cole Richard J.;Khanna Richendra;Cabrera Luis Felipe
分类号 G06F15/173;G06F9/48 主分类号 G06F15/173
代理机构 Seed IP Law Group PLLC 代理人 Seed IP Law Group PLLC
主权项 1. A computing system configured to manage distributed execution of programs, comprising: one or more hardware processors; and a system manager component of a distributed execution service that is configured to, when executed by at least one of the one or more hardware processors, manage distributed execution of multiple execution jobs by: initiating execution of the multiple execution jobs on multiple computing nodes;after a partial execution of at least one of the multiple execution jobs is performed but before the execution of the at least one execution jobs is completed, determining to terminate the execution of the at least one execution jobs based at least in part on a failure of at least one of the multiple computing nodes, and initiating persistent storage of an intermediate state of the partial execution of the at least one execution jobs;at a time after terminating the execution of the at least one execution jobs, retrieving the persistently stored intermediate state of the partial execution of the at least one execution jobs, and resuming the execution of the at least one execution jobs, wherein the retrieved persistently stored intermediate state is used as part of the resumed execution by using at least some of the retrieved persistently stored intermediate state as input for one or more of the at least one execution jobs whose execution is resumed; andafter the execution of the multiple execution jobs is completed, providing final results from the execution.
地址 Reno NV US