发明名称 OPTIMIZATION OF MAP-REDUCE SHUFFLE PERFORMANCE THROUGH SHUFFLER I/O PIPELINE ACTIONS AND PLANNING
摘要 A shuffler receives information associated with partition segments of map task outputs and a pipeline policy for a job running on a computing device. The shuffler transmits to an operating system of the computing device a request to lock partition segments of the map task outputs and transmits an advisement to keep or load partition segments of map task outputs in the memory of the computing device. The shuffler creates a pipeline based on the pipeline policy, wherein the pipeline includes partition segments locked in the memory and partition segments advised to keep or load in the memory, of the computing device for the job, and the shuffler selects the partition segments locked in the memory, followed by partition segments advised to keep or load in the memory, as a preferential order of partition segments to shuffle.
申请公布号 US2016283282(A1) 申请公布日期 2016.09.29
申请号 US201615173741 申请日期 2016.06.06
申请人 International Business Machines Corporation 发明人 Hu Zhenhua;Ma Hao Hai;Tang Wentao;Xu Qiang
分类号 G06F9/50;G06F9/54 主分类号 G06F9/50
代理机构 代理人
主权项 1. A method for optimizing a MapReduce shuffle, the method comprising: one or more processors performing MapReduce processes of a job running on one or more computing devices of a distributed grid of computing devices, wherein the MapReduce processes include generation of a set of partition segments of one or more map task outputs; one or more processors receiving information regarding the set of partition segments of one or more map task outputs and a pipeline policy for the job running on the one or more computing devices of the distributed grid; one or more processors transmitting a request to an operating system of the computing device to lock a first portion of the set of partition segments of the one or more map task outputs into memory of a computing device of the distributed grid; one or more processors transmitting to the operating system of the computing device of the distributed grid, an advisement to keep or load a second portion of the set of partition segments of the one or more map task outputs in the memory of the computing device; one or more processors building a pipeline of the one or more map task outputs, based on the pipeline policy of the job, and the first portion and the second portion of the set of partition segments; in response to receiving a fetch request from a reducer for partition segments of the one or more map task outputs, one or more processors shuffling the partition segments of the first portion of the set of partition segments before shuffling the partition segments from the second portion of the set of partition segments, as a preferential order of partition segments to shuffle; in response to shuffling one or more of the partition segments of the first portion as a response to a round of a reducer requests by a reducer, one or more processors transmitting a request to the operating system of the computing device of the distributed grid, to unlock from the memory of the computing device, the one or more of the partition segments of the first portion of the set of partition segments that are shuffled; and in response to shuffling one or more of the partition segments of the second portion as a response to a round of a reducer requests by a reducer, one or more processors transmitting a request to the operating system of the computing device of the distributed grid, to un-advise the one or more of the partition segments of the second portion of the set of partition segments from keeping or loading the one or more of the partition segments of the second portion that are shuffled, in the memory of the computing device.
地址 Armonk NY US