摘要 |
A shuffler receives information associated with partition segments of map task outputs and a pipeline policy for a job running on a computing device. The shuffler transmits to an operating system of the computing device a request to lock partition segments of the map task outputs and transmits an advisement to keep or load partition segments of map task outputs in the memory of the computing device. The shuffler creates a pipeline based on the pipeline policy, wherein the pipeline includes partition segments locked in the memory and partition segments advised to keep or load in the memory, of the computing device for the job, and the shuffler selects the partition segments locked in the memory, followed by partition segments advised to keep or load in the memory, as a preferential order of partition segments to shuffle. |
主权项 |
1. A method for optimizing a MapReduce shuffle, the method comprising:
one or more processors performing MapReduce processes of a job running on one or more computing devices of a distributed grid of computing devices, wherein the MapReduce processes include generation of a set of partition segments of one or more map task outputs; one or more processors receiving information regarding the set of partition segments of one or more map task outputs and a pipeline policy for the job running on the one or more computing devices of the distributed grid; one or more processors transmitting a request to an operating system of the computing device to lock a first portion of the set of partition segments of the one or more map task outputs into memory of a computing device of the distributed grid; one or more processors transmitting to the operating system of the computing device of the distributed grid, an advisement to keep or load a second portion of the set of partition segments of the one or more map task outputs in the memory of the computing device; one or more processors building a pipeline of the one or more map task outputs, based on the pipeline policy of the job, and the first portion and the second portion of the set of partition segments; in response to receiving a fetch request from a reducer for partition segments of the one or more map task outputs, one or more processors shuffling the partition segments of the first portion of the set of partition segments before shuffling the partition segments from the second portion of the set of partition segments, as a preferential order of partition segments to shuffle; in response to shuffling one or more of the partition segments of the first portion as a response to a round of a reducer requests by a reducer, one or more processors transmitting a request to the operating system of the computing device of the distributed grid, to unlock from the memory of the computing device, the one or more of the partition segments of the first portion of the set of partition segments that are shuffled; and in response to shuffling one or more of the partition segments of the second portion as a response to a round of a reducer requests by a reducer, one or more processors transmitting a request to the operating system of the computing device of the distributed grid, to un-advise the one or more of the partition segments of the second portion of the set of partition segments from keeping or loading the one or more of the partition segments of the second portion that are shuffled, in the memory of the computing device. |