发明名称 |
System and method for large-scale data processing using an application-independent framework |
摘要 |
A large-scale data processing system and method for processing data in a distributed and parallel processing environment is disclosed. The system comprises a set of interconnected computing systems, each having one or more processors and memory. The set of interconnected computing systems include: a set of application-independent map modules for reading portions of input files containing data, and for producing intermediate data values by applying at least one user-specified, application-specific map operation to the data; a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; and a set of application-independent reduce modules, distinct from the plurality of application-independent map modules, for producing final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values. |
申请公布号 |
US9612883(B2) |
申请公布日期 |
2017.04.04 |
申请号 |
US201314099806 |
申请日期 |
2013.12.06 |
申请人 |
Google Inc. |
发明人 |
Dean Jeffrey;Ghemawat Sanjay |
分类号 |
G06F17/30;G06F9/54;G06F9/48 |
主分类号 |
G06F17/30 |
代理机构 |
Morgan, Lewis & Bockius LLP |
代理人 |
Morgan, Lewis & Bockius LLP |
主权项 |
1. A system for large-scale processing of data in a distributed and parallel processing environment, comprising:
a set of interconnected computing systems, each having one or more processors and memory, the set of interconnected computing systems including:
a plurality of worker processes executing on the set of interconnected computing systems;an application-independent supervisory process executing on the set of interconnected computing systems, for:
determining, for input files, a plurality of data processing tasks including a plurality of map tasks specifying data from the input files to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning the data processing tasks to idle ones of the worker processes;a set of application-independent map functions, executed by a first subset of the plurality of worker processes, for reading portions of the input files containing data, and for producing intermediate data values by applying at least one user-specified, application-specific map operation to the data, wherein the set of application-independent map functions are independent of the at least one user-specified, application-specific map operation;a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; anda set of application-independent reduce functions, distinct from the set of application-independent map functions, the set of application-independent reduce functions executed by a second subset of the plurality of worker processes for producing the final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values, wherein the set of application-independent reduce functions are independent of the at least one user-specified, application-specific reduce operation. |
地址 |
Mountain View CA US |