发明名称 System and method for large-scale data processing using an application-independent framework
摘要 A large-scale data processing system and method for processing data in a distributed and parallel processing environment is disclosed. The system comprises a set of interconnected computing systems, each having one or more processors and memory. The set of interconnected computing systems include: a set of application-independent map modules for reading portions of input files containing data, and for producing intermediate data values by applying at least one user-specified, application-specific map operation to the data; a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; and a set of application-independent reduce modules, distinct from the plurality of application-independent map modules, for producing final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values.
申请公布号 US9612883(B2) 申请公布日期 2017.04.04
申请号 US201314099806 申请日期 2013.12.06
申请人 Google Inc. 发明人 Dean Jeffrey;Ghemawat Sanjay
分类号 G06F17/30;G06F9/54;G06F9/48 主分类号 G06F17/30
代理机构 Morgan, Lewis & Bockius LLP 代理人 Morgan, Lewis & Bockius LLP
主权项 1. A system for large-scale processing of data in a distributed and parallel processing environment, comprising: a set of interconnected computing systems, each having one or more processors and memory, the set of interconnected computing systems including: a plurality of worker processes executing on the set of interconnected computing systems;an application-independent supervisory process executing on the set of interconnected computing systems, for: determining, for input files, a plurality of data processing tasks including a plurality of map tasks specifying data from the input files to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning the data processing tasks to idle ones of the worker processes;a set of application-independent map functions, executed by a first subset of the plurality of worker processes, for reading portions of the input files containing data, and for producing intermediate data values by applying at least one user-specified, application-specific map operation to the data, wherein the set of application-independent map functions are independent of the at least one user-specified, application-specific map operation;a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; anda set of application-independent reduce functions, distinct from the set of application-independent map functions, the set of application-independent reduce functions executed by a second subset of the plurality of worker processes for producing the final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values, wherein the set of application-independent reduce functions are independent of the at least one user-specified, application-specific reduce operation.
地址 Mountain View CA US