发明名称 SYSTEMS AND METHODS INVOLVING A MULTI-PASS ALGORITHM FOR HIGH CARDINALITY DATA
摘要 This disclosure describes methods, systems, computer-readable media, and apparatuses for calculating a summary statistic. Calculating the summary statistic can be performed by identifying multiple subsets of a set of variable observations and assigning the subsets to grid-computing devices such that no two of the subsets are assigned to a same one of the grid-computing devices. A parallel processing operation that involves multiple processing phases at each of the grid-computing devices is then coordinated. The parallel processing operation includes each of the grid-computing devices inventorying the respectively assigned subset and generating inventory information representative of the respectively assigned subset. Subsequently, the inventory information generated by the grid-computing devices is received, and a summary statistic is determined by synthesizing the received inventory information.
申请公布号 US2014330839(A1) 申请公布日期 2014.11.06
申请号 US201414322737 申请日期 2014.07.02
申请人 SAS Institute Inc. 发明人 Meng Gang
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer-program product tangibly embodied in a non-transitory, machine-readable storage medium, the storage medium comprising: stored instructions executable to cause a grid-computing device to: access a first subset of a set of observations while being operated in a grid-computing system that includes multiple additional grid-computing devices configured to access additional subsets of the set;compute hash values by using a hash function to hash observations in the first subset;generate associations by associating the hash values with the additional grid-computing devices;compute information about observations in the first subset;designate a first portion of the computed information to be retained at the grid-computing device and designate other portions of the computed information to be communicated to the additional grid-computing devices, wherein designating the first portion of the information and designating the other portions of the information is done using an addressing scheme that is based on the associations;retain the first portion of the computed information at the grid-computing device;communicate the other portions of the computed information to the additional grid-computing devices;receive information computed by the additional grid-computing devices, wherein the received information is about observations in the additional subsets;generate a summary by aggregating the retained portion of the computed information and the received information; andcommunicate the generated summary to another computing device configured to use the generated summary to compute a statistic that provides information about the observations in the set.
地址 Cary NC US