发明名称 |
SYSTEMS AND METHODS INVOLVING A MULTI-PASS ALGORITHM FOR HIGH CARDINALITY DATA |
摘要 |
This disclosure describes methods, systems, computer-readable media, and apparatuses for calculating a summary statistic. Calculating the summary statistic can be performed by identifying multiple subsets of a set of variable observations and assigning the subsets to grid-computing devices such that no two of the subsets are assigned to a same one of the grid-computing devices. A parallel processing operation that involves multiple processing phases at each of the grid-computing devices is then coordinated. The parallel processing operation includes each of the grid-computing devices inventorying the respectively assigned subset and generating inventory information representative of the respectively assigned subset. Subsequently, the inventory information generated by the grid-computing devices is received, and a summary statistic is determined by synthesizing the received inventory information. |
申请公布号 |
US2014330839(A1) |
申请公布日期 |
2014.11.06 |
申请号 |
US201414322737 |
申请日期 |
2014.07.02 |
申请人 |
SAS Institute Inc. |
发明人 |
Meng Gang |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
1. A computer-program product tangibly embodied in a non-transitory, machine-readable storage medium, the storage medium comprising:
stored instructions executable to cause a grid-computing device to:
access a first subset of a set of observations while being operated in a grid-computing system that includes multiple additional grid-computing devices configured to access additional subsets of the set;compute hash values by using a hash function to hash observations in the first subset;generate associations by associating the hash values with the additional grid-computing devices;compute information about observations in the first subset;designate a first portion of the computed information to be retained at the grid-computing device and designate other portions of the computed information to be communicated to the additional grid-computing devices, wherein designating the first portion of the information and designating the other portions of the information is done using an addressing scheme that is based on the associations;retain the first portion of the computed information at the grid-computing device;communicate the other portions of the computed information to the additional grid-computing devices;receive information computed by the additional grid-computing devices, wherein the received information is about observations in the additional subsets;generate a summary by aggregating the retained portion of the computed information and the received information; andcommunicate the generated summary to another computing device configured to use the generated summary to compute a statistic that provides information about the observations in the set. |
地址 |
Cary NC US |