发明名称 Systems and methods involving a multi-pass algorithm for high cardinality data
摘要 This disclosure describes methods, systems, computer-readable media, and apparatuses for calculating a summary statistic. Calculating the summary statistic can be performed by identifying multiple subsets of a set of variable observations and assigning the subsets to grid-computing devices such that no two of the subsets are assigned to a same one of the grid-computing devices. A parallel processing operation that involves multiple processing phases at each of the grid-computing devices is then coordinated. The parallel processing operation includes each of the grid-computing devices inventorying the respectively assigned subset and generating inventory information representative of the respectively assigned subset. Subsequently, the inventory information generated by the grid-computing devices is received, and a summary statistic is determined by synthesizing the received inventory information.
申请公布号 US9524311(B2) 申请公布日期 2016.12.20
申请号 US201414322737 申请日期 2014.07.02
申请人 SAS Institute Inc. 发明人 Meng Gang
分类号 G06F9/50;G06F17/30;G06F17/18;H04L29/08 主分类号 G06F9/50
代理机构 Kilpatrick Townsend & Stockton LLP 代理人 Kilpatrick Townsend & Stockton LLP
主权项 1. A computer-program product tangibly embodied in a non-transitory, machine-readable storage medium, the storage medium comprising: stored instructions executable to cause a grid-computing device to: access a first subset of a set of observations while being operated in a grid-computing system that includes multiple additional grid-computing devices configured to access additional subsets of the set;compute hash values by using a hash function to hash observations in the first subset;generate associations by associating the hash values with the additional grid-computing devices;compute information about observations in the first subset;designate a first portion of the computed information to be retained at the grid-computing device and designate other portions of the computed information to be communicated to the additional grid-computing devices, wherein designating the first portion of the computed information and designating the other portions of the computed information is done using an addressing scheme that is based on the associations;retain the first portion of the computed information at the grid-computing device;communicate the other portions of the computed information to the additional grid-computing devices;receive additional information computed by the additional grid-computing devices, wherein the received additional information is about observations in the additional subsets;generate a summary by aggregating the retained portion of the computed information and the received additional information; andcommunicate the generated summary to another computing device configured to use the generated summary to compute a statistic that provides statistical information about the observations in the set.
地址 Cary NC US