发明名称 Methods and systems to operate on group-by sets with high cardinality
摘要 This disclosure describes methods, systems, computer-readable media, and apparatuses for efficiently calculating group-by statistics. A data set that includes multiple entries is accessed. The multiple entries are grouped into group-by subsets which are formed on two or more group-by variables and which are subsets are subsets of the data set. Cardinality data is determined for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset. At least one summary of data in each of the group-by subsets is generated, wherein each of the summaries includes the cardinality data determined for the group-by subset. Objects for the group-by subsets are initialized such that the objects store the summaries. The objects may then be used to generate multiple statistical summaries of the data set.
申请公布号 US9633104(B2) 申请公布日期 2017.04.25
申请号 US201414270297 申请日期 2014.05.05
申请人 SAS Institute Inc. 发明人 Wu Xunlei;Schabenberger Oliver
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Kilpatrick Townsend & Stockton LLP 代理人 Kilpatrick Townsend & Stockton LLP
主权项 1. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the storage medium having instructions stored thereon, and the instructions being operable to cause a data-processing apparatus to perform operations including: accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets include multiple disjoint subsets of the data set, multiple intersecting subsets of the data set, or multiple subsets of the data set which are formed on different combinations of group-by variables; displaying an interface that facilitates defining a subset of the data set by referencing one or more of the group-by subsets; receiving an input at the interface, the input defining a subset of the data set by referencing at least one of the group-by subsets; generating a statistical summary of the defined subset; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects.
地址 Cary NC US