发明名称 System And Method For Fast Identification Of Variable Roles During Initial Data Exploration
摘要 Systems and methods are provided for identifying data variable rules during initial data exploration. In one example, a computer-implemented method of determining a role for a data variable is disclosed. The method comprises identifying to a plurality of data nodes a set of data records containing data values assigned to each data node, a maximum number of levels to record in a sorted data structure at the data nodes, and the data node responsible for each of a plurality of variables. The method further comprises receiving for each variable from the data node responsible for the variable a plurality of unique data values for the variable, a count for each of the unique data values and an overflow count for the variable, wherein the number of unique data values does not exceed the maximum number of levels. A role for a variable can be determined based upon the unique data values, counts and overflow count for the variable.
申请公布号 US2014237001(A1) 申请公布日期 2014.08.21
申请号 US201313772404 申请日期 2013.02.21
申请人 SAS INSTITUTE INC. 发明人 Guirguis Georges H.;Pope Scott
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer-implemented method of determining a role for a data variable for use in data modeling of a physical process, comprising: identifying to a plurality of data nodes a set of data records containing data values assigned to each data node, a maximum number of levels to record in a sorted data structure at the data nodes, and the data node responsible for each of a plurality of variables; receiving for each variable from the data node responsible for the variable a plurality of unique data values for the variable, a count for each of the unique data values and an overflow count for the variable, wherein the number of unique data values does not exceed the maximum number of levels, wherein the data values, counts and overflow count have been generated at a plurality of data nodes by node data processors configured by data processing instructions to: determine whether a next data value for a data record can be added to the sorted data structure at the data node and that a count associated with that next data value can be added to the sorted data structure when the next data value can be added,determine whether the next data value is already included in the sorted data structure and that the count associated with that next data value can be incremented when the next data value is already included, anddetermine whether the next data value should not be added to the data structure and that an overflow count at that node should be incremented when the next data value cannot be added; wherein a role for a variable can be determined based upon the unique data values, counts and overflow count for a variable.
地址 Cary NC US