发明名称 Automatic consistent sampling for data analysis
摘要 A method, computer program product, and system for analyzing data within one or more databases, comprising selecting one or more databases for analysis, each database comprising one or more database objects comprising one or more data values, applying a function to each data value in each database object within the one or more databases, where the function produces function values limited to a predetermined range, identifying for analysis the data values producing a certain function value within the predetermined range to form a sampled data set, and analyzing the sampled data set to determine relationships between the database objects within and across the one or more databases.
申请公布号 US8856085(B2) 申请公布日期 2014.10.07
申请号 US201113185601 申请日期 2011.07.19
申请人 International Business Machines Corporation 发明人 Gorelik Alexander
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Edell, Shapiro & Finnan, LLC 代理人 Kashef Mohammed;Edell, Shapiro & Finnan, LLC
主权项 1. A computer program product for analyzing data within one or more databases, comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising computer readable program code configured to: select one or more databases for analysis, each database comprising one or more database objects comprising one or more data values, wherein the data values in each database object are arranged in columns;apply a function to each data value in each database object within the one or more databases, wherein the function produces function values limited to a predetermined range;identify for analysis the data values producing a certain function value within the predetermined range to form a sampled data set;identify for analysis the data values that produce function values other than the certain function value and reside in one or more columns lacking high cardinality to form an unsampled data set, wherein a column has a high cardinality when data values in the column satisfy one or more from a group of a predetermined cardinality threshold and a predetermined selectivity threshold; andanalyze the sampled data set with the unsampled data set by matching data values within these data sets to determine relationships between the database objects within and across the one or more databases.
地址 Armonk NY US