发明名称 AUTOMATIC JOINING OF DATA SETS BASED ON STATISTICS OF FIELD VALUES IN THE DATA SETS
摘要 A computer system processes arbitrary data sets to identify fields of data that can be the basis of a join operation. Each data set has a plurality of entries, with each entry having a plurality of fields. For each pair of data sets, the computer system compares the values of fields in a first data set in the pair of data sets to the values of fields in a second data set in the pair of data sets, to identify fields having substantially similar sets of values. Given pairs of fields that have similar sets of values, the computer system measures entropy with respect to an intersection of the sets of values of the pair of fields. The computer system can recommend fields for a join operation between any pair of data sets in the plurality of data sets based on such statistical measures.
申请公布号 US2016055212(A1) 申请公布日期 2016.02.25
申请号 US201414466231 申请日期 2014.08.22
申请人 Attivio, Inc. 发明人 Young Jonathan;O'Neil John;Johnson, III William K.;Serrano Martin;George Gregory
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer-implemented process comprising: receiving a plurality of data sets, each data set having a plurality of entries, each entry having a plurality of fields, wherein a field in the plurality of fields has at least one value; for each pair of data sets in the plurality of data sets: comparing the values of fields in a first data set in the pair of data sets to the values of fields in a second data set in the pair of data sets to identify fields having substantially similar sets of values, andmeasuring entropy with respect to an intersection of the sets of values of the identified fields from the pair of data sets; and suggesting fields for a join operation between any pair of data sets in the plurality of data sets, based at least on the measured entropy with respect to the intersection of the sets of values of the identified fields from the pair of data sets.
地址 Newton MA US