发明名称 Automated creation of join graphs for unrelated data sets among relational databases
摘要 A computer system processes a plurality of unrelated database tables to create a join graph where each node in the join graph represents a table. The nodes in the join graph are connected by weighted, directed edges, where each directed edge represents a join from a first column in a first table to a second column in a second table and where the weight of the directed edge represents a predicted level of success in performing the join. The edge weights can be based on a likelihood of finding a value from the first column in the second column. A user selects a subset of the tables, and the system creates a join tree with recommended joins between the tables selected by the user. The recommended joins are used to create a structured query language statement which is executed to return a result to the user.
申请公布号 US9507824(B2) 申请公布日期 2016.11.29
申请号 US201514834430 申请日期 2015.08.24
申请人 Attivio Inc. 发明人 Young Jonathan;O'Neil John;Johnson, III William K.;Serrano Martin;George Gregory;Das Udayan
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Patent GC LLC 代理人 Patent GC LLC
主权项 1. A method implemented on a computer system, the method comprising, the computer system: for each table of a plurality of database tables and for each column of a plurality of columns within the each table, creating a profile for the each column by accessing and analyzing a subset of values stored in the column; establishing a join graph of nodes, wherein each node represents one of the plurality of database tables; for each pair of a plurality of pairs of a first table and a second table from the plurality of database tables, wherein the first table is different than the second table and wherein no defined relationship exists between the first table and the second table: for each pair of a plurality of pairs of a first column from the first table and a second column from the second table, calculating a joinability score representative of a predicted level of success in performing a join from the first table on the first column to the second table on the second column, wherein the score is determined based upon the profile for the first column and the profile for the second column, andfor one pair of the plurality of pairs of the first column from the first table and the second column from the second table, adding, based on the joinability score, a directed edge to the join graph from a node representing the first table to a node representing the second table; receiving a selection of a subset of the plurality of database tables; creating a join tree comprising a subset of edges in the join graph that spans a subset of nodes in the join graph corresponding to the selected subset of the plurality of database tables; extracting a set of joins represented by the subset of edges; and providing the extracted set of joins as a result, wherein creating a profile for the each column comprises: processing the each column to create a set of m observables, with m being a positive integer constant greater than one, wherein each observable is a function of a set of elements in the each column, independent of replications, andincluding the set of m observables in the profile for the each column, and wherein calculating the joinability score comprises: combining the set of m observables included in the profile for the first column and the set of m observables included in the profile for the second column to create a combined set of m observables, wherein each observable in the combined set of m observables is a function of a set of elements in a union between the first column and the second column, independent of replications,computing an estimated cardinality of a union between the first column and the second column based on the combined set of m observables without creating a union between the first column and the second column,computing an estimated cardinality of an intersection between the first column and the second column by subtracting the estimated cardinality of the union from the sum of an estimated cardinality of the first column and an estimated cardinality of the second column, anddividing the estimated cardinality of the intersection by the estimated cardinality of the first column.
地址 Newton MA US