主权项 |
1. A method implemented on a computer system, the method comprising, the computer system:
for each table of a plurality of database tables and for each column of a plurality of columns within the each table, creating a profile for the each column by accessing and analyzing a subset of values stored in the column; establishing a join graph of nodes, wherein each node represents one of the plurality of database tables; for each pair of a plurality of pairs of a first table and a second table from the plurality of database tables, wherein the first table is different than the second table and wherein no defined relationship exists between the first table and the second table:
for each pair of a plurality of pairs of a first column from the first table and a second column from the second table, calculating a joinability score representative of a predicted level of success in performing a join from the first table on the first column to the second table on the second column, wherein the score is determined based upon the profile for the first column and the profile for the second column, andfor one pair of the plurality of pairs of the first column from the first table and the second column from the second table, adding, based on the joinability score, a directed edge to the join graph from a node representing the first table to a node representing the second table; receiving a selection of a subset of the plurality of database tables; creating a join tree comprising a subset of edges in the join graph that spans a subset of nodes in the join graph corresponding to the selected subset of the plurality of database tables; extracting a set of joins represented by the subset of edges; and providing the extracted set of joins as a result, wherein creating a profile for the each column comprises:
processing the each column to create a set of m observables, with m being a positive integer constant greater than one, wherein each observable is a function of a set of elements in the each column, independent of replications, andincluding the set of m observables in the profile for the each column, and wherein calculating the joinability score comprises:
combining the set of m observables included in the profile for the first column and the set of m observables included in the profile for the second column to create a combined set of m observables, wherein each observable in the combined set of m observables is a function of a set of elements in a union between the first column and the second column, independent of replications,computing an estimated cardinality of a union between the first column and the second column based on the combined set of m observables without creating a union between the first column and the second column,computing an estimated cardinality of an intersection between the first column and the second column by subtracting the estimated cardinality of the union from the sum of an estimated cardinality of the first column and an estimated cardinality of the second column, anddividing the estimated cardinality of the intersection by the estimated cardinality of the first column. |