发明名称 ADAPTIVE HANDLING OF SKEW FOR DISTRIBUTED JOINS IN A CLUSTER
摘要 Techniques for detecting data skew while performing a distributed join operation on tables in a cluster of nodes managed by database management system (cDBMS), is disclosed. In an embodiment, heavy hitter values in a join column of a table are determined during the runtime of a distributed join operation of the table with another table. The cDBMS keeps in a datastore a count for each unique value read from the join column of the table. The datastore may be a hash table with the unique values serving as keys and may additionally include a heap or a sorted array for an efficient count based traversal. When a count for a particular value in the datastore exceeds a threshold, then the particular value is identified as a heavy hitter value. The tuples from the joined table that include the heavy hitter value, are kept local at the node that the tuples were originally distributed to, while the other joined table tuples are broadcasted to one or more nodes of the cDBMS that at least include the originally distributed nodes.
申请公布号 US2016267135(A1) 申请公布日期 2016.09.15
申请号 US201514871490 申请日期 2015.09.30
申请人 Oracle International Corporation 发明人 IDICULA SAM;ROEDIGER WOLF
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for executing a join operation by determining a distribution of tuples, for the join operation, from a database among a cluster of nodes and that are coupled to a database management system (DBMS), the DBMS managing the database, comprising: executing a particular join operation to join a first table with a second table based on a first join key of the first table and a second join key of the second table, the executing further comprising: said DBMS distributing first plurality of tuples of the first table across nodes of the cluster;said DBMS distributing second plurality of tuples of the second table across nodes of the cluster;said cluster of nodes generating a respective count for each second join key value of a subset of the second join key values in said second join key at a receipt of said second plurality of tuples of the second table;based on the respective count of a particular second join key value of said subset, establishing said particular second join key value as a heavy hitter;in response to determining that the second join key value is a heavy hitter, replicating a first tuple across a set of nodes of the cluster, wherein the first tuple contains a first join key value, from the first join key, that corresponds to said particular second join key value; andeach node of said set of nodes locally performing a join operation between plurality of tuples of said first table and said second table based on said first join key value and said second join key value.
地址 Redwood Shores CA US