发明名称 FORMULATING GLOBAL STATISTICS FOR DISTRIBUTED DATABASES
摘要 The present invention extends to methods, systems, and computer program products for formulating global statistics for parallel databases. In general, embodiments of the invention merge (combine) information in multiple compute node level histograms to create a global histogram for a table that is distributed across a number of compute nodes. Merging can include aligning histogram step boundaries across the compute node histograms. Merging can include aggregating histogram step-level information, such as, for example, equality rows and average range rows (or alternately equality rows, range rows, and distinct range rows), across the compute node histograms into a single global step. Merging can account for distinct values that do not appear at one or more compute nodes as well as distinct values that are counted at multiple compute nodes. A resulting global histogram can be coalesced to reduce the step count.
申请公布号 US2015169688(A1) 申请公布日期 2015.06.18
申请号 US201514631735 申请日期 2015.02.25
申请人 Microsoft Technology Licensing, LLC 发明人 Halverson Alan Dale;Robinson Eric R.;Shankar Srinath;Naughton Jeffrey F.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for use at a computer system, the computer system including one or more processors and system memory, the computer system connected to a plurality of compute nodes, a data partitioning algorithm defining how portions of a table are partitioned across the plurality of compute nodes, the method for optimizing a query of the table based on global statistics for a multi-value element of the table, the method comprising: receiving a query, the query expressing a logical intent for retrieving specified data from the table, the specified data distributed across at least two of the plurality of compute nodes; considering a plurality of different parallel query plans for implementing the expressed logical intent of the query; accessing the global statistics for the multi-value element from an optimization database, the global statistics having been formulated by: formulating a global probability distribution estimate for the multi-value element by merging probability distribution estimates for the multi-value element accessed from each of the plurality of compute nodes, the global probability distribution estimate including global steps defined by step boundaries, including for each global step: calculating a number of occurrences of an upper boundary value by adding the number of occurrences of upper boundary values across the probability distribution estimates; andcalculating a global central tendency for a number of orthogonal multi-value elements per distinct value for the global step, the global central tendency calculated based on a probability that a particular distinct value appears at a particular node in view of the data partitioning algorithm, the orthogonal multi-value elements being orthogonal to the multi-value element within the table; and optimizing execution of the query by selecting a parallel query plan, from among the plurality of different parallel query plans, determined to perform better than other of the plurality of query plans in view of the accessed global statistics for the multi-value element.
地址 Redmond WA US