发明名称 Efficient query processing using histograms in a columnar database
摘要 A probabilistic data structure is generated for efficient query processing using a histogram for unsorted data in a column of a columnar database. A bucket range size is determined for multiples buckets of a histogram of a column in a columnar database table. In at least some embodiments, the histogram may be a height-balanced histogram. A probabilistic data structure is generated to indicate for which particular buckets in the histogram there is a data value stored in the data block. When an indication of a query directed to the column for select data is received, the probabilistic data structure for each of the data blocks storing data for the column may be examined to determine particular ones of the data blocks which do not need to be read in order to service the query for the select data.
申请公布号 US8949224(B2) 申请公布日期 2015.02.03
申请号 US201313742287 申请日期 2013.01.15
申请人 Amazon Technologies, Inc. 发明人 Gupta Anurag Windlass
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Meyertons, Hood, Kivlin, Kowert & Goetzel, P.C. 代理人 Kowert Robert C.;Meyertons, Hood, Kivlin, Kowert & Goetzel, P.C.
主权项 1. A distributed data warehouse system, comprising: a plurality of nodes; wherein at least some nodes of the plurality of nodes each comprise: storage for a columnar database table, wherein said storage comprises a plurality of data blocks;a query execution module; wherein at least one node of the plurality of nodes comprises a height-balanced histogram generator, configured to: determine a plurality of bucket range sizes for a height-balanced histogram representing a distribution of data among a plurality of buckets in a column of the columnar database table, wherein each bucket of the plurality of buckets represents an existence of one or more data values of the data in the column within a range of values;generate a probabilistic data structure for each data block of one or more data blocks storing data for the column, wherein the probabilistic data structure indicates for which buckets of the plurality of buckets there is a data value in the bucket range size stored in the data block; wherein the query execution module is configured to: receive an indication of a query directed to the column of the columnar database table for select data;in response to receiving the indication of the query: examine the probabilistic data structure for each of the one or more data blocks storing data for the column to determine particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data; andread the one or more data blocks storing data for the column excepting the particular ones of the one or more data blocks which do not need to be read.
地址 Reno NV US