发明名称 Efficient query processing using histograms in a columnar database
摘要 A probabilistic data structure is generated for efficient query processing using a histogram for unsorted data in a column of a columnar database. A bucket range size is determined for multiples buckets of a histogram of a column in a columnar database table. In at least some embodiments, the histogram may be a height-balanced histogram. A probabilistic data structure is generated to indicate for which particular buckets in the histogram there is a data value stored in the data block. When an indication of a query directed to the column for select data is received, the probabilistic data structure for each of the data blocks storing data for the column may be examined to determine particular ones of the data blocks which do not need to be read in order to service the query for the select data.
申请公布号 US9268838(B2) 申请公布日期 2016.02.23
申请号 US201514611939 申请日期 2015.02.02
申请人 Amazon Technologies, Inc. 发明人 Gupta Anurag Windlass
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Meyertons, Hood, Kivlin, Kowert & Goetzel, P.C. 代理人 Kowert Robert C.;Meyertons, Hood, Kivlin, Kowert & Goetzel, P.C.
主权项 1. A system, comprising: one or more hardware computing devices comprising a plurality of nodes, wherein the plurality of nodes comprise: storage for a plurality of data ranges corresponding to a distribution of data in a column of a columnar database table, wherein the data in the column is stored in a plurality of data blocks;storage for respective probabilistic data structures for individual ones of the plurality of data blocks, wherein the respective probabilistic data structure for a respective data block of the plurality of data blocks indicates which data ranges of the plurality of data ranges correspond to data values not stored in the respective data block;at least one node of the plurality of nodes configured to: detect a rebalancing event for the distribution of the data in the column of the columnar database table; andin response to detecting the rebalancing event: modify, based on a change to the distribution of the data corresponding to the rebalancing event, one or more of the plurality of data ranges; andupdate respective probabilistic data structures for respective ones of the plurality of data blocks, wherein said update is according to the modified one or more of the plurality of data ranges; anda query execution module configured to: receive an indication of a query directed to the column of the columnar database table; andin response to receiving the indication of the query, examine one or more of the probabilistic data structures for the plurality of data blocks to determine which ones of the plurality of data blocks not to read to service the query.
地址 Reno NV US