发明名称 SYSTEM AND METHOD FOR PERFORMING SET OPERATIONS WITH DEFINED SKETCH ACCURACY DISTRIBUTION
摘要 Techniques are provided for improving the speed and accuracy of analytics on big data using theta sketches, by converting fixed-size sketches to theta sketches, and by performing set operations on sketches. In a technique for performing a set operation, two sketches are analyzed to identify the maximum value of each sketch. The maximum values of the two sketches are compared. Based the comparison, one or more values are removed from the sketch whose maximum value is greater. After the removal, a set operation (e.g., union, intersection, or difference) is performed based on the modified sketch and the unmodified sketch. A result of the set operation is a third sketch, which may be used to estimate a cardinality of the larger data sets that are represented by the two input sketches.
申请公布号 US2015100596(A1) 申请公布日期 2015.04.09
申请号 US201414448487 申请日期 2014.07.31
申请人 Yahoo! Inc. 发明人 Rhodes Lee;Dasgupta Anirban;Lang Kevin
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A system comprising: a processor and a non-transitory memory comprising a sketch data structure that represents an output of a set operation on a first large data set and a second large data set, the sketch data structure comprising: a sample set that comprises a plurality of values,a target size, wherein a number of values in the sample set is based, at least in part, on the target size, anda scalar value that is based, at least in part, on the target size, and specifies an upper or lower bound for the plurality of values.
地址 Sunnyvale CA US