摘要 |
A method, system and computer program product for detecting outliers in a set of data points. In one embodiment, the method comprises partitioning the set of data points into a plurality of bins with each of the data points assigned to a respective one of the bins. A plurality of local lists are formed in parallel identifying points in the bins as outliers, and the local lists are merged into a global list to identify one or more of the points as outliers of the data set. Embodiments of the invention provide an outlier detection system that can parallelize in two levels. The dataset is split into partitions, called bins, and outliers are found in each bin in parallel. The execution of a single bin is also parallelized. Embodiments of the invention can scale to very large datasets by these two modes of parallelism.
|