发明名称 Nearest neighbor clustering determination and estimation algorithm that hashes centroids into buckets and redistributes vectors between clusters
摘要 Embodiments are described for determining and/or estimating a nearest neighbor to a data vector in a dataset are presented. Some embodiments redistribute data vectors between clusters based upon the character of the clusters to more evenly balance the computational load. Some embodiments employ Locality Sensitive Hashing (LSH) functions as part of the clustering and remove redundant data vectors from the data set to mitigate unbalanced computation. The disclosed embodiments may facilitate the analysis of very large and/or very high dimensional datasets with reasonable runtimes.
申请公布号 US9552408(B2) 申请公布日期 2017.01.24
申请号 US201414163751 申请日期 2014.01.24
申请人 Facebook, Inc. 发明人 Malewicz Grzegorz
分类号 G06F15/18;G06F17/30 主分类号 G06F15/18
代理机构 Perkins Coie LLP 代理人 Perkins Coie LLP
主权项 1. A computer-implemented method for estimating a nearest neighbor to a target data vector in a dataset, comprising: receiving, by a computer system, a request to determine the nearest neighbor to the target data vector; determining, by the computer system, a preliminary clustering of the dataset, the preliminary clustering comprising multiple preliminary clusters, the determining including: for an initial set of clusters of the dataset, computing a centroid for each cluster in the initial set of clusters,hashing, by the computer system, the centroids into a set of buckets,identifying a subset of the centroids that is hashed into the set of buckets into which a specified data vector from the dataset is hashed,determining a centroid from the subset of the centroids that is nearest to the specified data vector, andclassifying the specified data vector into a first preliminary cluster, of the preliminary clusters, that corresponds to the centroid; redistributing, by the computer system, data vectors between at least two of the preliminary clusters to produce at least two redistributed clusters, the target data vector included in a first of the at least two redistributed clusters; transmitting, by the computer system, the first of the at least two redistributed clusters to a first computer system and a second of the at least two redistributed clusters to a second computer system; estimating, by the first computer system, a nearest neighbor of the target data vector within the first of the two redistributed clusters, wherein the nearest neighbor is one of multiple data vectors in the dataset that is most similar to the target data vector; and outputting, by the first computer system and in response to the request, information of the nearest neighbor to a user.
地址 Menlo Park CA US