发明名称 Clustering data points
摘要 Systems and methods for clustering a group of data points based on a measure of similarity between each pair of data points in the group are provided. A pairwise similarity function can be estimated for each pair of data points in the group. A clustering algorithm can be executed to create clusters and associate data points with the clusters using the pairwise similarity function. The algorithm can be iterated multiple times until a stopping condition is reached in order to reduce variance in the output of the algorithm. The pairwise similarity function for each pair of data points can be updated between iterations of the algorithm and the results of each iteration can be aggregated. The data in each data point associated with a cluster can be consolidated into a consolidated data point.
申请公布号 US9053171(B2) 申请公布日期 2015.06.09
申请号 US201314075619 申请日期 2013.11.08
申请人 Google Inc. 发明人 Ailon Nir;Liberty Edo;Khalsa Harishabd
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A computer-implemented method comprising: receiving, by a computer system comprising at least one processor, a set of data points and similarity values according to a pairwise similarity function, wherein the pairwise similarity function provides similarity values representative of a similarity between each data point and each other data point of the set of data points, wherein the similarity values are determined and the similarity function is estimated using one or more machine learning process; clustering, by the computer system, the set of data points into at least one cluster based on the similarity values, the at least one cluster comprising one or more data points of the set of data points; consolidating, by the computer system, data stored in the one or more data points associated with the at least one cluster to create a consolidated data point, wherein consolidating data stored in the one or more data points associated with the at least one cluster comprises: extracting a data element from a data point of the at least one cluster;determining another data element from another data point of the at least one cluster, wherein the other data element is a duplicate of the data element;selecting one among the data element and the other data element to be added to the consolidated data point; andstoring the selected data element in the consolidated data point; and using the consolidated data point when providing results responsive to an associated search query to a user.
地址 Mountain View CA US