发明名称 System and method for evolutionary clustering of sequential data sets
摘要 An improved system and method for evolutionary clustering of sequential data sets is provided. A snapshot cost may be determined for representing the data set for a particular clustering method used and may determine the cost of clustering the data set independently of a series of clusterings of the data sets in the sequence. A history cost may also be determined for measuring the distance between corresponding clusters of the data set and the previous data set in the sequence of data sets to determine a cost of clustering the data set as part of a series of clusterings of the data sets in the sequence. An overall cost may be determined for clustering the data set by minimizing the combination of the snapshot cost and the history cost. Any clustering method may be used, including flat clustering and hierarchical clustering.
申请公布号 US8930365(B2) 申请公布日期 2015.01.06
申请号 US200611414448 申请日期 2006.04.29
申请人 Yahoo! Inc. 发明人 Chakrabarti Deepayan;Ravikumar Shanmugasundaram;Tomkins Andrew
分类号 G06F7/00;G06F17/30;G06K9/62 主分类号 G06F7/00
代理机构 Buchenhorner Patent Law 代理人 Buchenhorner Patent Law
主权项 1. A computer system for clustering a data set in a sequence of data sets, comprising: a processor device performing computer-executable instructions comprising: receiving a data set as part of a sequence of data sets in a series of clusterings, said data set having a plurality of data elements and each of the data sets in the sequence being acquired at different timesteps;determining a first cost of clustering the data set;wherein the first cost comprises a cost of clustering the data set independently of the series of clusterings of the data sets in the sequence, each of the data sets being acquired at different timesteps;determining a second cost of clustering the data set;wherein the second cost comprises a cost of clustering the data set as part of the series of clusterings of the data sets in the sequence;combining the first cost with the second cost at each timestep;determining an overall cost of clustering the data set as a sum of the first cost and the second cost, using a selected clustering method;minimizing the overall cost; andclustering the data set using the selected clustering method according to the minimized overall cost, such that the clustering at any time has high accuracy while also ensuring that said clustering does not change dramatically from one timestep to a next timestep.
地址 Sunnyvale CA US