发明名称 System and method for grouping multiple streams of data
摘要 A document clustering system and method of assigning a document to a cluster of documents containing related content are provided. Each cluster is associated with a cluster summary describing the content of the documents in the cluster. The method comprises: determining, at a document clustering system, whether the document should be grouped with one or more previously created cluster summaries, the previously created cluster summaries being stored in a memory in a B-tree data structure; and if it is determined that the document should not be grouped with the one or more previously created cluster summaries, then creating, at a document clustering system, a cluster summary based on the content of the document and storing the created cluster summary in the B-tree data structure.
申请公布号 US8965893(B2) 申请公布日期 2015.02.24
申请号 US201012857688 申请日期 2010.08.17
申请人 Rogers Communications Inc. 发明人 Cvet Michael;Andritsos Periklis;Estrada Francisco;Braziunas Darius
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Rowand LLP 代理人 Rowand LLP
主权项 1. A method of assigning a document to one of a plurality of clusters of documents containing related content, the document having at least one feature, each cluster being associated with a cluster summary describing the content of the documents in the cluster and comprising a summary of one or more features of the documents in the cluster, wherein the summary comprises an aggregated feature vector that describes the set of documents summarized, the method comprising: determining, at a document clustering system, whether the document should be grouped with one or more previously created cluster summaries by comparing the at least one feature of the document with each previously created cluster summary and evaluating the similarity therebetween, the previously created cluster summaries being stored in a memory in a B-tree data structure; determining that the document should not be grouped with the one or more previously created cluster summaries, and creating, at a document clustering system, a cluster summary based on the content of the document and storing the created cluster summary in the B-tree data structure, wherein the leaf nodes of the B-tree store the cluster summaries; identifying at least one outlier in the plurality of cluster summaries associated with each of the plurality of clusters; attempting to merge the at least one outlier into other cluster summaries; incrementing a merge attempt count each time a merge is attempted; and removing the outlier from the B-tree data structure when the merge attempt count exceeds a predetermined threshold.
地址 Toronto, Ontario CA