发明名称 System and method for clustering data in input and output spaces
摘要 A method of clustering a plurality of documents having input and output space data is disclosed that uses both input and output space criteria. The method can include aggregating documents into clusters based on input and/or output space similarity measures, and then refining the clusters based on further input and/or output space similarity measures. Aggregating the documents into clusters can include forming a hierarchical tree based on the input and/or output space similarity measures where the hierarchical tree has a root node, branching into intermediate nodes, and branching into leaf nodes covering individual documents, where the hierarchical tree includes a leaf node for each document of the plurality of documents. The method can then include forming a forest of sub-trees of the hierarchical tree based on cluster criteria. Textual and numeric similarity measures can be used depending on the type and distribution of data in the input and output spaces.
申请公布号 US9116974(B2) 申请公布日期 2015.08.25
申请号 US201313833022 申请日期 2013.03.15
申请人 Robert Bosch GmbH 发明人 Heit Juergen;Dey Sanjoy;Srinivasan Soundararajan
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Maginot Moore & Beck LLP 代理人 Maginot Moore & Beck LLP
主权项 1. A computer automated method of clustering a plurality of documents, each document including input space data and output space data, the method comprising: for each document in the plurality of documents, reading the input space data of the document from memory or storage; computing an input space similarity measure between the document and other documents of the plurality of documents using a computing device; aggregating the document into a first plurality of clusters based on the input space similarity measure; storing the first plurality of clusters in a database; for each cluster in the first plurality of clusters, reading the output space data of the documents in the current cluster of the first plurality of clusters; computing an output space similarity measure for the plurality of documents in the current cluster using the computing device; and maintaining or subdividing the current cluster in the first plurality of clusters based on the output space similarity measure, wherein aggregating the documents into a first plurality of clusters comprises: forming a hierarchical tree based on the input space similarity measure, the hierarchical tree having a root node covering all of the plurality of documents, branching into intermediate nodes covering subsets of the plurality of documents, and branching into leaf nodes covering individual documents of the plurality of documents, the hierarchical tree including a lead node for each document of the plurality of documents;computing a node similarity measure for each node of the hierarchical tree;retrieving a node similarity threshold from memory or storage, the node similarity threshold being less than the node similarity measure of the leaf nodes of the hierarchical tree;performing a graph traversal search over each node of the hierarchical tree starting with the root node to form a forest of sub-trees of the hierarchical tree by: comparing the node similarity measure for the current node in the graph traversal search with the node similarity threshold; andif the node similarity measure of the current node is equal to or greater than the node similarity threshold, storing the current node as a cluster in the first plurality of clusters, not proceeding further down the depth of the current branch of the hierarchical tree, and continuing the graph traversal search on the next branch of the hierarchical tree; if the node similarity measure of the current node is less than the node similarity threshold, continuing the graph traversal search further down the current branch of the hierarchical tree.
地址 Stuttgart DE