主权项 |
1. A computer automated method of clustering a plurality of documents, each document including input space data and output space data, the method comprising:
for each document in the plurality of documents, reading the input space data of the document from memory or storage; computing an input space similarity measure between the document and other documents of the plurality of documents using a computing device; aggregating the document into a first plurality of clusters based on the input space similarity measure; storing the first plurality of clusters in a database; for each cluster in the first plurality of clusters, reading the output space data of the documents in the current cluster of the first plurality of clusters; computing an output space similarity measure for the plurality of documents in the current cluster using the computing device; and maintaining or subdividing the current cluster in the first plurality of clusters based on the output space similarity measure, wherein aggregating the documents into a first plurality of clusters comprises:
forming a hierarchical tree based on the input space similarity measure, the hierarchical tree having a root node covering all of the plurality of documents, branching into intermediate nodes covering subsets of the plurality of documents, and branching into leaf nodes covering individual documents of the plurality of documents, the hierarchical tree including a lead node for each document of the plurality of documents;computing a node similarity measure for each node of the hierarchical tree;retrieving a node similarity threshold from memory or storage, the node similarity threshold being less than the node similarity measure of the leaf nodes of the hierarchical tree;performing a graph traversal search over each node of the hierarchical tree starting with the root node to form a forest of sub-trees of the hierarchical tree by: comparing the node similarity measure for the current node in the graph traversal search with the node similarity threshold; andif the node similarity measure of the current node is equal to or greater than the node similarity threshold, storing the current node as a cluster in the first plurality of clusters, not proceeding further down the depth of the current branch of the hierarchical tree, and continuing the graph traversal search on the next branch of the hierarchical tree; if the node similarity measure of the current node is less than the node similarity threshold, continuing the graph traversal search further down the current branch of the hierarchical tree. |