摘要 |
A method for document management includes automatically extracting respective features from each of a set of documents. The features are processed in a computer so as to generate respective vectors for the documents, each vector including elements having respective values that represent properties of a respective document. A similarity between the documents is assessed by computing a measure of distance between the respective vectors. The documents are automatically clustered responsively to the similarity so as to identify a cluster of the documents belonging to a common document type. Similar methods may be used in supervised categorization, wherein documents are compared and categorized based on a training set that is prepared for each document type. |