摘要 |
A document comparison and identification method comprises the steps of: identifying (S210), in a source document, words of a predetermined number of characters or greater; generating a list containing the identified words (S220), and excluding (S220) identified words occurring with a predetermined frequency or greater throughout a set of documents to be searched; searching (S230) each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; for each of the plurality of documents, determining (S230) how many identified words from the list occur in the document; and calculating (S240) a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
|