发明名称 System and method for detecting duplicate and similar documents
摘要 A system and a method are described for rapidly determining document similarity among a set of documents, such as a set of documents obtained from an information retrieval (IR) system. A ranked list of the most important terms in each document is obtained using a phrase recognizer system. The list is stored in a database and is used to compute document similarity with a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. It is shown that these techniques may be employed to accurately recognize that documents, that have been revised to contain parts of other documents, are still closely related to the original document. These teachings further provide for the computation of a document signature that can then be used to make a rapid comparison between documents that are likely to be identical.
申请公布号 US7139756(B2) 申请公布日期 2006.11.21
申请号 US20020054366 申请日期 2002.01.22
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 COOPER JAMES W.;CODEN ANNI;BROWN ERIC W.
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项
地址