发明名称 Computer-Implemented System And Method For Identifying Near Duplicate Documents
摘要 A computer-implemented system and method for identifying near duplicate documents is provided. A set of documents is obtained and each document is divided into segments. Each of the segments is hashed. A segment identification and sequence order is assigned to each of the hashed segments. The sequence order is based on an order in which the segments occur in one such document. The segments are compared based on the segment identification and those documents with at least two matching segments are identified. The sequence orders of the matching segments are compared and based on the comparison, a determination is made that the identified documents share a relative sequence of the matching segments. The identified documents are designated as near duplicate documents.
申请公布号 US2014082006(A1) 申请公布日期 2014.03.20
申请号 US201314027141 申请日期 2013.09.13
申请人 FTI CONSULTING INC. 发明人 KNIGHT WILLIAM C.;ANTOCH STEVE;MCNEE SEAN M.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址