摘要 |
A computer-implemented system and method for identifying near duplicate documents is provided. A set of documents is obtained and each document is divided into segments. Each of the segments is hashed. A segment identification and sequence order is assigned to each of the hashed segments. The sequence order is based on an order in which the segments occur in one such document. The segments are compared based on the segment identification and those documents with at least two matching segments are identified. The sequence orders of the matching segments are compared and based on the comparison, a determination is made that the identified documents share a relative sequence of the matching segments. The identified documents are designated as near duplicate documents. |