摘要 |
<p>Disclosed is a computer-assisted method for finding duplicate or near-duplicate documents or text spans within a document collection (#100) by using high-discriminability text fragments. Distinctive features of the documents or text spans are identified (#110). For each pair of documents or text spans with at least one distinctive feature in common, the distinctive features of each document or text span are compared to determine whether the pair is duplicates or near-duplicates (#114).</p> |