摘要 |
Duplicate or near-duplicate documents can be identified by creating a vector representing the evaluated document, where vector values are serial numbers of the summary vector coordinates, sorted according to the value in each coordinate. The summary vector is calculated summing bits of hashes of the document shingles. Vectors representing other documents can be reduced in size to 64-bit fingerprints and stored in permanent memory. The duplicates or near-duplicates can be identified by comparing these stored fingerprints with the vector representing evaluated document. |