发明名称 Method and apparatus for identifying near-duplicate documents
摘要 Duplicate or near-duplicate documents can be identified by creating a vector representing the evaluated document, where vector values are serial numbers of the summary vector coordinates, sorted according to the value in each coordinate. The summary vector is calculated summing bits of hashes of the document shingles. Vectors representing other documents can be reduced in size to 64-bit fingerprints and stored in permanent memory. The duplicates or near-duplicates can be identified by comparing these stored fingerprints with the vector representing evaluated document.
申请公布号 US8370390(B1) 申请公布日期 2013.02.05
申请号 US201113284919 申请日期 2011.10.30
申请人 PERMAKOFF VADIM 发明人 PERMAKOFF VADIM
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项
地址