发明名称 Near-duplicate document detection for web crawling
摘要 A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.
申请公布号 US8548972(B1) 申请公布日期 2013.10.01
申请号 US201213422130 申请日期 2012.03.16
申请人 JAIN ARVIND;MANKU GURMEET SINGH;GOOGLE INC. 发明人 JAIN ARVIND;MANKU GURMEET SINGH
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址