发明名称 Duplicate document detection in a web crawler system
摘要 Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
申请公布号 US7627613(B1) 申请公布日期 2009.12.01
申请号 US20030614111 申请日期 2003.07.03
申请人 GOOGLE INC. 发明人 DULITZ DANIEL;VERSTAK ALEXANDRE A.;GHEMAWAT SANJAY;DEAN JEFFREY A.
分类号 G06F12/00;G06F17/30 主分类号 G06F12/00
代理机构 代理人
主权项
地址