发明名称 Techniques for detecting duplicate web pages
摘要 Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.
申请公布号 US7698317(B2) 申请公布日期 2010.04.13
申请号 US20070788505 申请日期 2007.04.20
申请人 YAHOO! INC. 发明人 SASTURKAR AMIT;AHUJA RAJAT;RAVIKUMAR SHANMUGASUNDARAM;OFITSEROV VLADIMIR
分类号 G06F17/00 主分类号 G06F17/00
代理机构 代理人
主权项
地址