发明名称 Handling dynamic URLs in crawl for better coverage of unique content
摘要 Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters. Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.
申请公布号 US2008091685(A1) 申请公布日期 2008.04.17
申请号 US20060580443 申请日期 2006.10.13
申请人 GARG PRIYANK S;BHATTACHARJEE ARNABNIL 发明人 GARG PRIYANK S.;BHATTACHARJEE ARNABNIL
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址