发明名称 |
System and method for focused re-crawling of web sites |
摘要 |
A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
|
申请公布号 |
US7882099(B2) |
申请公布日期 |
2011.02.01 |
申请号 |
US20080054482 |
申请日期 |
2008.03.25 |
申请人 |
INTERNATIONAL BUSINESS MACHINES CORPORATION |
发明人 |
AGRAWAL NEERAJ;BALAKRISHNAN SREERAM VISWANATH;JOSHI SACHINDRA |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|