发明名称 Scheduler for search engine crawler
摘要 A search engine crawler includes a distributed set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers (for crawling) for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled (or scheduled for crawling) during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in each of the last X crawls. Other filtering mechanisms may also be used to filter out some of the document identifiers in the starting set. The resulting list of document identifiers is written to a scheduled output file for use in a next crawl cycle.
申请公布号 US8042112(B1) 申请公布日期 2011.10.18
申请号 US20040882956 申请日期 2004.06.30
申请人 GOOGLE INC. 发明人 ZHU HUICAN;IBEL MAXIMILIAN;ACHARYA ANURAG;GOBIOFF HOWARD BRADLEY
分类号 G06F9/46;G06F7/00 主分类号 G06F9/46
代理机构 代理人
主权项
地址