发明名称 System and method for associating an extensible set of data with documents downloaded by a web crawler
摘要 A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The web crawler includes a set of tools for storing an extensible set of data with each document address in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes an extensible set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each document, when to delete records of information, and so on. In another aspect of the present invention, the Frontier include a set of parallel "priority queues," each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.
申请公布号 US6351755(B1) 申请公布日期 2002.02.26
申请号 US19990433006 申请日期 1999.11.02
申请人 ALTA VISTA COMPANY 发明人 NAJORK MARC ALEXANDER;HEYDON CLARK ALLAN
分类号 G06F17/30;(IPC1-7):G06F17/21 主分类号 G06F17/30
代理机构 代理人
主权项
地址