摘要 |
A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The Frontier includes a set of parallel "priority queues," each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues. When performing a continuous crawl, the web crawler reinserts the queue element for a downloaded document into the Frontier in accordance with a download priority level associated with the downloaded document. For example, the download priority level may be determined as a function of an expiration date and time associated with document whose document address is denoted by the queue element.
|