发明名称 METHOD AND SYSTEM FOR SCHEDULING WEB CRAWLERS ACCORDING TO KEYWORD SEARCH
摘要 A method and a system for scheduling web crawlers according to keyword search. The method comprises: a scheduling end receiving a task request command sent by a crawling node; the scheduling end acquiring a secondary download link address from a priority bucket, generating tasks, adding the generated tasks into a task list, acquiring keyword link addresses from a dynamic bucket, deriving derivative link addresses of the quantities of pages corresponding to the keyword link addresses, generating tasks of the quantities of the pages according to the derivative link addresses of the quantities of the pages, adding the tasks of the quantities of the pages into the task list, acquiring a keyword link address from a basic bucket, generating tasks, adding the generated tasks into the task list, and the scheduling end returning the task list to the crawling node. By adjusting the quantities of the tasks allowed to be added from a virtual bucket, the quantities of scheduled link addresses of different types are flexibly adjusted. In addition, by crawling popular keywords more frequently, data miss is prevented, and repeated crawls of unpopular keywords is reduced.
申请公布号 US2016328475(A1) 申请公布日期 2016.11.10
申请号 US201515110564 申请日期 2015.01.09
申请人 BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD. ;BEIJING JINGDONG CENTURY TRADING CO., LTD. 发明人 Liao Yaohua;Li Xiaowei
分类号 G06F17/30;G06F9/48 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for scheduling web crawlers according to a keyword search, characterized in comprising: Step (12) of a scheduling end receiving a task request command sent by a crawling node; Step (13) of the scheduling end acquiring a secondary download link address from a priority bucket that stores secondary download link addresses, generating tasks, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the priority bucket are reached, performing Step (16), and otherwise performing Step (14), wherein the secondary download link addresses are link addresses that need secondary download acquired from analysis of crawled pages crawled by the crawling node according to the task in the task list; Step (14) of the scheduling end acquiring keyword link addresses from a dynamic bucket that stores keyword multipage link addresses, deriving derivative link addresses of the quantities of pages corresponding to the keyword link addresses, generating tasks of the quantities of the pages according to the derivative link addresses of the quantities of the pages, adding the tasks of the quantities of the pages into the task list, and if the quantities allowed to be added into the task list from the dynamic bucket are reached, performing Step (16), and otherwise performing Step (15), wherein the keyword link addresses are link addresses of search result pages generated in a target website according to the keyword, and the search result pages corresponding to the keyword link addresses in the dynamic bucket are of the quantities of pages no less than a preset threshold for the quantities of the pages that is no less than 2; Step (15) of the scheduling end acquiring a keyword link address from a basic bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the basic bucket are reached, performing Step (16), wherein the keyword link addresses are link addresses of search result pages generated in a target website according to the keyword, and the search result pages corresponding to the keyword link addresses in the basic bucket are of the quantities of pages no less than a preset threshold for the quantities of the pages that is no less than 2; and Step (16) of the scheduling end returning the task list to the crawling node, and the crawling node performing the task in the task list according to the received task list.
地址 Haidian District, Beijing CN
您可能感兴趣的专利