发明名称 Incremental crawling of multiple content providers using aggregation
摘要 A method for incremental crawling of content stored on a plurality of content providers using aggregation is provided. The method comprises receiving a request to crawl content on one or more associated content providers; retrieving one or more first references to content on a first content provider; retrieving one or more second references to content on one or more second content providers during the same request; aggregating the first and second references; and returning the aggregated first and second references. This is done while taking into consideration opaque timestamp object which is managed in a distributed manner. The opaque timestamp is filled in by the content providers but stored in the crawler side between crawling sessions.
申请公布号 US8799261(B2) 申请公布日期 2014.08.05
申请号 US200812343009 申请日期 2008.12.23
申请人 International Business Machines Corporation 发明人 Kenig Batya;Radchenko Constantin;Shapiro Eitan
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 Edell, Shapiro & Finnan, LLC 代理人 Polimeni Joe;Edell, Shapiro & Finnan, LLC
主权项 1. A method for incremental crawling of content stored on a plurality of content providers using aggregation, the method comprising: receiving a request to crawl content on one or more associated content providers, wherein the request comprises at least a starting index value and a range value associated with a quantity of content to be accessed on the associated content providers; forwarding the request to a first content provider on a list, in response to determining that there is no valid state information, wherein forwarding the request to a first content provider on a list further comprises: incrementing the range value by the starting index value; andpassing timing information corresponding to the first content provider with the request; forwarding the request to a first content provider identified by the state information as a next content provider, in response to determining that there is valid state information based on a comparison of the starting index value with a last received index, wherein forwarding the request to a first content provider identified by the state information further comprises: setting the starting index value to a value indicated by the state information as a next starting index; andpassing the state information and timing information corresponding to the first content provider with the request; receiving references, state information, or timing information from the first content provider; aggregating the received references, state information, and timing information with other references, state information, and timing information, respectively; forwarding the request to a second content provider on the list, in response to a quantity of content from the first content provider being less than the quantity of content to be accessed on the associated content providers; updating the state information with the next content provider and corresponding next starting index, in response to determining that the request includes no unsatisfied references that other content providers on the list are able to satisfy; and returning the aggregated references, the updated state information, and the aggregated time stamp.
地址 Armonk NY US