发明名称 Systems and methods of web crawling
摘要 Methods and systems for dynamically training a web crawler. The web crawler maintains one or more categories each comprising a set of words. The method includes selecting at least one hyperlink in response to a query received from a user. The method further includes determining a hyperlink score for the at least one hyperlink based on a category score associated with each of one or more categories. The category score associated with each of the one or more categories is updated based at least in part on the hyperlink score. The updated category score is compared with the hyperlink score to select a category from the one or more categories. The set of words associated with the category is updated based on content of a web page pointed by the at least one hyperlink.
申请公布号 US9576052(B2) 申请公布日期 2017.02.21
申请号 US201313942812 申请日期 2013.07.16
申请人 XEROX CORPORATION 发明人 Singh Nidhi;Coursimault Jean-Marc;Poirier Herve;Monet Nicolas
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for training a web crawler, wherein the web crawler maintains one or more categories each comprising a set of words, the method comprising: in response to receiving a query from a user: selecting, by a processor, at least one hyperlink based on the set of words;determining, by the processor, a hyperlink score for the at least one hyperlink based on a predetermined category score associated with each of one or more categories and a membership value of the at least one hyperlink for each of the one or more categories;updating, by the processor, the predetermined category score associated with each of the one or more categories based at least on a discount factor associated with the predetermined category score and an association of learning rate with a measure of contribution of the one or more categories for the selection of the at least one hyperlink and another measure of correctness of the selection of the at least one hyperlink with respect to semantic of the query;comparing, by the processor, the updated predetermined category score with the hyperlink score to select a category from the one or more categories; andupdating, by the processor, the set of words associated with the category based on content of a web page pointed by the at least one hyperlink.
地址 Norwalk CT US