发明名称 LANGUAGE-ORIENTED FOCUSED CRAWLING USING TRANSLITERATION BASED META-FEATURES
摘要 A web page identified by a URL stored in a downloads queue is downloaded, and hyperlinks in the downloaded web page are identified. Each hyperlink is screened by parsing the hyperlink (optionally only the URL of the hyperlink) to identify features comprising character strings, computing for each feature values for one or more meta-features indicative of the hyperlinked web page being in a target language, aggregating the meta-feature values to generate a score for the hyperlink, and adding the URL of the hyperlink to the downloads queue conditional upon the score satisfying a screening criterion. The downloading, identifying, and screening are iteratively repeated to perform web crawling, and an index of web pages in the target language is constructed based on analysis of content of the downloaded web pages. The meta-features may include a transliterated target word meta-feature, a language code meta-feature, a country code meta-feature, or so forth.
申请公布号 US2014258261(A1) 申请公布日期 2014.09.11
申请号 US201313792806 申请日期 2013.03.11
申请人 XEROX CORPORATION 发明人 Singh Nidhi;Coursimault Jean-Marc;Monet Nicolas;Poirer Herve
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A non-transitory storage medium storing instructions readable and executable by an electronic data processing device to perform a crawling method including the operations of: (i) identifying hyperlinks in a current document wherein each hyperlink links to a linked document and includes anchor text and a linked document identifier; (ii) scoring each hyperlink by parsing the hyperlink to identify features comprising character strings,computing values for one or more meta-features for each feature wherein the meta-features are indicative of the linked document being in a target language, andaggregating the meta-feature values of the features of the hyperlink to generate a score for the hyperlink; and (iii) downloading documents linked by hyperlinks of the current document whose scores satisfy a screening criterion and not downloading documents linked by hyperlinks of the current document whose scores do not satisfy the screening criterion.
地址 Norwalk CT US