摘要 |
A web page identified by a URL stored in a downloads queue is downloaded, and hyperlinks in the downloaded web page are identified. Each hyperlink is screened by parsing the hyperlink (optionally only the URL of the hyperlink) to identify features comprising character strings, computing for each feature values for one or more meta-features indicative of the hyperlinked web page being in a target language, aggregating the meta-feature values to generate a score for the hyperlink, and adding the URL of the hyperlink to the downloads queue conditional upon the score satisfying a screening criterion. The downloading, identifying, and screening are iteratively repeated to perform web crawling, and an index of web pages in the target language is constructed based on analysis of content of the downloaded web pages. The meta-features may include a transliterated target word meta-feature, a language code meta-feature, a country code meta-feature, or so forth. |