摘要 |
A method, apparatus, and system are disclosed for harvesting publicly accessible data from internet web pages. In one embodiment, the invention includes emulating user requests that are consistent with a user operating an industry standard browser, receiving text in response to the generated request, using a set of relevance estimators to select a most relevant candidate from a set of data items, and segmenting text received from a web page into extractable blocks. Relevance estimators may use techniques such as word matching, pattern matching, format matching, context assessment, word proximity, and the like. The extracted data may be aggregated into a database and used in applications such as phone directories or sales catalogs. The present invention facilitates data harvesting from web pages related to one or more specified topics.
|