摘要 |
A method and system that collects data from resources connected to a network for addition to a database that contains data records for businesses. A database of URL records is built according to a data structure that includes data elements that are useful to determine if an entity described by the data elements qualifies as a business. The data elements of the two databases are used to form web mining strategies. A distributing processing system is used to mine huge numbers of web pages in parallel. The bandwidth and transmission times are shortened at the distributed device end by summarizing web page content in an index that is returned to a central processor in the form of a byte. The central processor analyzes the byte and earmarks for a complete content extraction only those web pages that have enough business content.
|