发明名称 METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES
摘要 In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous "experts," each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as "hints." Based on these hints, the system may cluster the pages and text segments and identify semi- structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.
申请公布号 CA2614774(A1) 申请公布日期 2007.01.25
申请号 CA20062614774 申请日期 2006.07.14
申请人 FETCH TECHNOLOGIES, INC 发明人 GAZEN, BORA C.;MINTON, STEVEN N.
分类号 G06F7/00 主分类号 G06F7/00
代理机构 代理人
主权项
地址