发明名称 METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES
摘要 In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous “experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as “hints.” Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.
申请公布号 US2011282877(A1) 申请公布日期 2011.11.17
申请号 US201113191369 申请日期 2011.07.26
申请人 GAZEN BORA C.;MINTON STEVEN N.;FETCH TECHNOLOGIES, INC. 发明人 GAZEN BORA C.;MINTON STEVEN N.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址