发明名称 System and method for automatic wrapper induction using target strings
摘要 Wrappers are induced for multiple domains where, for a given target string having relatively universal distribution across domains of interest, a first wrapper may be defined and trained for a particular domain. Target strings extracted from that domain may be used to search for documents in other domains. New wrappers may be learned for other domains also containing the target strings. Further, a first wrapper may be learned for a given domain using a limited amount of training data from that single domain. The first wrapper is then applied to all pages in the domain to extract the relevant information. A few of the new words extracted are then searched against the document collection to obtain a list of domains that contain the extracted words. The updated information may be used as training data to learn new wrappers on those domains.
申请公布号 US9223871(B2) 申请公布日期 2015.12.29
申请号 US201313837961 申请日期 2013.03.15
申请人 Homer TLC, Inc. 发明人 Mallapragada Naga Surya Siva Kalyana Pavan Kumar
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Norton Rose Fulbright US LLP 代理人 Norton Rose Fulbright US LLP
主权项 1. A method for automatically constructing wrappers across a plurality of domains, executed by a processor, the method comprising: creating a first wrapper in a first domain in the plurality of domains using a first set of training data, the first set of training data created from a subset of documents in the first domain; applying the first wrapper to each page in the first domain to extract additional training data; combining the first set of training data with the additional training data to generate a first target string; searching domains in the plurality of domains other than the first domain to determine if any domain in the plurality of domains other than the first domain comprises at least one document having at least one portion of the first target string; and creating a second wrapper for at least one of the other domains in the plurality of domains from the first target string.
地址 Wilmington DE US