发明名称 HIGH PRECISION MULTI ENTITY EXTRACTION
摘要 Techniques for high precision multi entity extraction are provided. A wrapper that represents a generalized structure of a set of training web pages is accessed. The wrapper includes one or more annotations that indicate a set of attributes that are included in each of a plurality of records. Record boundaries are determined based on nodes included in the wrapper, where the record boundaries delimit the plurality of records within any training page of the set of training web pages. The wrapper is modified to include one or more boundary nodes, where the one or more boundary nodes indicate the record boundaries of the plurality of records within the set of training web pages. Multiple records are extracted from a web page, where extracting the multiple records comprises detecting record completions based at least on the wrapper and on a document object model (DOM) representation of the web page.
申请公布号 US2010185684(A1) 申请公布日期 2010.07.22
申请号 US20090351676 申请日期 2009.01.09
申请人 MADAAN AMIT;TIWARI CHARU 发明人 MADAAN AMIT;TIWARI CHARU
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址