摘要 |
Techniques for high precision multi entity extraction are provided. A wrapper that represents a generalized structure of a set of training web pages is accessed. The wrapper includes one or more annotations that indicate a set of attributes that are included in each of a plurality of records. Record boundaries are determined based on nodes included in the wrapper, where the record boundaries delimit the plurality of records within any training page of the set of training web pages. The wrapper is modified to include one or more boundary nodes, where the one or more boundary nodes indicate the record boundaries of the plurality of records within the set of training web pages. Multiple records are extracted from a web page, where extracting the multiple records comprises detecting record completions based at least on the wrapper and on a document object model (DOM) representation of the web page.
|