发明名称 METHOD AND APPARATUS FOR EXTRACTING STRUCTURED DATA FROM HTML PAGES
摘要 A method and apparatus for extracting structured data from HTML pages whereby an HTML file belonging to a pre-determined class of HTML files can be transformed into an instance tree (142). Other than the HTML file, there are two other inputs to the extraction procedure: a set of constraints (134), and a structure template (140). The steps in the process include: parsing the HTML file, thereby creating a parse tree (126); annotating the parse tree, thereby creating an annotated parse tree (130); creating an array of nodes from the annotated parse tree using a set of constraints (134); and generating an instance tree (142) from the array of nodes using the structure template (140). The instance tree (142) encodes, in a form that may be used by other computer programs, all the relevant information in the HTML file as prescribed by the set of constraints (134) and makes explicit the structure of this information.
申请公布号 CA2422490(C) 申请公布日期 2010.10.12
申请号 CA20002422490 申请日期 2000.09.08
申请人 SEDGHI, ALI R. 发明人 SEDGHI, ALI R.
分类号 G06F15/00;G06F17/30 主分类号 G06F15/00
代理机构 代理人
主权项
地址