发明名称 |
FINDING MULTIPLE FIELD GROUPINGS IN SEMI-STRUCTURED DOCUMENTS |
摘要 |
A method is provided for parsing a semi-structured document having a plurality of document lines on which a series of items are listed, the listing of each item spanning one or more document lines. The method includes: obtaining a plurality of candidate records, each candidate record spanning one or more lines of the document; defining a term representing an optimal cost of selecting a number n of candidate records to span the document lines up to a given ending document line i; efficiently evaluating the term over a first range of values for n and a second range of values for i; and selecting a subset of the plurality of candidate records as a global optimal parse of the document, wherein the subset selected is based on the evaluation of the defined term. |
申请公布号 |
US2014281938(A1) |
申请公布日期 |
2014.09.18 |
申请号 |
US201313799289 |
申请日期 |
2013.03.13 |
申请人 |
PALO ALTO RESEARCH CENTER INCORPORATED |
发明人 |
Pavlopoulou Christina |
分类号 |
G06F17/24 |
主分类号 |
G06F17/24 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method for parsing a semi-structured document having a plurality of document lines on which a series of items are listed, the listing of each item spanning one or more document lines, said method comprising:
obtaining a plurality of candidate records, each candidate record spanning one or more lines of the document; defining a term representing an optimal cost of selecting a number n of candidate records to span the document lines up to a given ending document line i; efficiently evaluating the term over a first range of values for n and a second range of values for i; and selecting a subset of the plurality of candidate records as a global optimal parse of the document, wherein the subset selected is based on the evaluation of the term. |
地址 |
Palo Alto CA US |