发明名称 ADAPTIVE DOCUMENT SAMPLING FOR INFORMATION EXTRACTION
摘要 A method and apparatus for improved sampling documents for training sets input to information extraction systems is provided, which improves the recall and robustness of wrapper extraction. A passive sampling technique provides a list of documents to present for human annotation ordered by representativeness of the document based on structural and content statistics. Thus, the document with the most interesting attributes and which is most representative of the cluster of structurally similar documents to which the document pertains is presented for annotation first. The problem is mapped to classical ‘Set-Cover’ problem and solved using greedy approach. An active sampling technique refines and reorders the sample list produced by the passive sampling technique after initial annotations, based on the human annotation, spatial boundaries of the documents, and structural and content statistics. The proposed techniques work at a site level and perform page-level structural analysis using XPath-term frequency, XPath-document frequency, and XPath-importance.
申请公布号 US2010228738(A1) 申请公布日期 2010.09.09
申请号 US20090398162 申请日期 2009.03.04
申请人 MEHTA RUPESH R;SENGAMEDU SRINIVASAN H 发明人 MEHTA RUPESH R.;SENGAMEDU SRINIVASAN H.
分类号 G06F17/30;G06F17/21 主分类号 G06F17/30
代理机构 代理人
主权项
地址