发明名称 METHOD AND SYSTEM FOR EXTRACTING AND MANAGING INFORMATION CONTAINED IN ELECTRONIC DOCUMENTS
摘要 This invention relates to a method and system that use metadata to facilitate the extraction and enable the management of information contained in electronic documents. This metadata describes the content of the documents based on the composition of their structure and the manner in which the information in question is arranged in that structure. In addition to providing a description that makes it possible to automatically manage the models used for extraction, this metadata also defines a logical schema for managing the information extracted. The method begins with a preparation step (10) in which said metadata (1) and document samples (2) are collected and stored in the system. The training step (20) is then performed, in which the system uses said metadata (1) and respective document samples (2) to build and train the models (3) used for extraction. Finally, in the extraction step (30), the system receives a collection of electronic documents (4) and uses the trained models (3) to extract the information of interest. This information, once extracted, is stored (5) by the system in accordance with the logical schema defined using the metadata, enabling it to be managed immediately. The system enables the method to be applied even if the information is dispersed throughout large documents. In one preferred embodiment, the metadata is defined using an XSD (XML Schema Definition), and the document samples are labelled in an XML format, allowing them to be validated by that XSD.
申请公布号 WO2011100814(A1) 申请公布日期 2011.08.25
申请号 WO2011BR00047 申请日期 2011.02.16
申请人 BERTOLI MARTINS, ALEXANDRE JONATAN 发明人 BERTOLI MARTINS, ALEXANDRE JONATAN
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址