摘要 |
This invention relates to a method and system that use metadata to facilitate the extraction and enable the management of information contained in electronic documents. This metadata describes the content of the documents based on the composition of their structure and the manner in which the information in question is arranged in that structure. In addition to providing a description that makes it possible to automatically manage the models used for extraction, this metadata also defines a logical schema for managing the information extracted. The method begins with a preparation step (10) in which said metadata (1) and document samples (2) are collected and stored in the system. The training step (20) is then performed, in which the system uses said metadata (1) and respective document samples (2) to build and train the models (3) used for extraction. Finally, in the extraction step (30), the system receives a collection of electronic documents (4) and uses the trained models (3) to extract the information of interest. This information, once extracted, is stored (5) by the system in accordance with the logical schema defined using the metadata, enabling it to be managed immediately. The system enables the method to be applied even if the information is dispersed throughout large documents. In one preferred embodiment, the metadata is defined using an XSD (XML Schema Definition), and the document samples are labelled in an XML format, allowing them to be validated by that XSD. |