摘要 |
In one embodiment, the present invention includes a method for conditioning semi-structured text to enhance its use as a data source for an analytical processing tool. In general, the method involves analyzing the semi-structured text to identify portions of text (referred to herein as sub-documents) that exhibit a repetitive characteristic. Next, for each sub-document identified, the semi-structured text is integrated, for example, by filtering the text for relevant words, removing stop words, stemming certain words, adding or replacing certain words with synonyms, modifying the spelling of certain words, and/or resolving certain homonyms based on a document class assigned to the semi-structured text, and so on. Once integrated, the sub-documents are mapped to existing structures defined for the document class and/or sub-document type. Finally, the mapped textual elements are used to generate an index, or alternatively, the textual elements are inserted directly into a structured data repository, such as a database.
|