主权项 |
1. An information extraction system that extracts phrases in documents from the documents and outputs the extracted phrases, comprising:
a processing device with an input unit that receives an input word list including a plurality of phrases; a storage unit that stores a plurality of documents including documents having formats different from each other; a pattern determining unit, of the processing device, that selects one document from the plurality of documents stored in the storage unit, finds a pattern, which separates a phrase included in the input word list from other words in the selected document, as a pattern, for each of the documents, and stores in the storage unit the found pattern associated with the selected document; a phrase candidate extracting unit, of the processing device, that extracts a character string separated by a pattern stored in the storage unit from a document associated with the pattern and determines the character string as a phrase candidate; and a phrase selecting unit, of the processing device, that, among phrase candidates extracted by the phrase candidate extracting unit or partial character strings included in the phrase candidates, selects as a target phrase to be outputted a phrase candidate or a partial character string that satisfies a predetermined condition, wherein the pattern determining unit finds the pattern by i) obtaining character strings each consisting of a predetermined number of characters located at least at one of immediately before and after the phrases included in the input word list, and ii) extracting, as the pattern, portions common to at least two of the character strings from the obtained character strings. |