发明名称 Information extraction system, information extraction method, information extraction program, and information service system
摘要 According to the present invention, phrases of the same kind can be extracted from a plurality of documents having various formats. A storage device stores a plurality of documents that have various formats. A pattern candidate creating unit receives a list of input words that are selected as samples among phrases that are to be included in a dictionary. The pattern candidate creating unit selects one document, determines forward and backward character strings of input words in the selected document as candidates of patterns, and stores the forward and backward character strings as a pattern candidate. The pattern candidate creating unit executes the above processes for each of the documents. A phrase candidate creating unit extracts phrases interposed between patterns included in the pattern candidate as candidates of phrases to be output, and stores the extracted phrases as a phrase candidate. A phrase selecting unit outputs a candidate of a phrase satisfying a predetermined condition among candidates of phrases included in the phrase candidate as an output word to an output device.
申请公布号 US8886661(B2) 申请公布日期 2014.11.11
申请号 US200712294143 申请日期 2007.03.23
申请人 NEC Corporation 发明人 Mizuguchi Hironori;Tsuchida Masaaki;Kusui Dai;Kawai Hideki
分类号 G06F17/30;G06F17/28;G06F17/27;G06Q30/02;G06Q30/06 主分类号 G06F17/30
代理机构 Young & Thompson 代理人 Young & Thompson
主权项 1. An information extraction system that extracts phrases in documents from the documents and outputs the extracted phrases, comprising: a processing device with an input unit that receives an input word list including a plurality of phrases; a storage unit that stores a plurality of documents including documents having formats different from each other; a pattern determining unit, of the processing device, that selects one document from the plurality of documents stored in the storage unit, finds a pattern, which separates a phrase included in the input word list from other words in the selected document, as a pattern, for each of the documents, and stores in the storage unit the found pattern associated with the selected document; a phrase candidate extracting unit, of the processing device, that extracts a character string separated by a pattern stored in the storage unit from a document associated with the pattern and determines the character string as a phrase candidate; and a phrase selecting unit, of the processing device, that, among phrase candidates extracted by the phrase candidate extracting unit or partial character strings included in the phrase candidates, selects as a target phrase to be outputted a phrase candidate or a partial character string that satisfies a predetermined condition, wherein the pattern determining unit finds the pattern by i) obtaining character strings each consisting of a predetermined number of characters located at least at one of immediately before and after the phrases included in the input word list, and ii) extracting, as the pattern, portions common to at least two of the character strings from the obtained character strings.
地址 Tokyo JP