发明名称 INFORMATION EXTRACTION METHOD, INFORMATION EXTRACTION DEVICE, AND INFORMATION EXTRACTION PROGRAM
摘要 <P>PROBLEM TO BE SOLVED: To extract the text from a structured document without depending upon a rule for text extraction. <P>SOLUTION: A document set recording part 2 records an HTML file of a document to be processed in a document set DB 3. A link source information extraction part 4 extracts a hyperlink embedded in an HTML file, acquired from the document set DB 3, and link peripheral text information. A text extraction part 5 specifies a hyperlink referring to the HTML file acquired from the document set DB 3 as an HTML file of a link destination document on condition that the hyperlink is extracted by the link source information extraction part 4. The text extraction part 5 compares a character string of text information present in the HTML file of the specified link destination document with a character string of the text information, and extracts a representative part in the link destination document as the body. An output part 6 outputs the extracted text. <P>COPYRIGHT: (C)2013,JPO&INPIT
申请公布号 JP2013030041(A) 申请公布日期 2013.02.07
申请号 JP20110166460 申请日期 2011.07.29
申请人 NIPPON TELEGR & TELEPH CORP 发明人
分类号 G06F17/30;G06F13/00 主分类号 G06F17/30
代理机构 代理人
主权项
地址