摘要 |
<P>PROBLEM TO BE SOLVED: To extract the text from a structured document without depending upon a rule for text extraction. <P>SOLUTION: A document set recording part 2 records an HTML file of a document to be processed in a document set DB 3. A link source information extraction part 4 extracts a hyperlink embedded in an HTML file, acquired from the document set DB 3, and link peripheral text information. A text extraction part 5 specifies a hyperlink referring to the HTML file acquired from the document set DB 3 as an HTML file of a link destination document on condition that the hyperlink is extracted by the link source information extraction part 4. The text extraction part 5 compares a character string of text information present in the HTML file of the specified link destination document with a character string of the text information, and extracts a representative part in the link destination document as the body. An output part 6 outputs the extracted text. <P>COPYRIGHT: (C)2013,JPO&INPIT |