摘要 |
<P>PROBLEM TO BE SOLVED: To properly extract even a text art portion to produce a structured document, in a document information processor producing a structured document. <P>SOLUTION: A picture and character separation part 200 separates an image portion G10 and a text portion G20 from an HTML document acquired by an electronic document input part 100. A text art acquisition part 300 specifies a text art having meaning in a pattern expressed by a character array from extracted character information. A structured document production part 500 refers to information G24a, G24b or G24c about the text art specified by the text art acquisition part 300, and structured information G20c produced as usual by a text analysis part 400 or pieces of information G10a, G10b about the image portion G10 to finish the logical structured document. <P>COPYRIGHT: (C)2005,JPO&NCIPI |