摘要 |
PROBLEM TO BE SOLVED: To grasp structural features of a document from a recognition result by optically reading characters and recognizing them, and analyzing the logical structure of the document from obtained image information and character information according to a specific rule. SOLUTION: An OCR part 11 of an optical image reader 10 reads out a document to be recognized. An image information recognition part 12 analyzes the image of the read document and recognizes image information on ruled lines, underlines, etc., other than character information. A character position information recognition part 133 recognizes appearance positions of characters from the image information and segments character patterns. Further, a font information recognition part 132 recognizes the fonts of the characters as to whether the characters are printed or handwritten accompanying conversion into character codes by the character recognition of a character code conversion part 131. Then, a DTD generation part 21 of a document analyzing device 20 analyzes the logical structure of the document according to the specific rule and generates document definitions(DTD) determining the structure of the document or format. |