摘要 |
<p>A method for identifying and extracting text data from a table-cell frame. The method includes the steps of tracing connected components of a document image, tracing white contours within a connected component, defining a frame outline based on the white contours, identifying unattached character data inside the frame outline, and defining an initial rectangular area inside the frame outline. The method further includes detecting black pixels in a horizontal or vertical direction from the initial rectangular area in order to create an extended character area, locating boundary pixels lying inside the extended character area for each white contour, identifying black pixels positioned between boundary pixels lying inside the extended character area, combining black pixels positioned between boundary pixels lying inside the extended character area so as to form at least one connected component, recognizing the at least one connected component as a text component if it is not recognized as a vertical line, as a horizontal line, as part of a broken line, or as part of the frame, and defining a character node of a hierarchical tree structure corresponding to the extended character area and containing both the at least one connected component and any identified unattached connected components. <IMAGE></p> |