摘要 |
Physical page layout analysis for optical character recognition is performed. A physical page layout analysis method finds constituent parts of an image and gives an initial data-type label, such as text or non-text. Within the text data, connected components are identified and analyzed. Tab-stops are detected from groups of edge-aligned connected components. The detected tab-stops are used to deduce the column layout of the page by finding column partitions. The column layout is then applied to find the polygonal boundaries of and a reading order of regions containing flowing text, headings, and pull-outs.
|