摘要 |
A method and system for preprocessing an image, wherein the image includes a plurality of columns, or regions, of text is disclosed. A plurality of components associated with the text is determined. On determining the plurality of components, a line height and a column spacing is determined for the components. The components are then associated with a column based on the line height and the column spacing. A set of characteristic parameters are calculated for each column and the plurality of components of each column are merged based on the characteristic parameters to form sub-words and words. A first plurality of words and/or subwords is merged and processed as a first region and a second plurality of words and/or subwords is merged and processed as a second region wherein at least a portion of the second region vertically overlaps at least a portion of the first region. |