摘要 |
A method and system for preprocessing an image for Optical Character Recognition (OCR), wherein the image includes a plurality of columns is disclosed. Each column includes one or more of Arabic text and non-text items. The method includes determining a plurality of components associated with one or more of the Arabic text and the non-text items, wherein a component includes a set of connected pixels. On determining the plurality of components, a line height and a column spacing is determined for the plurality of components. The plurality of components are then associated with a column of the plurality of columns based on the line height and the column spacing. Subsequently, a set of characteristic parameters are calculated for each column and the plurality of components of each column are merged based on the set of characteristic parameters to form sub-words and words.
|