摘要 |
Disclosed is a computer implemented method of text extraction in colour compound documents. The method connects similarly coloured pixels of an image of a colour compound document into connected components (CCs); classifies each CC as either text or non-text; refines the text CC classification for each text CC using global colour context statistics; groups text CCs into text blocks; recovers misclassified non-text CCs into a nearby text block; and removes extraneous CCs from each text block using local colour context statistics to thereby provide the extracted text in the text blocks. Also disclosed is a computer implemented method of locating graphics objects in a colour compound document image. The method connects similarly coloured pixels of said image into connected components (CCs) and placing the CCs in an enclosure tree; classifies (330,730) each CC into one of a plurality of classes wherein at least one class (862) represents salient graphics components; identifies (1140) a graphics container (441) to perform semantic analysis for each CC of said class representing salient graphics components; profiles (1170) descendents of said graphics container in said tree to obtain semantic context statistics; and decides (1710) whether the graphics container contains a whole or part of a graphics object based on said semantic context statistics.
|