摘要 |
<p>The method of identifying semantic units in an electronic document comprises the steps of: providing (10) an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further comprising geometric information and page description language parameters; determining (14) strips of at least one glyph by comparing (48) the geometric position of subsequent glyphs, determining (16) zones of at least one strip wherein a zone is defined by the combined area of strips, the geometrical areas of which overlap with each other; determining (102) a boundary between two semantic units in a zone based on the geometric properties of the glyphs, sorting (104) the identified semantic units in the zone in a sorted list; and combining (108) subsequent semantic units in the sorted list according to geometric considerations.</p> |