摘要 |
A method of identifying semantic units in an electronic document includes the steps of: providing an electronic document being described in a page description language, the document having at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further including geometric information and page description language parameters; determining strips of at least one glyph by comparing the geometric position of subsequent glyphs; determining zones of at least one strip wherein a zone is defined by the combined area of strips, the geometrical areas of which overlap with each other; determining a boundary between two semantic units in a zone based on the geometric properties of the glyphs; sorting the identified semantic units in the zone in a sorted list; and, combining subsequent semantic units in the sorted list according to geometric considerations.
|