摘要 |
Syntagms of a text including individual elements written without separators are segmented into chunks being comprised of strings including at least one individual element such as an ideogram of the Mandarin Chinese language. A lexicon (LEX) is defined including a set of strings, each string being comprised of at least one of the individual elements. The syntagm being segmented is orderly searched on an element-by-element basis (INDX) by searching within the lexicon strings corresponding to any of said chunks. In the case of a positive search result, the corresponding chunk located is stored with an associated cost. A check is made as to whether the chunk located was already present in the lexicon. In the case the chunk located wa s already present, the cost assbociated therewith is reduced. A plurality of candidate segmentation sequences are thus generated, each corresponding to a respective segmentation pattern having associated a corresponding accrued cost. The candidate sequence having the lowest associated cost is selected a s the final result of segmentation.
|