摘要 |
Syntagms of a text including individual elements written without separators are segmented into chunks having strings including at least one individual element, such as an ideogram of the Mandarin Chinese language. A lexicon is defined including a set of strings, each string having at least one of the individual elements. The syntagm, being segmented, is orderly searched on an element-by-element basis by searching within the lexicon strings corresponding to any of the chunks. In the case of a positive search result, the corresponding chunk located is stored with an associated cost. A check is made as to whether the chunk located was already present in the lexicon. If the chunk located was already present, the cost associated therewith is reduced. A plurality of candidate segmentation sequences are thus generated, each corresponding to a respective segmentation pattern having associated a corresponding accrued cost. The candidate sequence having the lowest associated cost is selected as the final result of segmentation.
|