发明名称 AUTOMATIC SEGMENTATION OF TEXTS COMPRISING CHUNKS WITHOUT SEPARATORS
摘要 Syntagms of a text including individual elements written without separators are segmented into chunks being comprised of strings including at least one individual element such as an ideogram of the Mandarin Chinese language. A lexicon (LEX) is defined including a set of strings, each string being comprised of at least one of the individual elements. The syntagm being segmented is orderly searched on an element-by-element basis (INDX) by searching within the lexicon strings corresponding to any of said chunks. In the case of a positive search result, the corresponding chunk located is stored with an associated cost. A check is made as to whether the chunk located was already present in the lexicon. In the case the chunk located wa s already present, the cost assbociated therewith is reduced. A plurality of candidate segmentation sequences are thus generated, each corresponding to a respective segmentation pattern having associated a corresponding accrued cost. The candidate sequence having the lowest associated cost is selected a s the final result of segmentation.
申请公布号 CA2523992(A1) 申请公布日期 2004.12.09
申请号 CA20032523992 申请日期 2003.05.28
申请人 LOQUENDO S.P.A. 发明人 BADINO, LEONARDO
分类号 G06F17/27;G06F17/28 主分类号 G06F17/27
代理机构 代理人
主权项
地址