摘要 |
A computer-implemented method includes receiving a plurality of character strings. The number of strings (M) in the plurality of strings having a unique substring of X characters at an extremity of the string is determined, the number of strings (N) in the plurality of strings having at least X characters in the string is determined. A probability is determined, based on a predetermined model for a distribution of characters in the strings, that the unique substring of X characters would occur M or more times out of the N strings, given that the unique character string occurs at least once. Based on the probability, the number M, and the number N, it is determined that the unique character string is a significant affix in the plurality of character strings, and the unique character string is stored. |