摘要 |
Methods and apparatus, including computer program products, for identifying a language corresponding to a string of data include receiving a data string and dividing the data string into coded character sequences for each of a plurality of languages. A length of one or more coded character sequences varies among different languages for coded character sequences having a particular number of characters. The coded character sequences are analyzed to calculate, for each of the plurality of languages, a probability that the data string corresponds to language. The calculated probabilities are compared among the languages, and a language is identified as corresponding to the data string based on the comparison.
|