摘要 |
A method is disclosed for classifying data according to in which character encoding it has been encoded. A fingerprint (62) is constructed from the data, wherein the fingerprint comprises, for each of a plurality of predetermined character encoding schemes, at least one confidence value, representing a confidence that the data was encoded using said character encoding scheme. The fingerprint also comprises a frequency value for each of a subset of byte values, each frequency value representing the frequency of occurrence of a respective byte value in the data. A statistical classification of the data is then performed based on the fingerprint. The method then preferably identifies a language represented by textual data in the classifying data and applies a language-specific policy based on the identified language.
|