摘要 |
A method is disclosed, for recognizing whether some electronic data is the digital representation of a piece of text and, if so, in which character encoding it has been encoded. A fingerprint is constructed from the data, wherein the fingerprint comprises, for each of a plurality of predetermined character encoding schemes, at least one confidence value, representing a confidence that the data was encoded using said character encoding scheme. The fingerprint also comprises a frequency value for each of a subset of byte values, each frequency value representing the frequency of occurrence of a respective byte value in the data. A statistical classification of the data is then performed based on the fingerprint.
The method may be applied to spam classification. |