摘要 |
A method for mechanically identifying the language and character code system of a text document encoded by a computer. In the list LBSL/C of byte string of specified length previously formed for each objective language/character code system, byte strings of a specified number of bytes possibly occurring in a text document of a relevant language/character code system are stored. For each language/character code string, an “occurrence rate of learnt byte string” , i.e. the rate of the number of byte strings of specified length already existing in the list LBSL/C and contained in an objective text document, is calculated and only when only one language/character code system having an “occurrence rate of learnt byte” close to 1 exists, the language/character code system is outputted as the result.
|