摘要 |
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for identifying misrepresented characters in strings of text. A computer system receives text that includes characters identified as being encoded in UTF-8. The characters are represented as code point values, each code point value representing one character in the text. The computer system makes a determination that the text likely includes characters incorrectly converted from Win-1252 to UTF-8 by comparing the code point values that represent the text with test values. Based on the comparison, the computer system identifies sequences of characters in the text that was likely incorrectly converted.
|