摘要 |
An automated system and method is provided for debugging training data used to train an automated language identifier. The system and method collects texts written in a particular language, generates an occurrence count for words in each text by counting the number of times each of the words is found within the text, and generates an occurrence ratio (OR) of each of the words by dividing the occurrence count by the total number of words in each text. Words are then filtered from the texts in which their occurrence ratios are substantially higher than their occurrence ratios in at least one of the other texts, to generate a clean text.
|