发明名称 AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR MULTILINGUAL OPTICAL CHARACTER RECOGNITION
摘要 1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. 2. The method of claim 1 wherein a hypothesis is formed only for words having a minimum length of at least two characters. 3. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the lengths of the hypothesized words. 4. The method of claim 1 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence factors in accordance with the recognition probabilities. 5. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the frequencies with which the hypothesized words appear in the respective languages. 6. The method of claim 1 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages. 7. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of: defining at least one zone in the document which contains a plurality of words; defining a dictionary for each one of plural languages; for each word in the zone, determining a confidence factor with respect to each of said plural languages, which factor is based on whether the respective dictionaries contain the word; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language 8. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of zones containing regions of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; for each hypothesized word, determining which ones of said dictionaries contain the hypothesis for the word and determining a confidence value for each language; identifying a zone language for each zone, based upon the confidence values associated with the words in the zone; identifying a region language for each region, based upon the confidence values associated with the words in the region; designating the zone language as the region language if the confidence values associated with the words in the region are not sufficiently high; and clustering regions in a zone which have the same region language to form a subzone that is identified with a particular language. 9. The method of claim 8 wherein a hypothesis is formed only for words having a predetermined minimum number of characters greater than one. 10. The method of claim 8 further including the step of weighting said confidence values in accordance with the lengths of the hypothesized words. 11. The method of claim 8 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence values in accordance with the recognition probabilities. 12. The method of claim 8 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages. 13. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; for each word hypothesis, determining a confidence factor that indicates whether the word is contained in each of said plural languages; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. 14. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of: defining at least one zone in the document which contains a plurality of words; for each word in the zone, determining a confidence factor that indicates whether the word is contained in each of said plural languages; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language.
申请公布号 EA001689(B1) 申请公布日期 2001.06.25
申请号 EA20000000321 申请日期 1997.11.20
申请人 SCANSOFT, INC. 发明人 PON, LEONARD, K.;KANUNGO, TARAS;YANG, JUN;CHOY, KENNETH, CHAN;BOXSER, MINDY, R.
分类号 G06K9/68 主分类号 G06K9/68
代理机构 代理人
主权项
地址