发明名称 Efficient language identification
摘要 A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.
申请公布号 US8027832(B2) 申请公布日期 2011.09.27
申请号 US20050056707 申请日期 2005.02.11
申请人 MICROSOFT CORPORATION 发明人 RAMSEY WILLIAM D.;SCHMID PATRICIA M.;POWELL KEVIN R.
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项
地址