发明名称 LANGUAGE IDENTIFICATION IN MULTILINGUAL TEXT
摘要 Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.
申请公布号 WO2012050743(A3) 申请公布日期 2012.06.21
申请号 WO2011US52133 申请日期 2011.09.19
申请人 MICROSOFT CORPORATION 发明人 LI, KANG;KLODER, STEPHEN ALLEN;JOHNSON, IAN GEORGE;ALONICHAU, SIARHEI
分类号 G06F17/21;G06F9/44;G06F17/28 主分类号 G06F17/21
代理机构 代理人
主权项
地址