发明名称 Systems and methods for language detection
摘要 Implementations of the present disclosure are directed to a method, a system, and a computer program storage device for detecting a language in a text message. A plurality of different language detection tests are performed on a message associated with a user. Each language detection test determines a set of scores representing a likelihood that the message is in one of a plurality of different languages. One or more combinations of the score sets are provided as input to one or more distinct classifiers. Output from each of the classifiers includes a respective indication that the message is in one of the different languages. The language in the message may be identified as being the indicated language from one of the classifiers, based on a confidence score and/or an identified linguistic domain.
申请公布号 US9372848(B2) 申请公布日期 2016.06.21
申请号 US201414517183 申请日期 2014.10.17
申请人 Machine Zone, Inc. 发明人 Bojja Nikhil;Wang Pidong;Linder Fredrik;Puzon Bartlomiej
分类号 G06F17/27;G06F17/28 主分类号 G06F17/27
代理机构 Goodwin Procter LLP 代理人 Goodwin Procter LLP
主权项 1. A computer-implemented method of identifying a language of a message, the method comprising: training a first classifier using training data comprising collections of first score sets from different language detection tests and an indication of the correct language for each collection of score sets wherein each first score set comprises a plurality of respective scores each representing a likelihood that a respective first message is in one of a plurality of different languages; performing a plurality of the language detection tests on text in a message authored by a user, each language detection test determining a respective set of scores, each score in the set of scores representing a likelihood that the message is in a respective language of the plurality of different languages; providing one or more combinations of the score sets as input to one or more distinct classifiers including the first classifier; obtaining as output from each of the one or more classifiers a respective indication that the message is in one of the plurality of different languages, the indication comprising a confidence score; and identifying the language of the message based on one of the confidence scores.
地址 Palo Alto CA US