发明名称 |
Language segmentation of multilingual texts |
摘要 |
The claimed subject matter provides a system and/or method for segmenting a multi-language text. An exemplary method comprises determining an initial probability distribution for sentences in the multi-language text, the initial probability distribution indicating the likelihood of each sentence being in each of a set of languages. A probability of language transitions across sentences may be learned based on the initial probability distribution. Additionally, a highest probability language sequence of sentences in the multi-language text may be determined based on a combination of the probability of language transitions and the prior probability distribution provided by an initial model. Further, web documents are annotated at a sentence by sentence level such that each sentence of a web document is labeled in a given language according to the highest probability language determined. |
申请公布号 |
US9400787(B2) |
申请公布日期 |
2016.07.26 |
申请号 |
US201314073036 |
申请日期 |
2013.11.06 |
申请人 |
Microsoft Technology Licensing, LLC |
发明人 |
Aue Anthony |
分类号 |
G06F17/28;G06F17/27 |
主分类号 |
G06F17/28 |
代理机构 |
|
代理人 |
Wight Steve;Swain Sandy;Minhas Micky |
主权项 |
1. A method of segmenting a multi-language text, comprising:
determining, using a processing unit, an initial probability distribution for sentences in a web document in the multi-language text, the initial probability distribution indicating the likelihood of each sentence being in each of a set of languages; learning, using the processing unit, a probability of language transitions across sentences based on the initial probability distribution; determining, using the processing unit, a highest probability language sequence of sentences in the multi-language text based on a combination of the probability of language transitions and a prior probability distribution provided by an initial model; and annotating web documents at a sentence by sentence level such that each sentence of a web document is labeled in a given language according to the highest probability language determined. |
地址 |
Redmond WA US |