发明名称 Filled translation for bootstrapping language understanding of low-resourced languages
摘要 Annotated training data (e.g., sentences) in a first language are used to generate annotated training data for a second language. For example, annotated sentences in English are manually collected first, and then is used to generate annotated sentences in Chinese. The annotated training data includes slot labels, slot values and carrier phrases. The carrier phrases are the portions of the training data that is outside of a slot. The carrier phrases are translated from the first language to one or more translations in the second language. The translations may include machine translations as well as human translations. Entities for the slot values are determined for the translated sentences using content sources that include locale-dependent entities. The determined entities are used to fill the slots in the translations of the second language. All or a portion of the resulting sentences may be used for training models in the second language.
申请公布号 US9613027(B2) 申请公布日期 2017.04.04
申请号 US201314074358 申请日期 2013.11.07
申请人 Microsoft Technology Licensing, LLC 发明人 Hwang Mei-Yuh;Ni Yong
分类号 G06F17/28;G06F17/00;G06F17/20;G06F17/21;G06F9/44;G06Q10/00;G06Q50/00;G10L15/00;G10L13/00 主分类号 G06F17/28
代理机构 代理人
主权项 1. A computer-implemented method, performed by at least one processor, for using training data in a first language to create training data in a second language, comprising: accessing the training data in the first language that include sentences that each comprises one or more carrier phrases, and one or more slot labels with slot values; performing slot abstraction on at least a portion of the training data to create a first plurality of abstract sentences that each comprises one or more carrier phrases, and one or more abstract tokens that replace the slot labels and the slot values; translating at least partially through machine translation the carrier phrases to the second language to generate a second plurality of abstract sentences in the second language; accessing a database of a plurality of locale-dependent entities based on a locale corresponding to the second language; replacing each of abstract tokens in the second plurality of abstract sentences in the second language with multiple locale-dependent entities from the plurality of locale-dependent entities for the slot type, in order to create a plurality of filled translated sentences for inclusion in the training data in the second language; training a locale-dependent statistical model based on the training data in the second language; and recognizing speech in the second language based on the locale-dependent statistical model.
地址 Redmond WA US