发明名称 |
Method and apparatus for building a language model |
摘要 |
A method includes: acquiring data samples; performing categorized sentence mining in the acquired data samples to obtain categorized training samples for multiple categories; building a text classifier based on the categorized training samples; classifying the data samples using the text classifier to obtain a class vocabulary and a corpus for each category; mining the corpus for each category according to the class vocabulary for the category to obtain a respective set of high-frequency language templates; training on the templates for each category to obtain a template-based language model for the category; training on the corpus for each category to obtain a class-based language model for the category; training on the class vocabulary for each category to obtain a lexicon-based language model for the category; building a speech decoder according to an acoustic model, the class-based language model and the lexicon-based language model for any given field, and the data samples. |
申请公布号 |
US9396724(B2) |
申请公布日期 |
2016.07.19 |
申请号 |
US201414181263 |
申请日期 |
2014.02.14 |
申请人 |
TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED |
发明人 |
Rao Feng;Lu Li;Chen Bo;Zhang Xiang;Yue Shuai;Li Lu |
分类号 |
G10L15/06;G10L15/183;G10L15/197 |
主分类号 |
G10L15/06 |
代理机构 |
Morgan, Lewis & Bockius LLP |
代理人 |
Morgan, Lewis & Bockius LLP |
主权项 |
1. A method of building a speech to text decoder, comprising:
at a device having one or more processors and memory: acquiring data samples for building a language model; performing categorized sentence mining in the acquired data samples to obtain mining results comprising a respective set of sentences obtained through the categorized sentence mining for each of a plurality of categories; obtaining categorized training samples based on the mining results; building a text classifier based on the categorized training samples; classifying the data samples using the text classifier to obtain a respective class vocabulary and a respective training corpus for each of a plurality of categories; mining the respective training corpus for each category according to the respective class vocabulary for the category to obtain a respective set of high-frequency language templates; performing training on the respective set of high-frequency language templates for each category to obtain a respective template-based language model for the category; performing training on the respective training corpus for each category to obtain a respective class-based language model for the category; and performing training on the respective class vocabulary for each category to obtain a respective lexicon-based language model, wherein the respective template-based language model, the respective class-based language model, and the respective lexicon-based language model for a given category are language models for a given field, and the method further comprises:
building the speech to text decoder according to a previously obtained acoustic model, the respective template-based language model, the respective class-based language model and the respective lexicon-based language model for the given field, and the data samples. |
地址 |
Shenzhen, Guangdong Province CN |