Method and apparatus for building a language model,申请号US201414181263-传众专利搜索

发明名称	Method and apparatus for building a language model
摘要	A method includes: acquiring data samples; performing categorized sentence mining in the acquired data samples to obtain categorized training samples for multiple categories; building a text classifier based on the categorized training samples; classifying the data samples using the text classifier to obtain a class vocabulary and a corpus for each category; mining the corpus for each category according to the class vocabulary for the category to obtain a respective set of high-frequency language templates; training on the templates for each category to obtain a template-based language model for the category; training on the corpus for each category to obtain a class-based language model for the category; training on the class vocabulary for each category to obtain a lexicon-based language model for the category; building a speech decoder according to an acoustic model, the class-based language model and the lexicon-based language model for any given field, and the data samples.
申请公布号	US9396724(B2)	申请公布日期	2016.07.19
申请号	US201414181263	申请日期	2014.02.14
申请人	TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED	发明人	Rao Feng;Lu Li;Chen Bo;Zhang Xiang;Yue Shuai;Li Lu
分类号	G10L15/06;G10L15/183;G10L15/197	主分类号	G10L15/06
代理机构	Morgan, Lewis & Bockius LLP	代理人	Morgan, Lewis & Bockius LLP
主权项	1. A method of building a speech to text decoder, comprising: at a device having one or more processors and memory: acquiring data samples for building a language model; performing categorized sentence mining in the acquired data samples to obtain mining results comprising a respective set of sentences obtained through the categorized sentence mining for each of a plurality of categories; obtaining categorized training samples based on the mining results; building a text classifier based on the categorized training samples; classifying the data samples using the text classifier to obtain a respective class vocabulary and a respective training corpus for each of a plurality of categories; mining the respective training corpus for each category according to the respective class vocabulary for the category to obtain a respective set of high-frequency language templates; performing training on the respective set of high-frequency language templates for each category to obtain a respective template-based language model for the category; performing training on the respective training corpus for each category to obtain a respective class-based language model for the category; and performing training on the respective class vocabulary for each category to obtain a respective lexicon-based language model, wherein the respective template-based language model, the respective class-based language model, and the respective lexicon-based language model for a given category are language models for a given field, and the method further comprises: building the speech to text decoder according to a previously obtained acoustic model, the respective template-based language model, the respective class-based language model and the respective lexicon-based language model for the given field, and the data samples.
地址	Shenzhen, Guangdong Province CN