发明名称 Method and apparatus for building a language model
摘要 A method includes: acquiring data samples; performing categorized sentence mining in the acquired data samples to obtain categorized training samples for multiple categories; building a text classifier based on the categorized training samples; classifying the data samples using the text classifier to obtain a class vocabulary and a corpus for each category; mining the corpus for each category according to the class vocabulary for the category to obtain a respective set of high-frequency language templates; training on the templates for each category to obtain a template-based language model for the category; training on the corpus for each category to obtain a class-based language model for the category; training on the class vocabulary for each category to obtain a lexicon-based language model for the category; building a speech decoder according to an acoustic model, the class-based language model and the lexicon-based language model for any given field, and the data samples.
申请公布号 US9396724(B2) 申请公布日期 2016.07.19
申请号 US201414181263 申请日期 2014.02.14
申请人 TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED 发明人 Rao Feng;Lu Li;Chen Bo;Zhang Xiang;Yue Shuai;Li Lu
分类号 G10L15/06;G10L15/183;G10L15/197 主分类号 G10L15/06
代理机构 Morgan, Lewis & Bockius LLP 代理人 Morgan, Lewis & Bockius LLP
主权项 1. A method of building a speech to text decoder, comprising: at a device having one or more processors and memory: acquiring data samples for building a language model; performing categorized sentence mining in the acquired data samples to obtain mining results comprising a respective set of sentences obtained through the categorized sentence mining for each of a plurality of categories; obtaining categorized training samples based on the mining results; building a text classifier based on the categorized training samples; classifying the data samples using the text classifier to obtain a respective class vocabulary and a respective training corpus for each of a plurality of categories; mining the respective training corpus for each category according to the respective class vocabulary for the category to obtain a respective set of high-frequency language templates; performing training on the respective set of high-frequency language templates for each category to obtain a respective template-based language model for the category; performing training on the respective training corpus for each category to obtain a respective class-based language model for the category; and performing training on the respective class vocabulary for each category to obtain a respective lexicon-based language model, wherein the respective template-based language model, the respective class-based language model, and the respective lexicon-based language model for a given category are language models for a given field, and the method further comprises: building the speech to text decoder according to a previously obtained acoustic model, the respective template-based language model, the respective class-based language model and the respective lexicon-based language model for the given field, and the data samples.
地址 Shenzhen, Guangdong Province CN