发明名称 DISCRIMINATIVE DATA SELECTION FOR LANGUAGE MODELING
摘要 A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.
申请公布号 US2016336006(A1) 申请公布日期 2016.11.17
申请号 US201514711447 申请日期 2015.05.13
申请人 Microsoft Technology Licensing, LLC 发明人 Levit Michael;Chang Shuangyu;Dumoulin Benoit
分类号 G10L15/06;G10L15/10;G10L15/14;G10L15/18 主分类号 G10L15/06
代理机构 代理人
主权项 1. A computer system for language modeling, the computer system comprising: a processor configured to execute computer-executable instructions; and memory storing computer-executable instructions configured to: collect training data from one or more information sources;generate a spoken corpus containing text of transcribed speech;generate a typed corpus containing typed text;derive feature vectors from the spoken corpus;analyze the typed corpus to determine feature vectors representing items of typed text;generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus;derive feature vectors from the unspeakable corpus; andtrain a classifier based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.
地址 Redmond WA US