发明名称 Maximum likelihood channel normalization
摘要 Features are disclosed for applying maximum likelihood methods to channel normalization in automatic speech recognition (“ASR”). Feature vectors computed from an audio input of a user utterance can be compared to a Gaussian mixture model. The Gaussian that corresponds to each feature vector can be determined, and statistics (e.g., constrained maximum likelihood linear regression statistics) can then be accumulated for each feature vector. Using these statistics, or some subset thereof, offsets and/or a diagonal transform matrix can be computed for each feature vector. The offsets and/or diagonal transform matrix can be applied to the corresponding feature vector to generate a feature vector normalized based on maximum likelihood methods. The ASR process can then proceed using the transformed feature vectors.
申请公布号 US9378729(B1) 申请公布日期 2016.06.28
申请号 US201313797662 申请日期 2013.03.12
申请人 Amazon Technologies, Inc. 发明人 Salvador Stan Weidner
分类号 G10L15/04;G10L15/02;G10L15/20;G10L15/14;G10L15/08 主分类号 G10L15/04
代理机构 Knobbe, Martens, Olson & Bear, LLP 代理人 Knobbe, Martens, Olson & Bear, LLP
主权项 1. A system comprising: a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least: receive a stream of audio data regarding an utterance of a user;calculate a first feature vector based at least partly on a first frame of the audio data;perform a comparison of a Gaussian mixture model to the first feature vector;identify a first Gaussian of the Gaussian mixture model based at least partly on the comparison of the Gaussian mixture model to the first feature vector;compute a first likelihood based at least partly on the first feature vector and the first Gaussian;generate first updated speech recognition statistics based at least partly on speech recognition statistics and the first likelihood computed based at least partly on the first feature vector and the first Gaussian;generate a first updated feature-vector transform based at least partly on a feature-vector transform and the first updated speech recognition statistics;generate a first normalized feature vector based on the first updated feature-vector transform and the first feature vector;calculate a second feature vector based at least partly on a second frame of the audio data;perform a comparison of the Gaussian mixture model to the second feature vector;identify a second Gaussian of the Gaussian mixture model based at partly on the comparison of the Gaussian mixture model to the second feature vector;compute a second likelihood based at least partly on the second feature vector and the second Gaussian; andgenerate second updated speech recognition statistics based on the first updated speech recognition statistics and the second likelihood computed based at least partly on the second feature vector and the second Gaussian;subsequent to generating the first normalized feature-vector: generate a second updated feature-vector transform based on the first updated feature-vector transform, a time associated with the first frame, and the second updated speech recognition statistics; andgenerate a second normalized feature vector based on the second updated feature-vector transform and the second feature vector.
地址 Seattle WA US