发明名称 Online incremental adaptation of deep neural networks using auxiliary Gaussian mixture models in speech recognition
摘要 Methods and systems for online incremental adaptation of neural networks using Gaussian mixture models in speech recognition are described. In an example, a computing device may be configured to receive an audio signal and a subsequent audio signal, both signals having speech content. The computing device may be configured to apply a speaker-specific feature transform to the audio signal to obtain a transformed audio signal. The speaker-specific feature transform may be configured to include speaker-specific speech characteristics of a speaker-profile relating to the speech content. Further, the computing device may be configured to process the transformed audio signal using a neural network trained to estimate a respective speech content of the audio signal. Based on outputs of the neural network, the computing device may be configured to modify the speaker-specific feature transform, and apply the modified speaker-specific feature transform to a subsequent audio signal.
申请公布号 US9466292(B1) 申请公布日期 2016.10.11
申请号 US201313886620 申请日期 2013.05.03
申请人 Google Inc. 发明人 Lei Xin;Aleksic Petar
分类号 G10L15/00;G10L15/16 主分类号 G10L15/00
代理机构 McDonnell Boehnen Hulbert & Berghoff LLP 代理人 McDonnell Boehnen Hulbert & Berghoff LLP
主权项 1. A method comprising: receiving, by a computing device, a sequence of consecutive audio signals, the sequence including a first audio signal at a first time and a second audio signal at a second time later than the first time, wherein the audio first signal and the second audio signal include speech content; selecting, by the computing device, a speaker-specific profile based on speech characteristics determined from one or more of the audio signals of the sequence; applying a speaker-specific feature transform to the first audio signal to obtain a transformed audio signal, wherein the speaker-specific feature transform is determined based on one or more speaker-specific speech characteristics of the selected speaker-profile; processing, by the computing device, the transformed audio signal using a neural network trained to estimate a given speech content of the first audio signal; modifying the speaker-specific feature transform based on an output of the neural network to obtain a modified speaker-specific feature transform; and applying the modified speaker-specific feature transform to the second audio signal to obtain a respective transformed audio signal to be processed by the neural network to estimate a respective speech content of the second audio signal.
地址 Mountain View CA US