发明名称 |
Online incremental adaptation of deep neural networks using auxiliary Gaussian mixture models in speech recognition |
摘要 |
Methods and systems for online incremental adaptation of neural networks using Gaussian mixture models in speech recognition are described. In an example, a computing device may be configured to receive an audio signal and a subsequent audio signal, both signals having speech content. The computing device may be configured to apply a speaker-specific feature transform to the audio signal to obtain a transformed audio signal. The speaker-specific feature transform may be configured to include speaker-specific speech characteristics of a speaker-profile relating to the speech content. Further, the computing device may be configured to process the transformed audio signal using a neural network trained to estimate a respective speech content of the audio signal. Based on outputs of the neural network, the computing device may be configured to modify the speaker-specific feature transform, and apply the modified speaker-specific feature transform to a subsequent audio signal. |
申请公布号 |
US9466292(B1) |
申请公布日期 |
2016.10.11 |
申请号 |
US201313886620 |
申请日期 |
2013.05.03 |
申请人 |
Google Inc. |
发明人 |
Lei Xin;Aleksic Petar |
分类号 |
G10L15/00;G10L15/16 |
主分类号 |
G10L15/00 |
代理机构 |
McDonnell Boehnen Hulbert & Berghoff LLP |
代理人 |
McDonnell Boehnen Hulbert & Berghoff LLP |
主权项 |
1. A method comprising:
receiving, by a computing device, a sequence of consecutive audio signals, the sequence including a first audio signal at a first time and a second audio signal at a second time later than the first time, wherein the audio first signal and the second audio signal include speech content; selecting, by the computing device, a speaker-specific profile based on speech characteristics determined from one or more of the audio signals of the sequence; applying a speaker-specific feature transform to the first audio signal to obtain a transformed audio signal, wherein the speaker-specific feature transform is determined based on one or more speaker-specific speech characteristics of the selected speaker-profile; processing, by the computing device, the transformed audio signal using a neural network trained to estimate a given speech content of the first audio signal; modifying the speaker-specific feature transform based on an output of the neural network to obtain a modified speaker-specific feature transform; and applying the modified speaker-specific feature transform to the second audio signal to obtain a respective transformed audio signal to be processed by the neural network to estimate a respective speech content of the second audio signal. |
地址 |
Mountain View CA US |