摘要 |
A speech processing method, comprising: receiving a speech input which comprises a sequence of feature vectors; determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising: providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and adapting the acoustic model to the mismatched speech input, the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the acoustic model to the mismatched speaker input comprises: relating speech from the mismatched speaker input to the speech used to train the acoustic model using: a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that: y=f(F(x,v),u) where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v.
|