摘要 |
A text-to-speech (TTS) system is trained according to a linear dynamic model (LDM) whereby text is converted to a sequence of linguistic units (eg. phonemes, sub-phonemes), each state of which is looked up in an acoustic model table to produce a sequence of speech vectors which is adjusted to increase the variance of the speech vectors vi(d) based on a predefined global variance v before being output as speech. A predefined number T of hidden vectors xt evolve according to a state equation involving an observation matrix H, state transformation matrix F, covariance matrices Q & R and mean vectors m. Second order LDMs may be constrained to be critically damped towards a target q, and speech parameter trajectories Y may be calculated according to a steepest ascent method. |