摘要 |
A method of deriving speech synthesis parameters from an input speech audio signal, wherein the audio signal is segmented on the basis of estimated positions of glottal closure incidents and the resulting segments are processed to obtain the complex cepstrum used to derive a synthesis filter. A reconstructed speech signal is produced by passing a pulsed excitation signal derived from the position of the glottal closure incidents through the synthesis filter, and compared with the input speech audio signal. The pulse excitation signal and the complex cepstrum are then iteratively modified to minimize the difference between the reconstructed speech signal and the input speech audio signal, by optimizing the position of the pulses in the excitation signal to reduce the mean squared error between the reconstructed speech signal and the input speech audio signal, and recalculating the complex using the optimized pulse positions. |