摘要 |
A system and method for synthesizing a facial image, compares a speech frame from an incoming speech signal with acoustic features stored within visually similar entries in an audio-visual codebook to produce a set of weights. The audio-visual codebook also stores visual features corresponding to the acoustic features. A composite visual feature is generated as a weighted sum of the corresponding visual features, from which the facial image is synthesized. The audio-visual codebook may include multiple samples of the acoustic and visual features for each entry, which corresponds to a sequence of one or more phonemes. |