摘要 |
PROBLEM TO BE SOLVED: To reduce a load applied to a computer by calculating convergent likelihood based on mouth shape information during a vocalizing section obtaining a candidate word from a photographing image of the mouth of a speaker. SOLUTION: A mouth shape recognition part 102 recognizes the shape and the movement of the mouth at a vocalizing time from a face image signal S101 (photographed image) read out from an image frame buffer 101. A word dictionary 104 stores syllable information and a phoneme model beforehand obtained related to the word candidate to be recognized. Further, a mouth shape syllable matching part 103 investigates a matching extent between the syllable information inputted from the word dictionary 104 and a syllable obtained from the operation of the mouth shape to output the result (mouth shape syllable matching score). Further, a word candidate convergent part 105 converges the word candidate according to the mouth shape syllable matching score. Then, a voice recognition part 108 compares a line of a voice frame S108 of an inputted sound section with the phoneme model S111 of the word converged by the word candidate convergent part 105, and outputs the word with the highest likelihood as the recognition result. |