摘要 |
A method is provided which trains acoustic models in an automatic speech recognizer ("ASR") without explicitly matching decoded scripts with correct scripts from which acoustic training data is generated. In the method, audio data is input and segmented to produce audio segments. The audio segments are clustered into groups of clustered audio segments such that the clustered audio segments in each of the groups have similar characteristics. Also, the groups respectively form audio similarity classes. Then, audio segment probability distributions for the clustered audio segments in the audio similarity classes are calculated, and audio segment frequencies for the clustered audio segments are determined based on the audio segment probability distributions. The audio segment frequencies are matched to known audio segment frequencies for at least one of letters, combination of letters, and words to determine frequency matches, and a textual corpus of words is formed based on the frequency matches. Then, acoustic models of the automatic speech recognizer are trained based on the textual corpus. In addition, the method may receive and cluster video or biometric data, and match such data to the audio data to more accurately cluster the audio segments into the groups of audio segments. Also, an apparatus for performing the method is provided. |