发明名称 Speech syllable/vowel/phone boundary detection using auditory attention cues
摘要 In syllable or vowel or phone boundary detection during speech, an auditory spectrum may be determined for an input window of sound and one or more multi-scale features may be extracted from the auditory spectrum. Each multi-scale feature can be extracted using a separate two-dimensional spectro-temporal receptive filter. One or more feature maps corresponding to the one or more multi-scale features can be generated and an auditory gist vector can be extracted from each of the one or more feature maps. A cumulative gist vector may be obtained through augmentation of each auditory gist vector extracted from the one or more feature maps. One or more syllable or vowel or phone boundaries in the input window of sound can be detected by mapping the cumulative gist vector to one or more syllable or vowel or phone boundary characteristics using a machine learning algorithm.
申请公布号 US9251783(B2) 申请公布日期 2016.02.02
申请号 US201414307426 申请日期 2014.06.17
申请人 Sony Computer Entertainment Inc. 发明人 Kalinli-Akbacak Ozlem;Chen Ruxin
分类号 G10L15/04;G10L15/05;G10L15/16;G10L15/24;G10L15/34;G10L25/03 主分类号 G10L15/04
代理机构 JDI Patent 代理人 Isenberg Joshua D.;JDI Patent
主权项 1. A method, comprising: extracting one or more multi-scale features from an auditory spectrum for an input window of sound, wherein each multi-scale feature is extracted using a separate two-dimensional spectro-temporal receptive filter; generating one or more feature maps corresponding to the one or more multi-scale features; extracting an auditory gist vector from each of the one or more feature maps; obtaining a cumulative gist vector through augmentation of each auditory gist vector extracted from the one or more feature maps; detecting one or more syllable or vowel or phone boundaries in the input window of sound by mapping the cumulative gist vector to one or more syllable or vowel or phone boundaries; and determining a number of syllables per unit time or a number of syllables per utterance using the one or more syllable or vowel or phone boundaries in the input window of sound.
地址 Tokyo JP