发明名称 ACOUSTIC SIGNATURE BUILDING FOR A SPEAKER FROM MULTIPLE SESSIONS
摘要 Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.
申请公布号 US2016217793(A1) 申请公布日期 2016.07.28
申请号 US201615006575 申请日期 2016.01.26
申请人 Verint Systems Ltd. 发明人 Gorodetski Alex;Shapira Ido;Wein Ron;Sidi Oana
分类号 G10L17/20;G10L17/16;G10L15/06;G10L25/84;G10L15/04;G10L17/04;G10L17/02 主分类号 G10L17/20
代理机构 代理人
主权项 1. A method of blind diarization of audio data having a first-pass blind diarization process and a second-pass blind diarization process, the method comprising: identifying non-speech segments in the audio using a voice-activity-detector (VAD) and segmenting audio data into a plurality of utterance that are separated by the identified non-speech segments, representing each utterance as an utterance model representative of a plurality of feature vectors of each utterance; clustering the utterance models, constructing a plurality of speaker models from the clustered utterance models; constructing a hidden Markov model (HMM) of the plurality of speaker models; decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; for each segment that was identified by the VAD, decoding the segment using a large-vocabulary continuous speech recognition (LVCSR) decoder, wherein the LVCSR decoder outputs words and non-speech symbols; analyzing the sequence of output words and non-speech symbols from the LVCSR decoder, wherein non-speech parts are discarded and the segment is refined resulting in sub-segments comprising words constructing a second plurality of speaker models by feeding the resulting sub-segments into a clustering algorithm; constructing a second HMM of the second plurality of speaker models, decoding a best path corresponding to the sequence of output words in the second HMM by applying a Viterbi algorithm that performs word-level segmentation.
地址 Herzliya Pituach IL