ACOUSTIC SIGNATURE BUILDING FOR A SPEAKER FROM MULTIPLE SESSIONS,申请号US201615006575-传众专利搜索

发明名称	ACOUSTIC SIGNATURE BUILDING FOR A SPEAKER FROM MULTIPLE SESSIONS
摘要	Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.
申请公布号	US2016217793(A1)	申请公布日期	2016.07.28
申请号	US201615006575	申请日期	2016.01.26
申请人	Verint Systems Ltd.	发明人	Gorodetski Alex;Shapira Ido;Wein Ron;Sidi Oana
分类号	G10L17/20;G10L17/16;G10L15/06;G10L25/84;G10L15/04;G10L17/04;G10L17/02	主分类号	G10L17/20
代理机构		代理人
主权项	1. A method of blind diarization of audio data having a first-pass blind diarization process and a second-pass blind diarization process, the method comprising: identifying non-speech segments in the audio using a voice-activity-detector (VAD) and segmenting audio data into a plurality of utterance that are separated by the identified non-speech segments, representing each utterance as an utterance model representative of a plurality of feature vectors of each utterance; clustering the utterance models, constructing a plurality of speaker models from the clustered utterance models; constructing a hidden Markov model (HMM) of the plurality of speaker models; decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; for each segment that was identified by the VAD, decoding the segment using a large-vocabulary continuous speech recognition (LVCSR) decoder, wherein the LVCSR decoder outputs words and non-speech symbols; analyzing the sequence of output words and non-speech symbols from the LVCSR decoder, wherein non-speech parts are discarded and the segment is refined resulting in sub-segments comprising words constructing a second plurality of speaker models by feeding the resulting sub-segments into a clustering algorithm; constructing a second HMM of the second plurality of speaker models, decoding a best path corresponding to the sequence of output words in the second HMM by applying a Viterbi algorithm that performs word-level segmentation.
地址	Herzliya Pituach IL