发明名称 Speaker indexing device and speaker indexing method
摘要 A speaker indexing device extracts a plurality of features from a speech signal on a frame-by-frame basis, models a distribution of first feature sets by a mixture distribution containing as many probability distributions as there are speakers, selects for each probability distribution either first feature sets located within a predetermined distance from the center of the probability distribution or a predetermined number of first feature sets in sequence starting from the first feature set closest to the center of the probability distribution, selects a second feature for the frame corresponding to the selected first feature sets as first training data for the speaker corresponding to the probability distribution and, using the first training data, trains a speaker model to be used to append to each frame identification information for identifying the speaker speaking in the frame.
申请公布号 US9536525(B2) 申请公布日期 2017.01.03
申请号 US201514825653 申请日期 2015.08.13
申请人 FUJITSU LIMITED 发明人 Hayakawa Shoji
分类号 G10L17/00;G10L17/04;G10L17/02 主分类号 G10L17/00
代理机构 Fujitsu Patent Center 代理人 Fujitsu Patent Center
主权项 1. A speaker indexing device comprising: a processor configured to: from a speech signal containing a conversation between a plurality of speakers, extract at least one feature and a data set including at least two values on a frame-by-frame basis, each frame having a predetermined time length, the at least one feature and the data set representing human speech features; model a distribution of the data set extracted for each frame, by a mixture distribution that contains as many probability distributions as the number of speakers; for each of the probability distributions, select from among the data set extracted for each frame, either data sets located within a predetermined distance from the center of the probability distribution or a predetermined number of data sets in sequence starting from the data set closest to the center of the probability distribution, and select the at least one feature extracted for the frame corresponding to each selected data set, as first training data for speaker of the plurality of speakers that corresponds to the probability distribution; train a speaker model for each of the plurality of speakers by using the first training data selected for the corresponding speaker, the speaker model representing the speech features of the speaker by a probability distribution of the at least one feature; and based on the speaker model of each of the plurality of speakers and on the at least one feature for each frame, append to each frame identification information for identifying the speaker speaking in the frame.
地址 Kawasaki JP