发明名称 Method and system for achieving emotional text to speech utilizing emotion tags assigned to text data
摘要 A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.
申请公布号 US9117446(B2) 申请公布日期 2015.08.25
申请号 US201113221953 申请日期 2011.08.31
申请人 International Business Machines Corporation 发明人 Bao Shenghua;Chen Jian;Qin Yong;Shi Qin;Shuang Zhiwei;Su Zhong;Wen Liu;Zhang Shi Lei
分类号 G10L13/08;G10L13/10 主分类号 G10L13/08
代理机构 Fleit Gibbons Gutman Bongini & Bianco PL 代理人 Fleit Gibbons Gutman Bongini & Bianco PL ;Grzesik Thomas
主权项 1. A method for achieving emotional Text To Speech (TTS), the method comprising: receiving a set of text data; organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; anddetermining for each of the set of phones a speech feature based on: Fi=(1−Pemotion)*Fi-neutral+Pemotion*Fi-emotion wherein: Fi is a value of an ith speech feature of one of the plurality of phones,Pemotion is the final emotion score of the rhythm piece where one of the plurality of phones lies,Fi-neutral is a first speech feature value of an ith speech feature in a neutral emotion category, andFi-emotion is a second speech feature value of an ith speech feature in the final emotion category.
地址 Armonk NY US