发明名称 System and method for synthetic voice generation and modification
摘要 Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
申请公布号 US9269346(B2) 申请公布日期 2016.02.23
申请号 US201514623183 申请日期 2015.02.16
申请人 AT&T Intellectual Property I, L.P. 发明人 Conkie Alistair D.;Syrdal Ann K.
分类号 G10L13/027;G10L13/047;G10L13/06;H04B7/04;H04B7/06;H04W72/04 主分类号 G10L13/027
代理机构 代理人
主权项 1. A method comprising: storing, in a database, voice data, wherein the voice data is associated with a plurality of voices, wherein the plurality of voices are stored within libraries according to emotions; identifying, using user speech exhibited by a user, a user emotion; identifying, via a processor and according to the user emotion, a first text-to-speech voice of the plurality of voices which are in the database, wherein the first text-to-speech voice has a first emotional content from a first speaker; identifying, via the processor and according to the user emotion, a second text-to-speech voice of the plurality of voices which are in the database, wherein the second text-to-speech voice has a second emotional content from a second speaker, and wherein the second emotional content is distinct from the first emotional content; and synthesizing synthesized speech using the first text-to-speech voice and the second text-to-speech voice, wherein the synthesized speech mimics the user emotion.
地址 Atlanta GA US