发明名称 |
System and method for synthetic voice generation and modification |
摘要 |
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages. |
申请公布号 |
US9269346(B2) |
申请公布日期 |
2016.02.23 |
申请号 |
US201514623183 |
申请日期 |
2015.02.16 |
申请人 |
AT&T Intellectual Property I, L.P. |
发明人 |
Conkie Alistair D.;Syrdal Ann K. |
分类号 |
G10L13/027;G10L13/047;G10L13/06;H04B7/04;H04B7/06;H04W72/04 |
主分类号 |
G10L13/027 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method comprising:
storing, in a database, voice data, wherein the voice data is associated with a plurality of voices, wherein the plurality of voices are stored within libraries according to emotions; identifying, using user speech exhibited by a user, a user emotion; identifying, via a processor and according to the user emotion, a first text-to-speech voice of the plurality of voices which are in the database, wherein the first text-to-speech voice has a first emotional content from a first speaker; identifying, via the processor and according to the user emotion, a second text-to-speech voice of the plurality of voices which are in the database, wherein the second text-to-speech voice has a second emotional content from a second speaker, and wherein the second emotional content is distinct from the first emotional content; and synthesizing synthesized speech using the first text-to-speech voice and the second text-to-speech voice, wherein the synthesized speech mimics the user emotion. |
地址 |
Atlanta GA US |