发明名称 SPEECH SYNTHESIS MODEL SELECTION
摘要 In some implementations, a text-to-speech system may perform a mapping of acoustic frames to linguistic model clusters in a pre-selection process for unit selection synthesis. An architecture may leverage data-driven models, such as neural networks that are trained using recorded speech samples, to effectively map acoustic frames to linguistic model clusters during synthesis. This architecture may allow for improved handling and synthesis of combinations of unseen linguistic features.
申请公布号 US2016343366(A1) 申请公布日期 2016.11.24
申请号 US201514716063 申请日期 2015.05.19
申请人 Google Inc. 发明人 Fructuoso Javier Gonzalvo;Chun Byungha
分类号 G10L13/027;G10L13/08;G10L13/047 主分类号 G10L13/027
代理机构 代理人
主权项 1. A computer-implemented method comprising: receiving textual input to a text-to-speech system; identifying a particular set of linguistic features that correspond to the textual input; providing the particular set of linguistic features as input to a first neural network that has been trained to identify a set of acoustic features given a set of linguistic features; receiving, as output from the first neural network, a particular set of acoustic features identified for the particular set of linguistic features; providing a representation of the particular set of acoustic features as input to a second neural network that has been trained to identify a text-to-speech model given a set of acoustic features; receiving, as output from the second neural network, data that indicates a particular text-to-speech model for the representation of the particular set of acoustic features; and generating, based at least on the particular text-to-speech model, audio data that represents the textual input.
地址 Mountain View CA US