发明名称 User specified keyword spotting using long short term memory neural network feature extractor
摘要 Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing keywords using a long short term memory neural network. One of the methods includes receiving, by a device for each of multiple variable length enrollment audio signals, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, processing each of the plurality of enrollment feature vectors using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector, and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether another audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal.
申请公布号 US9508340(B2) 申请公布日期 2016.11.29
申请号 US201414579603 申请日期 2014.12.22
申请人 Google Inc. 发明人 Parada San Martin Maria Carolina;Sainath Tara N.;Chen Guoguo
分类号 G10L15/16;G10L15/26;G10L15/02;G06F1/32;G10L15/08;G10L15/06;G10L15/28 主分类号 G10L15/16
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A method comprising: receiving, by a device for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the device determines that another audio signal encodes another spoken utterance of the enrollment phrase, the device performs a particular action assigned to the enrollment phrase; and for each of the multiple variable length enrollment audio signals: processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; andgenerating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same.
地址 Mountain View CA US