发明名称 Keyword detection without decoding
摘要 Embodiments pertain to automatic speech recognition in mobile devices to establish the presence of a keyword. An audio waveform is received at a mobile device. Front-end feature extraction is performed on the audio waveform, followed by acoustic modeling, high level feature extraction, and output classification to detect the keyword. Acoustic modeling may use a neural network or a vector quantization dictionary and high level feature extraction may use pooling.
申请公布号 US9378733(B1) 申请公布日期 2016.06.28
申请号 US201313860982 申请日期 2013.04.11
申请人 Google Inc. 发明人 Vanhoucke Vincent O.;Vinyals Oriol;Nguyen Patrick An Phu;San Martin Maria Carolina Parada;Schalkwyk Johan
分类号 G10L17/24;G06F21/46;G10L15/08;G10L15/22;G10L25/51 主分类号 G10L17/24
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A computer-implemented method comprising: receiving, by a key phrase detection system that is trained to detect a presence of an utterance of a particular key phrase in an audio waveform, a plurality of audio frame vectors that each model an audio waveform during a different period of time; generating, by the key phrase detection system, two or more acoustic event vectors by coding respective ones of two or more audio frame vectors from the plurality of audio frame vectors without decoding using a language model, each of the two or more acoustic event vectors having a predetermined length; generating, by the key phrase detection system, a pooled event vector by pooling all of the two or more acoustic event vectors, the pooled event vector having the predetermined length; determining, by the key phrase detection system, whether the particular key phrase was present in the audio waveform during the period of time modeled by the audio frame vectors; and outputting, by the key phrase detection system, a score that indicates a likelihood of whether or not the particular key phrase was present in the audio waveform during the period of time modeled by the audio frame vectors in response to determining whether the particular key phrase was present in the audio waveform during the period of time modeled by the audio frame vectors.
地址 Mountain View CA US