主权项 |
1. A computer-implemented method comprising:
receiving, by a key phrase detection system that is trained to detect a presence of an utterance of a particular key phrase in an audio waveform, a plurality of audio frame vectors that each model an audio waveform during a different period of time; generating, by the key phrase detection system, two or more acoustic event vectors by coding respective ones of two or more audio frame vectors from the plurality of audio frame vectors without decoding using a language model, each of the two or more acoustic event vectors having a predetermined length; generating, by the key phrase detection system, a pooled event vector by pooling all of the two or more acoustic event vectors, the pooled event vector having the predetermined length; determining, by the key phrase detection system, whether the particular key phrase was present in the audio waveform during the period of time modeled by the audio frame vectors; and outputting, by the key phrase detection system, a score that indicates a likelihood of whether or not the particular key phrase was present in the audio waveform during the period of time modeled by the audio frame vectors in response to determining whether the particular key phrase was present in the audio waveform during the period of time modeled by the audio frame vectors. |