发明名称 Source-specific speech interactions
摘要 A speech system may be configured to operate in conjunction with a stationary base device and a handheld remote device to receive voice commands from a user. Voice commands may be directed either to the base device or to the handheld device. When performing automatic speech recognition (ASR), natural language understanding (NLU), dialog management, text-to-speech (TTS) conversion, and other speech-related tasks, the system may utilize various models, including ASR models, NLU models, dialog models, and TTS models. Different models may be used depending on whether the user has chosen to speak into the base device or the handheld audio device. The different models may be designed to accommodate the different characteristics of audio and speech that are present in audio provided by the two different components and the different characteristics of the environmental situation of the user.
申请公布号 US9293134(B1) 申请公布日期 2016.03.22
申请号 US201414502103 申请日期 2014.09.30
申请人 Amazon Technologies, Inc. 发明人 Saleem Shirin;Somashekar Shamitha;Piercy Aimee Therese;Piersol Kurt Wesley;Typrin Marcello
分类号 G10L15/18;G10L15/26;G10L13/00 主分类号 G10L15/18
代理机构 Lee & Hayes, PLLC 代理人 Lee & Hayes, PLLC
主权项 1. A system comprising: a base device configured to operate from a fixed location to capture far-field audio containing first user speech in response to a user speaking a keyword; a handheld device configured to operate from a moveable location to capture near-field audio containing second user; a speech service configured to perform acts comprising: receiving a first audio signal from the base device, the first audio signal corresponding to the far-field audio;performing automatic speech recognition (ASR) on the first audio signal using a first ASR model to obtain first ASR results, wherein the first ASR model was trained using far-field audio signals;receiving a second audio signal from the handheld device, the second audio signal corresponding to the near-field audio;performing ASR on the near-field audio using a second ASR model to obtain second ASR results, wherein the second ASR model was trained using near-field audio signals;performing natural language understanding (NLU) on the first ASR results using a first NLU model to determine a first meaning of the first user speech, wherein the first NLU model was trained using ASR results from far-field audio signals;performing NLU on the second ASR results using a second NLU model to determine a second meaning of the second user speech, wherein the second NLU model was trained using ASR results from near-field audio signals.
地址 Seattle WA US