发明名称 VIDEO ANALYSIS BASED LANGUAGE MODEL ADAPTATION
摘要 Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes a user utterance, receiving image data obtained by a camera of the wearable computing device, identifying one or more image features based on the image data, identifying one or more concepts based on the one or more image features, selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions, adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the one or more concepts, and obtaining a transcription of the user utterance using the speech recognizer.
申请公布号 US2014379346(A1) 申请公布日期 2014.12.25
申请号 US201313923545 申请日期 2013.06.21
申请人 Google Inc. 发明人 Aleksic Petar;Lei Xin
分类号 G10L15/26;G10L15/25 主分类号 G10L15/26
代理机构 代理人
主权项 1. A computer-implemented method comprising: receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes an utterance of a user; receiving image data obtained by a camera of the wearable computing device; identifying one or more image features based on the image data; classifying the image data as pertaining to a particular activity, based at least on the one or more image features, wherein the particular activity is unrelated to providing an explicit user input to the wearable computing device; selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions; adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity; and obtaining, as an output of the speech recognizer that uses the adjusted probabilities, a transcription of the user utterance.
地址 Mountain View CA US