摘要 |
<p>Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing joint image-audio queries. In one aspect, a method includes receiving, from a client device, a joint image-audio query including query image data and query audio data. Query image feature data is determined from the query image data. Query audio feature data is determined from the audio data. The query image feature data and the query audio feature data are provided to a joint image-audio relevance model trained to generate relevance scores for a plurality of resources, each resource including resource image data defining a resource image for the resource and text data defining resource text for the resource. Each relevance score is a measure of the relevance of corresponding resource to the joint image-audio query. Data defining search results indicating the order of the resources is provided to the client device.</p> |