A facility for using a mobile device to search video content takes advantage of computing capacity on the mobile device to capture input through a camera and/or a microphone, extract an audio-video signature of the input in real time, and to perform progressive search. By extracting a joint audio-video signature from the input in real time as the input is received and sending the signature to the cloud to search similar video content through the layered audio-video indexing, the facility can provide progressive results of candidate videos for progressive signature captures.