摘要 |
A method of embedding video for text search includes extracting visual features from a video. The visual features may, for example, include appearance information, motion, audio, and/or like features. Term vectors are determined from textual descriptions associated with the video. The text may be included in a title for the video or included within the video (e.g., subtitles), for example. A feature projection is computed based on the extracted video features and a textual projection is computed based on the term vectors. A semantic embedding is computed based on the feature projection and the textual projection by jointly optimizing semantic predictability and semantic descriptiveness. |