主权项 |
1. A computer-implemented process for performing video concept detection on a video clip based upon a prescribed set of target concepts, comprising:
using a computer to perform the following process actions: segmenting the clip into a plurality of shots, wherein each shot comprises a series of consecutive frames that represent a distinctive coherent visual theme; constructing a multi-layer multi-instance (MLMI) structured metadata representation of each shot, comprising,
a layer indicator l,a hierarchy of three layers, said hierarchy comprising,an uppermost shot layer, l=1, comprising the plurality of shots segmented from the clip,an intermediate key-frame sub-layer, l=2, contiguously beneath the shot layer, comprising one or more key-frames for each shot, wherein each key-frame comprises one or more of the target concepts, anda lowermost key-region sub-layer, l=3, contiguously beneath the key-frame sub-layer, comprising a set of filtered key-regions for each key-frame, wherein each filtered key-region comprises a particular target concept, anda rooted tree structure, comprising a connected acyclic directed graph of nodes, wherein each node comprises structured metadata of a certain granularity describing a particular visual concept, and the granularity of the metadata increases for each successive layer down the hierarchy; validating a set of pre-generated trained models of the target concepts using a set of training shots selected from the plurality of shots; recursively generating an MLMI kernel kMLMI( ) which models the MLMI structured metadata representation of each shot by comparing prescribed pairs of shots; utilizing a regularization framework in conjunction with kMLMI( ) to generate a modified learned objective decision function f( ) which learns a classifier for determining if a particular shot x, that is not in the set of training shots, comprises instances of the target concepts, wherein the regularization framework introduces explicit constraints which serve to restrict instance classification in the key-frame and key-region sub-layers, thus maximizing the precision of the classifier, wherein the explicit constraints introduced by the regularization framework comprise,
a constraint A comprising a ground truth for the target concepts and instance classification labels for the plurality of shots in the shot layer, said constraint A serving to minimize instance classification errors for said shots, anda constraint B comprising the ground truth, instance classification labels for the key-frames in the key-frame sub-layer, and instance classification labels for the sets of filtered key-regions in the key-region sub-layer, said constraint B serving to minimize instance classification errors for said key-frames and said filtered key-regions. |