摘要 |
A method, system, and/or computer program product tracks an object in a video. A bounding box is defined by the user in a first frame, thus representing the object to be tracked based on a point of interest. A static dictionary D is populated with the densely overlapping patches from a search window. A new frame in the video is detected, and candidate patches, in the new frame, that potentially depict the object being tracked are identified. The candidate patches are co-located with the multiple densely overlapping patches to form a dynamic candidate dictionary Y of candidate patches. Candidate patches that best match the densely overlapping patches from the first frame are identified by an L1-norm solution, in order to identify a best-matched patch in the new frame. |
主权项 |
1. A method to track an object in a video, the method comprising:
initializing, by one or more processors, a first frame in a video by detecting a search window over an object to be tracked, wherein initializing the first frame comprises defining multiple densely overlapping patches within the search window; populating, by one or more processors, a static dictionary D with the densely overlapping patches from the search window; detecting, by one or more processors, a new frame in the video, wherein the new frame includes the object being tracked; identifying, by one or more processors, candidate patches, in the new frame, that potentially depict the object being tracked; co-locating, by one or more processors, the candidate patches with the multiple densely overlapping patches to form a dynamic candidate dictionary Y of candidate patches; identifying, by one or more processors, candidate patches that best match the densely overlapping patches from the first frame to generate selected candidate patches by minimizing a solution:
min∥Dαk−yk∥22+λ∥αk∥1 where the solution minimizes a square of a L2-norm for a distance between atoms in a dictionary D of the densely overlapping patches times an n-dimensional coefficient vector αk (Dαk) and each candidate patch (yk) in dictionary Y, plus a Lagrange Multiplier lambda (λ) times an L1-norm of αk, wherein the Lagrange multiplier is determined by the gradient between an initial atom dk from D and the candidate atom yk from Y;
weighting, by one or more processors, the selected candidate patches based on a sparse coefficient of confidence of the selected candidate patches belonging to the object being tracked; identifying, by one or more processors, a highest weighted candidate patch, from the selected candidate patches, as a patch that depicts the object being tracked in the new frame of the video; and constructing, by one or more processors, a confidence map for candidate patches in the dictionary Y, wherein the confidence map is a 2-D matrix that depicts a level of confidence that a patch from the new frame matches a densely overlapping patch from the first frame wherein the confidence map is based on:
yx,y=αxyD′ [x,y]εOw where αxy is a ‘n’ dimensional coefficient vector for each candidate patch at location (x, y), where [x,y] are elements of (ε) and object window (Ow) that describe locations in an object window in the new frame, wherein D′=[DoDb], wherein Do is a dictionary of object patches from the object being tracked, wherein Db is a dictionary of background patches outside of the object being tracked, and wherein Do and Db are both used to discriminate the object patches from the background patches. |