A method of generating a temporal saliency map is disclosed. In a particular embodiment, the method includes receiving an object bounding box from an object tracker. The method includes cropping a video frame based at least in part on the object bounding box to generate a cropped image. The method further includes performing spatial dual segmentation on the cropped image to generate an initial mask and performing temporal mask refinement on the initial mask to generate a refined mask. The method also includes generating a temporal saliency map based at least in part on the refined mask.