3DSal: An Efficient 3D-CNN Architecture for Video Saliency Prediction

Abstract:

In this work, we contribute to video saliency research community by developing a novel saliency prediction model. We propose a 3D CNN architecture based video saliency model that capture the motion information through multiple adjacent frames. Our model performs a cubic convolution on six consecutive frames to extract the spatio-temporal features allowing us to predict the saliency map of the last frame using its past frames. We thoroughly examine the performance of our model, with respect to state-of-the-art saliency models, on three largescale datasets (i.e., DHF1K, UCF-SPORTS, DAVIS). Experimental results demonstrate the competitiveness of our model comparing to the state of the art models.