Inferring Shared Attention in Social Scene Videos

Lifeng Fan^{⚹ 1}, Yixin Chen^{⚹ 1}, Ping Wei^2,1, Wenguan Wang^3,1 and Song-Chun Zhu¹

Center for Vision, Cognition, Learning, and Autonomy, UCLA¹

(⚹ indicates equal contribution.)

Abstract

This paper addresses a new problem of inferring shared attention in third-person social scene videos. Shared attention is a phenomenon that two or more individuals simultaneously look at a common target in social scenes. Perceiving and identifying shared attention in videos plays crucial roles in social activities and social scene understanding. We propose a spatial-temporal neural network to detect shared attention intervals in videos and predict shared attention locations in frames. In each video frame, human gaze directions and potential target boxes are two key features for spatially detecting shared attention in the social scene. In temporal domain, a convolutional Long Short-Term Memory network utilizes the temporal continuity and transition constraints to optimize the predicted shared attention heatmap. We collect a new dataset VideoCoAtt from public TV show videos, containing 380 complex video sequences with more than 492,000 frames that include diverse social scenes for shared attention study. Experiments on this dataset show that our model can effectively infer shared attention in videos. We also empirically verify the effectiveness of different components in our model.

Paper and Demo

Paper

Lifeng Fan^⚹, Yixin Chen^⚹, Ping Wei, Wenguan Wang and Song-Chun Zhu. Inferring Shared Attention in Social Scene Videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [pdf]

@inproceedings{FanCVPR2018,
  title     = {Inferring Shared Attention in Social Scene Videos},
  author    = {Lifeng Fan and Yixin Chen and Ping Wei and Wenguan Wang and Song-Chun Zhu},
  year      = {2018},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}
}

Demo

VideoCoAtt Dataset

The dataset is available for free only for research purposes. You download the dataset from here.

We greatly welcome emails about questions or suggestions. Please email to lfan [at] ucla.edu.

Please cite this paper if you use the dataset:

Lifeng Fan^⚹, Yixin Chen^⚹, Ping Wei, Wenguan Wang and Song-Chun Zhu. Inferring Shared Attention in Social Scene Videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.