Modeling 4D Human-Object Interactions

Ping Wei, Yibiao Zhao, Nanning Zheng, and Song-Chun Zhu

Overview and Demo

We present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization.


The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions. The 4DHOI model is a hierarchical spatial-temporal graph, whose structures and parameters are learned using an ordered expectation maximization algorithm. The inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization.



We collected a large-scale multiview RGB-D event dataset which contains 8 event categories, 11 object classes, 3,815 video sequences, and 383,036 RGB-D frames captured by three RGB-D cameras. A subset (about 8GB) of the dataset, with the same event category and viewpoint numbers but fewer sequence samples for each event category, can be downloaded from here.