Modeling 4D Human-Object Interactions for Event and Object Recognition

Ping Wei1,2, Yibiao Zhao2, Nanning Zheng1, and Song-Chun Zhu2

1 Xi'an Jiaotong University, China      2 University of California, Los Angeles, USA

Introduction

Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

The paper is here.

Concurrent Action Detection with Structural Prediction

Ping Wei1,2, Nanning Zheng1, Yibiao Zhao2, and Song-Chun Zhu2

1 Xi'an Jiaotong University, China      2 University of California, Los Angeles, USA

Introduction

Action recognition has often been posed as a classification problem, which assumes that a video sequence only has one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

The paper is here.