Animated Pose Templates for Modelling and Detecting Human Actions

Benjamin Yao, Bruce Nie, Zicheng Liu and Song-Chun Zhu

Download: paper  datasets  code


This paper presents animated pose templates (APT) for detecting short-term, long-term and contextual actions from cluttered scenes in videos. Eachpose templateconsists of two components: i) a shape template with deformable parts represented in an And-node whose appearances are represented by the Histogram of Oriented Gradient (HOG) features; and ii) a motion template specifying the motion of the parts by the Histogram of Optical-Flows (HOF) features. A shape template may have more than one motion templates represented by an Or-node. Therefore each action is defined as a mixture (Or-node) of pose templates in an And-Or tree structure. While this post template is suitable for detecting short-term action snippets in 2-5 frames, we extend it in two ways: i) for long-term actions, we animate the pose templates by adding temporal constraints in a Hidden Markov Model (HMM); and ii) for contextual actions, we treat contextual objects as additional parts of the pose templates and add constraints that encode spatial correlations between parts. To train the model, we manually annotate part locations on several key frames of each video and cluster them into pose templates using EM. This leaves the unknown parameters for our learning algorithm in two groups: i) latent variables for the unannotated frames including pose-IDs and part-locations; ii) model parametersshared by all training samples such as weights for HOG and HOF features, canonical part-locations of each pose, coefficients penalizing pose-transition and part-deformation. To learn these parameters, we introduce a Semi-Supervised Structural SVM algorithm that iterates between two steps: i) learning (updating) model parameters using labeled data by solving a structural SVM optimization; and 2) imputing missing variables (i.e. detecting actions on unlabelled frames) with parameters learned from the previous step and progressively accepting high score frames as newly labelled examples. This algorithm belongs to a family of optimization methods known as the Concave-Convex Procedure (CCCP) that converge to a local optimal solution. The inference algorithm consists of two components: i) Detecting top candidates for the pose templates; and ii) Computing the sequence of pose templates. Both are done by dynamic programming or more precisely beam search. In experiments, we demonstrate that this method is capable of discovering salient poses of actions as well as interactions with contextual objects. We test our method on several public action datasets and a challenging outdoor contextual action dataset collected by ourselves. The results show that our model achieves comparable or better performance compared to state-of-the-art methods.

The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative "+" relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive "-" relations represents negative links between competing entities, such as mutually exclusive boxes.

We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree).

Moving pose templates

The moving pose templates are used to model the short-term actions or so-called action snippets refer to actions observed in 3-5 frames, and they contain rich appearence and motion information about the aciton. A moving pose template consists of two components: shape templates and motion templates. Each action is composed of a sequence of key poses or action snippets, and the number of pose depends on the complexity of the action. For example, we can see from the above figure, the hand-clapping action contains three pose each of which represents a moving pose template which consists a shape template and one of the two motion templates depending on the direction of moving.

Animated pose templates

The animated pose templates are based on several moving pose templates and used to model the long-term and continuous action, such as walking and running can be represented by a sequence of moving pose templates. Our animated pose template is a generative model. The shape templates between consective action snippets are considerd trackable. So we will track the bounding boxes of root and part nodes over time by HMM model which captures the spatial constraints on the movement of bounding boxes between frames and the transition between moving pose templates. The details inside each part are considered intrackable and we calculate the histogram of the flow without pixel correspondence.

Animated pose templates with contextual object

The contextual objects are divided into two subsets based on the information flow between object and pose: weak objects and strong objects. The weak objects are either too small or too diverse so its hard to detect these objects. For weak objects, the inference process localizes them using the inferred body parts and quadraic functions as constraint between the positions of the body part and object. The strong objects are more distinguishable on their own such as cup and trash can. We treat the strong objects in the same way as the body parts in the moving pose templates.

Examples of detection results

MSR datasets

Coffee & cigarette dataset

CMU dataset

Datasets Download

Microsoft Research Dateset with parts annotation

UCLA action datasets

The work is supported by grants from NSF IIS-1018751, NSF CNS-1028381 and ONR MURI N00014-10-1-093