Learning Animated Basis for Action Detection & Recognition


The goal of this work is to study an Animated Basis model for learning, detecting and recognizing actions (of both human and animal) from real-world videos. Many previous approaches in human action recognition either assume the foreground objects (and sometimes their body parts) are detected and tracked or assume the silhouettes of these objects are available (through foreground segmentation). These assumptions, however, are sometimes too hard to satisfy, especially when it comes to real-world videos with cluttered and dynamic backgrounds. Therefore, early work on action recongition has been largely limited to several well-controlled action datasets with little background clutters/motions such as the KTH dataset and the Weizman dataset (and reported recognition performance on these two datasets has already saturated).

Several new action datasets with realistic background settings have been proposed recently, for example the CMU dataset [Y. Ke et al. ICCV07], the HOHA dataset [I. Laptev et al. CVPR08] and the UCF YouTube Action Dataset [J.G. Liu et al. CVPR09]. Learning action templates from these datasets, however, remains a challenge. On the one hand, there is no algorithm to detect and track foreground objects robustly. On the other hand, it is extremely tedious (and not easily scale-up) to label these actions for two reasons: 1) one bounding box per frame is required to offset global motion; 2) frame-to-frame correspondences are required to synchronize actions. Two groups of methods are proposed to get around this challenge: In group 1 (e.g. [E. Shechtman et al. PAMI07] and [Y. Ke et al. ICCV07]), one example per action is used as the template to match with other videos (there is essentially no learning). In group 2 (e.g. [J.G. Liu et al. CVPR09]), a model using loosely grouped feature points (bag-of-features) is propsed to avoid the requirements of strict alignment and synchronization. The problem with the group 1 is that using one example as template is too strict and cannot handle big variations. The problem with the group 2 is same as the problem of bag-of-features models in 2D images -- discarding spatial (and temporal) locations of features may lead to degrading in discrimative power.

In this work, we extend the method of group 1 by learning the templates from a set of training example. A semi-supervised learning approach is used to avoid the difficulty of annotation. We test our method mainly on the CMU dataset.

Overview of our approach

  • Representation. Our Animated Basis model is inspired by the active basis model of Y.N. Wu et al. In our generative model, an action template is a sequence of image templates each of which consists of a set of shape and motion primitives (Gabor basis and optical-flow patches) at selected orientations and locations (see the "template" column of Fig. 1). The primitives are allowed to locally perturb in position and orientation when they are linearly combined to encode each training or testing example, as illustrated by the "train1-5" columns of Fig. 1.
Train 1
Train 2
Train 3
Train 4
Train 5
    Figure.1 Illustration of animated basis model. The 1st row shows training videos. The 2nd row shows shape primtives (each frame has 70 Gabor wavelets illustrated by bars). The 3rd row shows motion primtives (each frame has 30 optical-flow patches illustrated by colored rectangles). The "template" column shows the intialization video and learned templates. "train1-train5" columns shows locally shifted versions of the shape and motion primitives fitting to the corresponding training videos. Red bounding boxes in the training videos are detection window.
We illustrate different directions and speed of motion primitives with colors in the fig below.
  • Learning. We use a three-step semi-supervised learning procedure (see Fig. 2 for an illustration). 1) For each action class, a template is initialized from a labeled (red bounding box) training video. 2) The template is used to detect actions from other training videos of the same class by a dynamic space-time warping algorithm (blue bounding boxes). 3) The template is updated by pooling over all aligned frames using a shared pursuit algorithm. The 2nd and 3rd steps iterate several times to arrive at an optimal action template.
Figure. 2 Diagram of our semi-supervised learning algorithm. See text for details.

Image Representation by Active Basis Model (this part will be updated later)

    Active basis for image.
    Each Gabor wavelet element is illustrated by a thin ellipsoid at a certain location and orientation. The upper half shows the perturbation of one basis element. By shifting its location or orientation or both within a limited range, the basis element (illustrated by a black ellipsoid) can change to other Gabor wavelet elements (illustrated by the blue ellipsoids). Because of the perturbations of the basis elements, the active basis represents a deformable template.


    The above model can be extended to video:

Semi-supervised Learning from Weakly Labeled Cluttered Videos

  • Shared pursuit algorithm
      Shared sketch algorithm. A selected element (colored ellipsoid) is shared by all the training images. For each image, a perturbed version of the element seeks to sketch a local edge segment near the element by a local maximization operation. The elements of the active basis are selected sequentially according to the Kullback-Leibler divergence between the pooled distribution (colored solid curve) of filter responses and the background distribution (black dotted curve). The divergence can be simplified into a pursuit index, which is the sum of the transformed filter responses. The sum essentially counts the number of edge segments sketched by the perturbed versions of the element.

  • Dynamic Space-time warping algorithm
      The original DTW solve the problem of finding a temporal match between two 1 dimensional signalQand C, as shown in Fig.3, the matching process is equivalent to finding an optimal path in a [T1×T2] matrix, therefore could be solved by dynamic programming, where T1 and T2 are the length of signal Q and C respectively. The path is the warping function w in our definition, which satisfies time continuity and causality. We want to optimize S, which happens to be a continuous function too, together with w. As illustrated in Fig.3, this problem can also be interpreted as trying to find an optimal path in a 5 dimensional space (x,y,s,ttemplate,ttarget). In Fig.3, we draw a 3D cube for illustration purpose. In the original DTW algorithm, the starting and ending points of two signals are aligned. But we cannot assume the same thing for template and target videos. We solve this problem by searching the starting and ending points on two surfaces of the cube (as illustrated by shaded areas in Fig.3).
    Figure. 3 Illustration of the dynamic space-time warping algorithm. *Illustration of DTW is adapted from [Eamonn Keogh, “Exact Indexing of Dynamic Time Warping “, 2002 ]
      Results of the DSTW algorithm:
      Figure. 4 First row from left to right: 1) A matrix representing the time warping path, where T and T0 axis stand for the temporal domain of the target video and template respectively. In side the matrix, the white zig-zag line stands for the time warping path. The dark area denotes the space of all possible time warping paths. The gray area represents for impossible search range, which is setup manually to save computation time. 2) shape primitive template 3) motion primtive template. Second row from left to right: 1) original video, 2) deformed shape primtives. 3) deformed motion primitives.

Experiments on the CMU dataset

  • We test our algorithm by detecting actions from cluttered videos in the CMU datasets. To save time, we use the DSTW algorithm to detect multiple events at a single search.
  • We generate the precision-recall (P-R) curve of each action, which is illustrated by the blue curves in Fig.6, by changing the threshold. The red curves are baseline detection algorithm reported in [Y. Ke et al. ICCV07]. We also use the same decision criterion as in this paper to consider a detection. Our performance curve is computed by averaging detection results from 5 separate runs with random selection of training videos, error bounds illustrate the best and worst cases when doing the random selection.


    Coming soon.

Related Publication:

Benjamin Yao and Song-chun Zhu Learning Deformable Action Templates from Cluttered Videos, ICCV'09 [paper|poster].

[Back to Benjamin's homepage]