| Learning Animated Basis for Action Detection & Recognition |
|
|
The goal of this work is to study an Animated Basis model for
learning, detecting and recognizing actions (of both human and animal) from real-world videos. Many previous approaches in human action recognition either assume the foreground objects (and sometimes their body parts) are detected and tracked or assume the silhouettes of these objects are available (through foreground segmentation). These assumptions, however, are sometimes too hard to satisfy, especially when it comes to real-world videos with cluttered and dynamic backgrounds. Therefore, early work on action recongition has been largely limited to several well-controlled action datasets with little background clutters/motions such as the KTH dataset and the Weizman dataset (and reported recognition performance on these two datasets has already saturated).
Several new action datasets with realistic background settings have been proposed recently, for example the CMU dataset [Y. Ke et al. ICCV07], the HOHA dataset [I. Laptev et al. CVPR08] and the UCF YouTube Action Dataset [J.G. Liu et al. CVPR09]. Learning action templates from these datasets, however, remains a challenge. On the one hand, there is no algorithm to detect and track foreground objects robustly. On the other hand, it is extremely tedious (and not easily scale-up) to
label these actions for two reasons: 1) one bounding box per frame is required
to offset global motion; 2) frame-to-frame correspondences are required to synchronize actions. Two groups of methods are proposed to get around this challenge: In group 1 (e.g. [E. Shechtman et al. PAMI07] and [Y. Ke et al. ICCV07]), one example per action is used as the template to match with other videos (there is essentially no learning). In group 2 (e.g. [J.G. Liu et al. CVPR09]), a model using loosely grouped feature points (bag-of-features) is propsed to avoid the requirements of strict alignment and synchronization. The problem with the group 1 is that using one example as template is too strict and cannot handle big variations. The problem with the group 2 is same as the problem of bag-of-features models in 2D images -- discarding spatial (and temporal) locations of features may lead to degrading in discrimative power.
In this work, we extend the method of group 1 by learning the templates from a set of training example. A semi-supervised learning approach is used to avoid the difficulty of annotation. We test our method mainly on the CMU dataset.
|
|
- Representation. Our Animated Basis model is inspired by the active basis model of Y.N. Wu et al. In our generative model,
an action template is a sequence of image templates each of which consists of a set of shape and motion primitives (Gabor basis and optical-flow patches) at selected orientations
and locations (see the "template" column of Fig. 1). The primitives are allowed to locally perturb in position and orientation when they are linearly combined to encode each training or testing example, as illustrated by
the "train1-5" columns of Fig. 1.
|
|
Template |
Train 1 |
Train 2 |
Train 3 |
Train 4 |
Train 5 |
| Traing
Videos
|
|
Shape
Primitives
|
Motion
Primitives
|
Figure.1 Illustration of animated basis model.
The 1st row shows training videos. The 2nd row shows shape primtives (each frame has 70 Gabor wavelets illustrated by bars). The 3rd row shows motion primtives (each frame has 30 optical-flow patches illustrated by colored rectangles). The "template" column shows the intialization video and learned templates. "train1-train5" columns shows locally shifted versions of the
shape and motion primitives fitting to the corresponding training
videos.
Red bounding boxes in the training videos are detection window.
|
|
We illustrate different directions and speed of motion primitives with colors in the fig below. |
|
|
|
-
Learning. We use a three-step
semi-supervised learning procedure (see Fig. 2 for an illustration). 1) For each
action class, a template is initialized from a labeled
(red bounding box) training video. 2) The template
is used to detect actions from other training videos of
the same class by a dynamic space-time warping algorithm (blue bounding boxes). 3) The template is updated by
pooling over all aligned frames using a shared pursuit algorithm. The
2nd and 3rd steps iterate several times to arrive at an optimal
action template.
|
|
|
Figure. 2 Diagram of our semi-supervised learning algorithm. See text for details. |
|
 |
Active basis for image.
Each Gabor wavelet element is illustrated by a thin ellipsoid at a certain location and orientation. The upper half shows the perturbation of one basis element. By shifting its location or orientation or both within a limited range, the basis element (illustrated by a black ellipsoid) can change to other Gabor wavelet elements (illustrated by the blue ellipsoids). Because of the perturbations of the basis elements, the active basis represents a deformable template.
|
 |
The above model can be extended to video:
|
|
|
|
- Shared pursuit algorithm
 |
Shared sketch algorithm. A selected element (colored ellipsoid) is shared by all the training images. For each image, a perturbed version of the element seeks to sketch a local edge segment near the element by a local maximization operation. The elements of the active basis are selected sequentially according to the Kullback-Leibler divergence between the pooled distribution (colored solid curve) of filter responses and the background distribution (black dotted curve). The divergence can be simplified into a pursuit index, which is the sum of the transformed filter responses. The sum essentially counts the number of edge segments sketched by the perturbed versions of the element.
|
| |
- Dynamic Space-time warping algorithm
The
original DTW solve the problem of finding a temporal
match between two 1 dimensional signalQand C, as shown
in Fig.3, the matching process is equivalent to finding an
optimal path in a [T1×T2] matrix, therefore could be solved
by dynamic programming, where T1 and T2 are the length
of signal Q and C respectively. The path is the warping
function w in our definition, which satisfies time continuity
and causality. We want to optimize S, which happens
to be a continuous function too, together with w. As illustrated
in Fig.3, this problem can also be interpreted
as trying to find an optimal path in a 5 dimensional space
(x,y,s,ttemplate,ttarget). In Fig.3, we draw a 3D cube for
illustration purpose. In the original DTW algorithm,
the starting and ending points of two signals are aligned.
But we cannot assume the same thing for template and target
videos. We solve this problem by searching the starting
and ending points on two surfaces of the cube (as illustrated
by shaded areas in Fig.3).
|
|
Figure. 3 Illustration of the dynamic space-time warping algorithm. *Illustration of DTW is adapted from [Eamonn Keogh, “Exact Indexing of Dynamic Time Warping “, 2002 ] |
Results of the DSTW algorithm: |
|
Figure. 4 First row from left to right: 1) A matrix representing the time warping
path, where T and T0 axis stand for the temporal domain of the target video and template respectively. In side the matrix,
the white zig-zag line stands for the time warping path. The dark area denotes the space of all possible time warping
paths. The gray area represents for impossible search range, which is setup manually to save computation time. 2) shape primitive template 3) motion primtive template. Second row from left to right: 1) original video, 2) deformed shape primtives. 3) deformed motion primitives. |
|
|
- We test our algorithm by detecting actions from cluttered videos in the CMU datasets. To save time, we use
the DSTW algorithm to detect multiple events at a single search.
- We generate the precision-recall (P-R) curve of each action, which is illustrated
by the blue curves in Fig.6, by changing the threshold.
The red curves are baseline detection algorithm reported in
[Y. Ke et al. ICCV07]. We also use the same decision criterion
as in this paper to consider a detection. Our performance curve is computed by
averaging detection results from 5 separate runs with random
selection of training videos, error bounds illustrate the
best and worst cases when doing the random selection.
|
 |
| |
|
|
|
|
Benjamin Yao and Song-chun Zhu Learning Deformable Action Templates from Cluttered Videos, ICCV'09 [paper|poster].
[Back to Benjamin's homepage] |