Active Basis Image Template (project page)

This work proposes an active basis model, a shared sketch algorithm, and a computational architecture of sum-max maps for representing, learning, and recognizing deformable templates.

Representation : active basis model.

In our generative model, a deformable template is in the form of an active basis, which consists of a small number of Gabor wavelet elements at selected locations and orientations. These elements are allowed to slightly perturb their locations and orientations before they are linearly combined to generate the observed image.
Below: deformed templates on observed deer examples. On the top left corner is the learned active basis template shared by these training examples. Following the common template are pairs of image and its corresponding deformed template.

Learning : shared sketch.

The active basis model, in particular, the locations and the orientations of the basis elements, can be learned from training images by the shared sketch algorithm. The algorithm selects the elements of the active basis sequentially from a dictionary of Gabor wavelets. When an element is selected at each step, the element is shared by all the training images, and the element is perturbed to encode or sketch a nearby edge segment in each training image.
Below: animated illustration of the selected active Gabor elements for the deer examples.

Inference : SUM-MAX map.

The recognition of the deformable template from an image can be accomplished by a computational architecture that alternates the SUM maps and the MAX maps. The computation of the MAX maps deforms the active basis to match the image data, and the computation of the SUM maps scores the template matching by the log-likelihood of the deformed active basis.

Below: detected examples of deer.


Ying Nian Wu, Zhangzhang Si, Chuck Fleming, Song-Chun Zhu. Deformable template as active basis. International Conference on Computer Vision, 2007. pdf | code | slides
Ying Nian Wu, Zhangzhang Si, Haifeng Gong, Song-Chun Zhu. Learning active basis model for object detection and recognition. International Journal of Computer Vision, in press. PDF | Code

Discovering object templates with weak supervision (project page)

Objects at unknown locations


The sequence of templates learned in EM iterations 0, 2, 4, 6, 8, 10, where the first one among these is the starting template learned from the first training image, with no activity of basis and with no given bounding box. The images are rescaled to .5 times the original sizes. The number of elements in the active basis is 50.

The above plots display the superposed deformed templates in the last iteration on the training images. We use the first 20 images of the Weizmann horse images which have moderate deformations.


The sequence of templates learned in EM iterations. The number of elements is 60. The size of the template is 225 * 169. The number of elements is 60. The number of iterations is 3.

The superposed deformed templates in the last iteration on the training images.


The template size is 192 * 145. The number of elements is 50.

The superposed deformed templates in the last iteration on the training images.


The template size is 285 * 190. The number of elements is 50.

The superposed deformed templates in the last iteration on the training images.

In the above examples, the initial template is learned from one example.

Objects with unknown rotations or poses

Baseball Cap
The following animation illustrates the iterative learning of the baseball cap template from the cap examples with unknown rotations. Three iterations are shown. The first iteration learns a cap template with moderate amount of noisy sketches. This raw template is then able to detect the maximum-a-posterior rotation in each example. The detected instances are rotated as aligned. The second iteration learns a cleaner template from the automatically aligned examples. The updated template is again used to detect the maximum-a-posterior rotations of baseball cap examples. The third iteration does not change the template from the second iteration, signaling the convergence of the learning algorithm.
The figure below shows: left) the learned template after the algorithm converges; right) the training examples and the rotated + deformed templates.

Template learned from 15 images of baseball caps facing different orientations. The image size is 100 * 100. The number of elements is 40. The number of iterations is 5.


Template learned from images of horses facing two different directions. The first row displays the templates learned in the first 3 iterations of the EM algorithm. In the second row, for each training image the deformed template (either in the original pose or in the flipped pose) is plotted to the right of it. The number of training images is 57. The image size is 120 * 100. The number of elements is 40. The number of EM iterations is 3.


Template learned from 11 images of pigeons at different directions. The image size is 150 * 150. The number of elements is 50. The number of iterations is 3.

Discovering objects categories

Local learning on animal faces

We show one example of local learning with active basis templates. Local representative templates are learned from an ensemble of 123 images of animal heads. Starting from each image as a seed, we learn an initial template. Then we iteratively find its K nearest neighbors, based on the log-likelihood with sigmoid transformation, and then re-learn the template from these K nearest neighbors. In the following experiment, K = 5. The templates are ranked by their information gain. The information gain of each template is computed as the average log-likelihood scores over its K nearest neighbors. We trim the learned templates to satisfy the constraint that the neighbors of the remaining templates should not overlap. This leaves 15 templates:

The following are the 15 templates and their corresponding 5 nearest neighbors.


Ying Nian Wu, Zhangzhang Si, Haifeng Gong, Song-Chun Zhu. Learning active basis model for object detection and recognition. International Journal of Computer Vision, in press. PDF | Code
Zhangzhang Si, Haifeng Gong, Song-Chun Zhu and Ying Nian Wu. Learning active basis models by EM-type algorithms . Statistical Science, in press. PDF | Code

Hybrid (mixed) image template (project page)

Combining the methodology of the active basis model and the idea of composing two types of image manifolds, this work seeks integrated modeling for a general set of image features regardless of their forms and metrics. In our search for informative and generative image features, it is observed that simple features and simple combination of them are able to provide promising results in object categorization.

(a) A hedgehog image may be seen as a bunch of local image patches, which are either sketches or textures. (b) Quantization in the image space and histogram feature space would provide an image primitive dictionary and a texture dictionary respectively, which compete to explain observed image patches. A hybrid template of hedgehog is composed of sketches and histogram prototypes explaining local image patches at different locations.


This work implements a method for learning hybrid image templates composed of local sketches and local textures at selected locations. Local sketches and local textures in the hybrid templates account for shapes and appearances in images respectively. With a simple design of features, we demonstrate the learning algorithm in which sketch and texture features are selected and compared by an information-theoretical criterion. Through two types of features, each local image patch is projected to a low dimensional space that enables us to perform robust modeling. With this algorithm we are able to build generative models for object or texture image categories with a small number of features and a small number of training examples, making it a useful tool in various vision tasks. We learn one mixed template for each category from a small number of training images that are roughly aligned.

Following pictures are symbolized illustrations of the templates automatically learned from example images per image category. Black strokes represent sketches, and red blobs represents orientation histograms (note that some are direction-less). It is interesting to see how the two types of feature compliment each other in explaining different regions of the image lattice, though they do overlap at part of the object boundary. Mouse over the images to see what object categories they are learned from.
bear head bonsai butterfly cat head clock dollar bill hedgehog horse ketch laptop lion head pigeon head pig head pizza tiger head

The following figures are taken from our poster. These may be helpful in explaining the learning algorithm for hybrid templates and the matching of the templates on testing examples.

The workflow:
Feature selection criteria:
Sketch/texture feature competition:
We also experimented on adaptive textural background where sketch features arise:


In conducting categorization experiments we are interested in :

  1. whether the combination of two features leads to performance improvement that is statistically significant;
  2. whether is this improvement consistent for different image categories and different size of the training sample;
  3. whether and how does the significance of performance improvement changes with different categories.

We are comparing three models: one that can only select from sketch features, one only from orientation histogram features and one that combine them. As a measure of classification performance we use AUC (area under ROC curve). 5 cross validation runs per model, per training sample size (in total 5 * 3 * 5 runs) are performed for training and testing and plotted are three curves together with their 95% student-t confidence bounds. The combined model (blue line) manifests significant improvement over either sketch-only or texture-only models, and this is consistent for different training sample sizes. Below we show the results for four categories.


pig head


cat head

Evaluating feature combination on a spectrum of categories. We selected ~60 object categories and ~40 texture categories. The classification task is then not only distinguishing objects from objects, but also objects from (structured) clutter and one type of clutter from another. This is expected to be a more realistic setting than the mere object categorization, since in real detection/parse task we would scan/test over all regions in a testing image, and recognize both objects and salient textures.

60 object categories and 41 texture categories from Caltech101, CUReT and LHI datasets. It is made moderately difficult by object categories easy to confuse, e.g., 18 kinds of animal faces and some similar texture categories. The boxplot compares three distributions of average precision (AP) over 100+ categories. The three boxes correspond to sketch-only, texture only and the combined models. This plot shows that the improvement in classification performance is consistent in a reasonably large set of image categories.

Sketch/texture competition over a wide spectrum of image categories:


Zhangzhang Si, Haifeng Gong, Ying Nian Wu, Song-Chun Zhu. Learning mixed image templates for object recognition. IEEE Conference on Vision and Pattern Recognition, June 2009. PDF | Code: learning, evaluation | poster (pptx)
Ying Nian Wu, Zhangzhang Si, Haifeng Gong, Song-Chun Zhu. Learning active basis model for object detection and recognition. International Journal of Computer Vision, in press. PDF | Code