Active Basis Model , Shared Sketch Algorithm , and Sum-Max Maps for Representing, Learning, and Recognizing Deformable Templates

Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song-Chun Zhu

100+ templates learned from non-aligned natural images

Learning the template from the training images. Each selected Gabor basis element is represented by a bar.

Detecting and sketching the objects in the testing images using the learned template. Examples of learning and detecting cats.

Allowing uncertainties in locations, scales and orientations of objects in training data.

Finding clusters in mixed training data.

Learning part templates for finer sketch at higher resolution.

Section 1: Paper and Presentation

Report 1 (latex) The original paper on model and algorithm.
Report 2 (latex) On sparse coding and preliminary attempt on shape motifs.
Report 3 (latex) More explicit notation and algorithmic motivation.
Talk

Section 2: Most Updated Source Code

The code in this section is more readable and up-to-date than the code in Section 3. You can start reading from the file "ReadMeFirst".

Typical parameters: scale parameter of Gabor = .7; locationShiftLimit = 4 pixels (reduced to 3 pixels if it is allowed to adjust overall locations, scales and rotations as in 2); The length or width of the bounding box is around 100 pixels. For larger scale (e.g., scale parameter = 1), you can change locationShiftLimit and the size of bounding box proportionally. You can tune these parameters, although in principle they can be selected automatically.

Learning from roughly aligned images and detection using learned template.
(a) Basic version This code is the basis for other code. It learns single-scale template in the learning step. It searches over multiple resolutions, rotations and left-right poses in the inference (detection, recognition) step. The learning step and the inference step are usually iterated in an unsupervised learning algorithm. The active basis model is designed for unsupervised learning, such as the learning of codebooks.
(b) Local shift version Recommended. It learns single-scale template. It allows local shift in location, scale, rotation as well as left-right flip of the object in each training image.
Learning from images where the locations, scales, orientations and left-right poses of the objects are unknown.
Basic version It learns single-scale template. It searches over multiple resolutions, rotations and left-right poses in detection.
Clustering by fitting a mixture of active basis models.
(a) Basic EM version It learns single-scale templates. It is initialized by random clustering, with rigorous implementation of E-step. This code only serves as a proof of concept.
(b) Local shift version Recommended. It searches over multiple resolutions, rotations and left-right poses in detection. It starts from random clustering, and allows for multiple starting.
Learning multi-scale templates.
(a) Learning from aligned images This code has a different structure and embellished features. It learns templates at multiple scales. It uses Gabor wavelets at multiple scales, although one can also use single-scale Gabor wavelet and multiple resolutions of images.
(b) Learning from non-aligned images Based on the code in 4.(a)

Section 3: Reproducing Experiments

Post-pulication experiences

Discriminative adjustment by logistic regression: Conditioning on hidden variables, the active basis model implies a logistic regression for classification based on MAX1 scores of the selected basis elements. The lambda parameters of the active basis model can be re-estimated discriminatively by regularized logistic regression to further improve classification performance. See Experiment 16 done by Ruixun Zhang, then a summer visiting student from Peking U.
Model-based clustering: The EM algorithm for clustering with random initialization seems to perform better than we claimed in the paper, if the number of training examples in each cluster is not too small. See Experiment 17.
Local normalization: We have experimented with local normalization of filter responses (e.g., scale parameter = .7, half window size = 20), which is the option (3) in section 3.5 of the paper. We have posted the code and results in the experiments pages in Section 3. It appears that local normalization is better than normalization within the template window, i.e., the option (2) in section 3.5 of the paper, in terms of the appearances of the learned templates and the classification performances. The local normalization is also easier to implement than normalization within the template window. In the updated code of Section 2, we have enforced this option as the only option for normalization.
Range of activity: We have also experimented with reduced range of activity (e.g., up to 3 or 4 pixels, where subsampling rate is 1). The model is still quite resilient with reduced activity, which gives tighter fit. For cleaner template, it is important to resize the images to a proper size, so that the bounding box of the template is not too large, e.g., around 100 pixels. In principle, finding the right template size can be done automatically by maximum likelihood.
Saturation level: The current saturation level in the sigmoid transformation is 6 (after local normalization). Experiments on classification suggest that the performance is optimal if it is lowered to 3.5. It is still unclear whether this level is better for unsupervised learning.

The section includes some experiments done after the publication of the two technical reports.

The code in this section is only for reproducing the experiments. We deliberately keep some fossilized versions of the code in this section for our own record. See Section 2 for the most updated code.
In the mex-C-code "Clearn.c" in this Section 3, many lines are devoted to a bin-chain trick for global maximization during learning. It now appears that this trick does not speed up the computation very much, unless the images are very large. So in the updated code in Section 2, we do not use this trick.
We also posted code where we normalize each filter response by the local average within a local window, by or the average response within the scanning window of the template.
For most experiments, we also posted code where we use two large natural images as background images to pool the marginal distributions of filter responses. There is no much change in results.
The experiments are re-numbered in the revised version of the manuscript.

Experiments 1 and 2 Supervised learning (without local shift) and detection
ETHZ shape data Supervised learning (without local shift)
LHI data of animal faces etc. Supervised learning with local shift which improves learning results ( 25 experiments )
Experiment 3 Supervised learning and classification
Experiment 4 Model-based clustering (without local shift)
Experiment 5 Learning from non-aligned images with unknown locations and scales, but the same rotation and aspect ratio
Experiment 6 Learning moving template
Experiment 7 Composing multiple templates
Experiment 8 Geometric transformation of template
Experiment 9 Local learning of representative templates
Experiment 10 Synthesis by multi-scale Gabors and DoGs
Experiment 11 Clustering initialized by local learning
Experiment 12 Discriminative learning by adaboost
Experiment 13 Learning with unknown rotations plus unknown locations and scales and left-right flip (100+ experiments on learning patterns of natural scenes)
Experiment 14 Composing simple geometric shapes into shape scripts
Experiment 15 Learning hierarchical active basis by composing partial templates
Experiment 16 Discriminatively adjusting active basis model by regularized logistic regression (summer project of Ruixun Zhang, then a visiting undergraduate student from Peking U)
Experiment 17 Model-based clustering while allowing local shifts of templates

Bug reports We did not clean the bugs from all the code in the above experiments, because they do not affect the existing results.
Version cross-validation Check the code by comparing different versions or implementations.

Section 4: Representation by Active Basis Model (a simplest AND-OR graph and a simplest shape model)

How to represent a deformable template?
The answer is no much beyond wavelet representation, just add some perturbations to the wavelet elements.
The perturbations and the coefficients of the wavelet elements are the hidden variables that seek to explain each individual training image.

Active basis. Each Gabor wavelet element is illustrated by a thin ellipsoid at a certain location and orientation. The upper half shows the perturbation of one basis element. By shifting its location or orientation or both within a limited range, the basis element (illustrated by a black ellipsoid) can change to other Gabor wavelet elements (illustrated by the blue ellipsoids). Because of the perturbations of the basis elements, the active basis represents a deformable template. eps

whew "I" stands for training image, "B" stands for basis, "c" stands for coefficients, "x" stands for pixel, "s" stands for scale, "alpha" stands for orientation, "i" indexes basis elements, "m" indexes training images, "Delta" stands for perturbations or activities. The Delta x and Delta alpha are hidden variables that are to be inferred.

Simplicity: the active basis model is a simplest generalization of the Olshausen-Field V1 model, and we hope it may be relevant to the cortex beyond V1. The generalization is in the subscripts (or attributes). The model is also a simplest instance of AND-OR graph advocated by Zhu and Mumford for vision. AND means composition of basis elements, OR means variations of basis elements caused by Detla hidden variables. The Delta variables can be understood as "residuals" of a shape model built on the geometric attributes of basis elements, where the shape model takes the form of an average shape plus residuals, which is the simplest shape model.

The following are the deformed templates matched to the training images. They are not about motion. They illustrate deformations.

The model is generative and seeks to explain the images.

The probability model at the bottom layer, i.e., conditioning on the Delta hidden variables or equivalently the deformed template B_m, is

Section 5: Learning by Shared Sketch Algorithm (with arg-max retrieving and inhibiting on each individual image)

How to learn a deformable template from a set of training images?
The answer is no much beyond matching pursuit, just do it simultaneously on all the training images.
The intuitive concepts of "edges" and "sketch" emerge from this process.

Shared sketch algorithm. A selected element (colored ellipsoid) is shared by all the training images. For each image, a perturbed version of the element seeks to sketch a local edge segment near the element by a local maximization operation. The elements of the active basis are selected sequentially according to the Kullback-Leibler divergence between the pooled distribution (colored solid curve) of filter responses and the background distribution (black dotted curve). The divergence can be simplified into a pursuit index, which is the sum of the transformed filter responses. The sum essentially counts the number of edge segments sketched by the perturbed versions of the element. eps

The prototype of the algorithm is the shared matching pursuit algorithm:

The generative learning is largely defined by Step (3), which is matching pursuit explaining away, and which leads to non-maximum suppression performed separately on each individual image.

Section 6: Inference by Sum-Max Maps (with bottom-up sum-max scoring and top-down arg-max sketching)

After learning the deformable template, how to recognize it in a new image?
The answer is no much beyond Gabor filtering, just add a local max filtering, and then a linear shape filtering.
But this is to be followed by a top-down pass for sketching the detected object.

Sum-max maps. The SUM1 maps are obtained by convolving the input image with Gabor filters at all the locations and orientations. The ellipsoids in the SUM1 maps illustrate the local filtering or summation operation. The MAX1 maps are obtained by applying a local maximization operator to the SUM1 maps. The arrows in the MAX1 maps illustrate the perturbations over which the local maximization is taken. The SUM2 maps are computed by applying a local summation operator to the MAX1 maps, where the summation is over the elements of the active basis. This operation computes the log-likelihood of the deformed active basis, and can be interpreted as a shape filter.

The bottom up process computes the log-likelihood

It can be decomposed into the following bottom-up operations:

which are followed by the following top-down operations:

The above architecture is more representational than operational, more top-down than bottom-up. The bottom-up operation serves the purpose of top-down representation. The neurons in MAX1 layer can be interpreted as representational OR-nodes of AND-OR graph, where max operation in bottom-up serves the purpose of arg-max inference in top-down. The neurons in SUM2 layer represent templates or partial templates by selective connections to neurons in MAX1 layer.

Section 7: Hierarchical Active Basis and Sum-Max Maps (with bottom-up sum-max scoring and top-down arg-max parsing)

How to model more articulated objects?
The answer is to further compose multiple active basis models that serve as part-templates.
The hierarchical model and the inference architecture are recursions of the themes in active basis.

Sum-max maps. A SUM2 map is computed for each sub-template. For each SUM2 map, a MAX2 map is computed by applying a local maximization operator to the SUM2 map. Then a SUM3 map is computed by summing over the two MAX2 maps. The SUM3 map scores the template matching, where the template consists of two sub-templates that are allowed to locally shift their locations. eps

The hierarchical active basis model shares the same form as the original active basis model.

Shape Script Model

Composing simple geometric shapes (line segments, curve segments, parallel bars, angles, corners, ellipsoids) represented by active basis models.
Are they related to V2 cells or beyond?

Sum-max maps. The sub-templates can be simple geometric shapes describable by simple geometric attributes, such as line segments, curve segments, corners, angles, parallels, ellipsoids, that are allowed to change their attributes such as locations, scales, orientations, as well as aspect ratios, curvatures, angles etc.

Section 8: Background and Texture Statistics

Where do texture statistics come from?
The answer is that they are pooled as null hypothesis to test the sketch hypothesis.
The null hypothesis itself also provides strong description of image intensity pattern.

Nonadaptive background (or negative) images. The residual image of the active basis model is assumed to share the same marginal statistics as these images.

eps

eps

Both images are 1024x768. The left image is a rural scene (higher entropy and lower kurtosis), and the right image is an urban scene (lower entropy and higher kurtosis).

eps

eps

Left: Marginal density q() of responses pooled from the above two natural images. It is roughly invariant over scales of Gabor wavelets. It has a very long tail (the small bump at the end is caused by sigmoid saturation), reflecting the fact that there are edges in natural images. The densities of the active basis coefficients are exponential tilting of q().
Middle: Link function between natural parameter and mean parameter. This is used to infer the natural parameters from the observed mean responses at selected locations and orientations.
Right: Log of normalizing constant versus natural parameter. This is used for calculating log-likelihood ratio score for template matching.

Adapative background, where the marginal density q() is pooled from testing images directly.

eps
For each image, at each orientation, a marginal histogram q of filter responses is pooled over all the pixels of this image. Such adaptive q forms a natural couple with p, so that their ratios can be used to score template matching. Such marginal histograms lead to the Markov random field model of texture. The concept of stochastic texture arises as a summary of failed sketching, or as the strongest null hypothesis against the hypothesis of sketchability. The explicit form of the Markov random field texture model is

Research reported on this page has been supported by NSF DMS 0707055, 1007889 and NSF IIS 0713652.
Back to active basis homepage