Active Basis Model
,
Shared Sketch Algorithm
,  
and Sum-Max Maps 
for Representing, Learning, and Recognizing Deformable Templates 
Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song-Chun Zhu 

 
 


Learning the template from the training images. Each selected Gabor basis 
element is represented by a bar. 
 
 
 
 
Detecting and sketching the objects in the testing images using 
the learned template. 
Examples of learning and detecting cats.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Allowing uncertainties in locations, scales and orientations of objects in training data. 
|
 |
 
 
 
 
 
   
 
 
Finding clusters in mixed training data. 
 
 
 
 
 
 
 
 
  
 
Learning part templates for finer sketch at higher resolution. 
 
Paper |
Code |
Experiments |
Model |
Learning |
Inference |
Hierarchical |
Texture 
The code in this section is more readable and up-to-date than the code in 
Section 3. You can start reading from the file "ReadMeFirst". 
 Typical parameters: scale parameter of Gabor = .7; locationShiftLimit = 4 pixels
(reduced to 3 pixels if it is allowed to adjust overall locations, scales and rotations as in 2); 
The length or width of the bounding box is around 100 pixels. For larger scale (e.g., scale parameter = 1), you can change 
locationShiftLimit and the size of bounding box proportionally. You can tune these parameters, although 
in principle they can be selected automatically. 
-   Learning from roughly aligned images and detection using learned template.  
(a) 
Basic version 
This code is the basis for other code. It learns single-scale template in the learning step. It searches over multiple resolutions, rotations and 
left-right poses in the inference (detection, recognition) step. The learning step and the inference step are usually iterated in an 
unsupervised learning algorithm. The active basis model is designed for unsupervised learning, such as the learning of codebooks.  
 
(b)  Local shift version
 Recommended. 
It learns single-scale template. It allows local shift in location, scale, rotation as well 
as left-right flip of the object in each training image. 
 
 -   Learning from images where the locations, scales, orientations and left-right poses 
of the objects are unknown.  
Basic version It learns single-scale template. 
It searches over multiple resolutions, rotations and 
left-right poses in detection.
 
 -   Clustering by fitting a mixture of active basis models. 
 
(a) 
 Basic EM version  It learns single-scale templates. 
It is initialized by random clustering, with rigorous implementation of E-step. This code only serves as a proof of concept.  
(b) 
 Local shift version Recommended. 
It searches over multiple resolutions, rotations and 
left-right poses in detection. It starts from random clustering, and allows for multiple starting.   
 -   Learning multi-scale templates. 
 
(a) 
 Learning from aligned images This code has a different structure and embellished 
features. It learns templates at multiple scales. It uses Gabor wavelets at multiple scales, although 
one can also use single-scale Gabor wavelet and multiple resolutions of images. 
(b) 
 Learning from non-aligned images Based on the code in 4.(a)  
 
 
Post-pulication experiences 
-   Discriminative adjustment by logistic regression: Conditioning on 
hidden variables, the active basis model implies a logistic regression for 
classification based on MAX1 scores of the selected basis elements. 
The lambda parameters of the active basis 
model can be re-estimated discriminatively by regularized logistic regression 
to further improve classification performance. See 
Experiment 16
 done by Ruixun Zhang, then 
a summer visiting student from Peking U.  
 -   Model-based clustering:  The EM algorithm for clustering with random initialization seems to perform better 
than we claimed in the paper, if the number of training examples in each cluster is 
not too small. 
See Experiment 17.
 -   Local normalization:  We have experimented with local normalization of 
filter responses (e.g., scale parameter = .7, half window size = 20), which 
is the option (3) in section 3.5 of the paper. We have posted the code and results in the experiments pages
in Section 3. It appears that local normalization is better than normalization within 
the template window, i.e., the option (2) in section 3.5 of the paper, in terms of 
the appearances of the learned templates and the classification performances. 
The local normalization is also easier to implement than normalization within the 
template window. In the updated code of Section 2, we have enforced this option as the only 
option for normalization.
 -   Range of activity: We have also experimented with reduced range of activity (e.g., up to 3 
or 4 pixels,  where subsampling rate is 1). The model is still quite resilient with reduced 
activity, which gives tighter fit. For cleaner template, it is important to resize the images to a proper size, 
so that the bounding box of the template is not too large, e.g., around 100 pixels. In principle, finding the
right template size can be done automatically by maximum likelihood.  
 -   Saturation level: The current saturation level in the sigmoid transformation is 6 (after 
local normalization). Experiments on classification suggest that the performance is optimal if it is 
lowered to 3.5. It is still unclear whether this level is better for unsupervised learning. 
 
 
The section includes some experiments done after the publication of the two technical reports. 
-  The code in this section is only for reproducing the experiments. We deliberately
keep some fossilized versions of the code in this section for our own record. See 
Section 2 for the most updated code. 
 -  In the mex-C-code "Clearn.c" in this Section 3, many lines are devoted to a bin-chain trick 
for global maximization during learning. It now appears that this trick does not speed up the 
computation very much, unless the images are very large. So in the updated code in Section 2, 
we do not use this trick.  
 -  We also posted code where we normalize each filter 
response by the local average within a local window, by or the average response within the 
scanning window of the template. 
 -  For most experiments, we also posted code where we use two large natural images as 
background images to pool the marginal distributions of filter responses. 
There is no much change in results. 
  
 -  The experiments are re-numbered in the revised version of the manuscript. 
 
- 
Experiments 1 and 2 
Supervised learning (without local shift) and detection 
ETHZ shape data 
Supervised learning (without local shift) 
LHI data of animal faces etc.  
Supervised learning  with local shift which improves learning results   ( 25 experiments ) 
 - Experiment 3
Supervised learning and classification 
 - Experiment 4
Model-based clustering (without local shift) 
 - Experiment 5 
Learning from non-aligned images with unknown locations and scales, but the same 
rotation and aspect ratio 
 - Experiment 6
Learning moving template 
 - Experiment 7
Composing multiple templates 
 - Experiment 8
Geometric transformation of template 
 - Experiment 9
Local learning of representative templates 
 - Experiment 10
Synthesis by multi-scale Gabors and DoGs 
 - Experiment 11
 Clustering initialized by local learning 
 - Experiment 12
 Discriminative learning by adaboost 
- Experiment 13
 Learning with unknown rotations plus unknown locations and scales and left-right flip 
 
(100+ experiments on learning patterns of natural scenes) 
 - Experiment 14
 Composing simple geometric shapes into shape scripts 
 - Experiment 15
 Learning hierarchical active basis by composing partial templates  
 - Experiment 16
 Discriminatively adjusting active basis model by regularized logistic regression 
(summer project of Ruixun Zhang, then a visiting 
undergraduate student from Peking U)  
- Experiment 17
 Model-based clustering while allowing local shifts of templates  
 
 
Bug reports 
We did not clean the bugs from all the code in the above experiments, 
because they do not affect the existing results. 
Version cross-validation
Check the code by comparing different versions or implementations. 
 
 How to represent a deformable template?  
 The answer is no much beyond wavelet representation, just add some perturbations
to the wavelet elements.  
The perturbations and the coefficients of the wavelet elements
are the hidden variables that seek to explain each individual training image.  

Active basis. Each Gabor wavelet element is illustrated by a thin ellipsoid at a certain 
location and orientation. The upper half shows the perturbation of one basis element. 
By shifting its location or orientation or both within a limited range, the basis 
element (illustrated by a black ellipsoid) can change to other Gabor wavelet 
elements (illustrated by the blue ellipsoids). Because of the perturbations of the 
basis elements, the active basis represents a deformable template.
eps 


whew "I" stands for training image, "B" stands for basis, "c" stands for coefficients, "x" stands for 
pixel, "s" stands for scale, "alpha" stands for orientation, "i" indexes 
basis elements, "m" indexes training images, "Delta" stands for perturbations
or activities. The Delta x and Delta alpha are hidden variables that are to be inferred. 
 
Simplicity: the active basis model is a simplest generalization of the Olshausen-Field V1 model, 
and we hope it may be relevant to the cortex beyond V1. 
The generalization is in the subscripts (or attributes). 
The model is also a simplest instance of AND-OR graph advocated by Zhu and Mumford for vision. AND means composition of basis elements, 
OR means variations of basis elements caused by Detla hidden variables. The Delta variables can be understood as "residuals" of 
a shape model built on the geometric attributes of basis elements, where the shape model takes the form of an average shape plus residuals, 
which is the simplest shape model.    
The following are the deformed templates matched to the training images. They are not 
about motion. They illustrate deformations. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The model is generative and seeks to explain the images. 
 
The probability model at the bottom layer, i.e., conditioning on the Delta hidden variables or equivalently the deformed template B_m, is 

 How to learn a deformable template from a set of training images?  
 The answer is no much beyond matching pursuit, just do it simultaneously on 
all the training images.  
 The intuitive concepts of "edges" and "sketch" emerge from this process.   
Shared sketch algorithm. A selected element (colored ellipsoid) is shared by all the 
training images. For each image, a perturbed version of the element seeks 
to sketch a local edge segment near the element by a local maximization operation. 
The elements of the active basis are selected sequentially according to the 
Kullback-Leibler divergence between the pooled distribution (colored solid curve) 
of filter responses and the background distribution (black dotted curve). 
The divergence can be simplified into a pursuit index, which is the sum of 
the transformed filter responses. The sum essentially counts the number of 
edge segments sketched by the perturbed versions of the element. 
eps
 
 
The prototype of the algorithm is the shared matching pursuit algorithm:  

The generative learning is largely defined by Step (3), which is matching pursuit 
explaining away, and which leads to non-maximum suppression performed separately on each individual image. 
 After learning the deformable template, how to recognize it in a new image?  
 The answer is no much beyond Gabor filtering, just add a local max filtering, 
and then a linear shape filtering.  
But this is to be followed by a top-down pass 
for sketching the detected object.  

Sum-max maps. The SUM1 maps are obtained by convolving the input image with Gabor 
filters at all the locations and orientations. The ellipsoids in the SUM1 maps 
illustrate the local filtering or summation operation. The MAX1 maps are obtained 
by applying a local maximization operator to the SUM1 maps. The arrows in the 
MAX1 maps illustrate the perturbations over which the local maximization is taken. 
The SUM2 maps are computed by applying a local summation operator to the MAX1 maps, 
where the summation is over the elements of the active basis. This operation 
computes the log-likelihood of the deformed active basis, and can be interpreted
as a shape filter.
The bottom up process computes the log-likelihood  

It can be decomposed into the following bottom-up operations: 

which are followed by the following top-down operations: 
 

 
The above architecture is more representational than operational, more top-down than 
bottom-up. The bottom-up operation serves the purpose of top-down representation. 
The neurons in MAX1 layer can be interpreted as representational OR-nodes of AND-OR graph, where 
max operation in bottom-up serves the purpose of arg-max inference in top-down. 
The neurons in SUM2 layer represent templates or partial templates by selective 
connections to neurons in MAX1 layer. 
 
 How to model more articulated objects?   
 The answer is to further compose multiple active basis models that serve as part-templates. 
 
 The hierarchical model and the inference architecture are recursions of the themes in active basis.   

Sum-max maps. A SUM2 map is computed for each sub-template. For each SUM2 map, 
a MAX2 map is computed by applying a local maximization operator to the SUM2 map. 
Then a SUM3 map is computed by summing over the two MAX2 maps. The SUM3 map 
scores the template matching, where the template consists of two sub-templates 
that are allowed to locally shift their locations.
eps 

The hierarchical 
active basis model shares the same form as the original active basis model. 
 Shape Script Model 
 Composing simple geometric shapes (line segments, curve segments, 
parallel bars, angles, corners, ellipsoids) represented by active basis models.  
 Are they related to V2 cells or beyond?  

Sum-max maps. The sub-templates can be simple geometric shapes describable by simple geometric attributes, 
such as line segments, curve segments, 
corners, angles, parallels, ellipsoids, that are allowed to change their attributes such as locations, scales, 
orientations, as well as aspect ratios, curvatures, angles etc. 
 Where do texture statistics come from?   
 The answer is that they are pooled as null hypothesis to test the sketch hypothesis. 
 
 The null hypothesis itself also provides strong description of image intensity pattern.      
Nonadaptive background (or negative) images. The residual image of the active basis
model is assumed to share the same marginal statistics as these images. 
eps 
eps 
Both images are 1024x768. The left image is a rural scene (higher entropy and lower kurtosis), 
and the right image is an urban scene (lower entropy and higher kurtosis). 
eps 
eps 
eps 
Left: Marginal density q() of responses pooled from the above two natural images.  
It is roughly invariant over scales of Gabor wavelets. It has a very long tail
(the small bump at the end is caused by sigmoid saturation), reflecting the 
fact that there are edges in natural images. The densities of the active basis 
coefficients are exponential tilting of q().  
Middle: Link function between natural parameter and mean parameter. This is used
to infer the natural parameters from the observed mean responses at selected 
locations and orientations. 
Right: Log of normalizing constant versus natural parameter. This is used for 
calculating log-likelihood ratio score for template matching. 
 
Adapative background, where the marginal density q() is pooled from testing images directly. 
eps 
For each image, at each orientation, a marginal histogram q of filter 
responses is pooled over all the pixels of this image. Such adaptive q forms a natural couple 
with p, so that their ratios can be used to score template matching. Such marginal 
histograms lead to the 
Markov random 
field model of texture. The concept of stochastic texture arises as a summary 
of failed sketching, or as the strongest null hypothesis against the hypothesis of 
sketchability. 
The explicit form of the Markov random field texture model is 
 

Paper |
Code |
Experiments |
Model |
Learning |
Inference |
Hierarchical |
Texture 
Research reported on this page has been supported by 
NSF DMS 0707055, 1007889 and NSF IIS 0713652. 
Back to active basis homepage