Active Basis

The following expands the Discussions of the two technical reports.

Generative vs discriminative: image-specific explaining-away

The difference between generative and discriminative methods is mainly about hidden variables and their explaining-away competition in inference and learning. In generative learning, these hidden variables compete to explain away each individual image, instead of the class label of the image. The image is more informative for inferring hidden variables than class label, especially when the number of hidden variables is large. The class label y is only binary, but the image x contains much more information. The max-margin does go far beyond the class labels or p(y|x) to capture the images, but it may still not be as informative as generative likelihood p(x|y).
The form of the generative model is p(hidden variables|A) p(observed data|hidden variables, B), where A and B are parameters. In active basis model, we assume B is already learned as the dictionary of basis elements, and we focus on learning A, supposedly the dictionary or codebook of templates or part-templates.
Some preliminary experiments in this paper suggest that it is possible to learn active basis templates or partial templates from non-aligned images. The goal is to learn a dictionary or codebook of partial templates in unsupervised manner. The generative model is more natural for this task. It is essentially a recursion of Olshausen-Field learning, which iterates between an inference/encoding step, and a code-book re-learning step.
Hidden variables seek to explain each individual training image. In active basis model, the perturbations of the basis elements and the coefficients of the basis elements are both hidden variables. We can continue to add hidden variables such as unknown overall locations, scales, view points, and categories of the templates or partial templates, in order to make the learning more unsupervised. We can also add mid-scale hidden variables to account for larger deformations.
In active basis model and its extensions, the hidden variables are dense and layered. They are dense in the sense that each basis element is associated with a perturbation variable. They are layered because the perturbation variables can be defined at the levels of elements, part-templates, and whole template. Learning in the presence of dense and layered hidden variables may have to be generative for estimation efficiency.
In active basis model, the selected and perturbed (arg-max) basis elements inhibit nearby basis elements in each individual training image, because the selected and perturbed element explains away part of the image, so that the nearby basis elements cannot further explain away the residual image. This step is missing in discriminative learning. This image specific arg-max inhibition may define generative learning in this special context. This is a matching pursuit effect.
The residual images neutralize selected basis elements and those basis elements that are highly correlated with them, because they have zero or small responses on the residual images. So if we want to continue to explain the residual images, we will avoid the selected elements and their neighbors.
This neutralizing data set does not involve reweighing the original data set as in adaboost, or resampling (filtering) the original data set as in the earlier boosting scheme, so there is no loss of effective training examples. This neutralizing data set viewpoint connects generative learning and discriminative learning.
In the generative approach, we remove the arg-max element from each example in creating the neutralizing example, so that the arg-max element inhibits overlapping elements. This is sharper than discriminative reweighing.
The above learning process essentially removes the objects from the positive images, so that the residual images become natural images. In that sense, generaltive learning of objects amounts to removal of the objects from training images. We can introduce further image-specific hidden variables for sharper removal of the object in each individual positive image, so the image-specific hidden variables are naturally needed in the generative model.
Another way to think about it from classification point of view is that here the features are the coefficients for reconstructing the images, instead of the filter responses. The filter responses can be highly dependent both linearly and statistically. The coefficients may be much less so. The coefficients can be obtained from the filter responses by some sparsification, which does two things. One is thresholding noisy responses to zero. The other is like non-maximum suppression or winner-takes-all.
The above is an expansion of the statement at the end of Section 4.4: "In the pursuit of basis elements, the generative learning does not re-weight positive examples, and the inhibition between basis elements is carried out through residual images. In the discriminative learning, the inhibition is done by re-weighting the training examples."
The above discussion hinges on an additive model with an explicit residual image, and the matching pursuit is a simplest concrete embodiment of explaining away. As we show in the appendix of our ICCV paper, it is a special case of a probability model p(I_m | B_m), where B_m is the deformed template that sketches image I_m, and B_m depends on the hidden variables of perturbations. The form of p(I_m | B_m) is based on exponential tilting or maximum entropy. When we select the next basis element, we want to continue to improve the current p(I_m | B_m) by adding one more stroke to the template, so that the new p(I_m | B_m) with the new template B_m gives a better explanation to I_m. We want to select the new stroke so that the overall improvement over all training images is maximized.
From discriminative point of view, the current p(I_m | B_m) (or its Monte Carlo samples) is the current negative training set. We want to find the new stroke to distinguish between the postive examples from these negative training sets. However, unlike in adaboost where all the positive examples share the same re-weighed negative training set, in generative learning, each image I_m has its own negative training set p(I_m | B_m), because different images I_m have different deformed templates B_m caused by the hidden variables of perturbations. Such image specific negative training set pushes for sharper selection of the next stroke.
The current p(I_m | B_m) is said to have explained away I_m, and we want to find a new stroke to continue to further explain away I_m by a new p(I_m | B_m) with a new template B_m with the new stroke added. So the generative learning sketches and paints each I_m by an image specific p(I_m | B_m). This is consistent with the above residual-based understanding of removing the object from each I_m. To be able to remove the object, you have to be able to generate the object.
Although hidden variables can be maxed out or integrated out and be reduced to invariant features such as local max or histograms, so that we do not have to deal with image specific hidden variables, invariant features in discriminative approach do not account for top-down pass and the above arg-max explaining-away inhibition. In fact, the local max operation itself needs justification in discriminative approach.
Moreover, the invariance in discriminative learning is at best approximate and should better be called blurring. The reason is that the invariance is supposed to be relative to the variations of the original form or shape. But the original shape is unknown and there can be many original shapes, so it is impossible to be completely invariant. This is possible only we explicitly model the original shape and its deformations, in particular, the original shape is the same over all the examples, and the deformations are represented by hidden variables that are specific to individual examples.
The generative likelihood can be considered the full likelihood, whereas the discriminative likelihood without regularization can be considered partial likelihood. The generative likelihood focuses on the internal structures of the positive training set instead of the boundary between positive and negative training sets.
More specifically, the posterior class probability is a non-linear function of the likelihood ratio p/q, and the class probability saturates at 1 for typical positive examples that are well within the classification boundary and that have large likelihood ratios. That is, if we increase the likelihood ratio by a large amount, then we only increase the class probability by a small amount. So the class probability may not be as strong as the likelihood ratio for modeling typical positive examples inside the classification boundary. The class probability is more sensitive for marginal examples close to the boundary.
Conditional on the hidden variables, the active basis model follows an exponential family model relative to the probability density of natural images. The lambda parameters (or weights) are estimated based on orthogonal-independence assumption of the background/residual model q. This q may be okay for residual images in positive training images, but is far from capturing patterns in negative images or natural images. If we let q be the distribution of negative or natural images, then one can correct for this assumption by fine-tuning or relaxing the lambda's given the selected basis elements, either by fitting the full generative exponential model or by fitting the discriminative logistic regression classifier, which is the conditional distribution of the full exponential model. In both cases, the log-likelihoods are concave, but the gradients are different. The generative gradient is in the form of the difference between observed average and the expectation of the sufficient statistics, thus the generative model targets the "average" or the "center". The computation of the expected sufficient statistics involves reweighting the natural image patches, but the average sufficent statistics stay the same, without reweighting the positive images.
Discriminative adjustment of lambda's by logistic regression tends to overfit the data, so regularization is necessary. The L2 regularization is preferrable because it encourages smoothness of lambda's of the selected basis elements. The L2 regularized logistic regression is similar to SVM. The logistic loss is smoother than hinge loss. The logistic regression is a consequence of the relative exponential family model. Usually in negative images, some images may match some portions of the template, but completely miss other portions. So we should down-weigh the lambda's of those portions that are frequently matched by negative images or false positives. We may also learn some negative templates using mixture model from false positives for better modeling q.
Although logistic regression is partial likelihood and therefore less efficient than the full generative likelihood, regularization changes the story considerably. The regularization term can be understood as reducing variance or as corresponding to a prior distribution. But maybe it should be better seen as approaching generative likelihood. Take SVM for example. Suppose the data are within a ball. Then the larger the margin, the smaller the volume occupied by positive and negative examples. So a uniform distribution on this margin-excluded ball would have high generative likelihood. So in this sense, SVM is doing something close to generative model, although the uniform distribution over the margin-voided ball is rather crude as a generative model.
The general idea is to use generative model to select basis elements and infer hidden variables, as well as estimating the lambda parameters based on conditional independence assumption. This leads to a stand-alone generative model whose likelihood can be used for detection and classification. After generative learning, given the selected basis elements and the inferred hidden variables, we can re-estimate lambda parameters by logistic regression. We believe the generative model is more efficient in learning with hidden variables. Moreover, the selection of the basis elements hugely reduces the dimensionality so that discriminative adjustment can be done very efficiently.
Take clustering for example. One can do clustering using discriminative method. But in general, classification focuses on marginal examples (either relative to negatives or relative to other clusters). Such marginal examples may not reveal a clear view of clustering. Clustering is more clear from a global view as a matter of polarizing away from center. The discriminative approach may fail to generate stronge enough polarizing force in the initial stage of the clustering algorithm where all clusters are similar. In EM mixture model, those examples that are clustered well receive higher weights in E-step. But in classification, those examples that are not classified well receive higher weights. There seems to be a conflict of interests.
Clustering is about discrete hidden variables. Even for continuous hidden variables, such as unknown locations and orientations, the focus on those marginal examples may not be helpful for us to infer such hidden variables for overall alignment among all the examples.
As to explaining-away mechanism in learning, a close form mechanism is the matching pursuit mechanism in linear additive model as we discussed above. Another related mechanism is basis pursuit or lasso, which is based on the trick of replacing L0 norm by L1 norm. Just like we crafted a shared matching pursuit algorithm, if we want to use basis pursuit or lasso, we also need to do it in a way that the selected elements is shared by all the training images. This can be enforced by tying the coefficients across images, and define L1 sparsity shared by all the images. We did not use Lasso becuase matching pursuit works well and is very fast. We were also concerned whether L1 is strong enough for sparsity. Moreover, the LARS solution to lasso is also a step-wise procedure similar to matching pursuit. Since we do not want to use heavily overlapping Gabors to sketch the images anyway, and the Gabors are small relative to the template, we are not concerned with the greediness of matching pursuit.
An important point is that we cannot do matching pursuit or lasso-type sparsification first on individual images, and then learn the next layer for the sparsified coefficients. This is early decision. Even if we get optimal solution to the sparse coding problem for each individual image, this optimal solution does not account for uncertainty in the selected basis elements and their coefficients, and they may not provide optimal fit to the model in the layer above, whatever the model may be. This is a key obstacle for learning the next layer beyond Olshausen-Field model. But on the other hand, we cannot give up explaining-away sparsification, because the model learned from the unsparsified coefficients does not give sparse explanation of the image data, thus violating the principle of Olshausen-Field.
If one thinks in terms of edge detection, the above paragraph essentially says that you cannot do edge detection before learning. Such edge detection is early commitement. No matter how "optimal" it may be, it may not be a good fit to the patterns you are looking for. That is, when you are looking for patterns of edges, you simply cannot do edge detection and then find the patterns. You have to find the patterns together with edge detection. Thet two cannot be separated. This is the same with both learning and inference (detection and recognition). In recognition, you cannot detect edges before you detect the object. Instead, you detect the edges together or even after you detect the object. The important message is that early decision should always be avoided. To put it another way, you need the edges to see the pattern, but you also need the pattern the see the edges. Such chicken and egg problem can only be solved by joint maximization of an objective function such as likelihood.
If one thinks in more general terms of compositionality, the tricky thing is that you cannot find the compositional patterns of the constituent elements without identifying the constituent elements, and yet identifying constituent elements before seeing the compositional patterns is an early decision. So you have to identify the constituent elements and learn the pattern at the same time, or by looping through bottom-up and top-down passes. Building up the hierarchy by a bottom-up procedure does not work.
Our solution is to simplify the learning problem first, starting from roughly aligned images, and later extend the learning to nonaligned images by introducing more hidden variables. For roughly aligned images, we can perform shared sparsification. That is, we do sparsification together with learning the next layer beyond Olshausen-Field. In other words, the next layer of learning is to memorize the commonly shared sparsifications. So each neuron in the next layer memorizes a commonly shared sparsification.
Explaining-away can be made very light in inference, if it is already memorized in learning, via selection or equivalently trimming of highly redundant elements. In active basis model, the selected basis elements have little overlaps. So in inference, these elements generally do not interfere with each other in local max perturbation.
The sparsification of the Olshausen-Field makes the layer of the coefficients into a point process, not another layer of images or dense maps any more. Thus the model on the coefficients should be drastically different than the model on images. It cannot be a simple recursion. For the sparse point process, the best one can do may be to memorize commonly occuring patterns.
We have been thinking about the layer on top of Olshausen-Field model for a long time. The form of this layer in return provides a justification why we need Olshausen-Field model in the first place. The sparsity principle, as appealing as it is, does not tell us exactly what is the purpose of sparse representation. We somehow feels that the active bais model provides one justification. We want to perturb the sparse coding elements in order to generalize to similar shapes. We have been feeling that going beyond Olshausen-Field is the key to understand vision, and the layer above Olshausen-Field must be fundamentally different than the layer of Olshausen-Field.
Generalizibility can (or maybe should) be defined extrapolatively, where the testing images come from different distributions than the training images. Such extrapolation can be achieved by changing the distribution of the hidden variables. In shape script model, extrapolation can be achieved by exaggerating the geometric parameters of the elementary shapes, or by recombining different elementary shapes. The example of extrapolating from regular bike to the tandem bike is a simplest example of extrapolative generalization.
Extrapolative generalizibility is a criterion for the usefulness of the knowledge learned from data. The learned knowledge should have the abilities to deal with "surprises" in order for an intelligent system to survive the real world. In fact, an intelligent system should be curious and adventurous in seeking out surprises, in order to learn rapidly. A surprise lies in between a positive testing example and a negative testing example. It is a positive example because it can be explained by existing knowledge. It is a negative example because it cannot be fully explained by existing knowledge without some extrapolation. Also, a surprise lies in between a testing example and a training example. It is a testing example because it demands explanation by existing knowledge. It is a training example because it offers opportunity for re-learn or adapt the existing model.
In order to explain surprise by existing knowledge, we need to explain in what aspects the surprise is similar to past examples, and more important, in what aspects the surprise differs from past examples. These aspects are hidden variables, which can be dense and layered as discussed above.

Sparse coding: attributed elements and further coding of attributes

The Olshausen-Field model learns Gabor-like wavelets by enforcing sparsity. These wavelet elements turn out to be localized in both spatial and frequency domains, and are similar to each other. They are indexed by attributes such as locations, orientations, scales, and phases. The similarity between the elements are not assumed in learning, and the attributes are not learned. The concept of location already exists in the labeling of pixels, but not orientation and scale. Labeling the learned wavelet elements by locations, orientations, and scales amounts to a re-representation, so that the layer above it is built upon these attributes, in addition to the wavelet coefficients.
These attributes serve as hidden variables of the active basis model. The local deformations of the template and the global geometrical transformations such as translation, dilation, and rotation of the template are all based on linear additive manipulations of these attributes. Such linear additive manipulations trace out the highly non-linear manifolds in the image space, and most natural images live on such manifolds. The original image manifolds are non-Euclidean. But the distributions of these attributes span flat Euclidean manifolds, so that linear interpolation becomes meaningful.
If we do not want to index the geometrical attributes of the basis elements explicitly or embed them into a geometric space, then at least we should organize them in a topological neighborhood system, so that perturbations of basis elements can be represented as changing the basis elements to their respective neighbors. Such a neighborhood system makes the ensemble of natural images more connected, so that one can easily flow from one natural image to another. If we define the neighborhood system based on the Euclidean distance of the image intensities, then this neighborhood system is less connected.
What is more fundamental than the attributes themselves is the underlying group of transformations that transform one wavelet element to another, under the condition that these elements are similar to each other. The attributes are only a result of representing such group of transformations. In other words, the transformations are the verbs, and the attributes are the nouns. The verb aspect is more important than the noun aspect. In the active basis model, transformations of limited ranges are applied to the elements first before they are linearly combined to generate the image. In general, such transformations correspond to the physical deformations and motions. The introduction of such transformations that connect the wavelet elements leads to further sparse coding, which is equivalent to inferring the underlying average shape that is subject to deformation or the speed of the underlying motion.
In Olshausen-Field model, those small number of basis elements that are turned on provides sparse coding of the observed image. Then we need to continue to code these on-elements by even smaller number of templates. The above-mentioned attributes enables further sparse coding of the on-elements, with additive residuals, and these additive residuals are the perturbations of the basis elements in the active basis model.
The perturbations of the basis elements are the most important non-linearities in the active basis model in the top-down generative direction. This leads to the non-linear local max operation in the bottom-up detection direction. Other non-linearities, which are less important but still very necessary, include the sigmoid saturation of the filter responses (justified by ratio of mixture distributions, not by logisitic perceptron), as well as the local normalization of the filter responses (which amounts to local linear whitening) before sigmoid saturation.
The shape script model further composes elementary geometric shapes such as line segments, curve segments, parallel bars, corners, junctions, ellipsoids, where each elementary shape is modeled by an active basis template. This is a second layer sparse coding. These elementary shapes are indexed by their overall locations, orientations, scales, and lengths, widths, curvatures, angles, aspect ratios. Some of these attributes are 3D-aware in the sense that they change with the underlying 3D change of view point, and some can be 3D-informative in the sense that they can be used to infer the underlying 3D structure. These attributes lead to further sparse coding, which is equivalent to inferring the underlying 3D information.
In general, endowing the sparse coding elements with attributes, or organizing the sparse coding elements by transformation groups so that one can relate one element to another, enables us to generalize by varying these attributes. It leads to further sparse coding by a generative model at a higher layer, and such further sparse coding is equivalent to inferring the underlying physical properties or hidden variables of the higher layer model that generates the sparse coding elements. The necessity of the higher layer model can be tested against the random coincidence of the sparse coding elements. Such transformation groups enable extrapolative generalization by moving further along the obits of transformation groups.
The learning of the sparse coding elements is only half the story. The learning of the attributes or the transformation group is equally important. It is not clear if it should be learned or simply be designed. But if it is designed, then the learning of the sparse coding elements may be much simplified.
Because the next layer of model is built on the attributes of the sparse coding elements and we need to be able to vary the attributes in order to generalize, the sparse coding elements must be agressively selected, so that each element has enough room to maneuver. That is, we cannot simultaneously include elements whose attributes are very similar to each other, because otherwise it would become meaningless to talk about varying these elements.
Neurally, the aggressive selection can be accomplished by explaining-away lateral inhibition. Lateral inhibition can either be carried out in short term activities of neurons in inference, or be stored in long term memory via the trimming or selection of neuron connectivities in learning. It is better to memorize it in learning. Once explaining-away is memorized or hardwired in learning by trimming or selection of connections, there is little explaining-away in inference. The sparse coding also suggests neurally both localized and distributed representation, where each concept is represented by a single node, but the concept is memorized by selective connections from this node to a small number of nodes at the layer below. The sparse and selective connections result in symbolic representations of image intensities.
Sparsity is important for (1) time and energy efficiency of the brain, (2) sharp learning and inference, (3) signal to symbol transition, and (4) extrapolative generalization by exaggerating the values of the small number of hidden variables, and by recombining the values of the hidden variables. Exaggeration requires continuous hidden variables whose distributions span flat manifolds.
In general terms, sparsity is about redunctionism, and about understanding by parsimonious models. However, here parsimony is different than that in physics, in the sense that it is flexibily parsimonious. That is, the world can be complex, but each instance can still be explained by a low-dimensional representation, and different instances are explained by different low-dimensional representations, under a common representational scheme.
When we use the words "symblic" or "explicit", we really mean attributed sparse coding elements as well as further coding of the attributes.

AND-OR graph: selective composition for AND-nodes and arg-max inference for OR-nodes

The active basis model and its hierarchical generalizations such as shape script model are special cases of AND-OR graphs proposed by Zhu and Mumford for vision. The local shifting at each layer corresponds to the OR-nodes. The AND-OR graph is an intuitive and concrete form of a generative model, where the OR-nodes are the hidden variables.
An OR-node may involve discrete change of categories. It may also involve the continuous change of geometric attributes such as locations, orientations, scales, and aspect ratios, though the continuous change can be implemented by discretization.
The OR-nodes lead to important non-linearities such as local max operations in the bottom-up pass of the inference. It also lead to the top-down pass of the inference to commit the values of the OR-nodes by retrieving arg-max.
The top-down pass is to impose a generative model on the image, or to parse the image, or in our model, to sketch the image.
The AND-OR graph leads to the SUM-MAX maps for inference, which turns out to be a simplified and specialized version of the max-product algorithm for fitting graphical models (this is realized after the publications of our recent papers).
Sparsity should be aggressively enforced when learning AND-OR graph, so that each AND-node is only connected to a small number of OR-node, but each OR-node spans a big range.
The AND-nodes are for representational efficiency by multiplicative composition. The OR-nodes account for variability and generalizibility.
The AND-OR graph is about recursive compositional whole-part relationship. Each whole concept is an AND-composition of a selected set of OR-parts. So the AND-OR graph is sparsely and selectively wired. The wiring happens in learning.
The AND-OR graph is a specialized layered belief network. Each layer consists of a sub-layer of AND-nodes and a sub-layer of OR-nodes. Each layer is a dictionary of concepts, and each concept is represented by a single AND-node. This may look like a localized representation, but it is also a distributed representation in the mean time, because each AND-node is connected to multiple OR-nodes, and different AND-nodes may share overlapping OR-nodes. These OR-nodes are shared and changable parts.
The architecture of the whole AND-OR representation of the visual knowledge is an upside-down triangle, where the dictionary becomes larger as we move up to higher layers. But in inference, those small number of nodes that are turned on form a normal triangle, where more nodes are turned on at the lower layers, and less nodes are turned on at the upper layers. These are the selected words from the dictionary.
The AND-OR graph may also be a model for the layered neural network. There are two types of neurons, the AND-neurons and the OR-neurons. The AND-neurons perform sum pooling. The OR-neurons perform max pooling as well as arg-max retrivial. The network is sparsely and selectively connected, not densely connected.

Max pooling: max for arg-max and arg-max for sketch

Generic local max or local average operations are robust to local distortions, but they are not strictly invariant to them. There is no generic invariance to local distortions, because the local distortions can be applied to different shapes, and the undistorted shapes or forms are unknown.
For the local max to be strictly or maximally invariant to the local deformation, the local max must be pooled specifically around the locations and orientations that are at the centers of the local shifts, that is, these highly selected locations and orientations describe the original undistorted shape. The local max is not invariant at other locations and orientations relative to the distortions of this shape.
These locations and orientations are those of the selected basis elements of the active basis template, and the local shifts are the hidden variables.
The above-mentioned selected locations and orientations cannot be replaced by simple generic subsampling of locations and orientations, because these locations and orientations form specific templates of different object categories, poses, and parts.
In general, each max or average pooling is performed over a group of units, and different groups may highly overlap with each other by sharing common units. A concept is a combination of highly selected groups. Only the pooling of the selected groups is invariant to the variations in this concept. Each such group corresponds to an OR-node, and each combination of OR-nodes corresponds to an AND-node.
Invariance is relative to the OR-nodes in an AND-OR graph model, or the hidden variables of a generative model. To be strictly or maximally invariant, we must learn the generative model in learning and infer the hidden variables in inference. The only way to be invariant to the hidden variables is to chase them down in inference. This is what the local max pooling is trying to do.
So the local max pooling should not only remember the local max, but more important, it should remember the arg-max, i.e., where the local maximum is achieved. The arg-max are used to sketch or parse the image.
Local max is for inferring the actual perturbations, and it is sharper than local average, which does not infer the perturbations. Actually local max and local average form a pair of hypotheses of local sketch vs local texture, so that local max is scored against local average.
The max operation is more than an invariance pooling. It is a computational step of the max-product algorithm for fitting a graphical model to the image data in the inference stage. As we realized after the publication of our recent papers, the sum-max maps are a simplified and specialized version of the max-product algorithm, which is a belief propagation algorithm for inferring the posterior mode.
The max operation is also important for learning the active basis model. It tells the learning algorithm: which max scores should be summed over? The selected locations and orientations around which to pool the max scores define the learned template.

Cortex-like architecture: sparse connection and arg-max synchronization?

The learned active basis templates may correspond to neurons in V2 or beyond. A template or a part-template is memorized by the connections from the V2 neuron to the complex V1 cells that correspond to the selected basis elements. Each V2 cell is sparsely connected to a small number of highly selected complex V1 cells. Different V2 cells are connected to different sets of V1 cells.
Each V1 complex cell is connected to a number of neighboring V1 simple cells. The V1 complex cell pools the maximum of the responses of its V1 simple cells. More important, we hypothesize that the V1 complex cell maintains its bond to the particular simple cell that achieves the maximum. In other words, the V1 complex cell stores both the local max and the arg-max.
The bond between the V1 complex cell and the arg-max simple cell might be realized by synchronization (Tai Sing Lee, personal communication).
We hypothesize that when a V2 cell is activated by the bottom-up pass, and the activation is above a threshold, it is then allowed to sustain its activation, and suppress nearby V2 cells. Meanwhile it starts a top-down pass to sustain the activations of the complex V1 cells connected to it. These complex V1 cells in turn sustain the activations of their arg-max simple cells. These arg-max simple cells then inhibit the non-maximal simple cells. So we should see a reduction in neural activities after recognition, which means a sharpening of the interpretation of the image.
The selection step in learning active basis templates may correspond to the aggressive trimming of redundant connections between V2 and V1 cells. The connection weights are the lambda parameters estimated after the selection step.
The visual cortex is more a representation device than an inference device, or in other words, it is more a generative model than a classifier. The top-down pass and the lateral inhibition are reflections of this.

Layered belief network: learning is trimming (more generally, learning is wiring)?

Our model stands on the shoulder of Olshausen-Field model, which accounts for simple V1 cells, and lays the foundation of the generative models for vision. It is naturally to ask what is the layer above the Olshausen-Field model or what is in V2. Olshasen-Field model learns a dictionary of basis elements or strokes. The layer above it should learn a dictionary of templates or part-templates sketched by the strokes.
In more general terms, after learning a model p(hidden 1) p(visible 1| hidden 1, parameter 1), where parameter 1 is a dictionary of basis elements or connection weights from V1 simple cells to image pixels, we want to continue the learning by a model p(hidden 2) p(hidden 1 | hidden 2, parameter 2) p(visible 1 | hidden 1, parameter 1). In general, "hidden k" are the activities of neurons at layer k, and "parameter k" are the connections and their weights from layer k to layer k-1. The difficulty here is that given visible 1, it is hard to infer hidden 1 because the inference involves explaining-away competition between the hidden 1 variables or the sparsification of the hidden 1 variables. Moreover, the uncertainty in the posterior distribution of the hidden 1 is hard to represent, but we cannot rely on a point estimate without accounting for the uncertainties of the posterior, no matter how good this point estimate is. Even worse, the posterior distribution of hidden 1 can change if we change the initial simple prior p(hidden 1) to a more sophisticated prior model p(hidden 2) p(hidden 1 | hidden 2, parameter 2).
The first layer hidden 1 deals with the visible layer or environmental data visible 1 directly. Ultimately, the second layer hidden 2 or higher layers are also intended to explain the environmental data visible 1, instead of the activities of the first layer neurons hidden 1. In other words, hidden 1 is not visible 2. In fact, viewed from hidden 2, hidden 1 is reporting the redundant and ambiguous proposals for explaining visible 1. At this bottom-up reporting stage, hidden 1 is not making committed decisions yet, as it really should not. Hidden 2 receives the redundant and double-talking proposals from hidden 1, and then tells hidden 1 which proposals it will accept. With this top-down guidance, hidden 1 can make decisions on how to explain visible 1.
So there are two phases of hidden 1: The bottom-up phase and the top-down phase. The bottom-up phase is proposal, and the top-down phase is verification. The top-down phase is much sparser than the bottom-up phase. The neural activity in the top-down phase is greatly reduced but sustained much longer.
Because hidden 1 is not visible 2, the learning of parameter 2 may not be a simple recursion of the learning of parameter 1.
The learning of parameter 2 is intimately tied to the inference of hidden 1 given parameter 1. The role of parameter 2 may be to memorize the commonly shared results of the inference of hidden 1. So the learning algorithm for parameter 2 may be a shared version of the inference algorithm of hidden 1 given parameter 1, just like the shared sketch algorithm.
The sparsity in hidden 1 requires that there should be lateral inhibition or selection in inferring hidden 1. Therefore, the learning algorithm of parameter 2, which is a shared version of the inference algorithm of hidden 1, should also have a selection step, so that different hidden 2 nodes are selectively and sparsely connected to different hidden 1 nodes. In other words, their connections memorize the commonly shared results of the inference of hidden 1 given parameter 1. This is a drastic change from the wiring of the connections from hidden 1 to visible 1, which may be localized and convolutional, but which is otherwise densely connected. The lateral inhibition between hidden 1 nodes leaves its print on the sparse connections from hidden 2 to hidden 1.
In general, the high-level knowledge may be learned by the shared execution of mid-level tasks, or the high-level knowledge is commonly shared mid-level processing results within specific object categories.
We suspect that the growing up of the brain or the learning is really a matter of trimming unwanted connections or re-wiring the connections. Such connections not only make the inference difficult but they also bring in a lot of noises that hurt generalizibility.
The selection in learning parameter 2 may cause a "signal to symbol" transition, so that hidden 2 is more symbolic. As a result, the generative direction from hidden 2 to hidden 1 may be occulusional or logical instead of simply being linear additive as in the generative direction from hidden 1 to visible 1.
Our attempt is to first simplify the task by learning from images where the objects are from the same category and are roughly aligned in the training images. Then we hope to make the learning less supervised by allowing multiple categories, multiple poses, and unknown locations and scales. Some preliminary experiments have already been attempted along this direction. In the whole scheme of things, the active basis model is intended for modeling the dictionary of part-templates, which are to be further composed into the dictionary of templates at still higher layers.
In active basis model, the hidden 1 variables are inferred together with the learning of the parameter 2 of templates by the shared sketch algorithm, so that parameter 1 memorizes the inference of hidden 1. We are not relying on a generic bottom-up operation for inferring hidden 1 variables before the learning of parameter 2.
In the learned templates, the selected basis elements have little overlap. This means that the explaining-away lateral inhibition is memorized and hardwired by the trimming of redundant connections. Then in the inference stage for template matching, the perturbed basis elements on each image do not have much overlap, so there is little explaining-away inhibition in inference.
The resulting active basis and its hierarchical generalizations thus are sparsely connected. In active basis model, each template node is selectively connected to a small number of "V1 complex cells". Given the template nodes, the selected "V1 complex nodes" become disentangled: they are linearly orthogonal and statistically independent, approximately, so inference becomes easy. The approximate conditional orthogonal independence may be the reason for the existence of the hidden nodes at each layer. This is to be distinguished from PCA, which has a fixed set of orthogonal independent nodes. In conditional orthogonal independence, different template nodes are connected to different sets of "V1 complex nodes", and these different sets may overlap.
Here approximate orthogonality means that the Gabor elements either have little overlap in spatial domain, or if they do overlap in spatial domain, they have little overlap in frequency domain, so we can superpose multi-scale templates in the same place.
As we move up the hierarchy, we may see increasingly more types of nodes at each layer, but each node is connected to a decreasingly less number of nodes at the layer below. Each layer consists of two sub-layers: one sub-layer of AND-nodes, and one sub-layer of OR-nodes.
The top-down generative model involves random selections or perturbations of geometric attributes such as locations and orientations. This leads to local max in bottom up pass and a back-tracing step in top-down pass to find the arg-max.
The sum operation is applied to local max scores of selected locations and orientations, which are the locations and orientations of the selected basis elements. These locations and orientations are selected in the learning algorithm. These are supposed to be the centers of local shifts, so that the local max scores on these locations and orientations are strictly or maximally invariant to such shifts. Other local max scores are ignored for the lack of invariance.
The max operation is more than an invariance pooling, and the role of the "V1 complex cells" is more than being invariance pooling units. It is to infer the perturbations of the basis elements, which are hidden variables. The max pooling is only half of the story and is a computation step in the max-product algorithm for inference in graphical model. The retrieval of the arg-max, i.e., the perturbations in the top-down pass is more fundamental, because it is fitting the generative model to image data.
In a hypothetical neural implementation, the "V1 complex cell" maintains its dynamic bond with the "V1 simple cell" whose response achieves the local maximum among all the simple cells connected to this complex cell. This bond, which is to record the arg-max and which might be maintained by synchronization, is needed for back-tracing the actual perturbations.
The max maps are not representations of the image data. They are highly redundant proposals for representing the image data, and are a bottom-up step in the max-product inferential algorithm in graphical model. The representation of the image is obtained after the top-down retrieval pass, which sketches the object by arg-max using the selected and perturbed Gabor elements, and the arg-max elements inhibits other non-arg-max elements.
The nodes are continuous (even though they may be implemented with discretization), not binary or Bernoulli. The sigmoid saturation is justified by the ratio of two mixture distributions, not by logisitic perceptron. Specifically, the log-likelihood ratio of the two mixture distributions is linear in a saturated function of the filter responses. The local normalization of filter responses before saturation is justified by linear whitening, to make the local spectrum flat over frequencies. The original local spectrum or the second-order statistics may not contain much information about specific object shapes or surface texture properties. They may be caused by lighting and shading variations or some physical properties that are common in the natural scenes. These are unwanted noises that are irrelevant or even harmful to object recognition.
The AND-OR graph seems to be both a localized representation and a distributed representation. It is localized because each concept is represented by a single AND-node. It is distributed because each AND-node is connected to a collection of OR-nodes. Partial damage of the connections will not have a big effects in inference, and partial observations of the OR-nodes may allow inference of the AND-node, which in turn allows the predictions of the unobserved OR-nodes. Purely distributed representation or associative memory does not explicitly index the local modes of the energy surface and all these local modes share the same connections. This may be too implicit.
Each layer of AND-nodes form a dictionary. In learning, the dictionary sizes are larger at upper layers than lower layers, but in inference, the numbers of activated nodes are smaller at upper layers than lower layers.

Avoiding explaining-away in inference

The following are some strategies:

Learn exponential family model with bottom-up features in sufficient statistics.
Assume independent prior for hidden nodes, and approximate their posterior by independent distributions.
Require the posterior of hidden nodes to be exactly independent distribution, and let the prior be dictated by the posterior.
Prohibit overlapping hidden nodes from connecting to the same higher node, so that given the higher node, both the prior and posterior can be made independent.

We prefer the last strategy, which is to hardwire the explaining-away in learning.

Past issues

Two issues with generative models that had puzzled us for a long time:

Partition function. In our Markov random field texture model, the normalizing constant (that Z-function or partition function) is not computable. So model fitting has to rely on Markov chain Monte Carlo. While this has a nice "analysis by synthesis" interpretation, it is just too slow (maybe not for today's GPU). (from 1995 to 2000)
Explaining away. In Olshausen-Field model, we need to represent each image by a small number of wavelet elements, and they are selected from the dictionary of all the wavelet elements, which compete with each other to explain away each image. If we perform basis selection for each image, the selection results cannot be easily used for further learning, because there is no correspondence between the sparse representations of different images, and the selection result in each image is an early decision that does not account for uncertainty in selection or posterior distribution of the selection. The dilemma is that you have to do sparsification or explaining away, but you cannot do it before learnign the next layer. (from 2000 to 2006)

These two obstacles have been circumvented in our current work as we explained above.

Key principles

Sparsity of Olshausen-Field.
Compositionality of S. Geman; And-or graph of Zhu-Mumford.
Invariance of Riesenhuber-Poggio.

Key references

Form of representation:

B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature, 381, 607-609, 1996.
S. C. Zhu, C. E. Guo, Y. Z. Wang, and Z. J. Xu, What are textons? International Journal of Computer Vision, 62, 121-143, 2005.
S. C. Zhu and D. B. Mumford, A stochastic grammar of images, Foundations and Trends in Computer Graphics and Vision, 2, 259-362, 2006.

Scheme of learning:

S. Mallat and Z. Zhang, Matching pursuit in a time-frequency dictionary, IEEE Transactions on Signal Processing, 41, 3397-415, 1993.
J. H. Friedman, Exploratory projection pursuit, Journal of the American Statistical Association, 82, 249-266, 1987.
Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55, 119-139.
P. A. Viola and M. J. Jones, Robust real-time face detection, International Journal of Computer Vision, 57, 137-154, 2004.
S. C. Zhu, Y. N. Wu, and D. B. Mumford, Minimax entropy principle and its applications in texture modeling, Neural Computation, 9, 1627-1660, 1997.
S. Della Pietra, V. Della Pietra, and J. Lafferty, Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380-393, 1997.

Architecture of inference:

M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex, Nature Neuroscience, 2, 1019-1025, 1999.

This work is a continuation of our long term search for generative models and model-based algorithms, as well as our attempt to understand these models within a common information-theoretical framework. The active basis model can be considered a revision of our previous model on textons. It can also be viewed as an inhomogeneous version of the Markov random field model that we previously developed for textures. More important, the active basis model is a simplest instance of the and-or graph in the compositional framework that we have been studying. The and-or grammar naturally suggests to further compose multiple active bases to represent more articulate shapes. The architecture of the sum-max maps is a natural computational tool for parsing the observed image according to the and-or grammar.

Wu, Y. N., Guo, C., and Zhu, S. C. (2008) From information scaling to regimes of statistical models. Quarterly of Applied Mathematics, 66, 81-122. pdf latex ppt
Wu, Y. N., Si, Z., Fleming, C., and Zhu, S. C. (2007) Deformable template as active basis. Proceedings of International Conference of Computer Vision. pdf latex ppt
Wu, Y. N., Li, J., Liu, Z., and Zhu, S. C. (2007) Statistical principles in image modeling. Technometrics, 49, 249-261. pdf latex
Guo, C., Zhu, S. C. and Wu, Y. N. (2007) Primal sketch: integrating structure and texture. Computer Vision and Image Understanding, 106, 5-19. pdf latex ppt
Zhu, S. C. and Mumford, D. (2006) A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2, 259-362. pdf
Zhu S. C., Guo C., Wang Y, and Xu Z. (2005) What are textons? International Journal of Computer Vision, 62, 21-143. pdf
Zhu S. C. (2003) Statistical modeling and conceptualization of visual patterns. IEEE Pattern Analysis and Machine Intelligence, 25. 691-712. pdf
Guo, C., Zhu, S. C., and Wu, Y. N. (2003) Towards a mathematical theory of primal sketch and sketchability. Proceedings of International Conference of Computer Vision. 1228-1235. pdf latex ppt
Guo, C., Zhu, S. C., and Wu, Y. N. (2003) Visual learning by integrating descriptive and generative models. International Journal of Computer Vision. 53, 5-29. pdf
Zhu, S. C., Guo, C., Wu, Y. N. and Wang, Y. (2002) What are textons? Proceedings of European Conference of Computer Vision, 793-807. pdf
Wu, Y. N., Zhu, S. C., and Guo, C. (2002) Statistical modeling of texture sketch. Proceedings of European Conference of Computer Vision, 240-254. pdf latex ppt
Wu, Y. N. and Zhu, S. C. (2001) Vision and the art of data augmentation. Discussion of a paper by Meng and van Dyk. Journal of Computational and Graphical Statistics, 10, 90-93. pdf
Wu, Y. N., Zhu, S. C., and Liu, X. (2000) Equivalence of Julesz ensembles and FRAME models. International Journal of Computer Vision, 38, 245-261. pdf
Zhu, S. C., Liu, X., and Wu, Y. N. (2000) Exploring texture ensembles by efficient Markov chain Monte Carlo - towards a `trichromacy' theory of texture. IEEE Pattern Analysis and Machine Intelligence, 22, 554-569. pdf
Zhu, S. C., Wu, Y. N., and Mumford, D. B. (1998) Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627-1660. pdf
Zhu, S. C. and Mumford, D. (1997) Prior learning and Gibbs reaction-diffusion. IEEE Pattern Analysis and Machine Intelligence, 19, 1236-1250. pdf
Zhu, S. C., Wu, Y. N., and Mumford, D. B. (1997) Filter, Random field, And Maximum Entropy (FRAME): towards a unified theory for texture modeling. International Journal of Computer Vision, 27, 107-126. pdf

Back to active basis homepage