The following expands the Discussions of the two technical reports.
- The difference between generative and discriminative methods is mainly about hidden variables
and their explaining-away competition in inference and learning. In generative learning,
these hidden variables compete to explain away
each individual image, instead of the class label of the image. The image is more informative
for inferring hidden variables than class label, especially when the number of hidden
variables is large. The class label y is only binary, but the image x contains much more information.
The max-margin does go far beyond the class labels or p(y|x) to capture the images, but it may still not
be as informative as generative likelihood p(x|y).
- The form of the generative model is p(hidden variables|A) p(observed data|hidden variables, B),
where A and B are parameters. In active basis model, we assume B is already learned as
the dictionary of basis elements, and we focus on learning A, supposedly the dictionary or codebook of templates
or part-templates.
- Some preliminary experiments in this paper suggest that it is possible to learn active basis
templates or partial templates from non-aligned images. The goal is to learn a dictionary or codebook
of partial templates in unsupervised manner. The generative model is more natural for this task.
It is essentially a recursion of Olshausen-Field learning, which iterates between an inference/encoding
step, and a code-book re-learning step.
- Hidden variables seek to explain each individual training image. In active basis model,
the perturbations of the basis elements and the coefficients of the basis elements are
both hidden variables. We can continue to add hidden variables such as unknown
overall locations, scales, view points, and categories of the templates or partial templates,
in order to make the learning more unsupervised. We can also add mid-scale hidden variables to
account for larger deformations.
- In active basis model and its extensions, the hidden variables are dense and layered. They are
dense in the sense that each basis element is associated with a perturbation variable. They are
layered because the perturbation variables can be defined at the levels of elements, part-templates,
and whole template. Learning in the presence of dense and layered hidden variables may have to be
generative for estimation efficiency.
- In active basis model, the selected and perturbed (arg-max) basis elements inhibit nearby basis elements
in each individual training image, because the selected and perturbed element explains away
part of the image, so that the nearby basis elements cannot further explain away the residual image.
This step is missing in discriminative learning. This image specific arg-max inhibition may define
generative learning in this special context. This is a matching pursuit effect.
- The residual images neutralize selected basis elements and those basis elements that are highly
correlated with them, because they have zero or small responses on the residual images. So if we want to
continue to explain the residual images, we will avoid the selected elements and their neighbors.
- This neutralizing data set does not involve reweighing the original data set as in adaboost, or resampling
(filtering) the original data set as in the earlier boosting scheme, so there is no loss of effective training examples.
This neutralizing data set viewpoint connects generative learning and discriminative learning.
- In the generative approach, we remove the arg-max element from each example in
creating the neutralizing example, so that the arg-max element inhibits overlapping elements. This is
sharper than discriminative reweighing.
- The above learning process essentially removes the objects from the positive images, so that the residual
images become natural images. In that sense, generaltive learning of objects amounts to removal of the
objects from training images. We can introduce further image-specific hidden variables for sharper
removal of the object in each individual positive image, so the image-specific hidden variables are naturally
needed in the generative model.
- Another way to think about it from classification point of view is that here the features are the coefficients
for reconstructing the images, instead of the filter responses. The filter responses can be highly dependent both
linearly and statistically. The coefficients may be much less so. The coefficients can be obtained from the filter
responses by some sparsification, which does two things. One is thresholding noisy responses to zero. The other is like
non-maximum suppression or winner-takes-all.
- The above is an expansion of the statement at the end of Section 4.4: "In the pursuit of basis elements,
the generative learning does not re-weight positive examples, and the inhibition between basis elements is carried out
through residual images. In the discriminative learning, the inhibition is done by re-weighting the
training examples."
- The above discussion hinges on an additive model with an explicit residual image, and the matching
pursuit is a simplest concrete embodiment of explaining away. As we show in the appendix of our ICCV paper,
it is a special case of a probability model p(I_m | B_m), where B_m is the deformed template that sketches
image I_m, and B_m depends on the hidden variables of
perturbations. The form of p(I_m | B_m) is based on exponential tilting or maximum entropy.
When we select the next basis element, we want to continue to improve the current p(I_m | B_m) by
adding one more stroke to the template, so that the new p(I_m | B_m) with the new template B_m gives a better
explanation to I_m. We want to select the new stroke so that the overall improvement over all training images
is maximized.
- From discriminative point of view, the current p(I_m | B_m) (or its Monte Carlo samples) is the current
negative training set. We want to find the new stroke to distinguish between the postive examples from these
negative training sets. However, unlike in adaboost where all the positive examples share the same re-weighed
negative training set, in generative learning, each image I_m has its own negative training set p(I_m | B_m),
because different images I_m have different deformed templates B_m caused by the hidden variables of perturbations. Such image
specific negative training set pushes for sharper selection of the next stroke.
- The current p(I_m | B_m) is said to have explained away I_m, and we want to find a new stroke to continue to
further explain away I_m by a new p(I_m | B_m) with a new template B_m with the new stroke added. So the
generative learning sketches and paints each I_m by an image specific p(I_m | B_m). This is consistent with
the above residual-based understanding of removing the object from each I_m. To be able to remove the object,
you have to be able to generate the object.
- Although hidden variables can be maxed out or integrated out and be reduced to invariant
features such as local max or histograms, so that we do not have to deal with image specific hidden
variables, invariant features in discriminative approach do
not account for top-down pass and the above arg-max explaining-away inhibition. In fact, the local max
operation itself needs justification in discriminative approach.
- Moreover, the invariance in discriminative learning is at best approximate and should better be
called blurring. The reason is that
the invariance is supposed to be relative to the variations of the original form or shape. But the original
shape is unknown and there can be many original shapes, so it is impossible to be completely invariant. This
is possible only we explicitly model the original shape and its deformations, in particular, the
original shape is the same over all the examples, and the deformations are represented by hidden variables
that are specific to individual examples.
- The generative likelihood can be considered the full likelihood, whereas the discriminative likelihood
without regularization can be considered
partial likelihood. The generative likelihood focuses
on the internal structures of the positive training set instead of the boundary between positive
and negative training sets.
- More specifically, the posterior class probability is a non-linear function
of the likelihood ratio p/q, and the class probability saturates at 1 for typical positive
examples that are well within the classification boundary and that have large likelihood ratios.
That is, if we increase the likelihood ratio by a large amount, then we only increase the
class probability by a small amount. So the class probability may not
be as strong as the likelihood ratio for modeling typical positive examples inside the
classification boundary. The class probability
is more sensitive for marginal examples close to the boundary.
- Conditional on the hidden variables, the active basis model follows an exponential
family model relative to the probability density of natural images. The lambda parameters
(or weights) are estimated based on orthogonal-independence assumption of the background/residual
model q. This q may be okay for residual images in positive training images, but
is far from capturing patterns in negative images or natural images. If we let q be the
distribution of negative or natural images, then one can correct for
this assumption by fine-tuning or relaxing the lambda's given the selected basis elements, either by
fitting the full generative exponential model or by fitting the discriminative logistic regression
classifier, which is the conditional distribution of the full exponential model. In both cases, the
log-likelihoods are concave, but the gradients are different. The generative gradient is in the
form of the difference between observed average and the expectation of the sufficient statistics,
thus the generative model targets the "average" or the "center". The computation of the
expected sufficient statistics involves reweighting the natural image patches, but the average
sufficent statistics stay the same, without reweighting the positive images.
- Discriminative adjustment of lambda's by logistic regression tends to overfit the data, so
regularization is necessary. The L2 regularization is preferrable because it encourages smoothness
of lambda's of the selected basis elements. The L2 regularized logistic regression is similar to
SVM. The logistic loss is smoother than hinge loss. The logistic regression is a consequence of the
relative exponential family model. Usually in negative images, some images may match some portions of
the template, but completely miss other portions. So we should down-weigh the lambda's of those
portions that are frequently matched by negative images or false positives. We may also learn some negative
templates using mixture model from false positives for better modeling q.
- Although logistic regression is partial likelihood and therefore less efficient than the full
generative likelihood, regularization changes the story considerably. The regularization term can
be understood as reducing variance or as corresponding to a prior distribution. But maybe it should
be better seen as approaching generative likelihood. Take SVM for example. Suppose the data are within a
ball. Then the larger the margin, the smaller the volume occupied by positive and negative examples.
So a uniform distribution on this margin-excluded ball would have high generative likelihood. So in
this sense, SVM is doing something close to generative model, although the uniform distribution over
the margin-voided ball is rather crude as a generative model.
- The general idea is to use generative model to select basis elements and infer hidden variables,
as well as estimating the lambda parameters based on conditional independence assumption. This leads to
a stand-alone generative model whose likelihood can be used for detection and classification. After
generative learning, given the selected basis elements and the inferred hidden variables, we can
re-estimate lambda parameters by logistic regression. We believe the generative model is more efficient
in learning with hidden variables. Moreover, the selection of the basis elements hugely reduces the
dimensionality so that discriminative adjustment can be done very efficiently.
- Take clustering for example. One can do clustering using discriminative method. But in general,
classification focuses on marginal examples (either relative to negatives or relative to other clusters).
Such marginal examples may not reveal a clear view of clustering. Clustering is more clear from a global view as
a matter of polarizing away from center. The discriminative approach may fail to generate stronge enough
polarizing force in the initial stage of the clustering algorithm where all clusters are similar.
In EM mixture model, those examples that are clustered well receive higher weights in E-step. But in
classification, those examples that are not classified well receive higher weights. There seems to be
a conflict of interests.
- Clustering is about discrete hidden variables. Even for continuous hidden variables, such as
unknown locations and orientations, the focus on those marginal examples may not be helpful for us to
infer such hidden variables for overall alignment among all the examples.
- As to explaining-away mechanism in learning, a close form mechanism is the matching pursuit
mechanism in linear additive model as we discussed above. Another
related mechanism is basis pursuit or lasso, which is based on the trick of replacing L0 norm by
L1 norm. Just like we crafted a shared matching pursuit algorithm,
if we want to use basis pursuit or lasso, we also need to do it in a way that the selected elements
is shared by all the training images. This can be enforced by tying the coefficients across images,
and define L1 sparsity shared by all the images. We did not use Lasso becuase matching pursuit works well
and is very fast. We were also concerned
whether L1 is strong enough for sparsity. Moreover, the LARS solution to lasso is also a step-wise
procedure similar to matching pursuit. Since we do not want to use heavily overlapping Gabors to
sketch the images anyway, and the Gabors are small relative to the template, we are not concerned
with the greediness of matching pursuit.
- An important point is that we cannot do matching pursuit or lasso-type sparsification first on
individual images, and then learn the next layer for the sparsified coefficients. This is early decision.
Even if we get optimal solution to the sparse coding problem for each individual image, this optimal
solution does not account for uncertainty in the selected basis elements and their coefficients, and
they may not provide optimal fit to the model in the layer above, whatever the model may be.
This is a key obstacle for learning
the next layer beyond Olshausen-Field model. But on the other hand, we cannot give up explaining-away
sparsification, because the model learned from the unsparsified coefficients does not give sparse
explanation of the image data, thus violating the principle of Olshausen-Field.
- If one thinks in terms of edge detection, the above paragraph essentially says that you cannot
do edge detection before learning. Such edge detection is early commitement. No matter how "optimal"
it may be, it may not be a good fit to the patterns you are looking for. That is, when you are looking
for patterns of edges, you simply cannot do edge detection and then find the patterns. You have to
find the patterns together with edge detection. Thet two cannot be separated. This is the same with
both learning and inference (detection and recognition). In recognition, you cannot detect edges
before you detect the object. Instead, you detect the edges together or even after you detect the
object. The important message is that early decision should always be avoided. To put it another way,
you need the edges to see the pattern, but you also need the pattern the see the edges. Such chicken
and egg problem can only be solved by joint maximization of an objective function such as likelihood.
- If one thinks in more general terms of compositionality, the tricky thing is that you cannot find
the compositional patterns of the constituent elements without identifying the constituent elements,
and yet identifying constituent elements before seeing the compositional patterns is an early decision. So
you have to identify the constituent elements and learn the pattern at the same time, or by looping
through bottom-up and top-down passes. Building up the hierarchy by a bottom-up procedure does not work.
- Our solution is to simplify the learning problem first, starting from roughly aligned images, and
later extend the learning to nonaligned images by introducing more hidden variables. For roughly aligned images,
we can perform shared sparsification. That is, we do sparsification together with learning the next layer
beyond Olshausen-Field. In other words, the next layer of learning is to memorize the commonly shared
sparsifications. So each neuron in the next layer memorizes a commonly shared sparsification.
- Explaining-away can be made very light in inference, if it is already memorized in learning,
via selection or equivalently trimming of highly redundant elements. In active basis model,
the selected basis elements have little overlaps. So in inference, these elements generally
do not interfere with each other in local max perturbation.
- The sparsification of the Olshausen-Field makes the layer of the coefficients into a point process,
not another layer of images or dense maps any more. Thus the model on the coefficients should be drastically different than
the model on images. It cannot be a simple recursion. For the sparse point process, the best one
can do may be to memorize commonly occuring patterns.
- We have been thinking about the layer on top of Olshausen-Field model for a long time. The form of this
layer in return provides a justification why we need Olshausen-Field model in the first place. The sparsity
principle, as appealing as it is, does not tell us exactly what is the purpose of sparse representation. We
somehow feels that the active bais model provides one justification. We want to perturb the sparse coding
elements in order to generalize to similar shapes. We have been feeling that going beyond Olshausen-Field
is the key to understand vision, and the layer above Olshausen-Field must be fundamentally different than
the layer of Olshausen-Field.
- Generalizibility can (or maybe should) be defined extrapolatively, where the testing images come from
different distributions than the training images. Such extrapolation can be achieved by
changing the distribution of the hidden variables. In shape script model, extrapolation can be
achieved by exaggerating the geometric parameters of the elementary shapes, or by recombining
different elementary shapes. The example of extrapolating from regular bike to the tandem bike is
a simplest example of extrapolative generalization.
- Extrapolative generalizibility is a criterion for the usefulness of the knowledge learned from
data. The learned knowledge should have the abilities to deal with "surprises" in order for an intelligent
system to survive the real world. In fact, an intelligent system should be curious and adventurous in seeking out surprises, in order to learn rapidly. A surprise lies in between a positive testing example
and a negative testing example. It is a positive example because it can be explained by existing knowledge.
It is a negative example because it cannot be fully explained by existing knowledge without some
extrapolation. Also, a surprise lies in between a testing example and a training example. It is a testing
example because it demands explanation by existing knowledge. It is a training example because it offers
opportunity for re-learn or adapt the existing model.
- In order to explain surprise by existing knowledge, we need to explain in
what aspects the surprise is similar to past examples, and more important,
in what aspects the surprise differs from past examples. These aspects are
hidden variables, which can be dense and layered as discussed above.
- The Olshausen-Field model learns Gabor-like wavelets by enforcing sparsity.
These wavelet elements turn out to be localized in both spatial and frequency domains,
and are similar to each other. They are indexed by attributes
such as locations, orientations, scales, and phases. The similarity between the elements are
not assumed in learning, and the attributes are not learned.
The concept of location already exists in the labeling of pixels, but not orientation and scale.
Labeling the learned wavelet elements by locations, orientations, and scales amounts to
a re-representation, so that the layer above it is built upon these attributes, in
addition to the wavelet coefficients.
- These attributes serve as hidden variables of the active basis model. The
local deformations of the template and the global geometrical transformations such as
translation, dilation, and rotation of the template are all based on linear
additive manipulations of these attributes. Such linear additive manipulations
trace out the highly non-linear manifolds in the image space, and most natural
images live on such manifolds. The original image manifolds are non-Euclidean.
But the distributions of these attributes span flat Euclidean manifolds,
so that linear interpolation becomes meaningful.
- If we do not want to index the geometrical attributes of the basis elements explicitly
or embed them into a geometric space, then at least we should organize them in a topological
neighborhood system, so that perturbations of basis elements can be represented as changing
the basis elements to their respective neighbors. Such a
neighborhood system makes the ensemble of natural images more connected, so that one can easily flow from
one natural image to another. If we define the neighborhood system based on the Euclidean distance
of the image intensities, then this neighborhood system is less connected.
- What is more fundamental than the attributes themselves is the underlying group of
transformations that transform one wavelet element to another, under the condition that these
elements are similar to each other. The attributes are only a result
of representing such group of transformations. In other words, the transformations are the
verbs, and the attributes are the nouns. The verb aspect is more important than the noun
aspect. In the active basis model,
transformations of limited ranges are applied to the elements first before
they are linearly combined to generate the image.
In general, such transformations correspond to the physical deformations and motions.
The introduction of such transformations that connect the wavelet elements leads to further
sparse coding, which is equivalent to inferring the underlying average shape that is subject
to deformation or the speed of the underlying motion.
- In Olshausen-Field model, those small number of basis elements that are turned
on provides sparse coding of the observed image. Then we need to continue to code
these on-elements by even smaller number of templates. The above-mentioned attributes
enables further sparse coding of the on-elements, with additive residuals, and these
additive residuals are the perturbations of the basis elements in the active basis model.
- The perturbations of the basis elements are the most important non-linearities
in the active basis model in the top-down generative direction. This leads to the
non-linear local max operation in the bottom-up detection direction. Other non-linearities,
which are less important but still very necessary, include the sigmoid saturation
of the filter responses (justified by ratio of mixture distributions, not by logisitic
perceptron), as well as the local normalization of the filter responses (which amounts
to local linear whitening) before sigmoid saturation.
- The shape script model further composes elementary geometric shapes such as line segments,
curve segments, parallel bars, corners, junctions, ellipsoids, where each elementary shape is modeled by an active basis
template. This is a second layer sparse coding. These elementary shapes are indexed by their
overall locations, orientations, scales, and lengths, widths, curvatures, angles, aspect ratios. Some of
these attributes are 3D-aware in the sense that they change with the underlying 3D change of view point, and some can be
3D-informative in the sense that they can be used to infer the underlying 3D structure. These attributes
lead to further sparse coding, which is equivalent to inferring the underlying 3D information.
- In general, endowing the sparse coding elements with attributes, or organizing the sparse
coding elements by transformation groups so that one can relate one element to another, enables us to
generalize by varying these attributes. It leads to
further sparse coding by a generative model at a higher layer, and such further sparse coding is
equivalent to inferring the underlying physical properties or hidden variables of the higher layer
model that generates the sparse coding elements. The necessity of the higher layer model can be
tested against the random coincidence of the sparse coding elements.
Such transformation groups enable extrapolative generalization by
moving further along the obits of transformation groups.
- The learning of the sparse coding elements is only half the story. The learning of the attributes or
the transformation group is equally important. It is not clear if it should be learned or simply be designed.
But if it is designed, then the learning of the sparse coding elements may be much simplified.
- Because the next layer of model is built on the attributes of the sparse coding elements and
we need to be able to vary the attributes in order to generalize, the sparse coding elements must be agressively selected,
so that each element has enough room to maneuver. That is, we cannot simultaneously include
elements whose attributes are very similar to each other, because otherwise it would become meaningless to
talk about varying these elements.
- Neurally, the aggressive selection can be accomplished by explaining-away lateral inhibition.
Lateral inhibition can either be carried out in short term activities of neurons in inference, or be stored in
long term memory via the trimming or selection of neuron connectivities in learning. It is
better to memorize it in learning. Once explaining-away is memorized or hardwired in learning by
trimming or selection of connections, there is little explaining-away in inference.
The sparse coding also suggests neurally both localized and distributed representation, where each concept is
represented by a single node, but the concept is memorized by selective connections from this node to a small number of nodes
at the layer below. The sparse and selective connections result in symbolic representations of image intensities.
- Sparsity is important for (1) time and energy efficiency of the brain,
(2) sharp learning and inference, (3) signal to symbol transition, and (4)
extrapolative generalization by exaggerating
the values of the small number of hidden variables, and by recombining
the values of the hidden variables. Exaggeration requires continuous hidden variables whose
distributions span flat manifolds.
- In general terms, sparsity is about redunctionism, and about understanding by parsimonious models. However,
here parsimony is different than that in physics, in the sense that it is flexibily parsimonious.
That is, the world can be complex, but each instance can still be explained by a low-dimensional representation,
and different instances are explained by different low-dimensional representations, under a common
representational scheme.
- When we use the words "symblic" or "explicit", we really mean attributed sparse coding elements as well
as further coding of the attributes.
- The active basis model and its hierarchical generalizations such as shape script model
are special cases of AND-OR graphs proposed by Zhu and Mumford for vision. The local shifting at
each layer corresponds to the OR-nodes. The AND-OR graph is an intuitive and concrete form of
a generative model, where the OR-nodes are the hidden variables.
- An OR-node may involve discrete change of categories. It may also involve the continuous change
of geometric attributes such as locations, orientations, scales, and aspect ratios, though the
continuous change can be implemented by discretization.
- The OR-nodes lead to important non-linearities such as local max operations
in the bottom-up pass of the inference. It also lead to the top-down pass of the inference to
commit the values of the OR-nodes by retrieving arg-max.
- The top-down pass is to impose a generative model on the image, or to parse the image, or
in our model, to sketch the image.
- The AND-OR graph leads to the SUM-MAX maps for inference, which turns out to be a simplified and
specialized version of the max-product algorithm for fitting graphical models (this is realized
after the publications of our recent papers).
- Sparsity should be aggressively enforced when learning AND-OR graph, so that each AND-node
is only connected to a small number of OR-node, but each OR-node spans a big range.
- The AND-nodes are for representational efficiency by multiplicative composition. The OR-nodes
account for variability and generalizibility.
- The AND-OR graph is about recursive compositional whole-part relationship. Each whole concept is
an AND-composition of a selected set of OR-parts. So the AND-OR graph is sparsely and selectively wired.
The wiring happens in learning.
- The AND-OR graph is a specialized layered belief network. Each layer consists of a sub-layer of
AND-nodes and a sub-layer of OR-nodes. Each layer is a dictionary of concepts, and each concept
is represented by a single AND-node. This may look like a localized representation, but it is also
a distributed representation in the mean time, because each AND-node is connected to multiple OR-nodes,
and different AND-nodes may share overlapping OR-nodes. These OR-nodes are shared and changable parts.
- The architecture of the whole AND-OR representation of the visual knowledge is an upside-down
triangle, where the dictionary becomes larger as we move up to higher layers. But in inference, those
small number of nodes that are turned on form a normal triangle, where more nodes are turned on at
the lower layers, and less nodes are turned on at the upper layers. These are the selected words
from the dictionary.
- The AND-OR graph may also be a model for the layered neural network. There are two types of
neurons, the AND-neurons and the OR-neurons. The AND-neurons perform sum pooling. The OR-neurons
perform max pooling as well as arg-max retrivial. The network is sparsely and selectively connected,
not densely connected.
- Generic local max or local average operations are robust to local distortions, but they are not strictly
invariant to them. There is no generic invariance to local distortions, because the local distortions can be
applied to different shapes, and the undistorted shapes or forms are unknown.
- For the local max to be strictly or maximally invariant to the local deformation, the local max
must be pooled specifically around the locations and orientations that are at the centers of the local shifts,
that is, these highly selected locations and orientations describe the original undistorted shape. The local
max is not invariant at other locations and orientations relative to the distortions of this shape.
- These locations and orientations are those of the selected basis elements of the active basis template,
and the local shifts are the hidden variables.
- The above-mentioned selected locations and orientations cannot be replaced by simple generic
subsampling of locations and orientations, because these locations and orientations form specific templates
of different object categories, poses, and parts.
- In general, each max or average pooling is performed over a group of units, and different
groups may highly overlap with each other by sharing common units. A concept is a combination of highly selected groups.
Only the pooling of the selected groups is invariant to the variations in this concept. Each such group corresponds to
an OR-node, and each combination of OR-nodes corresponds to an AND-node.
- Invariance is relative to the OR-nodes in an AND-OR graph model, or the hidden variables of a
generative model. To be strictly or maximally invariant, we must learn the generative model in learning
and infer the hidden variables in inference. The only way to be invariant to the hidden variables is to
chase them down in inference. This is what the local max pooling is trying to do.
- So the local max pooling should not only remember the local max, but more important, it should
remember the arg-max, i.e., where the local maximum is achieved. The arg-max are used to sketch or parse
the image.
- Local max is for inferring the actual perturbations, and it is sharper than local average, which does not
infer the perturbations. Actually local max and local average form a pair of hypotheses of
local sketch vs local texture, so that local max is scored against local average.
- The max operation is more than an invariance pooling. It is a computational step of the max-product
algorithm for fitting a graphical model to the image data in the inference stage. As we realized after
the publication of our recent papers, the sum-max maps are a simplified and specialized version of the
max-product algorithm, which is a belief propagation algorithm for inferring the posterior mode.
- The max operation is also important for learning the active basis model. It tells the learning
algorithm: which max scores should be summed over? The selected locations and orientations around which
to pool the max scores define the learned template.
- The learned active basis templates may correspond to neurons in V2 or
beyond. A template or a part-template is memorized by the connections from
the V2 neuron to the complex V1 cells that correspond to the selected basis
elements. Each V2 cell is sparsely connected to a small number of
highly selected complex V1 cells. Different V2 cells are connected to different sets
of V1 cells.
- Each V1 complex cell is connected to a number of neighboring V1 simple cells. The V1 complex
cell pools the maximum of the responses of its V1 simple cells. More important, we
hypothesize that the V1 complex cell maintains its bond to the particular simple cell
that achieves the maximum. In other words, the V1 complex cell stores both the
local max and the arg-max.
- The bond between the V1 complex cell and the arg-max simple cell might be realized by
synchronization (Tai Sing Lee, personal communication).
- We hypothesize that when a V2 cell is activated by the bottom-up pass, and the activation is
above a threshold, it is then allowed to sustain its activation, and suppress nearby V2 cells.
Meanwhile it starts a top-down pass to sustain the activations of the complex V1
cells connected to it. These complex V1 cells in turn sustain the activations of their arg-max simple
cells. These arg-max simple cells then inhibit the non-maximal simple cells. So we should see
a reduction in neural activities after recognition, which means a sharpening of the interpretation
of the image.
- The selection step in learning active basis templates may correspond to the
aggressive trimming of redundant connections between V2 and V1 cells. The connection
weights are the lambda parameters estimated after the selection step.
- The visual cortex is more a representation device than an inference device,
or in other words, it is more a generative model than a classifier. The top-down
pass and the lateral inhibition are reflections of this.
- Our model stands on the shoulder of Olshausen-Field model, which accounts for simple V1 cells, and lays the foundation
of the generative models for vision. It is naturally to ask what is the layer above the Olshausen-Field model
or what is in V2. Olshasen-Field model learns a dictionary of basis elements or strokes. The layer above it should
learn a dictionary of templates or part-templates sketched by the strokes.
- In more general terms, after learning a model p(hidden 1) p(visible 1| hidden 1, parameter 1),
where parameter 1 is a dictionary of basis elements or connection weights from V1 simple cells to image pixels,
we want to continue the learning by a model p(hidden 2) p(hidden 1 | hidden 2, parameter 2)
p(visible 1 | hidden 1, parameter 1). In general, "hidden k" are the activities of neurons at layer k, and
"parameter k" are the connections and their weights from layer k to layer k-1. The difficulty here
is that given visible 1, it is hard to infer hidden 1 because the inference involves explaining-away
competition between the hidden 1 variables or the sparsification of the hidden 1 variables. Moreover,
the uncertainty in the posterior distribution of the hidden 1 is hard to represent, but we cannot
rely on a point estimate without accounting for the uncertainties of the posterior,
no matter how good this point estimate is. Even worse, the posterior distribution of hidden 1
can change if we change the initial simple prior p(hidden 1) to a more sophisticated prior model
p(hidden 2) p(hidden 1 | hidden 2, parameter 2).
- The first layer hidden 1 deals with the visible layer or environmental data
visible 1 directly. Ultimately, the second layer hidden 2 or higher layers are also
intended to explain the environmental data visible 1, instead of the activities of the first layer
neurons hidden 1. In other words, hidden 1 is not visible 2. In fact, viewed from hidden 2, hidden 1 is
reporting the redundant and ambiguous proposals for explaining visible 1. At this bottom-up reporting
stage, hidden 1 is not making committed decisions yet, as it really should not. Hidden 2 receives the
redundant and double-talking proposals from hidden 1, and then tells hidden 1 which proposals it will
accept. With this top-down guidance, hidden 1 can make decisions on how to explain visible 1.
- So there are two phases of hidden 1: The bottom-up phase and the top-down phase. The bottom-up
phase is proposal, and the top-down phase is verification. The top-down phase is much sparser than
the bottom-up phase. The neural activity in the top-down phase is greatly reduced but sustained much
longer.
- Because hidden 1 is not visible 2, the learning of parameter 2 may not be a simple recursion of
the learning of parameter 1.
- The learning of parameter 2 is intimately tied to the inference of hidden 1 given
parameter 1. The role of parameter 2 may be to memorize the commonly shared results of the inference of
hidden 1. So the learning algorithm for parameter 2 may be a shared version of the inference algorithm
of hidden 1 given parameter 1, just like the shared sketch algorithm.
- The sparsity in hidden 1 requires that there should be lateral inhibition or selection in inferring
hidden 1. Therefore, the learning algorithm of parameter 2, which is a shared version of the inference
algorithm of hidden 1, should also have a selection step, so that different hidden 2 nodes are
selectively and sparsely connected to different hidden 1 nodes. In other words, their connections memorize
the commonly shared results of the inference of hidden 1 given parameter 1.
This is a drastic change from the wiring
of the connections from hidden 1 to visible 1, which may be localized and convolutional, but which is otherwise
densely connected. The lateral inhibition between hidden 1 nodes leaves its print on the sparse connections
from hidden 2 to hidden 1.
- In general, the high-level knowledge may be learned by the shared execution of mid-level tasks, or the
high-level knowledge is commonly shared mid-level processing results within specific object categories.
- We suspect that the growing up of the brain or the learning is really a matter of
trimming unwanted connections or re-wiring the connections. Such connections not only make the inference difficult but they also
bring in a lot of noises
that hurt generalizibility.
- The selection in learning parameter 2 may cause a "signal
to symbol" transition, so that hidden 2 is more symbolic. As a result, the generative direction from
hidden 2 to hidden 1 may be occulusional or logical instead of simply being linear additive as in
the generative direction from hidden 1 to visible 1.
- Our attempt is to first simplify the task by learning
from images where the objects are from the same category and are roughly aligned in the training images.
Then we hope to make the learning less supervised by allowing multiple categories, multiple poses,
and unknown locations and scales. Some preliminary experiments have already been attempted along this direction.
In the whole scheme of things, the active basis model is intended for modeling the dictionary of part-templates,
which are to be further composed into the dictionary of templates at still higher layers.
- In active basis model, the hidden 1 variables are inferred together with the learning of the parameter 2 of
templates by the shared sketch algorithm, so that parameter 1 memorizes the inference of hidden 1. We are not
relying on a generic bottom-up operation for inferring hidden 1 variables before the learning of parameter 2.
- In the learned templates, the selected basis elements have little overlap. This means that the explaining-away
lateral inhibition is memorized and hardwired by the trimming of redundant connections. Then in the inference
stage for template matching, the perturbed basis elements on each image do not have much overlap, so there is
little explaining-away inhibition in inference.
- The resulting active basis and its hierarchical generalizations thus are sparsely connected. In active
basis model, each template node is selectively connected to a small number of "V1 complex cells". Given the
template nodes, the selected "V1 complex nodes" become disentangled: they are linearly orthogonal and statistically
independent, approximately, so inference becomes easy. The approximate conditional orthogonal independence may
be the reason for the existence of the hidden nodes at each layer. This is to be distinguished from PCA,
which has a fixed set of orthogonal independent nodes. In conditional orthogonal independence, different
template nodes are connected to different sets of "V1 complex nodes", and these different sets may overlap.
- Here approximate orthogonality means that the Gabor elements either have little overlap in spatial domain,
or if they do overlap in spatial domain, they have little overlap in frequency domain, so we can superpose
multi-scale templates in the same place.
- As we move up the hierarchy, we may see increasingly more types of nodes at each layer,
but each node is connected to a decreasingly less number of nodes at the layer below.
Each layer consists of two sub-layers: one sub-layer of AND-nodes, and one sub-layer
of OR-nodes.
- The top-down generative model involves random selections or perturbations of geometric attributes such as
locations and orientations. This leads to local max in bottom up pass and a back-tracing step in top-down pass
to find the arg-max.
- The sum operation is applied to local max scores of selected locations and orientations, which are the
locations and orientations of the selected basis elements. These locations and orientations are selected in
the learning algorithm. These are supposed
to be the centers of local shifts, so that the local max scores on these locations and orientations are
strictly or maximally invariant to such shifts. Other local max scores are ignored for the lack of invariance.
- The max operation is more than an invariance pooling, and the role of the "V1 complex cells" is more than
being invariance pooling units. It is to infer the perturbations of the basis elements, which are
hidden variables. The max pooling is only half of the story and is a computation step in the max-product
algorithm for inference in graphical model. The retrieval of the arg-max, i.e., the perturbations
in the top-down pass is more fundamental, because it is fitting the generative model to image data.
- In a hypothetical neural implementation, the "V1 complex cell" maintains its dynamic bond with the
"V1 simple cell" whose response achieves the local maximum among all the simple cells connected to this
complex cell. This bond, which is to record the arg-max and which might be maintained by synchronization,
is needed for back-tracing the actual perturbations.
- The max maps are not representations of the image data. They are highly redundant proposals for representing
the image data, and are a bottom-up step in the max-product inferential algorithm in graphical model.
The representation of the image is obtained after the top-down retrieval pass,
which sketches the object by arg-max using the selected and perturbed Gabor elements, and the arg-max
elements inhibits other non-arg-max elements.
- The nodes are continuous (even though they may be implemented
with discretization), not binary or Bernoulli. The sigmoid saturation is justified by the ratio of two mixture distributions,
not by logisitic perceptron. Specifically, the log-likelihood ratio of the two mixture distributions
is linear in a saturated function of the filter responses. The local normalization of filter responses before saturation is justified by linear
whitening, to make the local spectrum flat over frequencies. The original local spectrum or the second-order statistics may not
contain much information about specific object shapes or surface texture properties. They may be caused by lighting and shading
variations or some physical properties that are common in the natural scenes. These are unwanted noises that are
irrelevant or even harmful to object recognition.
- The AND-OR graph seems to be both a localized representation and a distributed
representation. It is localized because each concept is represented by a single AND-node. It is
distributed because each AND-node is connected to a collection of OR-nodes.
Partial damage of the connections will not have a big effects in inference, and partial observations
of the OR-nodes may allow inference of the AND-node, which in turn allows the predictions of the
unobserved OR-nodes. Purely distributed representation or associative memory does not explicitly index
the local modes of the energy surface and all these local
modes share the same connections. This may be too implicit.
- Each layer of AND-nodes form a dictionary. In learning, the dictionary sizes are larger at upper layers
than lower layers, but in inference, the numbers of activated nodes are smaller
at upper layers than lower layers.
The following are some strategies:
- Learn exponential family model with bottom-up features in sufficient statistics.
- Assume independent prior for hidden nodes, and approximate their posterior
by independent distributions.
- Require the posterior of hidden nodes to be exactly independent
distribution, and let the prior be dictated by the posterior.
- Prohibit overlapping hidden nodes from connecting to the same higher node, so that
given the higher node, both the prior and posterior can be made independent.
We prefer the last strategy, which is to hardwire the explaining-away in learning.
Two issues with generative models that had puzzled us for a long time:
- Partition function. In our Markov random field texture model, the normalizing
constant (that Z-function or partition function) is not computable. So model fitting
has to rely on Markov chain Monte Carlo. While this has a nice "analysis by
synthesis" interpretation, it is just too slow (maybe not for today's GPU). (from 1995 to 2000)
- Explaining away. In Olshausen-Field model, we need to represent each image
by a small number of wavelet elements, and they are selected from the
dictionary of all the wavelet elements, which compete with each other to explain
away each image. If we perform basis selection for each image, the
selection results cannot be easily used for further learning, because there is no
correspondence between the sparse representations of different images, and the
selection result in each image is an early decision that does not account for
uncertainty in selection or posterior distribution of the selection. The dilemma is that
you have to do sparsification or explaining away, but you cannot do it before
learnign the next layer. (from 2000 to 2006)
These two obstacles have been circumvented in our current work as we explained above.
- Sparsity of Olshausen-Field.
- Compositionality of S. Geman; And-or graph of Zhu-Mumford.
- Invariance of Riesenhuber-Poggio.
Form of representation:
- B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive field properties
by learning a sparse code for natural images, Nature, 381, 607-609, 1996.
- S. C. Zhu, C. E. Guo, Y. Z. Wang, and Z. J. Xu, What are
textons? International Journal of Computer Vision, 62,
121-143, 2005.
- S. C. Zhu and D. B. Mumford, A stochastic grammar of images, Foundations and
Trends in Computer Graphics and Vision, 2, 259-362, 2006.
Scheme of learning:
- S. Mallat and Z. Zhang, Matching pursuit in a time-frequency dictionary, IEEE
Transactions on Signal Processing, 41, 3397-415, 1993.
- J. H. Friedman, Exploratory projection pursuit, Journal of the American Statistical
Association, 82, 249-266, 1987.
- Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System Sciences,
55, 119-139.
- P. A. Viola and M. J. Jones, Robust real-time face detection, International
Journal of Computer Vision, 57, 137-154, 2004.
- S. C. Zhu, Y. N. Wu, and D. B. Mumford, Minimax entropy principle and its applications
in texture modeling, Neural Computation, 9, 1627-1660, 1997.
- S. Della Pietra, V. Della Pietra, and J. Lafferty, Inducing features of random fields,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380-393, 1997.
Architecture of inference:
- M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex,
Nature Neuroscience, 2, 1019-1025,
1999.
This work is a continuation of our long term search for generative models and
model-based algorithms, as well as our attempt to understand these models within
a common information-theoretical framework. The active basis model can be considered
a revision of our previous model on textons. It can also be viewed as an inhomogeneous
version of the Markov random field model that we previously developed for textures.
More important, the active basis model is a simplest instance of the and-or graph
in the compositional framework that we have been studying. The and-or grammar naturally
suggests to further compose multiple active bases to represent more articulate shapes.
The architecture of the sum-max maps is a natural computational tool for parsing the
observed image according to the and-or grammar.
- Wu, Y. N., Guo, C., and Zhu, S. C. (2008) From information scaling to regimes of
statistical models. Quarterly of Applied Mathematics, 66, 81-122.
pdf
latex
ppt
- Wu, Y. N., Si, Z., Fleming, C., and Zhu, S. C. (2007) Deformable template as
active basis. Proceedings of International Conference of Computer Vision.
pdf
latex
ppt
- Wu, Y. N., Li, J., Liu, Z., and Zhu, S. C. (2007) Statistical principles in image
modeling. Technometrics, 49, 249-261.
pdf
latex
- Guo, C., Zhu, S. C. and Wu, Y. N. (2007) Primal sketch: integrating structure
and texture. Computer Vision and Image Understanding, 106, 5-19.
pdf
latex
ppt
- Zhu, S. C. and Mumford, D. (2006) A stochastic grammar of images. Foundations
and Trends in Computer Graphics and Vision, 2, 259-362.
pdf
- Zhu S. C., Guo C., Wang Y, and Xu Z. (2005) What are textons? International
Journal of Computer Vision, 62, 21-143.
pdf
- Zhu S. C. (2003) Statistical modeling and conceptualization of visual patterns.
IEEE Pattern Analysis and Machine Intelligence, 25. 691-712.
pdf
- Guo, C., Zhu, S. C., and Wu, Y. N. (2003) Towards a mathematical theory of primal
sketch and sketchability. Proceedings of International Conference of Computer Vision.
1228-1235.
pdf
latex
ppt
- Guo, C., Zhu, S. C., and Wu, Y. N. (2003) Visual learning by integrating descriptive
and generative models. International Journal of Computer Vision. 53, 5-29.
pdf
- Zhu, S. C., Guo, C., Wu, Y. N. and Wang, Y. (2002) What are textons? Proceedings of
European Conference of Computer Vision, 793-807.
pdf
- Wu, Y. N., Zhu, S. C., and Guo, C. (2002) Statistical modeling of texture sketch.
Proceedings of European Conference of Computer Vision, 240-254.
pdf
latex
ppt
- Wu, Y. N. and Zhu, S. C. (2001) Vision and the art of data augmentation. Discussion of a paper
by Meng and van Dyk. Journal of Computational and Graphical Statistics, 10, 90-93. pdf
- Wu, Y. N., Zhu, S. C., and Liu, X. (2000) Equivalence of Julesz ensembles and
FRAME models. International Journal of Computer Vision, 38, 245-261.
pdf
- Zhu, S. C., Liu, X., and Wu, Y. N. (2000) Exploring texture ensembles by efficient
Markov chain Monte Carlo - towards a `trichromacy' theory of texture. IEEE Pattern
Analysis and Machine Intelligence, 22, 554-569.
pdf
- Zhu, S. C., Wu, Y. N., and Mumford, D. B. (1998) Minimax entropy principle and its
application to texture modeling. Neural Computation, 9, 1627-1660.
pdf
- Zhu, S. C. and Mumford, D. (1997) Prior learning and Gibbs reaction-diffusion.
IEEE Pattern Analysis and Machine Intelligence, 19, 1236-1250.
pdf
- Zhu, S. C., Wu, Y. N., and Mumford, D. B. (1997) Filter, Random field, And Maximum
Entropy (FRAME): towards a unified theory for texture modeling. International Journal of
Computer Vision, 27, 107-126.
pdf
Back to active basis homepage