Learning Compositional Models for Object Categories from Small Sample Sets
Car (side view)
Mp3 player
    Figure 1. Samples from four object categories during each stage of the relationship pursuit. Objects become more coherent as new relationships are added. With this analysis-by-synthesis methd, new configurations can be generated from a small traing set. (Also see Fig. 5 for more details)


Modeling object categories is a challenging task due to the many structural variations between instances of the same category. There have been many non-hierarchical approaches to modeling object categories, all with limited levels of success.Appearance based models, which represent objects primarily by their photometric properties, such as global PCA, KPCA, fragments, SIFTs, and patches, tend to disregard geometric information about the position of important keypoints within an object. Thus, they are not well-suited for recognition in scenarios where pose, occlusion, or part re-configuration are factors. Structure based models, which include information about relative or absolute positions of features, such as the constellation model and pictorial structures, are more powerful than appearance based approaches as they can model relationships between groups of parts and thus improve recognition accuracy, but are rarely hierarchical and, as such, cannot account for radical transformations of the part positions.

Many of the problems in previous approaches come from lack of a good definition for object category that captures what is invariant between instances of the same class. We define an object category as an equivalence class where the object parts and their relations to one another are the invariants that make instances in the same category equivalent. Bikes always have wheels, clocks always have hands, and these parts are always related similarly to one another. We can capture the variation in these commonalities through a constrained compositional model that defines the set of instances for each object category. We provide a way to generalize the minimax entropy framework to learn such a model for objects with dynamic structure. This model combines a SCFG to capture the variance and hierarchy of object parts with the relational constraints of an MRF. This framework accounts for the dynamic structure of the model, in which constraints may not always exist from instance to instance, by pursuing relations according to their frequency of occurrence. The MRF constraints that we add to the SCFG match two main statistics of the model:

  • The frequencies of part ocurrences
  • The statistics on the spatial layouts of parts.


  • And-Or Graph. The model we learn for object categories is a combination of a SCFG and an MRF, referred to as an “And-Or Graph”. This model is like a language for an object category where parts of the object are analogous to words and parts of speech in language, and relational constraints model the context between them. Figure 2 (a) shows an example of an And-Or graph. An object is created by starting at the root of the graph and expanding nodes until only terminals remain, as in a SCFG. Node expansions in this structure can be thought of as “And” nodes, where one node expands into multiple nodes and “Or” nodes, which can only choose one of their child nodes to expand into.
    Figure 2: (a) An example of an “And-Or Graph”. And nodes must decompose into all of their children, while Or nodes can only decompose into one. Constraints are shown as horizontal lines. (b)“Parse graphs”, which are each one walk of the model.(c)“Configurations”, which consist only of terminal nodes and any constraints they inherited.
  • Relationships. The horizontal line between A and B represents a relational constraint. Fig. 3 illustrates a dictionary of relationships between nodes. Each relationship can be thought of as a filter that operates on a set of nodes V , producing a response. These responses can be pooled over a number of parse graphs G to create histograms that can be used as node constraints. The type of relationship will likely differ based on whether it is defined for one, two, or more nodes. Fig. 3 visualizes a set of relationships that could be used to augment an And-Or graph.
    Figure. 3 Visualization of possible pairwise relationships to be learned in the And-Or graph.
  • Parse graph. We define a parse graph pg as an instance drawn from the model, which is analogous to a sentence diagram in natural language. Figure 2 (b) shows some parse graphs from our example model. The Or nodes are determined during a walk of the graph, fixing the parts, so only And nodes remain. Fig. 4 shows a parse graph for the bike category.
    Figure 4: A parse graph of a bike. Nodes are composed via certain relations to form the representation of the bike.

Learning and Estimation with the And-Or Graph

    Suppose we have a set of observed parse graphs that follow f, the true distribution governing the objects from a ground truth databased. The objective is to learn a model p, which approaches f by minimizing the Kullback-Leibler divergence KL(f||p). This is equivalent to the ML estimate for the optimal vocabulary, relation set and parameters.
    Learning the probability model of And-Or graph includes two phases, both of which follow the same principle.
  • 1. Estimating the parameters from training data for given relation set and vocabulary
  • 2. Learning and pursuing the relation set.
  • Maximum Likelihood Learning by Sampling. We follow the stochastic gradient method adopted in learning FRAME model (Zhu et al. 1997) to estimate model parameters. This is the method of analysis-by-synthesize adopted in texture modeling (See paper for more details).
  • Learning and Pursuing the Relation Set. In addition to learning the parameters, we can also augment the relation set in an And-Or Graph, thus pursuing the energy terms in the same way as filters and statistics were pursued in texture modeling by the minimax entropy principle (Zhu et al. 1997).
    Suppose we start with an empty relation set, creating a parse graph with just its SCFG componenent defined. We define a greedy pursuit where, at each step, we add a relation e+ selected from the relationship library as illustrated in Fig. 3 so as to maximally reduce KL-divengence between p+ and true distribution f.
    Figure 5 shows an example of the above discused learning by random synthesis method. Samples drawn from the model at each stage of the relationship pursuit. At each stage, a new relationship was added and the parameters were estimated using the MLE by sampling method. Figure 5 shows this process for the clock and bicycle categories. The initial samples are wild and unconstrained, while the objects appear more coherent as relationships are added.
    Figure 5: Samples from p during each stage of the relationship pursuit. Objects become more coherent as new relationships are added.

Experiments on learing and sampling

  • Learning By Random Synthesis.
  • Figure 6 shows samples from the learned model for 24 different object categories. We can see that the samples are perceptually equivalent to their category examples, even if slightly different on a part-by-part basis.
    Figure 6. Twenty-four object categories with high intra-class variability and their corresponding samples.
    Fig. 7 shows the And-Or graph that was learned for the bicycle category and the relationships that were added to pairs and single parts. We can almost see causal chains appearing in the relationships learned (the frame constrains the chain’s position and scale, which in turn constrains the pedal’s position and scale). Each part decomposes into one of a number of terminals. These terminals could be represented by an appearance model for each part, though in our case we used exemplars from the initial dataset, scaled, oriented, and positioned to create new object configurations.
    Figure 7: The learned And-Or graph for the bicycle category, showing the relationships added between pairs of nodes.

  • Predicting Missing Parts Using Learned Model.
  • The sampling process can be used to provide top-down proposals for inference. Fig. 8 shows the results of removing the wheel, frame, and handle bars from a perfect sketch of a bike and predicting the true part positions at each stage of the learning process. The missing parts of the structure are first reintroduced, and then their constraints are sampled for 100 iterations. The model with few constraints does not yield good results, while the full model predicts the part locations perfectly, as shown overlaid in the original image. The error is measured as the sum of the thin-plate-spline deformation energy and affine energy needed to move the predicted part to its true location, and is shown in the last column. This shows that the compositional model provides strong top-down cues for part positions and appearances, and can predict the presence of parts that are occluded, missing, or had weak bottom-up cues for recognition and detection tasks.
    Figure 8: Top-down prediction of missing parts at each stage of the relationship pursuit. A neighborhood of parts is fixed and the remaining parts are Gibbs sampled. The accuracy of the prediction is measured by the Thin-plate-spline + affine transformation needed to move the predicted part to its true position. We can see that this energy decreases drastically as we add more relations to the model.
  • Small Sample Set Generalization.
  • Due to the consistency of many of the perceptual similarities between objects in the same class, e.g. relative position, we can learn our model from a very small sample set. Fig. 9 shows samples drawn from the model learned from just 6 training instances. Despite there being so few training instances, their parts can be reconfigured and adjusted according to the model to produce radically different instances. Note that the part configurations and appearances in the samples differ greatly from those in the training set, yet the objects are still coherent. This is useful for recognition tasks, where new instances can be recognized despite not appearing in the training data. One can also generate large amounts of training data for discriminative tasks using this model, learned from a small, easily obtained set of images. Such an experiment was done comparing recognition results using two different datasets. The first was fully hand-collected, while the second consisted of hand-collected data and samples generated from our model. The latter classifier showed a 15% improvement over solely hand-collected data, likely because there were more varied data samples available in the second dataset. Further work is being done on this topic.
    Figure 9: Demonstration of the model’s generalizability. The model learned from only 6 training instances can produce the varied samples below.

Experiments on object recognition using the And-Or graph

    We apply our inference algorithm to five object categories – clock, bike, computer, cup/bowl, and teapot. Fig. 10 shows recognition results for the five categories. For each input image, the image on the right shows the recognized parts from the image in different colors. It should be mentioned that the recognition algorithm is distinct from most of the classification algorithms in the literature. It interprets the image as a parse graph, which includes the classification of categories and parts, matches the leaf templates to images, and hallucinates occluded parts.
    Figure 10: Recognition experiments on 5 object categories.

    Some recent work was done to show the accuracy of our method versus common recognition and classification techniques. This work uses our methodology for learning but includes additional experimental results for the inference algorithm. In this work, our model was trained on the categories of rear car and bicycle and these models were used to recognize objects in the Lotus Hill Database as well as the Caltech 101 dataset. Figure 11 shows the results of this experiment. Our model was compared against a HOG-based SVM (Dalal et al. 2006) and a Haar-feature-based Adaboost model (Viola and Jones 2001). One can see that our model performs much better than either of these two methods. For the Caltech 101 dataset, we trained our model and ran it against a Probabilistic Boosting Tree (Tu 2005) as well as a SIFT-based boosting method (Zhang et al. 2005), both of which performed worse than our grammar-based approach.
    Figure 11: ROC curves for recognizing bicycles (from the LHI dataset) and rear-cars (from the Caltech 101 dataset). Our model outperforms a HOG-based SVM and Adaboost on bicycles and PBT and SIFT-based boosting for rear-cars.

Related Publication:

Jake Porway, Benjamin Yao, and Song-chun Zhu, Learning compositional models for object categories from small sample sets, Book Chapter in Sven Dickinson et al (eds.) Object Categorization: Computer and Human Vision Perspectives, Cambridge University Press. 2009 [pdf].
Lin L., Wu T., Porway J, Xu Z., A stochasitic graph grammar for compositional object representation and recognition. Under review for Pattern Recognition

[Back to Benjamin's homepage]