International Journal of Computer Vision

Learning Sparse FRAME Models for Natural Image Patterns

Jianwen Xie 1   Wenze Hu 2    Song-Chun Zhu 1    Ying Nian Wu 1

1 University of California, Los Angeles (UCLA), USA     2 Google Inc, USA


Please also check out our more recent work on learning sparse FRAME models by Generative Boosting

Paper | Latex | Talk


Abstract

It is well known that natural images admit sparse representations by redundant dictionaries of basis functions such as Gabor-like wavelets. However, it is still an open question as to what the next layer of representational units above the layer of wavelets should be. We address this fundamental question by proposing a sparse FRAME (Filters, Random field, And Maximum Entropy) model for representing natural image patterns. Our sparse FRAME model is an inhomogeneous generalization of the original FRAME model. It is a non-stationary Markov random field model that reproduces the observed statistical properties of filter responses at a subset of selected locations, scales and orientations. Each sparse FRAME model is intended to represent an object pattern and can be considered a deformable template. The sparse FRAME model can be written as a shared sparse coding model, which motivates us to propose a twostage algorithm for learning the model. The first stage selects the subset of wavelets from the dictionary by a shared matching pursuit algorithm. The second stage then estimates the parameters of the model given the selected wavelets. Our experiments show that the sparse FRAME models are capable of representing a wide variety of object patterns in natural images and that the learned models are useful for object classification.


1. Inhomogeneous FRAME

1.1 Model

We start from modeling roughly aligned images of object patterns from the same category, such as images in each category of Fig 1. Let {Im, m = 1,...,M} be a set of training training images. We use the notation Bx,s,α to denote a basis function such as Gabor wavelet centered at pixel x, and tuned to scale s and orientation α. We assume that s and α take values within a finite and properly discretized range. The inner product <I, Bx,s,α> can be considered the filter response of I at pixel x to a filter tuned to scale s and orientation α. All basis functions are normalized to have unit l2 norm. The inhomogeneous FRAME model is probability distribution defined on image I

(1)

where λx,s,α is a coefficient of the absolute value of the response, q(I) is a known reference distribution such as a Gaussian white noise model, Z(λ) is the normalizing constant.

Fig 1. Training images. The inhomogeneous FRAME is a generative model that seeks to represent and generate object patterns shown above. Two categories of training sets: 6 hummingbirds and 12 Cats.

1.2 Learning

The FRAME model is a special case of the exponential family model, and the parameter λ={λx,s,α, ∀x,s,α} can be estimated from the training images by MLE. The MLE can be obtained by the stochastic gradient algorithm analyzed by Younes (1999) [1]. Let λ(t) be the current estimate of λ, and let {m, m = 1,...,} be a sample of synthesized images drawn from p(I; λ(t) ). Then we can update λ by

(2)

where is the step size. The synthesized images {m, m = 1,...,} can be drawn from p(I; λ(t) ) by Hamiltonian Monte Carlo (HMC) [2]. The ratio of the normalizing constants

(3)

can be approximated by averaging over the sampled images {m}. Starting from λ(0) = 0 and log Z(λ(0)) = 0, we can compute log Z(λ(t)) along the learning process by iteratively updating its value as follows:

(4)

Figure 2 illustrates the learning process by showing the synthesized images with λ being updated by the algorithm. The synthesized image starts from Gaussian white noises sampled from q(I), then gradually gets similar to the observed images in the overall shape and appearance.

Fig 2. Learning sequence by Inhomogeneous FRAME. The size of the images are 70 × 70. We work with grey-level images in this work. A separate model is learned from each training set shown in Fig 1 (Top: Hummingbird. Bottom: Cat). Synthesized images generated in step t = 1, 4, 7, 10, 13, 20, 50, 100, 200, 300, 400, and 500.


2. Sparse FRAME

2.1 Model

In model (1), the (x,s,α) in x,s,α is over all the pixels x and all the scales s and orientations α. We call such a model the dense FRAME. It is possible to sparsify the model by selecting only a small set of (x,s,α) so that x,s,α is restricted to this selected subset, while such a selected subset of basis functions should be enough to interpret or reconstruct the training images. More explicitly, we can write the sparsified model as

Deformable Shared Sparse Coding
(5)
Sparsified inhomogeneous FRAME
(6)

where n is the number of selected basis function, λ={λi, i=1,...,n} are the coefficients, and { Bxi,sii , i=1,...,n} is the selected set of basis functions shared by training images. We may allow these shared basis functions to perturb their locations and orientations to account for shape deformations. (Δxm,i , Δαm,i) are the perturbations of the location and orientation of the i-th basis function in the m-th image. Both Δxm,i and Δαm,i are assumed to vary within limited ranges.

2.2 Learning

The learning algorithm consists of two stages. (1) Selecting B = {Bxi,sii , i=1,...,n} by deformable shared sparse coding. (2) Estimating λ={λi, i=1,...,n} given selected B.

Stage 1: We want to select the basis functions B = {Bxi,sii , i=1,...,n} by minimizing the sum of reconstruction errors of all training images

(7)

We can extend the matching pursuit algorithm [3] to select basis functions to code multiple images simultaneously, while inferring local perturbations by local max pooling [4]:

Deformable shared matching pursuit algorithm

Stage 2 : After selecting B = {Bxi,sii , i=1,...,n}, we can then model {Im} by the sparse FRAME model (6), by estimating λ at MLE. p(I; B, λ) in (6) now served as the deformable template in that the log-likelihood ratio of the Im is

(8)

which serves as the template matching score, where we allow each selected Bi to perturb its location and orientation in view of (5), with the perturbations inferred by local max pooling. Again, let λ(t) be the current estimate of λ, and let {m, m = 1,...,} be the synthesized images drawn from p(I; λ(t) ) by parallel chains. Then we can update λ by

(9)

The learned p(I; λ(t) ) models the appearance of the undeformed template. So there is an explicit separation between appearance and shape variations.After we learn λ and compute Z(λ) as in (4), we can use the learned model as a deformable template to be matched to the testing image, where the template matching score at each location can be computed like (8).

Fig 4. Synthesis. Synthesized images generated by the learned sparse FRAME model. With HMC and warm start, they are produced by =36 parallel chains. The images are 100 × 100. The number of selected wavelets is 300.
Fig 3. Reconstruction. The images are 100 x 100. The number of selected wavelets is 300. The first row are symbolic sketches of selected Gabor wavelets at different scales, where each selected Gabor wavelet is symbolized by a bar. The first 4 sketches correspond to 4 different scales. The last one is the superposition of the 4 scales, where smaller scales appear darker. The remainder display five examples of the observed images, the deformed sketches, the reconstructed images, and the residual images.


3. Synthesis by FRAME

Figure 5 and Figure 6 displays some images generated by the dense models and the sparse models learned from roughly aligned training images respectively. All the synthesized images are generated by the HMC algorithm along the learning process.

Fig 5. Dense model. Images generated by the learned dense FRAME models. Typical sizes of the images are 70 × 70. We run single chain in the learning process. Typical number of training images for each category is around 10.
Fig 6. Sparse model. Images generated by the learned sparse FRAME models. Typical sizes of the images are 80 × 80. Number of selected wavelets is 300.


4. Geometric Transformation

We can obtain the geometrically transformed versions of the learned model by directly applying the operations of dilation, rotation, flipping, or even changing the aspect ratio to B = {Bxi,sii , i=1,...,n} without changing the values of λ. This amounts to simple affine transformations of (xi,sii, i=1,...,n). Figure 7 shows some examples of geometric transformations of the sparse FRAME model: flipping and rotation.

Fig 7. Geometric transformation. Each case shows the synthesized images and the symbolic sketch templates of the transformed models, which are derived from the learned model. Left: 360 degrees rotation. Middle: 360 degrees rotation. Right: left/right flipping.


5. Detection

The learned model can be used to detect the object in a testing image by deformable template matching. We can scan the template over the image domain D, and at each location X D, we match the template to the image patch within the bounding box centered at X by computing the template matching score based on (8). We then choose the location X that achieves the maximum template matching score as the center of the detected object. We also allow geometric transformation of template in detection, which includes rotation, scaling and left-right flipping. Figure 8 shows examples of detection.

Fig 8. Detection by template matching. Left top: a synthesized image generated by the model trained on 8 roughly aligned images with 250 wavelets. (The lack of regional contrasts in the synthesized image is due to the fact that no DoG wavelets are used). Left bottom: a model template represented by symbolic sketches of selected Gabor wavelets. Right: testing images where the objects are located by the bounding boxes.


6. Clustering

Model-based clustering can be accomplished by EM-type algorithm that fits mixtures of sparse FRAME models. Figure 9 and Figure 10 illustrate 5 experiments. The EM-type algorithm usually converges within 3-5 iterations. For each cluster, we generate 144 parallel chains in learning.

Fig 9. Clustering. First row: synthesized images from a mixture of sparse FRAME models learned from 45 training images. Last three rows: three groups of training images are separated by the mixture models. Number of wavelets in each model is 300.

To evaluate the clustering quality, we use two metrics: conditional purity and conditional entropy [5]. For a random training image, let x be its true category (unknown to the algorithm) and y be the infered category. The conditional purity is defined as

and the conditional entropy is defined as,

where both p(y) and p(x|y) can be estimated from the training set, and we would expect higher purity and lower entropy for a better clustering algorithm. We perform 7 clustering experiments. The number of clusters vary from 2 to 5 and are assumed known in these experiments. The number of images in each cluster is typically 15 except in one experiment. We compare the performance of the sparse FRAME with that of k-means based on HoG features. Table 1 displays the clustering accuracies and standard errors based on 10 repetitions of each experiment.

Fig 10. Clustering. Each row illustrates one clustering experiment by displaying a synthesized image (100 × 100) and a training example from each cluster. Typical number of wavelets for each template is 300. The number of images within each cluster is around 15 to 20 images.

Table 1. Comparison of conditional purity (the first two rows) and conditional entropy (the last two rows) between sparse FRAME and k-means for clustering

 
Exp1
Exp2
Exp3
Exp4
Exp5
Exp6
Exp7
k-means (purity)
0.623 ± 0.016
0.870 ± 0.043
0.933 ± 0.141
0.825 ± 0.121
0.911 ± 0.086
0.888 ± 0.091
0.687 ± 0.110
FRAME (purity)
0.943 ± 0.063
0.990 ± 0.016
0.938 ± 0.131
0.895 ± 0.132
1.000 ± 0.000
0.879 ± 0.141
0.741 ± 0.111
k-means (entropy)
0.652 ± 0.009
0.376 ± 0.086
0.092 ± 0.195
0.243 ± 0.167
0.226 ± 0.084
0.199 ± 0.126
0.639 ± 0.161
FRAME (entropy)
0.145 ± 0.157
0.037 ± 0.060
0.090 ± 0.191
0.155 ± 0.189
0.000 ± 0.000
0.179 ± 0.208
0.497 ± 0.192


7. Unsupervised learning of codebooks

We can learn a codebook of sparse FRAME models from non-aligned images without annotation, by adopting the method of [6]. The learning algorithm iterates the following two steps: (1) Image encoding: given the current codebook, encode the training images by spatially translated, rotated, scaled versions of the models (templates) in the codebook. (2) Codebook re-learning: re-learn each model in the codebook from the image patches currently covered by this template. Figure 11 and 12 illustrate experiments of codebook learning. For the experiments in Figure 12, we select a small number wavelets of a single scale, so the synthesized images mainly capture the edge patterns.

Fig 11. Codebook learning (interpretation by reconstruction). The first row displays the synthesized images (100 × 100) from the mixture models. The second row displays training images and the third row shows their reconstructions by the mixture models.

Fig 12. Codebook learning (interpretation by sketching). In each experiment, synthesized images (100 × 100) from the models of the learned codebook are displayed together with the training images and their sketches by the learned models, where each Gabor wavelet is illustrated by a bar. Lotus experiment: each model has 30 wavelets, learned from 7 images. Cat experiment: each model has 40 wavelets, learned from 20 images.


8. Object Classification Using learned Codebooks

8.1 "Bag-of-word" Representation by A Learned Codebook

The learned codebook of sparse FRAME models can serve as "words" in the "bag-of-word" method for object classification. Suppose we have a codebook of T models {B(t) , t=1, ..., T} learned from training images. For each image Im, we denote R(t)m(X,S,A)=L(Im|B(t)X,S,A) as the log-likelihood of B(t) at location X, scale S, and orientation A, both S and A are assumed to take values within a finite and properly discretized range. Let r(t)m(A)=max(maxX,S R(t)m(X,S,A), 0) be the maximum score at orientation A. Then each image Im can be represented by a T×NA-dimensional feature vector (r(t)m(A), t=1,...,T, ∀A), where NA is the number of possible values A can take. After extracting features, we can use any discriminative method to train classifiers (e.g. linear logistic regression or SVM [7]) on such feature vectors for object classification. Spatial pyramid matching (SPM) [8] can also be utilized to further boost the classification performance.

8.2 Binary Classification

We evaluate the "bag-of-word" representation extracted by a codebook of sparse FRAME models on a binary classification task. We test it on a collection of 16 categories from Caltech-101 [9], all 5 categories from ETHZ Shape [10] and all 3 categories from Graz-02 [11] datasets. The task is to separate each category from a negative background category. For each category, we learn a codebook of T=10 sparse FRAME templates. Each template is of the size 100 × 100 and has n = 40 wavelets. We set scale S∈{0.8, 1, 1.2} and orientation A∈{±1, 0} × π/16. Binary classification is done with linear logistic regression with regularization by l2 norm. We compare our results with those obtained by SIFT features and SVM classifier, where SIFT features are quantized into "words" by K-means clustering (K=50, 100, 500) and fed into linear or kernel SVM. The best result among these six combinations (3 numbers of words × 2 types of SVM) is then reported. Table 2 shows the comparison results of the binary classification experiments.

Table 2. Classification accuracies (%) on binary classification tasks for 24 categories from Caltech-101, ETHZ Shape and Graz-02 datasets

Datasets
SIFT+SVM
Our method
Datasets
SIFT+SVM
Our method
Watch
90.1 ± 1.0
89.1 ± 1.6
Sunflower
76.0 ± 2.5
89.6 ± 3.7
Laptop
73.5 ± 5.3
89.8 ± 2.7
Chair
62.5 ± 5.0
82.9 ± 4.7
Piano
84.5 ± 4.2
93.8 ± 2.6
Lamp
61.5 ± 4.5
86.6 ± 4.3
Ketch
82.2 ± 0.8
83.3 ± 6.5
Dragonfly
66.0 ± 4.0
89.9 ± 5.7
Motorbike
93.9 ± 1.2
92.2 ± 2.9
Umbrella
73.4 ± 4.4
90.0 ± 0.7
Guitar
70.0 ± 2.4
77.3 ± 6.3
Cellphone
68.7 ± 5.1
95.7 ± 1.8
Schooner
64.3 ± 2.2
87.7 ± 2.8
Face
91.8 ± 2.3
94.4 ± 2.3
Ibis
67.8 ± 6.0
85.3 ± 2.7
Startfish
73.1 ± 6.7
90.0 ± 2.3
ETHZ-Bottle
68.6 ± 3.2
77.5 ± 5.6
ETHZ-Cup
66.0 ± 3.3
62.5 ± 3.0
ETHZ-Swans
64.2 ± 1.5
74.0 ± 7.5
ETHZ-Ciraffes
61.5 ± 6.4
73.3 ± 4.8
ETHZ-Apple
55.0 ± 1.8
65.8 ± 6.1
Graz02-Person
70.4 ± 1.2
68.2 ± 3.8
Graz02-Car
64.0 ± 6.7
59.6 ± 5.5
Graz02-Bike
68.5 ± 2.8
71.3 ± 5.1

8.3 Multi-class Classification

Our second set of experiments is on the LHI-Animal-Faces dataset [12], which consists of around 2200 images for 20 categories of animal or human faces. We randomly select half of the images per class for training and the rest for testing. We learn a codebook of 10 sparse FRAME models for each category in an unsupervised way. We then combine the codebooks of all the categories (in total 20 × 10 = 200 codewords). The maps of the template matching scores from the models in the combined codebook are computed for each image, and they are then fed into SPM, which equally divides an image into 1, 4, 16 areas, and concatenates the maximum scores at different image areas into a feature vector. We use multi-class SVM to train image classifiers based on the feature vectors, and then evaluate the classification accuracies of these classifiers on the testing data using the one-versus-all rule. Table 3 lists four published results [12] on this dataset obtained by other methods: (a) HoG feature trained with SVM, (b) Hybrid Image Template (HIT) [12], (c) multiple transformation invariant HITs (Mixture of HIT) [12], and (d) part-based HoG feature trained with latent SVM [13]. Our method outperforms the other methods in terms of classification accuracy on this dataset.

Table 3. Classification accuracies (%) on the animal faces dataset.

HoG+SVM
HIT
Mixture of HIT
Part-based LSVM
Our method
70.8
71.6
75.6
77.6
79.4

8.4 Domain Transfer

Classifiers learned from one domain (the source domain) may perform poorly on other domains (the target domains), because that training and testing data may not come from the same distribution. Learning domain-invariant feature representations can deal with this problem. In this experiment, we test our proposed representation for the task of domain transfer on four domain datasets, and compare with published results [14] [15] [16] [17] [18] [19] [20]. The four datasets are: Amazon, Webcam, DSLR and Caltech-256 [21]. Each dataset is regarded as a domain.

Table 4. Recognition accuracies (%) on target domains  (C: Caltech, A:Amazon, W:Webcam, and D:DSLR)

Method
C→A
C→D
A→C
A→W
W→C
W→A
D→A
D→W
Metric [14]
33.7 ± 0.8
35.0 ± 1.1
27.3 ± 0.7
36.0 ± 1.0
21.7 ± 0.5
32.3 ± 0.8
30.3 ± 0.8
55.6 ± 0.7
SGF [15]
40.2 ± 0.7
36.6 ± 0.8
37.7 ± 0.5
37.9 ± 0.7
29.2 ± 0.7
38.2 ± 0.6
39.2 ± 0.7
69.5 ± 0.9
GFK [16]
46.1 ± 0.6
55.0 ± 0.9
39.6 ± 0.4
56.9 ± 1.0
32.8 ± 0.7
46.2 ± 0.7
46.2 ± 0.6
80.2 ± 0.4
FDDL [17]
39.3 ± 2.9
55.0 ± 2.8
24.3 ± 2.2
50.4 ± 3.5
22.9 ± 2.6
41.1 ± 2.6
36.7 ± 2.5
65.9 ± 4.9
MMDT [18]
49.4 ± 0.8
56.5 ± 0.9
36.4 ± 0.8
64.6 ± 1.2
32.2 ± 0.8
47.7 ± 0.9
46.9 ± 1.0
74.1 ± 0.8
SDDL [19]
49.5 ± 2.6
76.7 ± 3.9
27.4 ± 2.4
72.0 ± 4.8
29.7 ± 1.9
49.4 ± 2.1
48.9 ± 3.8
72.6 ± 2.1
Our method
62.2 ± 1.6
52.2 ± 4.0
46.7 ± 2.5
53.2 ± 4.9
39.1 ± 3.0
53.2 ± 4.4
55.3 ± 2.9
72.4 ± 3.1

For the experiment with single source training, 10 classes common to all four datasets are extracted: backpack, touring-bike, calculator, head-phones, computer-keyboard, laptop-101, computer-monitor, computer-mouse, coffee-mug, and video-projector. For the experiment with multiple sources training, all 31 classes in Amazon, Webcam and DSLR are used. We use the evaluation protocol in [16].

Table 5. Performance comparison on multiple source domain transfer

Source
Target
SGF [15]
RDALR [20]
FDDL [17]
our method
DLSR, amazon
webcam
52 ± 2.5
36.9 ± 1.1
41.0 ± 2.4
52.2 ± 1.4
amazon, webcam
DSLR
39 ± 1.1
31.2 ± 1.3
38.4 ± 3.4
54.5 ± 3.3
webcam, DSLR
amazon
28 ± 0.8
20.9 ± 0.9
19.0 ± 1.2
32.1 ± 1.6

We randomly sample labeled data in the source domain as training examples, and unlabeled data in the target domain as testing examples. We learn a combined codebook (by learning a codebook of 3 templates with n = 40 wavelets for each category and combining them together), then use it to extract feature vectors and train classifiers by multi-class SVM using the same scheme as in section 8.3. We evaluate the classification accuracies of these classifiers on the testing domain. For each pair of source and target domains, we report averaged accuracies on target domains as well as standard errors. Table 4 and Table 5 show the comparisons of recognition accuracies on target domains for single source training and multiple source training. It can be seen that our method perform significantly better than previous methods on 8 out of 11 sub-tasks, even though we do not make use of any domain adaptation techniques. This suggests that the learned codebooks of models capture intrinsically meaningful patterns.



Conclusion

The sparse FRAME model has the following properties.

(1) It can reconstruct the training images.

(2) It can synthesize new images.

(3) It separates shape deformations and appearance variations.

(4) It gives interpretable sketches.

(5) Dictionaries or codebooks of models can be learned in unsupervised manner.

(6) It combines rich traditions of harmonic analysis and Markov random field models.



References

[1] L. Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65, 177-228, 1999.
[2] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2011.
[3] S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397-3415, 1993.
[4] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025, 1999.
[5] T. Tuytelaars, C. H. Lampert, M. B. Blaschko, and W. Buntine. Unsupervised object discovery: a comparison. IJCV, 2009.
[6] Y. Hong, Z. Si, W. Hu, S.-C. Zhu, and Y. N. Wu, Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics, in press, 2013.
[7] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 2000.
[8] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. CVPR, 2006.
[9] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. CVPR Workshop, 2004.
[10] V. Ferrari, F. Jurie, and C. Schmid. From images to shape models for object detection. IJCV, 87, 284-303, 2010.
[11] M. Marszalek and C. Schmid. Accurate object localization with shape masks. CVPR, 2007.
[12] Z. Si and S. C. Zhu. Learning Hybrid image Template (HIT) by Information Projection. IEEE Trans. PAMI, 34, 1354–1367, 2012.
[13] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. PAMI, 32, 1627–1645, 2010.
[14] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. ECCV, 213-226, 2010.
[15] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: an unsupervised approach. ICCV, 999-1006, 2011.
[16] B. Gong, Y. Shi, F. Sha and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. CVPR, 2066–2073, 2012.
[17] M. Yang, L. Zhang, X. Feng, and D. Zhang. Fisher discrimination dictionary learning for sparse representation. ICCV, 543-550, 2011.
[18] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. Efficient learning of domain-invariant image representations. ICLR, 2013.
[19] S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa. Generalized domain adaptive dictionaries. CVPR, 2013.
[20] I. Jhou, D. Liu, D. T. Lee, and S. Chang. Robust visual domain adaptation with low-rank reconstruction. CVPR, 2168-2175, 2012.
[21] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, Caltech, 2007.


The work is supported by NSF DMS 1310391, ONR MURI N00014-10-1-0933, DARPA MSEE FA8650-11-1-7149.