Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence

Ruiqi Gao, Yang Lu, Junpei Zhou*, Song-Chun Zhu, and Ying Nian Wu

Equal contributions

University of California, Los Angeles (UCLA), USA

* College of Computer Science and Technology, Zhejiang University, China


This paper proposes a minimal contrastive divergence method for learning energy-based generative ConvNet models of images at multiple scales or multiple grids simultaneously. Each model is an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires MCMC sampling of synthesized examples from the current model. In our learning algorithm, within each learning iteration, for each observed training image, we generate the syn- thesized images at multiple grids by initializing the finite-step MCMC sampling from the minimal 1 × 1 version of the training image, and the synthesized image at each scale is obtained by a finite-step MCMC sampling initialized from the synthe- sized image at the previous lower scale. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the difference between the synthesized examples and the observed training examples. We call this learning method the multi-grid minimal contrastive divergence. We show that this method can learn realistic energy-based generative ConvNet models. and this method outperforms the traditional contrastive divergence (CD) and persistent CD which initialize the MCMC sampling from the observed images.


The paper can be downloaded here.


The code for all experiments can be downloaded here.

Multi-grid CD Algorithm

Figure 1. Synthesized images at multi-grids. From left to right: 4 × 4 grid, 16 × 16 grid and 64 × 64 grid. Synthesized image at each grid is obtained by 30 step Langevin sampling initialized from the synthesized image at the previous lower grid, beginning from the 1 × 1 grid.



Exp 1 : Synthesis and diagnosis
Exp 2 : Learning feature maps for classification
Exp 3 : Image inpainting

Experiment 1: Synthesis and diagnosis

We learn multi-grid models from CelebA (Liu et al., 2015) and MIT places205 (Zhou et al., 2014) datasets. In the CelebA dataset, the images are cropped at the center to 64 × 64. We randomly sample 10,000 images for training. For the MIT places205 dataset, we learn the models from images of a single place category. The number of training images is 8,500 for rock and 15,100 for the other categories.

Figure 2. Synthesized images from models learned on the CelebA (top) and forest road category of MIT places205 (bottom) datasets. From left to right: observed images, images synthesized by DCGAN (Radford, Metz, and Chintala, 2015), single-grid CD and multi-grid CD. CD1 and persistent CD cannot synthesize realistic images and their results are not shown.

Figure 3. Synthesized images from models learned by multi- grid CD on 4 categories of MIT places205 datasets. From left to right: rock, volcano, hotel room, building facade.

Experiment 2: Learning feature maps for classfication

To evaluate the quality of the feature maps learned by multi-grid CD, we perform a classification experiment. The procedure is similar to the one outlined in (Radford, Metz, and Chintala, 2015). That is, we use multi-grid CD as a feature extractor. We first train the models on the SVHN training set in an unsupervised way. Then we extract the top layer feature maps of the three grids and train a two-layer classification CNN on top of the feature maps. Figure 4 shows synthesized results and table 1 shows classification errors.

Figure 4. Synthesized images from models learned on the SVHN dataset. From left to right: observed images, images synthesized by DCGAN (Radford, Metz, and Chintala, 2015), single-grid CD and multi-grid CD. CD1 and persistent CD cannot synthesize realistic images and their results are not shown.

Table 1: Test errors of classification on SVHN.

Test error rate with # of labeled images 1000 2000 4000
DGN (Kingma et al., 2014) 36.02 - -
Virtual Adversarial (Miyato et al., 2015) 24.63 - -
Auxiliary Deep Generative Model (Maaløe et al., 2016) 22.86 - -
DCGAN+L2-SVM (Radford, Metz, and Chintala, 2015) 22.48 - -
persistent CD 42.74 35.20 29.16
one-step CD 29.75 23.90 19.15
single-grid CD 21.63 17.90 15.07
Supervised CNN with the same structure 39.04 22.26 15.24
multi-grid CD + CNN classifier 19.73 15.86 12.71

Experiment 3: Image inpainting

We conduct image inpainting experiments using CelebA dataset. Please check the paper for learning details. We test three different shapes of masks: 1) 32 × 32 square mask, 2) doodle mask with approximately 25% missing pixels, and 3) pepper and salt mask with approximately 60% missing pixels. Figure 5 shows the inpainting examples of the three types of masks.

Figure 5. Inpainting examples on CelebA dataset for three types of masks. In each block from left to right: (1) the original image; (2) masked input; (3) inpainting image by multi-grid CD.

Further, we perform quantitative evaluations using two metrics: 1) per pixel difference and 2) peak signal-to-noise ratio (PSNR). Matrices are computed between the inpainting results obtained by different methods and the original face images on the masked pixels. We compare with persistent CD, CD1 and single-grid CD. We also compare with the ContextEncoder (Pathak et al., 2016) (CE). The results are shown in table 2.

Table 2: Quantitative evaluations for three different masks. Lower values of error are better. Higher values of PSNR are better. PCD, SCD and MCD indicate persistent CD, single-grid CD and multi-grid CD, respectively.

Error Mask 0.056 0.081 0.066 0.045 0.042
Doodle 0.055 0.078 0.055 0.050 0.045
Pepper 0.069 0.084 0.054 0.060 0.036
PSNR Mask 12.81 12.66 15.97 17.37 16.42
Doodle 12.92 12.68 14.79 15.40 16.98
Pepper 14.93 15.00 15.36 17.04 19.34


We thank Zhuowen Tu for sharing his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.

The work is supported by DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.