Learning Generative ConvNets via Multi-grid Modeling and Sampling

Ruiqi Gao, Yang Lu, Junpei Zhou*, Song-Chun Zhu, and Ying Nian Wu

Equal contributions

University of California, Los Angeles (UCLA), USA

* College of Computer Science and Technology, Zhejiang University, China


This paper proposes a multi-grid method for learning energy-based generative ConvNet models of images. For each grid, we learn an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires generating synthesized examples from the model. Within each iteration of our learning algorithm, for each observed training image, we generate synthesized images at multiple grids by initializing the finite-step MCMC sampling from a minimal 1 x 1 version of the training image. The synthesized image at each subsequent grid is obtained by a finite-step MCMC initialized from the synthesized image generated at the previous coarser grid. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the differences between synthesized and observed examples. We show that this multi-grid method can learn realistic energy-based generative ConvNet models, and it outperforms the original contrastive divergence (CD) and persistent CD.


The paper can be downloaded here.


The code can be downloaded here.

Please site our paper if you use our code:
title={Learning Generative ConvNets via Multi-grid Modeling and Sampling},
author={Gao, Ruiqi and Lu, Yang and Zhou, Junpei and Zhu, Song-Chun and Wu, Ying Nian},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2018} }

Multi-grid Learning and Sampling

Figure 1. Synthesized images at multi-grids. From left to right: 4 × 4 grid, 16 × 16 grid and 64 × 64 grid. Synthesized image at each grid is obtained by 30 step Langevin sampling initialized from the synthesized image at the previous lower grid, beginning from the 1 × 1 grid.



Exp 1 : Synthesis and diagnosis
Exp 2 : Learning feature maps for classification
Exp 3 : Image inpainting

Experiment 1: Synthesis and diagnosis

We learn multi-grid models from five datasets: CelebA, LSUN, CIFAR-10, SVHN and MIT places205. Figure 2-4 shows synthesized results learned on CelebA, LSUN and CIFAR-10. Table 1 shows quantitative evaluation of results learned on CIFAR-10. More synthesized results can be found here.

Figure 2. Synthesized images from models learned on the CelebA dataset. From left to right: observed images, images synthesized by DCGAN (Radford, Metz, and Chintala, 2015), single-grid method and multi-grid method. CD1 and persistent CD cannot synthesize realistic images and their results are not shown.

(a) Original images.

(b) Synthesized images.

Figure 3. Synthesized images generated by the multi-grid
models learned from the LSUN bedrooms dataset.

Figure 4. Synthesized images generated by the multi-grid models learned from the CIFAR-10 dataset. Each row illustrates a category, and the multi-grid models are learned conditional on the category.

Table 1: Inception scores on CIFAR-10.

Real images DCGAN Multi-grid method
Inception socre 11.237 6.581 6.565

To check the diversity of Langevin dynamics sampling, we synthesize images by initializing the Langevin dynamics from the same 1 x 1 image. As shown in Fig. 5, after 90 steps of Langevin dynamics, the sampled images from the same 1 x 1 image are different from each other.

Figure 5. Synthesized images by initializing the Langevin dynamics sampling from the same 1× 1 image. Each block of 4 images are generated from the same 1× 1 image.

Experiment 2: Learning feature maps for classfication

To evaluate the features learned by the multi-grid method, we perform a semi-supervised classification experiment by following the same procedure outlined in (Radford, Metz, and Chintala, 2015). That is, we use the multi-grid method as a feature extractor. We first train a multi-grid model on the combination of SVHN training and testing sets in an unsupervised way. Then we extract features from learned model and train a classifier with small amount of labeled data. Table 1-2 shows classification errors using features from grid 3 to train a SVM classifier and features from all three grids to train a two-layer CNN, respectively.

Table 2: Classification error of L2-SVM trained on the features learned from SVHN.

Test error rate with # of labeled images 1000 2000 4000
Persistent CD 45.74 39.47 34.18
One-step CD 44.38 35.87 30.45
Wasserstein GAN 43.15 38.00 32.56
Deep directed generative models 44.99 34.26 27.44
DCGAN 38.59 32.51 29.37
Single-grid method 36.69 30.87 25.60
Multi-grid method 30.23 26.54 22.83

Table 3: Classification error of L2-SVM trained on the features learned from SVHN.

Test error rate with # of labeled images 1000 2000 4000
DGN 36.02 - -
Virtual adversarial 24.63 - -
Auxiliary deep generative model 22.86 - -
Deep directed generative models 44.99 34.26 27.44
DCGAN 38.59 32.51 29.37
Supervised CNN with the same structure 39.04 22.26 15.24
Multi-grid method + CNN classifier 19.73 15.86 12.71

Experiment 3: Image inpainting

We conduct image inpainting experiments using CelebA dataset. Please check the paper for learning details. We test three different shapes of masks: 1) 32 × 32 square mask, 2) doodle mask with approximately 25% missing pixels, and 3) pepper and salt mask with approximately 60% missing pixels. Figure 6 shows the inpainting examples of the three types of masks. Table 4 shows quantitative comparison results.

Figure 6. Inpainting examples on CelebA dataset for three types of masks. In each block from left to right: (1) the original image; (2) masked input; (3) inpainting image by multi-grid method.

Table 4: Quantitative evaluations for three types of masks. Lower values of error are better. Higher values of PSNR are better. PCD, CD1, SG, CE and MG indicate persistent CD, one-step CD, singlegrid method, ContextEncoder and multi-grid method, respectively.

Error Mask 0.056 0.081 0.066 0.045 0.042
Doodle 0.055 0.078 0.055 0.050 0.045
Pepper 0.069 0.084 0.054 0.060 0.036
PSNR Mask 12.81 12.66 15.97 17.37 16.42
Doodle 12.92 12.68 14.79 15.40 16.98
Pepper 14.93 15.00 15.36 17.04 19.34


We thank Zhuowen Tu for sharing his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.

The work is supported by DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.