Learning Generative ConvNets via Multi-grid Modeling and Sampling

Ruiqi Gao^†, Yang Lu^†, Junpei Zhou*, Song-Chun Zhu, and Ying Nian Wu

^† Equal contributions

University of California, Los Angeles (UCLA), USA

* College of Computer Science and Technology, Zhejiang University, China

Abstract

This paper proposes a multi-grid method for learning energy-based generative ConvNet models of images. For each grid, we learn an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires generating synthesized examples from the model. Within each iteration of our learning algorithm, for each observed training image, we generate synthesized images at multiple grids by initializing the finite-step MCMC sampling from a minimal 1 x 1 version of the training image. The synthesized image at each subsequent grid is obtained by a finite-step MCMC initialized from the synthesized image generated at the previous coarser grid. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the differences between synthesized and observed examples. We show that this multi-grid method can learn realistic energy-based generative ConvNet models, and it outperforms the original contrastive divergence (CD) and persistent CD.

Paper

The paper can be downloaded here.

Code

The code can be downloaded here.

Please site our paper if you use our code:

@inproceedings{gao2018learning,
title={Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence},
author={Gao, Ruiqi and Lu, Yang and Zhou, Junpei and Zhu, Song-Chun and Wu, Ying Nian},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2018} }

Multi-grid Learning and Sampling

Experiments

Contents

Exp 1 : Synthesis and diagnosis
Exp 2 : Learning feature maps for classification
Exp 3 : Image inpainting

Experiment 1: Synthesis and diagnosis

We learn multi-grid models from five datasets: CelebA, LSUN, CIFAR-10, SVHN and MIT places205. Figure 2-4 shows synthesized results learned on CelebA, LSUN and CIFAR-10. Table 1 shows quantitative evaluation of results learned on CIFAR-10. More synthesized results can be found here.

(a) Original images. (b) Synthesized images.
Figure 3. Synthesized images generated by the multi-grid models learned from the LSUN bedrooms dataset.	Figure 4. Synthesized images generated by the multi-grid models learned from the CIFAR-10 dataset. Each row illustrates a category, and the multi-grid models are learned conditional on the category.

Table 1: Inception scores on CIFAR-10.
	Real images	DCGAN	Multi-grid method
Inception socre	11.237	6.581	6.565

Experiment 2: Learning feature maps for classfication

To evaluate the features learned by the multi-grid method, we perform a semi-supervised classification experiment by following the same procedure outlined in (Radford, Metz, and Chintala, 2015). That is, we use the multi-grid method as a feature extractor. We first train a multi-grid model on the combination of SVHN training and testing sets in an unsupervised way. Then we extract features from learned model and train a classifier with small amount of labeled data. Table 1-2 shows classification errors using features from grid 3 to train a SVM classifier and features from all three grids to train a two-layer CNN, respectively.

Table 2: Classification error of L2-SVM trained on the features learned from SVHN.
Test error rate with # of labeled images	1000	2000	4000
Persistent CD	45.74	39.47	34.18
One-step CD	44.38	35.87	30.45
Wasserstein GAN	43.15	38.00	32.56
Deep directed generative models	44.99	34.26	27.44
DCGAN	38.59	32.51	29.37
Single-grid method	36.69	30.87	25.60
Multi-grid method	30.23	26.54	22.83

Table 3: Classification error of L2-SVM trained on the features learned from SVHN.
Test error rate with # of labeled images	1000	2000	4000
DGN	36.02	-	-
Virtual adversarial	24.63	-	-
Auxiliary deep generative model	22.86	-	-
Deep directed generative models	44.99	34.26	27.44
DCGAN	38.59	32.51	29.37
Supervised CNN with the same structure	39.04	22.26	15.24
Multi-grid method + CNN classifier	19.73	15.86	12.71

Experiment 3: Image inpainting

We conduct image inpainting experiments using CelebA dataset. Please check the paper for learning details. We test three different shapes of masks: 1) 32 × 32 square mask, 2) doodle mask with approximately 25% missing pixels, and 3) pepper and salt mask with approximately 60% missing pixels. Figure 6 shows the inpainting examples of the three types of masks. Table 4 shows quantitative comparison results.

Table 4: Quantitative evaluations for three types of masks. Lower values of error are better. Higher values of PSNR are better. PCD, CD1, SG, CE and MG indicate persistent CD, one-step CD, singlegrid method, ContextEncoder and multi-grid method, respectively.
	Mask	PCD	CD1	SG	CE	MG
Error	Mask	0.056	0.081	0.066	0.045	0.042
	Doodle	0.055	0.078	0.055	0.050	0.045
	Pepper	0.069	0.084	0.054	0.060	0.036
PSNR	Mask	12.81	12.66	15.97	17.37	16.42
	Doodle	12.92	12.68	14.79	15.40	16.98
	Pepper	14.93	15.00	15.36	17.04	19.34

Acknowledgement

We thank Zhuowen Tu for sharing his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.

The work is supported by DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.