Ruiqi Gao†, Yang Lu†, Junpei Zhou*, Song-Chun Zhu,
and Ying Nian Wu
† Equal contributions
University of California, Los Angeles (UCLA), USA
* College of Computer Science and Technology, Zhejiang University, China
This paper proposes a multi-grid method for learning energy-based generative ConvNet models of images. For each grid, we learn an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires generating synthesized examples from the model. Within each iteration of our learning algorithm, for each observed training image, we generate synthesized images at multiple grids by initializing the finite-step MCMC sampling from a minimal 1 x 1 version of the training image. The synthesized image at each subsequent grid is obtained by a finite-step MCMC initialized from the synthesized image generated at the previous coarser grid. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the differences between synthesized and observed examples. We show that this multi-grid method can learn realistic energy-based generative ConvNet models, and it outperforms the original contrastive divergence (CD) and persistent CD.
The paper can be downloaded here.
The code can be downloaded here.
Please site our paper if you use our code:Contents Exp 1 : Synthesis and diagnosisExp 2 : Learning feature maps for classification Exp 3 : Image inpainting |
We learn multi-grid models from five datasets: CelebA, LSUN, CIFAR-10, SVHN and MIT places205. Figure 2-4 shows synthesized results learned on CelebA, LSUN and CIFAR-10. Table 1 shows quantitative evaluation of results learned on CIFAR-10. More synthesized results can be found here.
Figure 4. Synthesized images generated by the multi-grid models learned from the CIFAR-10 dataset. Each row illustrates a category, and the multi-grid models are learned conditional on the category. |
To evaluate the features learned by the multi-grid method, we perform a semi-supervised classification experiment by following the same procedure outlined in (Radford, Metz, and Chintala, 2015). That is, we use the multi-grid method as a feature extractor. We first train a multi-grid model on the combination of SVHN training and testing sets in an unsupervised way. Then we extract features from learned model and train a classifier with small amount of labeled data. Table 1-2 shows classification errors using features from grid 3 to train a SVM classifier and features from all three grids to train a two-layer CNN, respectively.
Test error rate with # of labeled images | 1000 | 2000 | 4000 |
Persistent CD | 45.74 | 39.47 | 34.18 |
One-step CD | 44.38 | 35.87 | 30.45 |
Wasserstein GAN | 43.15 | 38.00 | 32.56 |
Deep directed generative models | 44.99 | 34.26 | 27.44 |
DCGAN | 38.59 | 32.51 | 29.37 |
Single-grid method | 36.69 | 30.87 | 25.60 |
Multi-grid method | 30.23 | 26.54 | 22.83 |
We conduct image inpainting experiments using CelebA dataset. Please check the paper for learning details. We test three different shapes of masks: 1) 32 × 32 square mask, 2) doodle mask with approximately 25% missing pixels, and 3) pepper and salt mask with approximately 60% missing pixels. Figure 6 shows the inpainting examples of the three types of masks. Table 4 shows quantitative comparison results.
Mask | PCD | CD1 | SG | CE | MG | |
Error | Mask | 0.056 | 0.081 | 0.066 | 0.045 | 0.042 |
Doodle | 0.055 | 0.078 | 0.055 | 0.050 | 0.045 | |
Pepper | 0.069 | 0.084 | 0.054 | 0.060 | 0.036 | |
PSNR | Mask | 12.81 | 12.66 | 15.97 | 17.37 | 16.42 |
Doodle | 12.92 | 12.68 | 14.79 | 15.40 | 16.98 | |
Pepper | 14.93 | 15.00 | 15.36 | 17.04 | 19.34 |
We thank Zhuowen Tu for sharing his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.
The work is supported by DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.