† Equal contributions
University of California, Los Angeles (UCLA), USA
* College of Computer Science and Technology, Zhejiang University, China
This paper proposes a minimal contrastive divergence method for learning energy-based generative ConvNet models of images at multiple scales or multiple grids simultaneously. Each model is an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires MCMC sampling of synthesized examples from the current model. In our learning algorithm, within each learning iteration, for each observed training image, we generate the syn- thesized images at multiple grids by initializing the finite-step MCMC sampling from the minimal 1 × 1 version of the training image, and the synthesized image at each scale is obtained by a finite-step MCMC sampling initialized from the synthe- sized image at the previous lower scale. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the difference between the synthesized examples and the observed training examples. We call this learning method the multi-grid minimal contrastive divergence. We show that this method can learn realistic energy-based generative ConvNet models. and this method outperforms the traditional contrastive divergence (CD) and persistent CD which initialize the MCMC sampling from the observed images.
The paper can be downloaded here.
The code for all experiments can be downloaded here.
ContentsExp 1 : Synthesis and diagnosis
Exp 2 : Learning feature maps for classification
Exp 3 : Image inpainting
We learn multi-grid models from CelebA (Liu et al., 2015) and MIT places205 (Zhou et al., 2014) datasets. In the CelebA dataset, the images are cropped at the center to 64 × 64. We randomly sample 10,000 images for training. For the MIT places205 dataset, we learn the models from images of a single place category. The number of training images is 8,500 for rock and 15,100 for the other categories.
To evaluate the quality of the feature maps learned by multi-grid CD, we perform a classification experiment. The procedure is similar to the one outlined in (Radford, Metz, and Chintala, 2015). That is, we use multi-grid CD as a feature extractor. We first train the models on the SVHN training set in an unsupervised way. Then we extract the top layer feature maps of the three grids and train a two-layer classification CNN on top of the feature maps. Figure 4 shows synthesized results and table 1 shows classification errors.
|Test error rate with # of labeled images||1000||2000||4000|
|DGN (Kingma et al., 2014)||36.02||-||-|
|Virtual Adversarial (Miyato et al., 2015)||24.63||-||-|
|Auxiliary Deep Generative Model (Maaløe et al., 2016)||22.86||-||-|
|DCGAN+L2-SVM (Radford, Metz, and Chintala, 2015)||22.48||-||-|
|Supervised CNN with the same structure||39.04||22.26||15.24|
|multi-grid CD + CNN classifier||19.73||15.86||12.71|
We conduct image inpainting experiments using CelebA dataset. Please check the paper for learning details. We test three different shapes of masks: 1) 32 × 32 square mask, 2) doodle mask with approximately 25% missing pixels, and 3) pepper and salt mask with approximately 60% missing pixels. Figure 5 shows the inpainting examples of the three types of masks.
Further, we perform quantitative evaluations using two metrics: 1) per pixel difference and 2) peak signal-to-noise ratio (PSNR). Matrices are computed between the inpainting results obtained by different methods and the original face images on the masked pixels. We compare with persistent CD, CD1 and single-grid CD. We also compare with the ContextEncoder (Pathak et al., 2016) (CE). The results are shown in table 2.
We thank Zhuowen Tu for sharing his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.
The work is supported by DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.