Cooperative Training of Descriptor and Generator Networks



Jianwen Xie 1,2, Yang Lu 1,3, Ruiqi Gao 1, Song-Chun Zhu 1, and Ying Nian Wu 1

1 University of California, Los Angeles, USA
2 Hikvision Research Institute, Santa Clara, USA
3Facebook, USA


Abstract

This paper studies the cooperative training of two generative models for image modeling and synthesis. Both models are parametrized by convolutional neural networks (ConvNets). The first model is a deep energy-based model, whose energy function is defined by a bottom-up ConvNet, which maps the observed image to the energy. We call it the descriptor network. The second model is a generator network, which is a non-linear version of factor analysis. It is defined by a top-down ConvNet, which maps the latent factors to the observed image. The maximum likelihood learning algorithms of both models involve MCMC sampling such as Langevin dynamics. We observe that the two learning algorithms can be seamlessly interwoven into a cooperative learning algorithm that can train both models simultaneously. Specifically, within each iteration of the cooperative learning algorithm, the generator model generates initial synthesized examples to initialize a finite-step MCMC that samples and trains the energy-based descriptor model. After that, the generator model learns from how the MCMC changes its synthesized examples. That is, the descriptor model teaches the generator model by MCMC, so that the generator model accumulates the MCMC transitions and reproduces them by direct ancestral sampling. We call this scheme MCMC teaching. We show that the cooperative algorithm can learn highly realistic generative models.

Cooperative Learning

 

Figure: (a) The flow chart for training the descriptor net. The updating in Step D2 is based on the difference between the observed examples and the synthetic examples. The Langevin sampling of the synthetic examples from the current model in Step D1 can be time consuming. (b) The flow chart for training the generator net. The updating in Step G2 is based on the observed examples and their inferred latent factors. The Langevin sampling of the latent factors from the current posterior distribution in Step G1 can be time consuming. (c) The flow chart of the cooperative learning of descriptor and generator nets. The part of the flow chart for training the descriptor is similar to Algorithm D in (a), except that the D1 Langevin sampling is initialized from the initial synthetic examples supplied by the generator. The part of the flow chart for training the generator can also be mapped to Algorithm G in (b), except that the revised synthetic examples play the role of the observed examples, and the known generated latent factors can be used as inferred latent factors (or be used to initialize the G1 Langevin sampling of the latent factors).

Code and Paper

The code for CoopNets algorithm can be downloaded from: (i) Matlab with MatConvNet, (ii) Python with Tensorflow, and (iii) Python with Pytorch

The code for Spatial-temporal CoopNets can be downloaded from: Python with Tensorflow.

If you wish to use our code, please cite the following papers:

Cooperative Training of Descriptor and Generator Networks
Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu
arXiv preprint arXiv:1609.09408 2016

Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching
Jianwen Xie, Yang Lu, Ruiqi Gao, Ying Nian Wu
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI) 2018

Tex

The tex file can be downloaded here.

Experiments

Contents

Exp 1 : Experiment on generating texture patterns
Exp 2 : Experiment on generating object patterns
Exp 3 : Experiment on generating scene patterns
Exp 4 : Experiment on generating handwritten digits
Exp 5 : Experiment on large-scale benchmark datasets
Exp 6 : Experiment on pattern completion
Exp 7 : Experiment on generating dynamic textures

Experiment 1: Generating texture patterns (stationary CoopNets)

 

Figure 1. Generating texture patterns. For each category, the first image displays the training image, and the rest are 3 of the images generated by the CoopNets algorithm.

Experiment 2: Generating object patterns (non-stationary CoopNets)


Figure 2. Generating object patterns. Each row displays one object experiment, where the first 3 images are 3 of the training images, and the rest are 6 of the synthesized images generated.

  

Figure 3. Left: Average softmax class probability on single Imagenet category versus the number of training images. Middle: Top 5 classification error. Right: Average pairwise structural similarity.

Experiment 3: Generating scene patterns (non-stationary CoopNets)


Figure 4. Generating scene patterns. Both observed and synthesized scene images are shown for each category. The image size is 64 × 64 pixels. The categories are from MIT places205 dataset. (a) volcano. (b) desert. (c) rock. (d) apartment building.


  

(a) observed images                                                                    (b) synthesized images

Figure 5. Images generated by CoopNets learned from 10 Imagenet scene categories. The training set consists of 1100 images randomly sampled from each category. The total number of training images is 11000.



Figure 6. Interpolation between latent vectors of the images on the two ends.

Table 1: Inception scores of different methods on learning from 10 Imagenet scene categories. n is the number of training images randomly sampled from each category

  n=50 n=100 n=300 n=500 n=700 n=900 n=1100
CoopNets 2.66 ± .13 3.04 ± .13 3.41 ± .13 3.48 ± .08 3.59 ± .11 3.65 ± .07 3.79 ± .15
DCGAN 2.26 ± .16 2.50 ± .15 3.16 ± .15 3.05 ± .12 3.13 ± .09 3.34 ± .05 3.47 ± .06
EBGAN 2.23 ± .17 2.40 ± .14 2.62 ± .08 2.46 ± .09 2.65 ± .04 2.64 ± .04 2.75 ± .08
W-GAN 1.80 ± .09 2.19 ± .12 2.34 ± .06 2.62 ± .08 2.86 ± .10 2.88 ± .07 3.14 ± .06
VAE 1.62 ± .09 1.63 ± .06 1.65 ± .05 1.73 ± .04 1.67 ± .03 1.72 ± .02 1.73 ± .02
InfoGAN 2.21 ± .04 1.73 ± .01 2.15 ± .03 2.42 ± .05 2.47 ± .05 2.29 ± .03 2.08 ± .04
DDGM 2.65 ± .17 1.05 ± .03 3.27 ± .14 3.42 ± .09 3.47 ± .13 3.41 ± .08 3.34 ± .11
Algorithm G 1.72 ± .07 1.94 ± .09 2.32 ± .09 2.40 ± .06 2.45 ± .05 2.54 ± .05 2.61 ± .06
Persistent CD 1.30 ± .08 1.94 ± .03 1.80 ± .02 1.53 ± .02 1.45 ± .04 1.35 ± .02 1.51 ± .02

Experiment 4: Generating handwritten digits

We conduct an experiment on learning CoopNets from MNIST dataset of handwritten digits. The images are grey-scale with size of 28 × 28 pixels. Figure 6 displays some synthesized examples by the learned models after training. Table 2 shows a comparison of Gaussian Parzen window log-likelihood estimates of the MNIST testing set.

  

Figure 7. Generating handwritten digits by CoopNets after training.

Table 2: A comparison of Parzen window-based log-likelihood estimates for MNIST dataset.

Model Log-likelihood
DBN 138 ± 2.0
Stacked CAE 121 ± 1.6
Deep GSN 214 ± 1.1
GAN 225 ± 2.0
Generator in CoopNets (ours) 226 ± 2.1
Descriptor in CoopNets (ours) 228 ± 2.1

Experiment 5: Evaluation on large-scale benchmark datasets

     

(a) observed human face images                                                        (b) observed bedroom images

     

(a) synthesized human face images                                                        (b) synthesized bedroom images

Figure 8. (a) Generating human face images (128 × 128 pixels). The synthesized images are generated by the CoopNets algorithm that learns from celebA dataset with 200K training images. (b) Generating bedroom images (256 × 256 pixels). The synthesized images are generated by the CoopNets algorithm that learns from LSUN dataset with 3,033K training images.

Table 3: The performance of CoopNets, DCGAN, W-GAN, and VAE on LSUN bedrooms, CelebA and Cifar-10 datasets with respect to the Frechet Inception Distance (FID).

  LSUN CelebA Cifar-10
W-GAN 67.72 52.54 48.40
DCGAN 70.40 21.40 37.70
VAE 243.47 50.53 126.32
Generator in CoopNets (ours) 64.30 16.98 35.25
Descriptor in CoopNets (ours) 35.42 16.65 33.61

Experiment 6: Pattern completion

We conduct an experiment on learning from fully observed training images, and then testing the learned model on completing the occluded testing images. The purpose of this experiment is to check whether the learned generator model can generalize to the testing data. Figure 7 shows the qualitative results. We report the recovery errors and compare our method with 8 different image inpainting methods in Table 3.

(a) face

(b) forest road

(c) hotel room

Figure 9. Pattern completion. First row: original images. Second row: occluded images. Third row: recovered images by CoopNets. (a) face. (b) forest road. (c) hotel room.

Table 4: Comparison of face recovery performances of different methods in 3 experiments

Exp task CoopNets DCGAN MRF-L1 MRF-L2 inter-1 inter-2 inter-3 inter-4 inter-5
error M30 0.115 0.211 0.132 0.134 0.120 0.120 0.265 0.120 0.120
M40 0.124 0.212 0.148 0.149 0.135 0.135 0.314 0.135 0.135
M50 0.136 0.214 0.178 0.179 0.170 0.166 0.353 0.164 0.164
PSNR M30 16.893 12.116 15.739 15.692 16.203 16.635 9.524 16.665 16.648
M40 16.098 11.984 14.834 14.785 15.065 15.644 8.178 15.698 15.688
M50 15.105 11.890 13.313 13.309 13.220 14.009 7.327 14.164 14.161

Experiment 7: Generating dynamic textures

We learn to generate video sequences by cooperative training of a spatial-temporal descriptor and a spatial-temporal generator. The spatial-temporal descriptor consists of multiple layers of spatial-temporal filters that capture spatial-temporal features at various scales of the video sequences, while the spatial-temporal generator maps the latent variables to the video sequences by multiple layers of spatial-temporal kernels. Figure 8 displays the results. To evaluate the quality of the synthesized examples, we compare our model with some baseline models for dynamic textures in terms of PSNR and structural similarity measures (SSIM) on 6 dynamic texture videos. Table 4 shows the average performances of the models over the 6 videos.

   

(a) burning fire heating a pot                                                         (b) waterfall

   

(c) flashing lights                                                                  (d) water vapour

Figure 10. Generating dynamic textures by learning spatial-temporal CoopNets. In each example, the first video is the observed example, while the rest two are the synthesized examples generated by the learned model. (a) burning fire heating a pot. (b) waterfall. (c) flashing lights. (d) water vapour.

Table 5: A comparison of models for dynamic textures

Model PSNR SSIM
LDS 19.148 0.5939
FFT-LDS 12.463 0.2898
MKGPDM 14.288 0.3577
HOSVD 18.392 0.4573
CoopNets (ours) 19.407 0.5988

Acknowledgement

We thank Hansheng Jiang, Zilong Zheng, Erik Nijkamp, Tengyu Liu, Yaxuan Zhu, Zhaozhuo Xu and Xiaolin Fang for their assistance with coding and experiments. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The work is supported by Hikvision gift fund, NSF DMS 1310391, DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.

Top