Cooperative Training of Descriptor and Generator Networks

Jianwen Xie ^1,2, Yang Lu ^1,3, Ruiqi Gao ¹, Song-Chun Zhu ¹, and Ying Nian Wu ¹

¹ University of California, Los Angeles, USA
² Hikvision Research Institute, Santa Clara, USA
³Facebook, USA

Abstract

This paper studies the cooperative training of two generative models for image modeling and synthesis. Both models are parametrized by convolutional neural networks (ConvNets). The first model is a deep energy-based model, whose energy function is defined by a bottom-up ConvNet, which maps the observed image to the energy. We call it the descriptor network. The second model is a generator network, which is a non-linear version of factor analysis. It is defined by a top-down ConvNet, which maps the latent factors to the observed image. The maximum likelihood learning algorithms of both models involve MCMC sampling such as Langevin dynamics. We observe that the two learning algorithms can be seamlessly interwoven into a cooperative learning algorithm that can train both models simultaneously. Specifically, within each iteration of the cooperative learning algorithm, the generator model generates initial synthesized examples to initialize a finite-step MCMC that samples and trains the energy-based descriptor model. After that, the generator model learns from how the MCMC changes its synthesized examples. That is, the descriptor model teaches the generator model by MCMC, so that the generator model accumulates the MCMC transitions and reproduces them by direct ancestral sampling. We call this scheme MCMC teaching. We show that the cooperative algorithm can learn highly realistic generative models.

Cooperative Learning

Figure: (a) The flow chart for training the descriptor net. The updating in Step D2 is based on the difference between the observed examples and the synthetic examples. The Langevin sampling of the synthetic examples from the current model in Step D1 can be time consuming. (b) The flow chart for training the generator net. The updating in Step G2 is based on the observed examples and their inferred latent factors. The Langevin sampling of the latent factors from the current posterior distribution in Step G1 can be time consuming. (c) The flow chart of the cooperative learning of descriptor and generator nets. The part of the flow chart for training the descriptor is similar to Algorithm D in (a), except that the D1 Langevin sampling is initialized from the initial synthetic examples supplied by the generator. The part of the flow chart for training the generator can also be mapped to Algorithm G in (b), except that the revised synthetic examples play the role of the observed examples, and the known generated latent factors can be used as inferred latent factors (or be used to initialize the G1 Langevin sampling of the latent factors).

Code and Paper

The code for CoopNets algorithm can be downloaded from: (i) Matlab with MatConvNet, (ii) Python with Tensorflow, and (iii) Python with Pytorch

The code for Spatial-temporal CoopNets can be downloaded from: Python with Tensorflow.

If you wish to use our code, please cite the following papers:

Cooperative Training of Descriptor and Generator Networks
Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu
arXiv preprint arXiv:1609.09408 2016

Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching
Jianwen Xie, Yang Lu, Ruiqi Gao, Ying Nian Wu
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI) 2018

Tex

The tex file can be downloaded here.

Experiments

Contents

Exp 1 : Experiment on generating texture patterns
Exp 2 : Experiment on generating object patterns
Exp 3 : Experiment on generating scene patterns
Exp 4 : Experiment on generating handwritten digits
Exp 5 : Experiment on large-scale benchmark datasets
Exp 6 : Experiment on pattern completion
Exp 7 : Experiment on generating dynamic textures

Experiment 1: Generating texture patterns (stationary CoopNets)

Experiment 2: Generating object patterns (non-stationary CoopNets)

Experiment 3: Generating scene patterns (non-stationary CoopNets)

Table 1: Inception scores of different methods on learning from 10 Imagenet scene categories. n is the number of training images randomly sampled from each category
	n=50	n=100	n=300	n=500	n=700	n=900	n=1100
CoopNets	2.66 ± .13	3.04 ± .13	3.41 ± .13	3.48 ± .08	3.59 ± .11	3.65 ± .07	3.79 ± .15
DCGAN	2.26 ± .16	2.50 ± .15	3.16 ± .15	3.05 ± .12	3.13 ± .09	3.34 ± .05	3.47 ± .06
EBGAN	2.23 ± .17	2.40 ± .14	2.62 ± .08	2.46 ± .09	2.65 ± .04	2.64 ± .04	2.75 ± .08
W-GAN	1.80 ± .09	2.19 ± .12	2.34 ± .06	2.62 ± .08	2.86 ± .10	2.88 ± .07	3.14 ± .06
VAE	1.62 ± .09	1.63 ± .06	1.65 ± .05	1.73 ± .04	1.67 ± .03	1.72 ± .02	1.73 ± .02
InfoGAN	2.21 ± .04	1.73 ± .01	2.15 ± .03	2.42 ± .05	2.47 ± .05	2.29 ± .03	2.08 ± .04
DDGM	2.65 ± .17	1.05 ± .03	3.27 ± .14	3.42 ± .09	3.47 ± .13	3.41 ± .08	3.34 ± .11
Algorithm G	1.72 ± .07	1.94 ± .09	2.32 ± .09	2.40 ± .06	2.45 ± .05	2.54 ± .05	2.61 ± .06
Persistent CD	1.30 ± .08	1.94 ± .03	1.80 ± .02	1.53 ± .02	1.45 ± .04	1.35 ± .02	1.51 ± .02

Experiment 4: Generating handwritten digits

We conduct an experiment on learning CoopNets from MNIST dataset of handwritten digits. The images are grey-scale with size of 28 × 28 pixels. Figure 6 displays some synthesized examples by the learned models after training. Table 2 shows a comparison of Gaussian Parzen window log-likelihood estimates of the MNIST testing set.

Figure 7. Generating handwritten digits by CoopNets after training.

Table 2: A comparison of Parzen window-based log-likelihood estimates for MNIST dataset.
Model	Log-likelihood
DBN	138 ± 2.0
Stacked CAE	121 ± 1.6
Deep GSN	214 ± 1.1
GAN	225 ± 2.0
Generator in CoopNets (ours)	226 ± 2.1
Descriptor in CoopNets (ours)	228 ± 2.1

Experiment 5: Evaluation on large-scale benchmark datasets

(a) observed human face images (b) observed bedroom images

(a) synthesized human face images (b) synthesized bedroom images

Figure 8. (a) Generating human face images (128 × 128 pixels). The synthesized images are generated by the CoopNets algorithm that learns from celebA dataset with 200K training images. (b) Generating bedroom images (256 × 256 pixels). The synthesized images are generated by the CoopNets algorithm that learns from LSUN dataset with 3,033K training images.

Table 3: The performance of CoopNets, DCGAN, W-GAN, and VAE on LSUN bedrooms, CelebA and Cifar-10 datasets with respect to the Frechet Inception Distance (FID).
	LSUN	CelebA	Cifar-10
W-GAN	67.72	52.54	48.40
DCGAN	70.40	21.40	37.70
VAE	243.47	50.53	126.32
Generator in CoopNets (ours)	64.30	16.98	35.25
Descriptor in CoopNets (ours)	35.42	16.65	33.61

Experiment 6: Pattern completion

Table 4: Comparison of face recovery performances of different methods in 3 experiments
Exp	task	CoopNets	DCGAN	MRF-L1	MRF-L2	inter-1	inter-2	inter-3	inter-4	inter-5
error	M30	0.115	0.211	0.132	0.134	0.120	0.120	0.265	0.120	0.120
	M40	0.124	0.212	0.148	0.149	0.135	0.135	0.314	0.135	0.135
	M50	0.136	0.214	0.178	0.179	0.170	0.166	0.353	0.164	0.164
PSNR	M30	16.893	12.116	15.739	15.692	16.203	16.635	9.524	16.665	16.648
	M40	16.098	11.984	14.834	14.785	15.065	15.644	8.178	15.698	15.688
	M50	15.105	11.890	13.313	13.309	13.220	14.009	7.327	14.164	14.161

Experiment 7: Generating dynamic textures

We learn to generate video sequences by cooperative training of a spatial-temporal descriptor and a spatial-temporal generator. The spatial-temporal descriptor consists of multiple layers of spatial-temporal filters that capture spatial-temporal features at various scales of the video sequences, while the spatial-temporal generator maps the latent variables to the video sequences by multiple layers of spatial-temporal kernels. Figure 8 displays the results. To evaluate the quality of the synthesized examples, we compare our model with some baseline models for dynamic textures in terms of PSNR and structural similarity measures (SSIM) on 6 dynamic texture videos. Table 4 shows the average performances of the models over the 6 videos.

(a) burning fire heating a pot (b) waterfall

Figure 10. Generating dynamic textures by learning spatial-temporal CoopNets. In each example, the first video is the observed example, while the rest two are the synthesized examples generated by the learned model. (a) burning fire heating a pot. (b) waterfall. (c) flashing lights. (d) water vapour.

Table 5: A comparison of models for dynamic textures
Model	PSNR	SSIM
LDS	19.148	0.5939
FFT-LDS	12.463	0.2898
MKGPDM	14.288	0.3577
HOSVD	18.392	0.4573
CoopNets (ours)	19.407	0.5988

Acknowledgement

We thank Hansheng Jiang, Zilong Zheng, Erik Nijkamp, Tengyu Liu, Yaxuan Zhu, Zhaozhuo Xu and Xiaolin Fang for their assistance with coding and experiments. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The work is supported by Hikvision gift fund, NSF DMS 1310391, DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.