Learning Dynamic Generator Model
by Alternating Back-Propagation Through Time



Jianwen Xie 1*, Ruiqi Gao 2*, Zilong Zheng 2, Song-Chun Zhu 2, and Ying Nian Wu 2

(* Equal contributions)
1 Hikvision Research Institute, Santa Clara, USA
2 University of California, Los Angeles (UCLA), USA


Abstract

This paper studies the dynamic generator model for spatialtemporal processes such as dynamic textures and action sequences in video data. In this model, each time frame of the video sequence is generated by a generator model, which is a non-linear transformation of a latent state vector, where the non-linear transformation is parametrized by a top-down neural network. The sequence of latent state vectors follows a non-linear auto-regressive model, where the state vector of the next frame is a non-linear transformation of the state vector of the current frame as well as an independent noise vector that provides randomness in the transition. The non-linear transformation of this transition model can be parametrized by a feedforward neural network. We show that this model can be learned by an alternating back-propagation through time algorithm that iteratively samples the noise vectors and updates the parameters in the transition model and the generator model. We show that our training method can learn realistic models for dynamic textures and action patterns.

Paper

The paper can be downloaded here.

The tex file can be downloaded here.

The poster can be downloaded here.

Slides

The AAAI 2019 Oral presentation can be downloaded here.

Code and Data

The Python code using tensorflow can be downloaded here

If you wish to use our code, please cite the following paper: 

Learning Dynamic Generator Model by Alternating Back-Propagation Through Time
Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI) 2019 

Experiments

Contents

Exp 1 : Experiment on Learn to generate dynamic textures
Exp 2 : Experiment on Learn to generate action patterns with appearance consistency
Exp 3 : Experiment on Learn from incomplete data for recovery
Exp 4 : Experiment on Learn to remove content
Exp 5 : Experiment on Learn to animate static image

Experiment 1: Learn to generate dynamic textures

 

 

Figure 1. Generating dynamic textures. For each category, the first one is the observed video, and the other three are synthesized videos generated by the learned model. The observed video is of 60 frames in length, while the two synthesized videos are of 120 frames in length.

Experiment 2: Learn to generate action patterns with appearance consistency

Figure 2. Generating action patterns. (Top) synthesizing human actions (Weizmann dataset) (Bottom) synthesizing animal actions (animal action dataset). The first row shows the observed videos, while the second and third rows display two corresponding synthesized videos for each obcerved video. The number of frames of the observed video is less than that of the synthesized video in the experiment of synthesizing human actions.


 

Figure 3. Video interpolation by interpolating between appearance latent vectors of videos at the two ends. (Left) Melting (Right) Blooming.

Experiment 3: Learn from incomplete data for recovery

 

 

Figure 4. Learn from incomplete data. In each example, the first one is the occluded training video, and the second one is the recovered result.

Experiment 4: Learn to remove content

 

Figure 5. In each example, the first one is the original video, the second one is the result where the target object is removed by our algorithm. (Left) removing a walking person in front of fountain. (Right) removing a moving boat in the lake.

Experiment 5: Learn to animate static image

 

 

Figure 6. : Learn to animate static image. The first row displays static image frames, and the second row are the corresponding animations.

Acknowledgment

Part of the work was done while Ruiqi. Gao was an intern at Hikvision Research Institute during the summer of 2018. She thanks Director Jane Chen for her help and guidance. The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; ONR MURI project N00014-16-1-2007; and a Hikvision gift to UCLA. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Related Reference

[1] Xie, Jianwen, et al. "Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet." CVPR. 2017.

[2] Xie, Jianwen, et al. "Cooperative Training of Descriptor and Generator Networks." PAMI. 2018.

[3] Han, Tian, et al. "Alternating back-propagation for generator network." AAAI. 2017.

[4] Tulyakov, Sergey, et al. "MoCoGAN: Decomposing Motion and Content for Video Generation." CVPR. 2018.

[5] Doretto, Gianfranco, et al. "Dynamic textures." IJCV. 2003.

Top