Spatial-Temporal Generative ConvNet:

Learning Energy-Based Models for Video Synthesis



Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu

University of California, Los Angeles (UCLA), USA


Abstract

Video sequences contain rich dynamic patterns, such as dynamic texture patterns that exhibit stationarity in the temporal domain, and action patterns that are non-stationary in either spatial or temporal domain. We show that a spatialtemporal generative ConvNet can be used to model and synthesize dynamic patterns. The model defines a probability distribution on the video sequence, and the log probability is defined by a spatial-temporal ConvNet that consists of multiple layers of spatial-temporal filters to capture spatialtemporal patterns of different scales. The model can be learned from the training video sequences by an “analysis by synthesis” learning algorithm that iterates the following two steps. Step 1 synthesizes video sequences from the currently learned model. Step 2 then updates the model parameters based on the difference between the synthesized video sequences and the observed training sequences. We show that the learning algorithm can synthesize realistic dynamic patterns

Paper

The CVPR paper can be downloaded here.

The CVPR tex file can be downloaded here.

The TPAMI paper can be downloaded here.

The TPAMI tex file can be downloaded here.

The poster can be downloaded here.

Code and Data

The Matlab code and data for all experiments can be downloaded here.

The Python code using tensorflow for synthesis experiment can be downloaded here or Github.

If you wish to use our code, please cite the following paper : 

Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet
Jianwen Xie, Song-Chun Zhu, Ying Nian Wu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017 
Learning Energy-based Spatial-Temporal Generative ConvNet for Dynamic Patterns
Jianwen Xie, Song-Chun Zhu, Ying Nian Wu
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2019

Experiments

Experiment 1: Generating dynamic textures with both spatial and temporal stationarity

In each example, the first one is the observed video, the other three are the synthesized videos.

Sea
Flowing water
Ocean
River
Water in a pool
Lake

Experiment 2: Generating dynamic textures with only temporal stationarity

Exp 2.1: Learning from one training video

In each example, the first one is the observed video, the other three are the synthesized videos.

Burning fire heating a pot
Waterfall
Flashing lights
Washing machine washing clothes
Boiling water
Mountain stream
Fountain
Spring water

Exp 2.2: Learning from multiple training videos

The first panel displays 30 observed videos, the second panel displays 39 synthesized videos.

Observed fire videos
Synthesized fire videos

Experiment 3: Generating action patterns without spatial or temporal stationarity

In the example of running cow (tiger), the first five (two) are the original videos, the rest are the synthesized videos.

Running cows

Runing tigers

Experiment 4: Learning from incomplete data

In each example, the first one is the ground truth video, the second one is the occluded training video, and the third one is the recovered result.

Single region masks
50% missing frames
50% salt and pepper masks

Experiment 5: Background inpainting

In each example, the first one is the original video, the second one is the video with black mask occluding the target to be removed, and the third one is the inpainting result by our algorithm.

Moving boat

Walking person

Acknowledgement

The work is supported by NSF DMS 1310391, DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579. We thank Zilong Zheng for assistance with python implementation.

Reference

[1] Xie, Jianwen, et al. "A theory of generative convnet." International Conference on Machine Learning. 2016.

[2] Doretto, Gianfranco, et al. "Dynamic textures." International Journal of Computer Vision. 2003.

[3] Goyette, Nil, et al. "A novel video dataset for change detection benchmarking." IEEE Transactions on Image Processing. 2014.

Top