Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns



Jianwen Xie 1*, Ruiqi Gao 2*, Zilong Zheng 2, Song-Chun Zhu 2, and Ying Nian Wu 2

(* Equal contributions)
1 Hikvision Research Institute, Santa Clara, USA
2 University of California, Los Angeles (UCLA), USA


Abstract

Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic patterns requires a disentangled representational model that separates the factorial components. A commonly used model for dynamic patterns is the state space model, where the state evolves over time according to a transition model and the state generates the observed image frames according to an emission model. To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels. Thus in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. The warping of the previous image is about the trackable part of the change of image frame, while the residual image is about the intrackable part of the image. We use a maximum likelihood algorithm to learn the model parameters that iterates between inferring latent noise vectors that drive the transition model and updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by trackable motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields or optical flows. In addition, our model defines a notion of intrackability by the separation of warped component and residual component in each image frame. We show that our method can synthesize realistic dynamic pattern, and disentangling appearance, trackable and intrackable motions. The learned models can be useful for motion transfer, and it is natural to adopt it to define and measure intrackability of a dynamic pattern.

Paper

The paper can be downloaded here.

The tex file can be downloaded here.

The poster can be downloaded here.

Slides

The AAAI 2020 Oral presentation can be downloaded here.

Code and Data

The Python code using tensorflow is comming soon

If you wish to use our code, please cite the following paper: 

Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns
Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) 2020 

Experiments

Contents

Exp 1 : Experiment on dynamic texture synthesis
Exp 2 : Experiment on disentanglement of appearance and motion
Exp 3 : Experiment on disentanglement of trackable and intrackable motions

Experiment 1: Dynamic texture synthesis

     

     

Figure 1. Each row shows one example of dynamic texture synthesis. For each example, the first column shows one single training video. The second and third columns show two synthesized videos generated by the learned model.

Experiment 2: Disentanglement of appearance and motion

   

   

Figure 2. Motion exchange. The first row shows two observed facial expression videos. The second row shows the result of motion exchange between two observed videos.


           

           

Figure 3. Motion transfer. The first row shows one single training video. The second row shows some static appearance images. The third row shows the corresponding motion transfer results, where the motion learned from the training video is applied to the static images shown in the second row.


           

           

Figure 4. Motion transfer. The first row shows one single training video. The second row shows some static appearance images. The third row shows the corresponding motion transfer results, where the motion learned from the training video is applied to the static images shown in the second row.


       

       

Figure 5. Motion transfer. The first row shows one single training video. The second row shows some static appearance images. The third row shows the corresponding motion transfer results, where the motion learned from the training video is applied to the static images shown in the second row.

Experiment 3: Disentanglement of trackable and intrackable motions

       

preference rate λ1 = 0.5

       

preference rate λ1 = 5

Figure 6. : Unsupervised disentanglement of trackable and intrackable motions of a video exhibiting a burning fire heating a pot. The first column shows the original videos, the second column shows the trackable components, and the third column shows the intrackable components.


       

preference rate λ1 = 0.5

       

preference rate λ1 = 5

Figure 7. : Unsupervised disentanglement of trackable and intrackable motions of a waving flag video. The first column shows the original videos, the second column shows the trackable components, and the third column shows the intrackable components.

Acknowledgment

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; ONR MURI project N00014-16-1-2007. We thank Yifei Xu for his assistance with experiments. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research

Related Reference

[1] Xie, Jianwen, et al. "Learning Dynamic Generator Model by Alternating Back-Propagation Through Time." AAAI. 2019.

[2] Xie, Jianwen, et al. "Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet." CVPR. 2017.

[3] Gong, Haifeng, et al. "Intrackability : Characterizing Video Statistics and Pursuing Video Representations." IJCV. 2012.

[4] Doretto, Gianfranco, et al. "Dynamic textures." IJCV. 2003.

Top