Generative VoxelNet: Learning Energy-Based Models

for 3D Shape Synthesis and Analysis



Jianwen Xie 1*, Zilong Zheng 2*, Ruiqi Gao 2, Wenguan Wang 2,3, Song-Chun Zhu 2, and Ying Nian Wu 2

(* Equal contributions)
1 Cognitive Computing Lab, Baidu Research, USA
2 University of California, Los Angeles (UCLA), USA
3 ETH Zurich, Switzerland


Abstract

3D data that contains rich geometry information of objects and scenes is valuable for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a deep 3D energy-based model to represent volumetric shapes. The maximum likelihood training of the model follows an “analysis by synthesis” scheme. The benefits of the proposed model are six-fold: first, unlike GANs and VAEs, the model training does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by Markov chain Monte Carlo (MCMC); third, the conditional model can be applied to 3D object recovery and super-resolution; fourth, the model can serve as a building block in a multi-grid modeling and sampling framework for high resolution 3D shape synthesis; fifth, the model can be used to train a 3D generator via MCMC teaching; sixth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which is useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.

Generative VoxelNet

Paper

The TPAMI journal paper can be downloaded here.

The TAPMI tex file can be downloaded here.

The CVPR conference paper can be downloaded here.

The CVPR tex file can be downloaded here.

The poster can be downloaded here.

Code and Data

The Python code using tensorflow can be downloaded here

If you wish to use our code, please cite the following paper: 

Learning Descriptor Networks for 3D Shape Synthesis and Analysis
Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018 
Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis
Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020

Experiments

Contents

Exp 1 : Experiment on 3D object synthesis
Exp 2 : Experiment on 3D object recovery
Exp 3 : Experiment on 3D object super resolution
Exp 4 : Experiment on Cooperative training of 3D generator
Exp 5 : Experiment on 3D object classification
Exp 6 : Experiment on 3D multi-grid modeling and sampling

Experiment 1: Generating 3D Objects

Each row displays one experiment, where the first 3 3D objects are some observed examples, columns 4, 5, 6, 7, 8, and 9 are 6 of the synthesized 3D objects. The nearest neighbors retrieved from the training set are shown in columns 10, 11, 12, and 13 for the last four synthesized 3D objects.

Experiment 2: 3D Object Recovery

We can perform recovery on occluded data by sampling from conditional distribution p(YM|YV,θ), which is learned from fully observed training pairs {(YM, YV)}, where YM is the masked part of the data and YV is the visible part of the data. The sampling is accomplished by Langevin dynamics, which is the same as the one that samples from p(Y; θ), except that we fix the visible part YV and only update the masked part YM through the Langevin dynamics. For each experiment shown below, the first row displays some original 3D data as ground truths, the second row displays the corresponding corrupted data, and the third row displays the corresponding recovery results by the learned model.

Experiment 3: 3D Object Super Resolution

We can perform super-resolution on a low resolution 3D objects by sampling from p(Yhigh|Ylow, θ), which is learned from fully observed training pairs {(Yhigh , Ylow)}. In each iteration, we first up-scale Ylow by expanding each voxel into a d × d × d block (where d is the scaling ratio) of constant intensity to obtain an up-scaled version Y'high of Ylow and then run Langevin dynamics staring from Y'high to obtain Yhigh. The first row displays some original 3D data as ground truths, the second row displays the corresponding low resolution (16 × 16 × 16) 3D data, and the third row displays the corresponding super-resolution (64 × 64 × 64) results by our learned model.

Experiment 4: Cooperative Training of 3D Generator

We evaluate a 3D generator trained by a 3D EBM via cooperative learning scheme on experiments of latent space interpolation and 3D object arithmetic.

Exp 4.1: Interpolation

The following results show interpolation between latent vectors of the 3D objects on the two ends. Our method can learn smooth 3D generator model that traces the manifold of the 3D data distribution.

Exp 4.2: 3D Object Arithmetic

The following shows 3D object arithmetic by the 3D generator net. It encodes semantic knowledge of 3D shapes in the latent space.

Experiment 5: 3D Object Classification

We first train a single model on all categories of the training set of ModelNet10 dataset in an unsupervised manner. Then we use the model as a feature extractor. We train a multinomial logistic regression classifier from labeled data based on the extracted feature vectors for classification. The following shows 3D object classification results on ModelNet10 dataset. We evaluate the classification accuracy on the testing data using the one-versus-all rule.

Method Classification
Geometry Image 88.4%
PANORAMA-NN 91.1%
ECC 90.0%
3D ShapeNets 83.5%
DeepPana 85.5%
SPH 79.8%
LFD 79.9%
VConv-DAE 80.5%
VoxNet 92.0%
3D-GAN 91.0%
3D-WINN 91.9%
Primitive GAN 92.2%
Generative VoxelNet (EBM) (ours) 92.4%

Experiment 6: 3D Multi-grid Modeling and Sampling

We can perform super-resolution on a low resolution 3D objects by sampling from p(Yhigh|Ylow, θ), which is learned from fully observed training pairs {(Yhigh , Ylow)}. In each iteration, we first up-scale Ylow by expanding each voxel into a d × d × d block (where d is the scaling ratio) of constant intensity to obtain an up-scaled version Y'high of Ylow and then run Langevin dynamics staring from Y'high to obtain Yhigh. The first row displays some original 3D data as ground truths, the second row displays the corresponding low resolution (16 × 16 × 16) 3D data, and the third row displays the corresponding super-resolution (64 × 64 × 64) results by the learned model.


Reference

[1] Jianwen Xie*, Yang Lu*, Song-Chun Zhu, Ying Nian Wu. "A Theory of Generative ConvNet." International Conference on Machine Learning. 2016. (*equal contribution)

[2] Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu. "Cooperative Training of Descriptor and Generator Networks." IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018.

[3] Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu. "Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching." The Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

[4] Ruiqi Gao, Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, Ying Nian Wu. "Learning Generative ConvNets via Multi-Grid Modeling and Sampling." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

Top