IEEE Conference on Computer Vision and Pattern Recognition, 2014

Single View 3D Scene Parsing by Attributed Grammar                                  

Xiaobai Liu, Yibiao Zhao and Song-Chun Zhu

Center for Vision, Cognition, Learning and Art, UCLA, CA


In this paper, we present an attributed grammar  for  parsing  man-made outdoor scenes into semantic surfaces, and recovering its 3D model simultaneously.  The grammar takes superpixels as its terminal nodes and use five production rules  to generate the scene into a hierarchical parse graph.   Each graph node actually correlates with a  surface  or a composite of surfaces in the 3D world or the 2D image.  They are described by attributes for the global scene model, e.g. focal length, vanishing points, or the surface properties,  e.g. surface normal, contact line with other surfaces, and relative spatial location etc. Each production rule is associated with some equations that constraint  the attributes of  the parent nodes and those of their children nodes.  Given an input image, our goal is to construct a hierarchical parse graph by recursively applying the five grammar rules while preserving the attributes constraints. We develop an effective top-down/bottom-up cluster sampling procedure which can explore this  constrained space efficiently.  We evaluate our method on both  public benchmarks and newly built datasets, and achieve state-of-the-art performances in terms of layout estimation and region segmentation. We also demonstrate that our method is able to recover detailed 3D model with relaxed Manhattan structures which clearly advances the state-of-the-arts of single-view 3D reconstruction.



 Automatically creating high-quality 3D model  from single view  provides the background context to  other high-level vision tasks, e.g. human activities recognition. This is a challenging problem due to its ill-posed nature. However,  for the image of man-made outdoor scenes,  human can recognize 3D structure of the scene effortlessly. We conjecture that human make 3D inference with some commonsense knowledge, such as most of the objects placed on the ground due to gravity, building are most standing uprightly,  man-made scenes usually have  Manhattan type structure or parallel lines in the words merge at vanishing points in images. Recently, researchers also tried to use the physics law to guide the 3D reconstruction Integrating these cues can definitely improve the system performance whereas an open problem is how to select the most useful knowledge during the inference.

In this paper, we present a  simple attributed grammar for the 3D parsing of man-made scenes. The basic observation is, like language where a large number of sentences are generated by a small set of words through a few of grammar rules, the visual patterns in the scene can be decomposed  hierarchically into primitives through a few  grammar rules.


Figure 1. A typical result of our approach. (a) input image overlaid detected parallel lines; (b) segmentation of scene layout; (c) and d) synthesized images from novel viewpoints.





 [1] Xiaobai Liu, Yibiao Zhao, Song-Chun Zhu. Single-view 3D parsing by attributed grammar. IEEE CVPR, 2014 PDF



Xiaobai Liu, LXB at UCLA dot EDU