3D Representation and 2D Representation, Which is Better?
nbsp;When learning to recognize object images of different views, should we learn some key views or learn a 3D model of the object?

In human vision, this problem has already been widely discussed. Researchers have involved in a well documented debate[1,2,3,4], but has no final conclusion.
For this question, a simple answer from us is: different objets should be represented differently, and the representation efficiency is the key to decide which one is better.
Actually, we believe the two representations can be combined to get better representation efficiency, such as the desk globe images below. For the desk globe, the viewer-centered representation fits the globe part well, since that pattern is a circle on almost all desk globe images. For the base and handle, we believe the object-centered representation would be more efficient, since the image patterns varies in different views, but are all the projection of same 3D shape.

Inspired by the active basis model[6], we have found a method to automatically learn this desired representation, and the black/red sketches above composes the result templates of our algorithm. We have designed a mixed representation containing both 3D and 2D primitives, which corresponds to object-centered and viewer-centered representations respectively.

By using the same probabilistic model for the both types of primitives, we are able to evaluate their information contribution to a given set of object images under different views. Using the computed information contribution as an index, we are able to automatically select 3D and 2D primitives and mix them to form intuitive object templates. Figure below are some of the learned object templates from different object categories.


With the 3D model in hand, the matching is simply trying on each angle and see if the templates fits a testing image well. Given a view point, the ideal projection scenario would be like this:
We do experiment on UIUC 3D car dataset, and tested performance of our 3D model on car pose estimation and object detection. Figure below shows our pose estimation performance, and some of the projected templates using estimated pose.

Reference
- I. Biederman and M. Bar. One-shot viewpoint invariance in matching novel objects. Vision Research, 1999.
- I. Biederman and P. C. Gerhardstein. Viewpoint-dependent mechanisms in visual object recognition: Reply to tarr and bülthoff (1995). Journal of Experimental Psychology: Human Perception and Performance, 1995.
- W. G. Hayward and M. J. Tarr. Testing conditions for viewpoint invariance in object recognition. Journal of Experimental Psychology: Human Perception and Performance, 1997
- M. J. Tarr and H. H. Bülthoff. Is human object recognition better described by geon structural descriptions or by multiple views? comment on biederman and gerhardstein (1993). Journal of Experimental Psychology: Human Perception and Performance, 1995
- W. Hu and S.C. Zhu. Learning a probabilistic model mixing 3D and 2D primitives for view invariant Object Recognition. CVPR, 2010 [pdf][data]
- Y.N. Wu, Z. Si, H. Gong and S.C. Zhu. Learning active basis model for object detection and recognition. IJCV, 2010.[project page]