Learning V1 cells with vector representations of local contents and matrix representations of local motions

Ruiqi Gao ¹, Jianwen Xie ², Song-Chun Zhu ¹, and Ying Nian Wu ¹

¹ University of California, Los Angeles (UCLA), USA
² Hikvision Research Institute, Santa Clara, USA

Abstract

Simple cells in primary visual cortex (V1) can be approximated by Gabor filters, and adjacent simple cells tend to have quadrature phase relationship. This paper entertains the hypothesis that a key purpose of such simple cells is to perceive local motions, i.e., displacements of pixels, caused by the relative motions between the agent and the surrounding environment. Specifically, we propose a representational model that couples the vector representations of local image contents with the matrix representations of local pixel displacements. When the image changes from one time frame to the next due to pixel displacements, the vector at each pixel is rotated by a matrix that represents the displacement of this pixel. We show that by learning from pair of images that are deformed versions of each other, we can learn both vector and matrix representations. The units in the learned vector representations reproduce properties of V1 simple cells. The learned model enables perceptual inference of local motions.

Paper

The paper can be downloaded here.

Code

The TensorFlow code is coming soon!

If you wish to use our code or results, please cite the following paper:

Learning V1 cells with vector representations of local contents and matrix representations of local motions
@article{gao2019learning,
title={Learning V1 cells with vector representations of local contents and matrix representations of local motions},
author={Gao, Ruiqi and Xie, Jianwen and Zhu, Song-Chun and Wu, Ying Nian},
journal={arXiv preprint arXiv:1902.03871},
year={2019}}

Background

Representational Model

Experiments

Exp 1 : Learned units
Exp 2 : Inference of displacement field
Exp 3 : Unsupervised learning
Exp 4 : Multi-step frame animation
Exp 5 : Frame interpolation

Experiment 1: Learned units

Experiment 2: Inference of displacement field

Experiment 3: Unsupervised learning

Table 1. Average distance between inferred and ground truth displacements

methods	FlowNetC	FlowNetS	FlowNetSD	FlowNetCS	FlowNet2	Our model (no-mixing)	Ours model (local mixing)
Inference error	1.324	1.316	0.799	0.713	0.686	0.884	0.444

Experiment 4: Multi-step frame animation

Figure 6. Examples of multi-step animation, learned with non-parametric version of M. For each block, the first row shows the ground truth frame sequences, while the second row shows the animated frame sequences.

Experiment 5: Frame interpolation

Figure 7. Examples of frame interpolation, learned with non-parametric M. For each block, the first frame and last frame are given, while the frames between them are interpolated frames.

Acknowledgment

The work reported in this paper is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; ONR MURI project N00014-16-1- 2007; and a Hikvision gift to UCLA. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.