Learning V1 cells with vector representations of local contents and matrix representations of local motions



Ruiqi Gao 1, Jianwen Xie 2, Song-Chun Zhu 1, and Ying Nian Wu 1


1 University of California, Los Angeles (UCLA), USA
2 Hikvision Research Institute, Santa Clara, USA


Abstract

Simple cells in primary visual cortex (V1) can be approximated by Gabor filters, and adjacent simple cells tend to have quadrature phase relationship. This paper entertains the hypothesis that a key purpose of such simple cells is to perceive local motions, i.e., displacements of pixels, caused by the relative motions between the agent and the surrounding environment. Specifically, we propose a representational model that couples the vector representations of local image contents with the matrix representations of local pixel displacements. When the image changes from one time frame to the next due to pixel displacements, the vector at each pixel is rotated by a matrix that represents the displacement of this pixel. We show that by learning from pair of images that are deformed versions of each other, we can learn both vector and matrix representations. The units in the learned vector representations reproduce properties of V1 simple cells. The learned model enables perceptual inference of local motions.

Paper

The paper can be downloaded here.

Code

The TensorFlow code is coming soon!

If you wish to use our code or results, please cite the following paper: 

Learning V1 cells with vector representations of local contents and matrix representations of local motions
@article{gao2019learning,
title={Learning V1 cells with vector representations of local contents and matrix representations of local motions},
author={Gao, Ruiqi and Xie, Jianwen and Zhu, Song-Chun and Wu, Ying Nian},
journal={arXiv preprint arXiv:1902.03871},
year={2019}}

Background

 

Figure 1. (a) Primary visual cortex or V1 is the first step in representing and interpreting retina image data (source: internet). (b) Cells in V1 respond to bars of different locations, orientations and sizes (source: internet).

Representational Model

In this paper, we proposed a representational scheme that couples the vector representation of static image contents with the matrix epresentation of changes due to pixel displacements.

As illustrated by the diagram, the large rectangle stands for the image, the inner square for local image content and the center dot for a single pixel. For the arrows, the short arrow labeled δ(x) represents displacement of a pixel, which stays inside the inner square; the longer arrows vt(x) are vectors representing the local image content, which rotate as the image undergoes deformation due to the pixel displacements.

Experiments

Exp 1 : Learned units
Exp 2 : Inference of displacement field
Exp 3 : Unsupervised learning
Exp 4 : Multi-step frame animation
Exp 5 : Frame interpolation

Experiment 1: Learned units

Figure 2. Learned results on the original image pairs. (a) Learned units. Each block shows two learned units within the same sub-vector. (b) Fitted Gabor patterns. (c) Distributions of spatial-frequency bandwidth (in octaves) and spatial phase.

Figure 3. Learned results on band-pass image pairs. (a) Learned units. Each block shows two learned units within the same sub-vector. (b) Distribution of the Gabor envelope shapes in the width and length plane. (c) Difference of frequency, orientation and phase of paired units within each sub-vector.

Experiment 2: Inference of displacement field

 

Figure 4. Examples of inference of displacement field. For each block, from left to right are It, It+1, ground truth displacement field, inferred displacement field with the original motion model, the local mixing motion model, FNCS and FN2 (FN stands for FlowNet). The displacement fields are color coded.

Experiment 3: Unsupervised learning

 

Figure 5. Examples of inferred displacement fields by unsupervised learning. The top row shows the observed image sequences, while the bottom row shows the inferred color coded displacement field.

Table 1. Average distance between inferred and ground truth displacements
methods FlowNetC FlowNetS FlowNetSD FlowNetCS FlowNet2 Our model (no-mixing) Ours model (local mixing)
Inference error 1.324 1.316 0.799 0.713 0.686 0.884 0.444

Experiment 4: Multi-step frame animation

 

Figure 6. Examples of multi-step animation, learned with non-parametric version of M. For each block, the first row shows the ground truth frame sequences, while the second row shows the animated frame sequences.

Experiment 5: Frame interpolation

 

Figure 7. Examples of frame interpolation, learned with non-parametric M. For each block, the first frame and last frame are given, while the frames between them are interpolated frames.

Acknowledgment

The work reported in this paper is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; ONR MURI project N00014-16-1- 2007; and a Hikvision gift to UCLA. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Top