Overview

This website is in conjunction with our ICCV 2011 paper Unsupervised Learning of Event AND-OR Grammar and Semantics from Video (PDF). We study the problem of automatically learning event AND-OR grammar from videos of a controlled environment, e.g. an office where students conduct daily activities. We propose to learn the event gammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic events. Recursively simpler events are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve or correct the noisy bottom-up detection of atomic events. It can also be used to infer semantics of the scene.

Video

Part 1 (ZIP), Part 2 (ZIP)

Above are two parts of the Office Life video dataset. In total there are 57885 video frames in the format of jpeg images. The resolution is 1280 * 720.

Sample frames:

The videos are taken in the office scene below, where the semantic regions are manually labelled.

Below are graphical illustrations of several atomic actions learned from video. Each atomic action is represented as a list of grounded unary relations about the agent pose, and binary relations highlighting the interaction between the agent and environment.

Under information projection principle we learn a stochastic AND-OR grammar from video. The figure below is the graphical representation of this event AND-OR grammar, where for brevity we only show the graph structure and omit the branching probabilities of OR nodes.

Using the learned event grammar, we parse the sequence of atomic actions extracted from a long video. A partial parse graph is shown in the following figure: