Joint Inference of Groups, Events and Human Roles in Aerial Videos

Tianmin Shu¹, Dan Xie¹, Brandon Rothrock², Sinisa Todorovic³ and Song-Chun Zhu¹

Center for Vision, Cognition, Learning and Art, UCLA¹

School of EECS, Oregon State University³

Introduction

With the advent of drones, aerial video analysis is becoming increasingly important; yet, it has received scant attention in the literature. This project addresses a new problem of parsing low-resolution aerial videos of large spatial areas, in terms of grouping and assigning roles to people and objects engaged in events, and recognizing these events. Due to low resolution and top-down views, person detection and tracking – the standard input to recent approaches to event recognition – are very unreliable. We address these challenges with a novel framework aimed at conducting joint inference of the above tasks, as reasoning about each in isolation typically fails in our setting. Given noisy tracklets of people and detections of large objects and scene surfaces (e.g., building, grass), we use a spatiotemporal AND-OR graph to drive our joint inference, using Markov Chain Monte Carlo and dynamic programming. We introduce a new formalism of deformable templates characterizing latent sub-events. For evaluation, we have collected a new set of aerial videos using a hex-rotor flying over picnic areas rich with group events. Our results demonstrate that we successfully address above inference tasks under challenging conditions.

Paper and Demo

Paper

Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. (to appear) (Oral) [pdf]

@inproceedings{ShuCVPR15,
  title     = {Joint Inference of Groups, Events and Human Roles in Aerial Videos},
  author    = {Tianmin Shu and Dan Xie and Brandon Rothrock and Sinisa Todorovic and Song-Chun Zhu},
  year      = {2015},
  booktitle = {CVPR}
}

Demo

UCLA Aerial Video Dataset

Introduction

At UCLA, we assembled a new low-cost hex-rotor with a GoPro camera, which is able to eliminate the high frequency vibration of the camera and hold in air autonomously through a GPS and a barometer. It can also fly 20 ∼ 90m above the ground and stays 5 minutes in air. We use this hex-rotor to take a set of videos with some plots at a park where the terrain is interesting: hiking routes, parking lots, camping sites, picnic areas with shelters, restrooms, tables, trash bins and BBQ ovens. By detecting/tracking humans and objects in the videos, we can recognize events such as BBQ, queuing, exchanging objects, loading/unloading, etc.

We have collected some events with scripts involving the interactions between humans and objects at two different sites. The original videos are pre-processed, including camera calibration and frame registration. After pre-processing, there are totally 27 videos in the dataset, the length of which ranges from 2 minutes to 5 minutes. We annotate the hierarchical semantic information of objects, roles, events and groups in the videos.