Aerial Full Motion Video (FMV) Stabilization, Tracking and Activity Recognition

This work was done as part of the VIRAT project at UCF.

Aerial video collected by electro-optical and infrared cameras deployed on small UAV platforms is rapidly becoming a low-cost and up-to-date source of imagery. This increase in aerial videos has led to the development of systems for natural disaster remediation and surveillance monitoring. Given the large volume of video which is currently being generated by such platforms, it has become infeasible for users to sort through thousands of hours of video for events or activities of interest. Therefore, vision-based action recognition methods capable of detecting specific actions of interest in aerial videos are needed.

Recognizing Actions from UAV Video

In this work we address the challenge of recognizing human actions from video captured at high altitudes. To obtain a full grasp of the problem at hand, it is useful to contrast "actions from above" with traditional ground-camera mediums. In traditional near-field, ground-camera action recognition, humans in the scene are hundreds of pixels both tall and wide. In this class of mediums articulated motions can be clearly distinguished. Most of the existing work on human action classification works best with data of this resolution and viewpoint.

At altitudes over 400 feet, humans captured on video are, on average, less than 8 pixels wide and 15 pixels tall, and the overhead camera perspective leads to frequent self-occlusion of the limbs.

Given the low resolution of humans within the aerial video sequences, we did not rely on high level features such as contours or tracking individual limbs. Conversely, low level motion features were forgone due to the extensive amount of jitter of the UAV (which translates to unstable video). Instead, we focused on generating spatiotemporal action templates based on a number of mid-level motion features which diminished the effect of generating action templates on noisy low-level features in aerial video sequences. We also investigated the use of temporally small templates which do not require motion compensation of aerial videos.