CRAM: Compact Representation of Actions in Movies

Mikel Rodriguez, "CRAM: Compact Representation of Actions in Movies", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Fransico, 2010.

Every day millions of hours of video are captured around the world by CCTV cameras, webcams, and traffic-cams. In the United States alone, an estimated 26 million video cameras spit out more than four billion hours of video footage every week. In the time it takes to read this sentence, close to 20,000 hours of video have been captured and saved at different locations in the U.S. However, the vast majority of this wealth of data is never analyzed by humans. Instead, most of the video is used in an archival, post-factummanner once an event of interest has occurred.

The main reason for this lack of exploitation resides in the fact that video browsing and retrieval are inconvenient due to inherent spatio-temporal redundancies, in which extended periods of time contain little to no activities or events of interest. In most videos, a specific activity of interest may only occur in a relatively small region along the entire spatio-temporal extent of the video.

In this work, we introduce activity-specific video summaries, which provide an effective means of browsing and indexing video based on a set of events of interest. Our method automatically generates a compact video representation of a long sequence, which features only activities of interest while preserving the general dynamics of the original video.

Given a long input video sequence, we compute optical flow and represent the corresponding vector field in the Clifford Fourier domain. Dynamic regions within the flow field are identified within the phase spectrum volume of the flow field. We then compute the likelihood that certain activities of relevance occur within the the video by correlating it with spatio-temporal maximum average correlation height filters. Finally, the input sequence is condensed via a temporal shift optimization, resulting in a short video clip which simultaneously displays multiple instances of each relevant activity.

In order to generate action-specific summaries we identify the most relevant activities based on pre-defined action templates. Worms containing activities of interest are assembled into a synopsis video of a specified temporal length. In order to include as many activities as possible in the short video synopsis, different action instances may be displayed concurrently, even if they originally occurred at different times. The resulting synopsis video will contain events of interest in a short clip which can serve as an index into the original long video by keeping a pointer to the original spatio-temporal location of the even

Resources:

Source code (Matlab/C++)

CRAM: Compact Representation of Actions in Movies (PDF)

Video Sequences used in the paper (please email me)