Introduction

 
The goal of this research project was to investigate the limits of and implement an automated camera tracking system in a completely uncalibrated environment.  Specifically, the question is, given a sequence of images taken by a video camera moving through an unchanging environment, what techniques are available for and to what extent can the motion of the camera be recovered based only on the motion of objects as seen in the video.  This is meant as an introduction to the topic and an exploration of the basic concepts and founding principles.

It is normal for us to see this task as being somewhat trivial.  Much of our human brain devotes its computational power to the processing and comprehension of what our eyes see making the entire process seem effortless to us.  We have relatively little trouble watching a video clip on a monitor and visualizing how the camera was moving at the time when the sequence was recorded.  In fact, for us, we constantly perform this very task in that we use feedback from our vision in order to situate ourselves as we move around.

Presumably, the human visual system performs this task by recognizing individual objects and points of interest and noting how they change in time.  Assuming that those objects are in fact not moving, we interpret their apparent motion as being the result of the our head and eyes moving.  This is the approach we shall take in trying to recover a cameras motion based on a movie clip shot by it.  We will first identify moving parts in the sequence (we will vaguely refer to this as feature tracking in the rest of the discussion) and then analyze this motion to infer how the camera was moving.  What makes this second task particularly difficult (and in fact not completely solvable) is the fact that we will not assume any knowledge about the camera's physical features such as viewing angle, aspect ratio, etc.

Practically speaking, such a system will not be able to perform perfect featuring tracking as input to its motion analysis.  Hence, what is often done, is to make an initial estimate of feature tracking which will undoubtedly include noise in its measurements as well as a number of erroneous observations.  As long as the measurements are not too noisy and only a few false matches are made, we can potentially still make a descent estimate of camera motion.  Typically the process will be repeated at this point using the estimates of camera motion as basis for refining feature tracking.  The whole cycle may be repeated iteratively until a stable result is attained.  This project will focus mainly on feature tracking and how to infer camera motion from these measurements.  We will not discuss in detail the iterative methods for improving such an initial estimate.