Camera Tracking

Admittedly, this portion of the project is not entirely complete.  A number of difficulties were encountered in trying to implement a working camera tracking system.  These were mostly due to unexpected sensitivity to measurement noise and false feature matches.

We will now proceed to discuss the theory behind scene information reconstruction and its application to camera tracking.  We will address individually the problems encountered in implementation.
 

Epipolar Geometry

The mathematical foundation for scene reconstruction is epipolar geometry.  The epipolar system encapsulates the constraints present in a multiple view system where point correspondences are known between individual image frames.  Consider Fig.1 where we have two image planes representing frames from our sequence.  For the sake of his explanation well refer to one plane as being the left one and the other the right (this is the standard labeling as it refers to stereoscopic system in general).  Left left camera's optical center Ol, the 3D point P being imaged, and the right camera's optical center Or, form a plane pi called the epipolar plane.  Furthermore, the point P projects to pl and pr in the left and right image planes respectively.  We also have the image of the right camera's optical center in the left image plane el and the left camera's optical center in the right image plane er.  These are called the epipoles.  Finally, the lines ll through el and pl, and lr through er and pr are called epipolar lines.  Notice that all these various points and lines all lie on the plane pi.

The importance of the system becomes clear once we understand the role of the epipolar lines.  Take, for example, the image of the 3D point P in the left image plane.  If we don't know the actual position of P, the point pl could correspond to any 3D point along the line Ol pl.  From the left image plane's point of view, any point along Ol pl would correspond to a point projected onto its image plane somewhere along lr.  That's the important constraint, once the epipolar system is known, any point pl in the left image plane is constrained to be projected somewhere along lr in the right image plane and vice versa.

Fig.1 - Epipolar Geometry
Without worrying about the relative position of the two cameras or the position of any points in the system, we can still concisely express this relationship between the left and right camera systems:
prT . F . pl = 0
where pl and pr are in pixel coordinates (not to be confused with camera coordinates!).  Here, the 3x3 matrix F is referred to as the fundamental matrix.  F has rank 2 and has el and er as its left and right null spaces respectively.  In other words, they can be found by F . el = 0 and vice versa.  It is also interesting to note that F-1 = FT, giving us a simple inverse relationship from pl to pr.  This sets up a wonderfully useful constraint for finding feature correspondences: Any point in the left image pl must have a correspondence that lies somewhere on the line ll in the right image plane where ll is simply the line F . pl = 0.  This implies that if we are looking for a point corresponding to pl, we only need to perform a search along the 1D ll in the other image rather than the entire 2D image as a whole!
 

Estimating the Fundamental Matrix

The epipolar constraint is great!  But how can we run it in reverse?  In other words, we have point correspondences (from our feature tracking which was discussed previously) and which to know how to retrieve the fundamental matrix which gave rise to them.  The direct closed-form equation for finding F directly requires the use of 7 point correspondences between two frames and requires solving for the roots of a 6th degree polynomial.  We will not discuss this here because a much simpler numerical method exists when at least 8 point correspondences are known.  This is the popular 8 point algorithm.

Using at least 8 point correspondences, we set up a homogeneous linear system of equations in the 9 unknown entries of F using the equation relating points in the two image planes to each other given above.  In other words, for each pair of corresponding points, we have an entry in our system which is:

xlxrF11 + ylxrF12 + xrF13 + xlyrF21 + ylyrF22 + yrF23 + xlF31 + ylF32 + 1 = 0
where the 1 term is due to the fact that we can only recover the constraint up to a scale factor.  Taking each of these equations we form the linear system Af = 0.

This can be solved quite readily (and fast) using singular value decomposition.  That is, decompose A into matrices UDVT where A = UDVT, matrices U and V are orthornormal and D is diagonal.  The diagonal elements of D corresponds to the singular values of the system and hence we are guaranteed that one of them will be zero.  The corresponding column in V to the zero element in D is the correct solution because we know that F is rank 2!
 

Extrapolating more Fundamental Matrices

The eight point algorithm gives us a simple way of estimating the F between any two frames of a movie where we have tracked at least 8 feature correspondences.  However, how can we possibly extrapolate the Fs between frames where we do not have access to point correspondences?  This arises, for example, between the first few and last few frames of a movie sequence.  We are unlikely to have been successful in tracking at least 8 features over the entire length of the sequence.

The name of the equivalent constraint system to the fundamental matrix but for three view stereopsis is known as the trifocal tensor.  It encapsulates all the geometry between 3 views of a scene.  We will not address the trifocal tensor directly but we will use one of its constraints to enable us to extrapolate fundamental matrices between views where we are lacking point correspondences.

Between any set of completely determined fundamental matrices between three views, it is possible to perform a point transfer.  That is to say, if we have three views 1, 2, and 3 and we know the fundamental matrices F12, F13, and F23, we are able to determine where in the third view a point will lie given its projection in views 1 and 2.  Quite simply, the point in the first image plane p1 and its correspondence in the second image plane point p2, give rise to two epipolar lines in the third plane, p1F13 = 0 and p2F23 = 0.  We know that the point p3 must line on both epipolar lines and hence must be at its intersection.  Of course, this fails if the point in question lies on the trifocal plane (the plane defined by the three camera centers) giving rise to collinear epipolar lines in the third view.  Written concisely, a point transfer from views 1 and 2 to view 3 is equivalent to:

p3 = (F31p1) x (F32p2)
Point transfers between three views lead us to a direct method of extrapolating new fundamental matrices.  Say we have completely determined the fundamental matrices between three views which we will refer to as views 1, 2 and 3.  In other words, F12, F13, and F23 are known.  Furthermore, say we have also completely determined the fundamental matrices between three other views where two of them are shared with the first set.  In other words between views 2, 3, and 4 giving rise to F23, F24, and F34.  It is possible to infer the fundamental matrix F14 from this information directly!

Consider all the point correspondences which which gave rise to the estimate of F23.  There are at least 8 of them if we used the SVD estimation technique described above.  Using point transfer, we can determine where the points giving rise to these correspondences should be in view 1 since F12, F13, and F23 are known.  The point correspondence can also be transferred to view number 4 in the same way.  We now have at least 8 point correspondences between views 1 and 4.  This is enough once again to use SVD estimation!

What Are Fundamental Matrices Good For?

Now that we have fundamental matrices relating (hopefully) every single frame of a movie to every other frame of the movie, what can we do with this?  The first step is to examine location of the epipoles (remember, these are the projections of the camera centers in each others image planes).  For example, we can, from the point of view of the very first image frame in our movie sequence, look at where the camera centers lie in every other frame.  Visually, this means that we can play back the change in location of the camera center from the point of view of the first camera.  This information can be drawn in the pixel coordinates of the first image frame as exactly the epipoles with respect to every other frame.  Furthermore, the field of view of the camera can be drawn in the first image frame by noting that it simply corresponds to the epipolar lines in every other frame corresponding to the corners of the first image frame.  (The savvy reader will note that the field of view will extend in front and behind the camera...but this can be resolved simply by noting which correspondences lie in front of the camera).

Numerical Stability

Well all is well until one tries to implement such a system.  I was very surprised at the sensitivity of the eight point algorithm to noise in measurements and to the occasional mismatched point correspondence coming from the harris feature detection previously discussed.  Although I normalized all coordinates (as suggested in the literature) before attempting fundamental matrix estimation through SVD, the system still seemed ridiculously sensitive to noise.  Consider Fig.2, where 18 point correspondences between two frames of the dinosaur test sequence are drawn in green.  The fundamental matrix F relating the two frames was estimated through SVD using these 18 correspondences.  Then the corresponding epipolar lines for each one of the features was drawn back into one of the frames in yellow.  As we can obviously see, the lines do not intersect the corresponding points as they should.  In fact they they are terribly noisy even though visual inspection of the point correspondences indicates that they are pretty accurate!
Fig.2 - Epipolar Lines and Feature Correspondences
We do however notice that one point correspondence on the dinosaur's head is not quite right.  In fact, as discussed in the previous section, the point correspondence is due to the boundary between the dinosaur's head the blue background, not an actual static feature in the scene.  Removing this single point correspondence from the system, we see that the fundamental matrix is better approximated and that the epipolar lines are indeed much closer to the actual point correspondences.  However, they are still not perfect (in fact, I'm not quite sure how closely I could ever expect to reproject these epipolar lines).  Nonetheless, without human intervention to remove spurious false feature tracking data, the estimations of F are severely plagued by noise.

Due to this I was unable to pursue any test implementations further than the noisy estimation of fundamental matrices.

Fig.3 - Epipolar Lines with Removal of False Feature
(a single feature on the top of dinosaur's head was suppressed)
However, as a final note, it is somewhat gratifying to see that in Fig.3, the epipolar lines in the left image do indeed intersect at a plausible camera center with respect to the point of view of the right image.  The right image was obviously taken by a camera placed to the right and slightly behind the dinosaur.  This is the position of the camera implied by the epipoles in the left image.