The task of tracking the motion of objects or other features within a movie sequence can be broken down into two main problems. Firstly, we must decide what we will be tracking over time and how to identify these features. Secondly, once this is resolved, we must deal with the question of exactly how we will go about tracking particular features over time.
Our main purpose in identifying features is to reduce the information content of an entire sequence of moving images to a discrete set of moving points. Furthermore, as is discussed in the next section, first identifying particular features of interest reduces the effort required to gauge motion in a sequence. Rather than inspecting the entire image to identify motion, we can allow ourselves to focus only on the parts of the image which are of interest. For the purposes of this first foray into the the field of information recovery from an uncalibrated movie clip, we will restrict ourselves to identification of features which have an identifiable center or point of interest. Being able to reduce a feature to a point measurement will enable us to use very fundamental math in trying to infer information from our motion data. The use of other types of features such as lines or planes can be used but these will not be be addressed here.Since we seek to reduce the information content of a movie sequence, a reasonable approach to tackling this problem is to try to abstract the data in successive steps. We will do so with the aid of an example movie clip (Fig.1) of a rotating toy dinosaur. Although one of our original assumptions about input movie clips was that of a static scene with only a moving camera, one can clearly see how this sequence can be thought of as a stationary object with the camera orbiting around it.
![]() |
|
(src: Visual Geometry Group at Oxford University) |
We will assume that interesting features give rise to areas of high contrast in our input images, therefore a logical first step is to look at the image gradients of the input. There are a number of different ways to compute the image gradients. The simplest (and fastest) of which are the forward and backward difference approaches. To compute these, for every pixel, we simply subtract from its value, the value of its left neighbor to get its horizontal component and subtract the value of its bottom neighbor to get its vertical component. This simplistic approach has the drawback that it samples only three pixels to arrive at its calculation of the two components of the image gradient. The central difference approach takes into account four pixel values. For every pixel, one subtracts the left neighbor from the right and the top from the bottom to get the two gradient magnitudes. We can get better and better approximations to the image gradient as we increase the complexity of the calculation algorithm. For example, the cubic approximation takes into account 9 neighboring pixel values in its calculation but requires a tremendous number of addition, subtraction, and multiplication operations with respect to the forward difference approach.In experimenting with image gradient calculation, I devised and implemented a new technique which is especially useful when the image gradient output is further being analyzed. The technique is both fast and robust to noise. The idea is similar to the central difference algorithm, in that it calculates the image derivatives by subtracting the left from the right pixel, and the top from the bottom pixel. We note that this gives us a cross shaped pixel sampling. We approximate the image gradient a second time by performing the same operations but using an X shaped sampling. i.e.. subtract the upper left from the lower right pixel, and the upper right from the lower left pixel. After proper scaling and rotation of the second image gradient approximation (Note: the scaling and rotation can be applied through one single floating point multiplication on each component), we have two equivalent approximations to the horizontal and vertical derivatives. In practice, these will not match, and in fact, in areas where the image is plagued by noise, the two will not correspond at all. The simplest way to combine the two results is to average them together but this throws away the extra information we have as to how closely the two correlate. Rather, we will normalize the two gradient approximations and subtract one from the other. We note, that if the two were the same, we get a zero vector after subtraction, however, if they do not match we may get a vector of magnitude up to 2. So, we measure the magnitude of the difference vector and add 1. This gives a value between 1 and 3 (where 1 represents perfect correlation and 3 represents opposite results). Simply dividing the average of our two gradient estimates by this value provides an interesting way of scaling the image gradient by a factor related to our confidence in the gradient at that pixel. An un-optized implementation was created to test the results of this approach (bwImage::gradToulouse() in bwImage.C). It should be noted that this approach is only useful to us since later we will be examining areas of the image with high image gradient values. This approach simply suppresses high image gradients in areas where we aren't confident in the image gradient direction. Fig.2 allows us to make a qualitative comparison between the two approaches. Notice that inside the dinosaurs body in image gradient A, we see relatively a relatively high, somewhat noisy image gradient response, due to the dinosaur's skin texture. In the image gradient b (calculated using the aforementioned approach), the image gradient is somewhat suppressed on the dinosaur's skin but is held at the same level for "true" image edges.
It may have occurred to you that in speaking about these different approaches to image gradient calculations, I've been vaguely using phrases such as "subtract the left pixel from the right pixel". What it is that we are subtracting? In most simple cases, simply taking the luminance component of a color image and working with that is sufficient. However, it did occur to me that in doing so, we in fact are throwing away important information. Take, for example, the border between the dinosaur's orange tail and the blue background. When viewed in color, there is high contrast between these two. However, when converted to black and white, we wee much less contrast because the luminance of the orange is relatively close to that of the blue. Looking at the image gradient around the dinosaur's tail in Fig.2a shows the decreased contrast in this area. The simplest way to use this extra color information is to form the image gradient in each of the red, green, and blue color components individually and then simply sum their results. This is not by any means perfect and will give rise exceedingly high image gradients between black and white borders (where each of the R, G, and B gradient components will be high) bu it does provide acceptable and useful results. Fig.2b was also created using this additive RGB gradient technique. Notice that around the tail and back sections, the image gradient edge is more emphasized than that of the simple black and white technique.
![]() |
![]() |
|
|
|
Image gradients are interesting in their own right, but they do not give rise directly to point features. Use of further edge detection and edge following techniques is also interesting and, as desired, abstracts further the information content in the image. In particular, the Canny Edge Enhancer gives very good results which would seem to lead to interesting feature points. Although visual examination of Canny output would lead one to believe that there is rich feature content information, choosing particular edge intersections or edge curvatures as points of interests remains a difficult task to implement. Only seemingly ad-hoc techniques seem to be available.An interesting alternative to edge detection is corner detection. We will examine the Harris Corner Detector (popular for its simplicity and effectiveness) in further detail as a way to identify interest points. The Harris Detector is based on the following matrix:
| (dI/dx)2
(dI/dx)(dI/dy) |
C = |
|
| (dI/dx)(dI/dy)
(dI/dy)2 |
They key to unlocking the power of this equation occurs when we examine its eigenvalues. We note, that when we the matrix has two large eigenvalues this correspondents to two separate principal directions in the underlying image gradient. We calculate this quickly and efficiently in a simple equation for the corner response:
where k is a tunable parameter (Harris suggests 0.04).
In practice, Harris Corner detection is calculated using a particular small window size and summing each of the derivatives over their given widow region.Corners themselves are therefore areas where the this response is very high. Fig3 shows the results of applying the Harris Corner Operator to dinosaur test sequence using the two gradient approximation techniques shown in Fig.2 respectively. Notice that there is a very slight improvement in corner detection for the spikes on the dinosaur's back when the improved image gradient technique is used (Fig 3b). This is mostly due to the technique of recombining the R, G, and B gradients. However, the improvement that we see here in the Harris Corner Detector is too slight to merit the increased computational time required to produce the improved gradient estimate. From this point on, we will no discuss the issue of image gradient estimation and leave it up to the reader's discretion for how image gradients should be calculated and their results weighted against computational complexity.
![]() |
![]() |
|
|
|
Given the result of some feature response function, such as the Harris Corner Operator, we still do not have the desired discrete positions of particular features. Rather we have an intensity map (Fig.3) of feature response. We propose that to extract discrete feature positions, we find the centroid of each "hotspot" in the intensity map. This is easier said than done! A plausible approach (and the one implemented here) is to order the pixels according to response intensity and then perform a search for features in decreasing order through this list. The algorithm proposed is to choose the brightest pixel from the list and then initially calculate the centroid and standard deviation of intensity over the same window size as used for the corner detection. If the standard deviation indicates that the centroid is not fully enclosed by the chosen window, we expand the window and recalculate around the calculated centroid. Conversely, if the standard deviation indicates that the hotspot is much smaller than the chosen window size, we reduce the size of the window and recalculate. This process is repeatedly iteratively until the solution converges. The final centroid and window size give us the position and size of a feature. Once this is achieved, we remove all response pixels within the final window size from the ordered list and repeat the entire process for another feature.It should be noted that this iterative process is quite delicate and blows up to large window sizes on occasion. The implementation used here for testing is encapsulated in featureFinder.C. Fig.4 displays the results of our feature detection over the dinosaur test sequence with the algorithm set to extract the 6 most "interesting" features at every image frame. It is interesting to note that although we have yet to make any attempt to extract frame-to-frame feature correspondences, the algorithm still extracts the same set of features at every frame. This will be useful as we continue on with feature tracking.
![]() |
|
(video compression artifacts are responsible for the degradation of the red squares) |
The reader should realize that although we have been using motion sequences to demonstrate the concept of feature extraction, we have yet to deal with any algorithm which explicitly attempts to "track" feature over time. This is the next and final important step in achieving feature tracking. Assuming that we are given a discrete list of features for every frame of a sequence (for example the result of harris corner detection discussed above), we wish to form sets of correspondences between them over the sequence of frames.In general, intensity correlation is used to achieve this task. Given a chosen window size, compare the sub region in one image to that in the next on a pixel by pixel basis. Two widely used approaches are cross correlation and sum of squared differences matching. The first of which is faster to calculate (we simply sum the product of the intensity of each pair of corresponding pixels) but is only applicable in the case where the energy of the sub regions is the same. This will generally not be the case, so the second approach is used, in which, as the name implies, we sum the square of the differences between each pair of corresponding pixels. A simple fast OpenGL based image correlation class is implemented in correlator.C. The naive approach to finding a correspondence for a given feature in subsequent frames would be to search the entire image for the best correlation. This is computationally quite expensive and in most situations unacceptably slow. We therefore reduce the search space by only comparing each feature with the identified features of the next frame. Those yielding a high enough correlation, are then considered to be matched across a pair of frames. Applying this approach iteratively builds up chains of tracked features across the image sequence.
In this test implementation, further refinement of this simple approach yielded an algorithm capable of tracking a specified minimum number of features each lasting a specified minimum interval of time over the image sequence. It's implementation is in featureCorrelator.C. Fig.5 displays the results for our dinosaur clip where each detected feature was given an arbitrary random color to show that it is individual features that are being tracked. The movie clearly shows the rotation of features as the underlying dinosaur is rotated.
![]() |
|
|
Fig.6 simply superimposes the feature tracking information shown in the previous figure atop the underlying image sequence (note that the images frames in the snapshots correspond exactly between Fig.5 and 6)
![]() |
|
|
Although the algorithm and approaches discussed do yield very good results in most cases, they are not perfect. In particular they do not deal with object occlusions. For example, in the test sequence, although the dinosaur's tail is visible throughout most of the sequence, it is hidden by the dinosaurs body part way through the clip. The given approach then erroneously identifies features in the first section where the tail is visible as being seperate entities then those in the second interval of its visibility. This is obviously due to the fact that we really on frame to frame image correspondences. We make no attempt to match features across across more than a pair of subsequent frames. This approach could provide useful further correlations, we will see that it is unnecessary for our purposes.However, object occlusion also gives rise to another, more important, problem in our feature tracking. Actually occlusion boundaries are tracked rather than static features of interest. For example, the corner formed by the occlusions due to the dinosaur's armpit are tracked even though they correspond to no particular static feature on the dinosaur. This type of feature tracking error will cause problems as we continue with camera tracking. However, it should be noted that relatively few of these cases were noticed with the tested input.
The complete set of C++ source code files used during testing can be found here. Of particular interest up to this point are the following classes: correlator, featureCorrelator, featureFinder, bwImage, floatImage, rgbImage and the test applications: correlTest, harrisTest, and trackTest
Fig.7 and 8 show the results of the same test implementation on to other test image sequences.
![]() |
![]() |
|
(src: CMU VASC Image Database) |
|
![]() |
![]() |
|
(src: CMU VASC Image Database) |
|