|
CAMSHIFT Experiments
Last update: 11/15/2004
|
Introduction
Background and context
Design experiments
Implementation
Summary
References
Acknowledgments
This tutorial explores design and implementation strategies for CAMSHIFT-based tracking systems using the Intel OpenCV Library (version beta 3.1) and SAI/MFSM.
Full source code and VC++ project files for the application are available for download from the MFSM Source Forge page: sourceforge.net/projects/mfsm (package name: mfsm (legacy); file name: CAMSHIFTTracker_7.0-1.0). The CAMSHIFT specific code is regrouped in the files CamShiftModule.[h|cpp] in the coltrk namespace. Some general OpenCV elements encapsulations are regrouped in the files OpenCVModule.[h|cpp] in the opencv namespace.
Please acknowledge use of this information and/or code by citing this page, or, in the case of academic publications, by citing the following Technical Report: CAMSHIFT Tracker Design Experiments with Intel OpenCV and SAI, Institute for Robotics and Intelligent Systems, University of Southern California.
Required modules: Image, DirectShow Live Input, DirectShow File Input, Display, Scripting
Required libraries: cv.lib (OpenCV), strmiids.lib strmbase.lib (for Direct show stuff, tested with DirectX 9.0)
The Continuously Adaptive Mean SHIFT (CAMSHIFT) algorithm (Bradski, 1998), is based on the mean shift algorithm (Comaniciu and Meer, 1997), a robust non-parametric iterative technique for finding the mode of probability distributions.
Given a color image and a color histogram, the image produced from the original color image by using the histogram as a look-up table is called back-projection image. If the histogram is a model density distribution, then the back projection image is a probability distribution of the model in the color image. CAMSHIFT detects the mode in the probability distribution image by applying mean shift while dynamically adjusting the parameters of the target distribution. In a single image, the process is iterated until convergence (or until an upper bound on the number of iterations is reached).
A detection algorithm can be applied to successive frames of a video sequence to track a single target. The search area can be restricted around the last known position of the target, resulting in possibly large computational savings. This type of scheme introduces a feed-back loop, in which the result of the detection is used as input to the next detection process. The version of CAMSHIFT applying these concepts to tracking of a single target in a video stream is called Coupled CAMSHIFT.
The Coupled CAMSHIFT algorithm as described in (Bradski, 1998) is demonstrated in a real-time head tracking application, which is part of the Intel OpenCV library. The area of application specifically considered is that of Perceptual User Interfaces (PUIs), in this particular case using head movements as seen in a face shot video stream to control various interactive programs. The color model used for the head is actually a skin color tone model, initialized by sampling an area specified manually in one image. Best results are obtained for skin area tracking when using the Hue component in the color images and the histogram. In the remainder of this paper, CAMSHIFT refers to the coupled CAMSHIFT algorithm, as implemented in OpenCV.
OpenCV contains features useful in implementing a CAMSHIFT-based tracking system, that will be explored in the experiments described in this paper. These features range from general purpose data structures (e.g. histograms) and functions (e.g. color conversion, back projection), to more specialized functions (e.g. MeanShift, CamShift) that package calls to general purpose functions in specific algorithms, to a black-box CAMSHIFT tracker class that encapsulates all necessary data structures and hides all intermediate library calls internally. Although it can make programming easier, such a neatly packaged programming component still requires something else to build a working system: an architecture model.
Software Architecture is the field of study concerned with the design, analysis and implementation of software systems (Shaw and Garlan, 1996). A specific architecture can be described as a set of computational components, and their inter-relations, or connectors. An architectural style characterizes families of architectures that share some patterns of structural organization.
Whether explicitly or implicitly, any implementation of a computer program follows some architectural model. The CAMSHIFT demonstration that comes with OpenCV is designed and implemented using Microsoft's DirectShow dataflow architecture and library. Principles common to dataflow architectures form the Pipes and Filters architectural style. It is particularly popular in the domains of signal processing, parallel processing and distributed processing (Andrews, 2000), to name but a few.
In the Pipes and Filters style, the components, called filters have a set of inputs and a set of outputs. Each component reads (and consumes) a stream of data on its input and produces a stream of data on its output. The connectors are pipes in which the output stream of a filter flows to the input of the next filter. Filters can be characterized by their input/output behavior: source filters produce a stream without any input; transform filters consume an input stream and produce an output stream; sink filters consume an input stream but do not produce any output. The OpenCV CAMSHIFT demonstration comprises three main filters (see Figure 1): a video source, a CamShift transform filter that encapsulates the OpenCV CAMSHIFT tracker class, and a video renderer sink. An additional color space conversion filter is required for the renderer to accept the output of the CamShift filter as input.
Figure 1: DirectShow graph for the OpenCV CAMSHIFT tracker demonstration
The Pipes and Filters style has a number of desirable properties that make it an attractive and efficient model for a number of applications. It is relatively simple to describe, understand and implement. The localization and isolation of computations facilitates system design, implementation, maintenance and evolution. Reversely, filters can be implemented and tested separately. Furthermore, because filters are independent, the model naturally supports parallel and distributed processing. Dataflow models are also quite intuitive, and allow to model systems while preserving the flow of data. Because of the well-defined and constrained interaction modalities between filters, complex systems are easily understandable as a series of well defined local transformations. This property is obviously not fully leveraged in the CAMSHIFT demonstration, as the tracking algorithm is implemented as a single filter.
In spite of these positive aspects, dataflow models have well studied limitations that make them unsuitable for designing multi-modal, interactive systems. Generally, most systems are designed in different styles, introducing unmapped areas at the interface between the resulting parts. The SAI architectural style was introduced to allow consistent design and modeling of complex systems (Francois, 2004). This tutorial explores how SAI can help design a more explicit, more understandable, easier to implement, and maybe more efficient, CAMSHIFT tracking system.
This section presents experiments in designing a CAMSHIFT-based tracking system with SAI. A reverse engineering approach, starting from the OpenCV CAMSHIFT demonstration, and getting deeper into the details of CAMSHIFT, produces designs of increasing apparent complexity but also increasing explicitness.
The most immediate way to implement a CAMSHIFT-based tracking system with OpenCV is to use the CvCamShiftTracker class as a black-box, mirroring its use in the DirectShow filter. Figure 2 shows the resulting design (design 1). Just like the CAMSHIFT specific portion of the OpenCV CAMSHIFT demonstration is completely encapsulated in a single DirectShow filter, it is encoded as a minimal functional unit in SAI. This unit, composed of a single source and a single cell, is delineated by a dashed line in the figure. Standard video input and image display modules provide capture and display functionalities.
Figure 2: Black-box design. The dashed line delineates the CAMSHIFT functional unit. A flow monitoring unit allows to collect throughput and latency statistics with minimum impact on overall system performance.
One defining feature of the SAI framework is the explicit identification of two fundamental data classes: volatile and persistent. Volatile data is used, produced and/or consumed, and remains in a system only for a limited fraction of its lifetime. For example, in a video processing application, the video frames captured and processed are typically volatile data: after they have been processed and displayed or saved, they are not kept in the system. Process parameters, on the other hand, must remain in the system for the whole duration of its activity.
In design 1, an instance of the tracker class encodes both processing functions and persistent data, and should thus be held in a source. A custom node type encapsulates an instance of the CvCamShiftTracker class. The specifications for this CAMSHIFT tracker node are as follows:
coltrk::CCamShiftTrackerNode (fsf::CNode) "COLTRK_CAMSHIFT_TRACKER_NODE"
The input and output frames are volatile. The custom CAMSHIFT tracker cell feeds each input frame to the persistent instance of CvCamShiftTracker, and creates an output synthetic frame with the result of its processing. The synthetic frame is added to the pulse and subsequently displayed. The logical definition for the CAMSHIFT Tracker cell is as follows:
| coltrk::CCamShiftTracker (fsf::CCell) | COLTRK_CAMSHIFT_TRACKER |
| Active filter | [IMAGE_IMAGE "Input"] |
| Passive filter | [COLTRK_CAMSHIFT_TRACKER_NODE "CamShift tracker"] |
| Output | (x (IMAGE_IMAGE "Tracking")) |
Because the same persistent instance of CvCamShiftTracker is accessed concurrently by individual threads processing subsequent input frames, the CAMSHIFT tracker node class must provide a mutual exclusion mechanism to ensure computational consistency. As a result, the processing of each input frame is performed within a critical section, resulting in a sequentialization of this part of the system.
Design 1, although correct and straightforward, is not really satisfactory. First, the output of the tracker cell is an image, which is useful for visualizing the results of the tracker, but definitely not the form in which they would be used in a PUI-based system for example. This aspect is easily solved by limiting the CAMSHIFT tracker cell to only tracking, and having a separate custom cell perform the rendering of the synthetic image.
Even then, having a single persistent object carry all the computations in a critical section without a second look is, at least potentially, not the most efficient use of SAI's asynchronous parallel processing model. The sequentialization of the processing in this part of the system re-correlates throughput and latency, a potentially serious performance bottleneck. The modularity of SAI could certainly be better used.
An analysis of the code for the CvCamShiftTracker class shows a three-step processing sequence for each input frame:
cvCvtColor function for color space conversion. A cell encapsulating this functionality can be defined as follows:
| opencv::CCvtColor (fsf::CCell) | OPENCV_CVT_COLOR |
| Active filter | [IMAGE_IMAGE "Input"] |
| Passive filter | [FSF_NODE "Root" [FSF_INT32_NODE "Conversion code" CVT_COLOR_CODE] [FSF_INT32_NODE "Nb output channels" CVT_COLOR_CHANNELS]*] |
| Output | (x (IMAGE_IMAGE "HSV image")) |
cvCalcBackProject. These elements are encapsulated as a Histogram node and a Back-projection cell:
coltrk::CHistogram (fsf::CNode) "COLTRK_HISTOGRAM"
| coltrk::CBackProject (fsf::CCell) | COLTRK_BACK_PROJECT |
| Active filter | [IMAGE_IMAGE "HSV image"] |
| Passive filter | [COLTRK_HISTOGRAM "Histogram"] |
| Output | (x (IMAGE_IMAGE "Back projection image")) |
cvCamShift function. Given an initial search window and a back-projection image, the function returns the bounding box of the most probable detection area in the image, and a size and orientation estimate of the distribution. The bounding box is used to infer the initial search window in the next input back-projection image. The back projection image is obviously volatile data. The initial search window data (input) is related to the detected area (output) in the previous frame. In an asynchronous approach, this information can be encoded as the last known detection result (bounding box), which is persistent information updated after the processing of each new frame, thus forming a feed-back loop. The rectangle data structure CvRect defined in OpenCV is encapsulated in a Rectangle node, implemented as a type specialization of the fsf::CTypeNode template class in the opencv namespace; so are the CvBox2D and CvConnectedComp structures:
opencv::RectangleNode (fsf::CTypeNode Note that this implementation is not strictly thread safe, as it assumes that operations on these structures are atomic (no mutual exclusion mechanism).
The CAMSHIFT cell definition is as follows:
opencv::Box2DNode (fsf::CTypeNode
opencv::ConnectedCompNode (fsf::CTypeNode
coltrk::CCamShift (fsf::CCell) COLTRK_CAMSHIFT Active filter [IMAGE_IMAGE "Back projection image"] Passive filter [OPENCV_RECTANGLE_NODE "Window"] Output (x (OPENCV_CONNECTED_COMP_NODE "CamShift") (OPENCV_BOX_2D_NODE "CamShift-Box"))
Figure 3 shows an SAI design (design) making these three independent steps (color conversion, back-projection and CAMSHIFT) explicit, with a separate rendering unit to produce the result visualization images.
Figure 3: A 3(+1)-step design. The dashed line delineates the subsystem functionally equivalent to that delineated in figure 2. A flow monitoring unit allows to collect throughput and latency statistics with minimum impact on overall system performance.
The specifications for the rendering cell is as follows:
| coltrk::CRender (fsf::CCell) | COLTRK_RENDER |
| Active filter | [FSF_NODE "Root" [IMAGE_IMAGE "Input" RENDER_INPUT] [OPENCV_CONNECTED_COMP_NODE "Tracking" RENDER_COMP]* [OPENCV_BOX_2D_NODE "Tracking-Box" RENDER_BOX]* [IMAGE_IMAGE "Back projection image" RENDER_BACK_PROJECT]* ] |
| Passive filter | [FSF_BOOL_NODE "Show back projection"] |
| Output | (x (IMAGE_IMAGE "Composite image")) |
The composition of the four units in a straightforward process dependency sequence is functionally equivalent to the single CAMSHIFT source/cell unit of design 1. Because independent processes are separated, this design is more efficient than the single-unit design. Instead of a single critical section processing for each input frame, only histogram node access and last known bounding box access are placed in critical sections and they are independent. Since the processing involved in each critical section is much lighter than in the single unit case, the constraints put by the processing time on the system throughput are much weaker. Another advantage of design 2 is to make explicit the main data structures and the different components that are involved in the processing, and their relationships. The finer modularity of the system makes it more understandable and also more flexible in its implementation, maintenance and evolution. In the light of these observations, the CAMSHIFT unit of design 2 is worth a closer look.
As noted earlier, the core of CAMSHIFT is a feed-back loop in which the last known target bounding box is used to set the initial search location in the new input image. First, mean shift is used to compute the centroid of the 2D color probability distribution within the search window. second, the size and orientation of the target probability distribution are computed from the zeroth and second moments respectively. Finally, the last known target bounding box is updated, and will be used to compute the initial search location for the next frame. Cells implementing these three steps could be defined as follows:
| coltrk::CMeanShift (fsf::CCell) | COLTRK_MEANSHIFT |
| Active filter | [IMAGE_IMAGE "Back projection image"] |
| Passive filter | [OPENCV_RECTANGLE_NODE "Window"] |
| Output | (x (OPENCV_CONNECTED_COMP_NODE "MeanShift")) |
| coltrk::CMoments (fsf::CCell) | COLTRK_MOMENTS |
| Active filter | [FSF_NODE "Root" [IMAGE_IMAGE "Back projection image" MOMENTS_BACK_PROJECT] [OPENCV_CONNECTED_COMP_NODE "MeanShift" MOMENTS_COMP] ] |
| Passive filter | [FSF_NODE "Root"] |
| Output | (x (OPENCV_CONNECTED_COMP_NODE "CamShift") (OPENCV_BOX_2D_NODE "CamShift-Box")) |
| coltrk::CWindowUpdate (fsf::CCell) | COLTRK_WINDOW_UPDATE |
| Active filter | [OPENCV_CONNECTED_COMP_NODE "CamShift"] |
| Passive filter | [OPENCV_RECTANGLE_NODE "Window"] |
| Output | (no output) |
Figure 4 shows an SAI design (design 3) based on such cells. The three CAMSHIFT stages (mean shift, moments and update) are explicit, and form a feed-back loop. Since the MeanShift and the Update cells both use the persistent last known bounding box, they are both connected to a common source that holds the bounding box node.
Figure 4: An explicit feedback loop design. The dashed delineates the subsystem functionally equivalent to the CAMSHIFT unit (step III) in figure 3. A flow monitoring unit allows to collect throughput and latency statistics with minimum impact on overall system performance.
Because of the asynchrony of the processing, the target information used for initialization in the MeanShift cell at any given time is in fact the latest available known bounding box. If the latency of the feed-back loop subsystem is higher than the inverse of its throughput, the time difference between the previous result used for input and the frame being processed will be more than the interval between two consecutive frames. The algorithm will then be applied in conditions similar to that of a system with lower throughput, and will perform consistently. However, thanks to the asynchronous parallelism, the higher overall throughput will be maintained for the system.
Another type of parallelism is illustrated in the branching of the stream to follow independent parallel paths. After coming out of the Moments cell, the stream follows a path to the Update cell, and another path through the rendering cell and finally to the display cell. While pulse-level multithreading principally improves throughput, stream-level parallelism can have a major impact on latency. The result of the processing should be used as soon as possible for the application of interest (here visualization) and for update, in no arbitrarily imposed order. As long as computing resources (in a general sense) are available, and assuming fair scheduling, the model allows to achieve minimal latency.
Design 3 certainly gives more insight in the structure of the CAMSHIFT tracking system than the black-box designs in either DirectShow or SAI. In fact, design 3 could (and, normally, should) be directly derived from a description of CAMSHIFT and its use for tracking. The system design graph itself gives an informational pictorial representation of the algorithm. Depending on the level of detail in which each element is described, the graph may be seen as a conceptual or logical level design. It is also to be used as a structural map of the code that implements it (physical level). Note that such a system would not be easily modeled with the same level of detail in a traditional dataflow approach. In particular, because they lack the concept of shared repositories, dataflow models make feed-back loops involving more than one processing unit ill-defined.
Design 3 makes good use of SAI modeling power and has a level of detail suitable to the communication and sharing of the design, and the efficient implementation of a corresponding system using OpenCV. The next section reports on the implementation of the three proposed designs and on their comparative performance.
SAI designs constitute blueprints for implementation that map closely to the code. The three designs presented above can be readily implemented using MFSM and relevant modules. This section describes the corresponding demonstration program, and presents comparative numerical studies of coding effort and system performance.
Full source code and VC++ project files for the application are available for download from the MFSM Source Forge page: sourceforge.net/projects/mfsm (package name: CAMSHIFT tracker). The systems developed make use of existing modules as often as possible, to minimize the amount of coding required. The CAMSHIFT specific code is regrouped in the files CamShiftModule.[h|cpp] in the coltrk namespace. Some general OpenCV elements encapsulations are regrouped in the files OpenCVModule.[h|cpp] in the opencv namespace. The application skeleton, including video input and image display, is provided by the demonstration developed in the Video 101 tutorial.
The three designs have been built into a single application. The design to run can be specified as input to the executable, as well as the requested frame rate and an eventual input video file name (if no file name is specified, the input will be taken from a USB or FireWire video camera). If live input is requested and no camera (USB or FireWire) is detected, the live input module crashes ("feature" of the live input module). The color histogram is initialized (start/reset) from a window in the center of the image.
Usage: ColorTracking <design> <frame rate> [<file name>]
For example, to run the graph corresponding to Design 3 with live input at a requested frame rate of 30 fps: ColorTracking 3 30
For example, to run the graph corresponding to Design 2 on file c:\mytrackingvideo.avi at a requested frame rate of 15 fps: ColorTracking 2 15 c:\mytrackingvideo.avi
Move the camera so that the target fills the center of the image, then press F1 to start tracking. Press F4 to see the backprojection images. Press F3 to reset the histogram. Press ESC to quit. Before terminating, the application writes flow monitoring statistics to the console: filtered throughput and latency, and number of frames processed.
Software Line Of Code (SLOC) metrics help estimate code volume, and thus development effort. SLOC measurements were computed with the Code Count tool for C++. The physical SLOC definition is independent of the programming language syntax; the logical SLOC definitions involve language-specific syntax. Table 1 shows SLOC counts for the code implementing the CAMSHIFT DirectShow filter and for the code implementing analogous SAI elements (CAMSHIFT tracker cell and node in design 1). The simpler API offered by MFSM requires less than one third of the coding to achieve the same functionality. Even if, in both cases, most of the framework "wrapping" code can be generated automatically, a smaller code volume is always preferable, because simpler to understand. The demonstration applications using the filter and the module are quite different in their structure, and thus not directly comparable.
Table 1: SLOC metrics for CAMSHIFT tracker DirectShow filter and SAI module.
| SLOC | Physical | Logical |
| DirectShow filter (OpenCV) | 735 | 504 |
| Tracker module (SAI design 1) | 169 | 162 |
Table 2 shows SLOC counts for the different SAI designs. The numbers reported distinguish non-specific elements (MFSM core library and non-OpenCV related modules) from CAMSHIFT specific elements (module(s) and application graph). The tests do not include any user interface code. Only specific code must be actually written in order to implement the systems, non specific code is available as libraries or source files that can be included directly into a project.
Table 2: SLOC metrics for the systems implementing the three SAI designs.
| SLOC | Physical | Logical |
| Non specific | 4059 | 2998 |
| Design 1 specific | 277 | 263 |
| Design 1 total | 4336 | 3261 |
| % specific | 6% | 8% |
| Design 2 specific | 734 | 579 |
| Design 2 total | 4793 | 3577 |
| % specific | 15% | 16% |
| Design 3 specific | 963 | 794 |
| Design 3 total | 5022 | 3792 |
| % specific | 19% | 21% |
As can be expected, as the design complexity increases, so does the amount of specific code to write. Yet, this amount only represents about one fifth of the total project SLOC count for design 3. As mentioned above, some of this code is ``wrapper'' code that could be generated automatically. The remaining code is directly related to the implementation of the specific algorithms of interest for the experiments or application. In the present case, this implementation is a mix of direct coding and use of specific OpenCV data structures and functions. Note that hardly any code that is not directly relevant to a Computer Vision researcher was written: the MFSM core library, as well as video input and display, are examples of code that any researcher in the field should be able to take for granted.
Also note that some of the specific code developed for the CAMSHIFT demonstration would actually belong in a wide scale OpenCV encapsulation module, which would be directly reusable in subsequent projects.
SAI's asynchronous parallel processing model allows to specify optimal designs in terms of throughput and latency. Critical questions in the context of these experiments are: (1) how well MFSM implements the asynchronous parallel processing model; and, (2) what influence a system's design has on its throughput and latency.
Efficiency and scalability of systems implemented with MFSM are supported by a large number of experiments. Multithreading support enables real-time performance for media stream processing using current state-of-the art single or multi-processor machines, while allowing to achieve maximum throughput and minimizing latency.
A flow monitoring module (part of MFSM) implements a cell that collects throughput and latency statistics with minimum impact on overall system performance. For each design, measurements were taken as shown on figures 2, 3 and 4. The latency is measured as the time difference from the image time stamp (set at capture time), to the time when the detection data becomes available in the system. This latency measure does not take into account camera delay, which is a combination of the latency imposed by the capture rate, and hardware delays. Rendering and display of the composite image are also left out, to be consistent with the use of the tracking for PUI purposes.
Measurements were taken in similar conditions, with live video input at 320x240 and 640x480 pixels, for an effective throughput of 30 frames per second. For these experiments, the critical elements of the computer system used are: a dual Intel Pentium 4 at 2.2GHz CPU, a Logitech QuickCam Pro 4000 (USB) for the live video input, and an nVidia Quadro4 XGL graphics card for display. The latency as well as the processor load linked to the system showed no significant variation among the three tracking systems. For reference, the numerical values obtained in this particular experimental setting were: 320x240: <6ms latency and <20% processor load; 640x480: <30ms latency and <80% processor load.
These results suggest that, even for such a simple system, a significantly more complex design (in terms of architectural component instances involved) does not result in a significantly more computationally expensive nor less efficient software system. For more complex systems, full use of SAI's properties should result in significant performance improvements. The experimental results also bring supporting evidence that MFSM, although it is a proof-of-concept implementation yet to be fully optimized, is efficient enough for real-time video processing on modern computers.
This tutorial presented design and implementation experiments for CAMSHIFT-based tracking systems using OpenCV and SAI. The reverse engineering of the OpenCV CAMSHIFT illustrated the intellectual and practical value of a precise system representation that explicits the internal structure of the algorithms that it comprises as well as their relationships. An SAI design graph also represents a detailed blueprint for the coding of the system, whose implementation, using a corresponding architectural middleware such as MFSM, is then straightforward. A large part of the code can be directly imported as reusable modules, significantly reducing development effort. When used correctly, SAI confers properties to the system architectures designed in the style, that translate into efficiency and optimality properties for the software implementing those designs. The modularity of the resulting code and the availability of a detailed map (the design graph) facilitate system maintenance and evolution, code reuse in subsequent applications, as well as experiment reproducibility and objective comparative testing of various functionally similar algorithms in otherwise identical systems. Finally, even in the case of such a simple and straightforward system as the CAMSHIFT-based tracker, an explicit design provides much better understanding of the algorithm and its dynamics. Proof-of-concept system coding effort is kept to a low level, and performance of the resulting system is not impacted. In the case of complex systems, full use of SAI's properties should result in significant performance improvements.
This work was supported in part by the Advanced Research and Development Activity of the U.S. Government under contract No. MDA904-03-C-1786.