Sample application (sources) - 149K
Sample application (binaries) - 130K
Sample video #1 - 7290K
Sample video #2 - 11084K

Introduction
Since the time I've wrote my first article about motion detection, I've got a lot
of e-mails from different people around the world, which found the article quite
useful and found a lot of applications of the code in many different areas. Those
areas were including from simple video surveillance tools to quite impressing applications,
like laser gestures recognition, detecting comets with telescope, detecting humming-birds
and making camera shots of them, controlling water cannon and many other applications.
In this article I would like to discuss one more application, which uses motion
detection as its first step and then does some interesting routines with the detected
object - hands gesture recognition. Let's suppose we have a camera, which monitors
some area. When somebody gets into the area and makes some hands gestures in front
of the camera, application should detect type of the gesture and raise an event,
for example. When the hands gesture recognition is detected, the application may
perform different actions depending on the type of gesture. For example, gestures
recognition application may control some sort of device or another application sending
different commands to it depending on the recognized gesture. What type of hands
gestures are we talking about? This particular application, which is discussed in the
article, may recognize up to 15 gestures, which are combination of 4 different positions
of 2 hands - hand is not raised, raised diagonally down, diagonally up or raised
straight.
All the algorithms described in the article are based on the
AForge.NET framework,
which provides different image processing routines used by the application. The
application also uses some motion detection routines, which are inspired by the
framework and another article dedicated to
motion detection.
Before we go into deep discussions about what the application does and how it is
implemented, let?s take a look at the very quick demo ...
Motion detection and object extraction
Before we can start with hands gesture recognition, first of all we need to extract
human?s body, which demonstrates some gesture, and find a good moment, when the
actual gesture recognition should be done. For both these tasks we are going to
reuse some motion detection ideas described in the dedicated to motion detection
article.
For object extraction task we are going to use the approach, which is based on background
modeling. Let?s suppose that the very first frame of a video stream does not contain any moving
objects, but just contains a background scene.
Of course such assumption can not be valid for call cases. But, first of all, it may be valid
for most of the cases, so it is quite applicable, and the second ? our algorithm is going to be
adaptive, so it could handle situations, when the first frame contains not only the background.
But, let?s be consecutive ... So, our very fist frame can be taken as approximation of background
frame.
// check background frame
if ( backgroundFrame == null )
{
// save image dimension
width = image.Width;
height = image.Height;
frameSize = width * height;
// create initial backgroung image
bitmapData = image.LockBits(
new Rectangle( 0, 0, width, height ),
ImageLockMode.ReadOnly, PixelFormat.Format24bppRgb );
// apply grayscale filter getting unmanaged image
backgroundFrame = grayscaleFilter.Apply( new UnmanagedImage( bitmapData ) );
// unlock source image
image.UnlockBits( bitmapData );
...
}
Now let?s suppose that after a while we receive a new frame, which contains some object, and our
task is to extract it.
When we have two images, the background and the image with an object, we may use Difference
filter to get a difference image:
// lock source image
bitmapData = image.LockBits(
new Rectangle( 0, 0, width, height ),
ImageLockMode.ReadOnly, PixelFormat.Format24bppRgb );
// apply the grayscale filter
grayscaleFilter.Apply( new UnmanagedImage( bitmapData ), currentFrame );
// unlock source image
image.UnlockBits( bitmapData );
// set backgroud frame as an overlay for difference filter
differenceFilter.UnmanagedOverlayImage = backgroundFrame;
// apply difference filter
differenceFilter.Apply( currentFrame, motionObjectsImage );
On the difference image it is possible to see absolute difference between two images ? whiter
areas show the areas of higher difference and black areas show the areas of no difference. The
next two steps are:
- Threshold the difference image using Threshold filter, so each pixel may be classified
as significant change (most probably caused by moving object) or as non significant change.
- Remove noise from the thresholded difference image using Opening filter. After this step
the stand alone pixels, which could be caused by noisy camera and other circumstances, will be removed,
so we?ll have an image, which depicts only more or less significant areas of changes (motion areas).
// apply threshold filter
thresholdFilter.ApplyInPlace( motionObjectsImage );
// apply opening filter to remove noise
openingFilter.ApplyInPlace( motionObjectsImage );
It looks like we got quite good hands gesture image and we are ready for the next step ? recognition ...
Not yet. The object?s image we got as an example represents quite recognizable human?s body, which
demonstrates us some hands gesture. But, before we get such image in our video stream, we?ll receive a
lot of other frames, where we may have many other different objects, which are far from being human body.
Such objects could be anything else moving across the scene, or it even could be quite bigger noise than
the one we filtered out before. To get rid of some false objects, let?s go through all objects in the image
and check their size. To achieve this we are going to use BlobCounter class:
// process blobs
blobCounter.ProcessImage( motionObjectsImage );
Blob[] blobs = blobCounter.GetObjectInformation( );
int maxSize = 0;
Blob maxObject = new Blob( 0, new Rectangle( 0, 0, 0, 0 ) );
// find biggest blob
if ( blobs != null )
{
foreach ( Blob blob in blobs )
{
int blobSize = blob.Rectangle.Width * blob.Rectangle.Height;
if ( blobSize > maxSize )
{
maxSize = blobSize;
maxObject = blob;
}
}
}
How are we going to use the information about the biggest object?s size? First of all we are going to
implement adaptive background, which we?ve mentioned before. Suppose that from time to time we may have some
minor changes in the scene, like minor changes of light condition, some movements of small objects or even a
small object has appeared and stayed on the scene. To take these changes into account, we are going to have
adaptive background ? we are going to change our background frame (which is initially initialized from the
first video frame) in the direction of our changes using MoveTowards
filter. The MoveTowards
filter changes slightly one image in the direction to make smaller difference with the second provided image.
For example, if we have a background image, which contains scene only, and an object image, which contains the
same scene plus an object on it, then applying sequentially MoveTowards
filter to the background image, will make it the same as object image after a while ? the more we apply
MoveTowards filter to the background
image, the more evident becomes the presence of the object on it (the background image becomes "closer" to the
object image ? the difference becomes smaller).
So, we are checking the size of the biggest object in the current frame and, if it is not that big, we consider
the object as not significant and we just update our background frame to adapt to the changes:
// if we have only small objects then let's adopt to changes in the scene
if ( ( maxObject.Rectangle.Width < 20 ) || ( maxObject.Rectangle.Height > 20 ) )
{
// move background towards current frame
moveTowardsFilter.UnmanagedOverlayImage = currentFrame;
moveTowardsFilter.ApplyInPlace( backgroundFrame );
}
The second usage of maximum object?s size is to find the one, which is quite significant and which may potentially
be a human?s body. To save CPU time our hands gesture recognition algorithm is not going to analyze any object, which
is the biggest on the current frame, but only objects which satisfy some requirements:
if ( ( maxObject.Rectangle.Width >= minBodyWidth ) &&
( maxObject.Rectangle.Height >= minBodyHeight ) &&
( !firstFrame ) )
{
// do further processing of the frame
}
Ok, now we have an image, which contains moving object, and the object's size is quite reasonable so it could be a
human?s body potentially. Are we ready to pass the image to the hands gesture recognition module for further processing?
Again, not yet ...
Yes, we?ve detected a quite big object, which may be a human?s body demonstrating some gesture. But, what if the object
is still moving? What if the object did not stop yet and it is not yet ready to demonstrate us the real gesture it would
like to demonstrate? Do we really want to pass all these frames to the hands gesture recognition module while the object is
still in motion, loading our CPU with more computations? More of it, since the object is still in motion, we may even detect
the gesture, which is not the one the object would like to demonstrate. So, let?s not hurry with gesture recognition yet.
After we've detected an abject, which is a candidate for further processing, we would like to give it a chance to stop for
a while and demonstrate us something ? a gesture. If the object is constantly moving, it does not want to demonstrate us anything,
so we can skip its processing. To catch the moment when the object has stopped, we are going to use another motion detector,
which is based on between frames difference. The motion detector checks the amount of changes between two consequent video frames
(the current and the previous one) and depending on this makes a decision if there is or no motion detected. But, in this particular
case we are interested in not motion detection, but detection of motion absence.
// check motion level between frames
differenceFilter.UnmanagedOverlayImage = previousFrame;
// apply difference filter
differenceFilter.Apply( currentFrame, betweenFramesMotion );
// apply threshold filter
thresholdFilter.ApplyInPlace( betweenFramesMotion );
// apply opening filter to remove noise
openingFilter.ApplyInPlace( betweenFramesMotion );
// calculate amount of changed pixels
VerticalIntensityStatistics vis = new VerticalIntensityStatistics( betweenFramesMotion );
int[] histogram = vis.Gray.Values;
int changedPixels = 0;
for ( int i = 0, n = histogram.Length; i < n; i++ )
{
changedPixels += histogram[i] / 255;
}
// check motion level
if ( (double) changedPixels / frameSize <= motionLimit )
{
framesWithoutMotion++;
}
else
{
// reset counters
framesWithoutMotion = 0;
framesWithoutGestureChange = 0;
notDetected = true;
}
As it can be seen from the code above the between frames difference is checked by analyzing changedPixel variable,
which is used to calculate the amount of changes in percents and then the value is compared with configured motion limit to
check if we have motion or not. But, as it also can be seen from the code above, we don?t call gesture recognition routine
immediately after we detect that there is no motion. Instead of this we keep counter, which calculates the amount of consequent
frames without motion. And only when the amount of consequent frames without motion reaches some certain value, we finally pass
the object to hands gesture recognition module.
// check if we don't have motion for a while
if ( framesWithoutMotion >= minFramesWithoutMotion )
{
if ( notDetected )
{
// extract the biggest blob
blobCounter.ExtractBlobsImage( motionObjectsImage, maxObject );
// recognize gesture from the image
Gesture gesture = gestureRecognizer.Recognize( maxObject.Image, true );
maxObject.Image.Dispose( );
...
}
}
One more comment before we move to the hands gesture recognition discussion. To make sure we don?t have false gesture recognition,
we make one more additional check ? we check that the same gesture can be recognized on several consequent frames. This additional
check makes sure that the object we?ve detected really demonstrates us one gesture for a while and that gesture recognition module
provides an accurate result.
// check if gestures has changed since the previous frame
if (
( gesture.LeftHand == previousGesture.LeftHand ) &&
( gesture.RightHand == previousGesture.RightHand )
)
{
framesWithoutGestureChange++;
}
else
{
framesWithoutGestureChange = 0;
}
// check if gesture was not changing for a while
if ( framesWithoutGestureChange >= minFramesWithoutGestureChange )
{
if ( GestureDetected != null )
{
GestureDetected( this, gesture );
}
notDetected = false;
}
previousGesture = gesture;
Hands Gesture Recognition
Now, when we detected an object to process, we can analyze it trying to recognize a hands gesture. The hands gesture recognition
algorithm described below assumes that target object occupies the entire image, but not part of it:
The idea of our hands gesture recognition algorithm is quite simple and 100% based on histograms and statistics, but not on something
like pattern recognition, neural networks, etc. This makes this algorithm quite easy in implementation and understanding.
The core idea of this algorithm is based on analyzing two kinds of object?s histograms ? horizontal and vertical histograms, which can
be calculated using HorizontalIntensityStatistics
and VerticalIntensityStatistics classes:
We are going to start hands gesture recognition from utilizing horizontal histogram since for the first step it looks more useful.
The first thing we are going to do is to find areas of the image, which are occupied by hands, and the area, which is occupied by torso.
Let?s take a closer look at the horizontal histogram. As it can be seen from the histogram, the hands? areas have relatively small
values on the histogram, but the torso area is represented by a peak of high values. Taking into account some simple relative proportions
of humans? body, we may say that human hand?s thickness can never exceed 30% percent of human?s body height (30% is quite big value, but
let?s take this for safety and as an example). So, applying simple thresholding to the horizontal histogram, we can easily classify hands?
areas and torso area:
// get statistics about horizontal pixels distribution
HorizontalIntensityStatistics his = new HorizontalIntensityStatistics( bodyImageData );
int[] hisValues = (int[]) his.Gray.Values.Clone( );
// build map of hands (0) and torso (1)
double torsoLimit = torsoCoefficient * bodyHeight;
for ( int i = 0; i < bodyWidth; i++ )
{
hisValues[i] = ( (double) hisValues[i] / 255 > torsoLimit ) ? 1 : 0;
}
From the thresholded horizontal histogram we can easily calculate hands? length and body torso?s width ? the length of the right hand
is the width of the empty area on the histogram from the right, the length of the left hand is the width of the empty area from the left
and the torso?s width is the width of the area between empty areas:
// get hands' length
int leftHand = 0;
while ( ( hisValues[leftHand] == 0 ) && ( leftHand < bodyWidth ) )
leftHand++;
int rightHand = bodyWidth - 1;
while ( ( hisValues[rightHand] == 0 ) && ( rightHand > 0 ) )
rightHand--;
rightHand = bodyWidth - ( rightHand + 1 );
// get torso's width
int torsoWidth = bodyWidth - leftHand - rightHand;
Now, when we have hand?s length and torso?s width, we can determine if the hand is raised or not. For each hand, the algorithm is trying
to detect if the hand is not raised, raised diagonally down, raised straight or raised diagonally up. All 4 possible positions are
demonstrated on the image below in the order they were listed above:
To check if a hand is raised or not we are going to use some statistical assumptions about body proportions again. If the hand is not
raised it?s width on horizontal histogram will not exceed 30% of torso?s width, for example. Otherwise it is raised somehow.
// process left hand
if ( ( (double) leftHand / torsoWidth ) >= handsMinProportion )
{
// hand is raised
}
else
{
// hand is not raised
}
So far we are able to recognize one hand's position ? when hand is not raised. Now we need to
complete the algorithm recognizing exact hand's position when it is raised. And to do this we?ll use
the VerticalIntensityStatistics class, which was mentioned before. But now the class will be
applied not to the entire object?s image, but only to the hand?s image:
// extract left hand's image
Crop cropFilter = new Crop( new Rectangle( 0, 0, leftHand, bodyHeight ) );
Bitmap leftHandImage = cropFilter.Apply( bodyImageData );
// get left hand's position
gesture.LeftHand = GetHandPosition( leftHandImage );
The image above contains quite good samples and using above histograms it is quite easy to
recognize the gesture. But, in some cases we may not have such clear histograms like the ones above,
but some noisy histograms, which may be caused by light conditions and shadows. So before making any
final decision about the raised hand, let?s perform two small preprocessing steps of the vertical
histogram. These two additional steps are quite simple, so their code is not provided here, but can
be retrieved from the attached to the article source code.
1) First of all we need to remove low values from the histogram, which are lower then 10% of
maximum histogram?s value, for example. The image below demonstrates a hand?s image, which contains
some artifacts caused by shadows. Such type of artifacts can be easily removed by filtering low value
on the histogram, what is also demonstrated on the image below (the histogram is filtered already).
2) Another type of issue, which we also need to take care about, is a ?twin? hand, which is
actually a shadow. This also could be easily solved by walking through the histogram and removing all
peaks, which are not the highest peak.
At this point we should have quite clear vertical histograms, like the ones we?ve seen before, so
now we are few steps away from recognizing the hands gesture.
Let?s start with recognizing straight raised hand first. If we take a look at the image of straight
hand, then we may make one more assumption about body proportions ? length of the hand is much bigger
than its width. In the case of straight raised hand its histogram should have quite high, but thin
peak. So, let?s use these properties to check if the hand is raised straight:
if ( ( (double) handImage.Width / ( histogram.Max - histogram.Min + 1 ) ) >
minStraightHandProportion )
{
handPosition = HandPosition.RaisedStraigh;
}
else
{
// processing of diagonaly raised hand
}
(Note: Min
and Max
properties of Histogram class return minimum and maximum
values with non-zero probability. In the above sample code these values are used to calculate the
width of the histogram area occupied by hand. See documentation to AForge.Math namespace).
Now we need to make the last check to determine if the hand is raised diagonally up or diagonally
down. As we can see from histograms of raised diagonally up/down hands, the peak for the diagonally up
hand is shifted to the beginning of the histogram (to the top in the case of vertical histogram), but
the peak of the diagonally down hand is shifted more to the center. Again we can use this property to
check the exact type of raised hand:
if ( ( (double) histogram.Min / ( histogram.Max - histogram.Min + 1 ) ) <
maxRaisedUpHandProportion )
{
handPosition = HandPosition.RaisedDiagonallyUp;
}
else
{
handPosition = HandPosition.RaisedDiagonallyDown;
}
We are done! Now our algorithm is able to recognize 4 positions of each hand. Applying the same for
the second hand, our algorithm will provide next results for those 4 hands gestures, which were
demonstrated above:
- Left hand is not raised; Right hand is not raised;
- Left hand is raised diagonally down; Right hand is not raised;
- Left hand is raised straight; Right hand is not raised;
- Left hand is raised diagonally up; Right hand is not raised.
If two not raised hands is not considered to be a gesture, then the algorithm can recognize 15
hands gestures, which are combination of different hands positions.
Conclusion
As we can see from the above article, we got algorithms, which, first of all, allow us to extract
moving object from a video feed, and, the second, to recognize successfully hands gestures demonstrated
by the object. The recognition algorithm is very simple and easy as in implementation, as in
understanding. Also, since it is based only on information from histograms, it is quite efficient
in performance and does not require a lot of computational resources, which is quite important in
case if we need to process a lot of frames per second.
To make the algorithms easy to understand we?ve used generic image processing routines from
AForge.Imaging library, which is part of AForge.NET framework.
This means that going from generic routines to specialized (routines which may combine several steps in one)
it is easily possible to improve performance of these algorithms even more.
Concerning possible areas of improvements of these algorithms, we may identify next areas:
- More robust recognition in case of hands? shadows on walls;
- Handling of dynamic scene, where different kind of motion may occur behind the main object.

|