Hands Gesture Recognition

by Andrew Kirillov

Some ideas about Hands Gesture Recognition in still images and video feeds.
Posted: October 12, 2008  
Updated: February 16, 2010  
Programming languages: C#  
AForge.NET framework: 2.1.1  

Sample application (sources) - 149K
Sample application (binaries) - 130K
Sample video #1 - 7290K
Sample video #2 - 11084K

Hands Gestures Recognition


Since the time I've wrote my first article about motion detection, I've got a lot of e-mails from different people around the world, which found the article quite useful and found a lot of applications of the code in many different areas. Those areas were including from simple video surveillance tools to quite impressing applications, like laser gestures recognition, detecting comets with telescope, detecting humming-birds and making camera shots of them, controlling water cannon and many other applications.

In this article I would like to discuss one more application, which uses motion detection as its first step and then does some interesting routines with the detected object - hands gesture recognition. Let's suppose we have a camera, which monitors some area. When somebody gets into the area and makes some hands gestures in front of the camera, application should detect type of the gesture and raise an event, for example. When the hands gesture recognition is detected, the application may perform different actions depending on the type of gesture. For example, gestures recognition application may control some sort of device or another application sending different commands to it depending on the recognized gesture. What type of hands gestures are we talking about? This particular application, which is discussed in the article, may recognize up to 15 gestures, which are combination of 4 different positions of 2 hands - hand is not raised, raised diagonally down, diagonally up or raised straight.

All the algorithms described in the article are based on the AForge.NET framework, which provides different image processing routines used by the application. The application also uses some motion detection routines, which are inspired by the framework and another article dedicated to motion detection.

Before we go into deep discussions about what the application does and how it is implemented, let?s take a look at the very quick demo ...

Motion detection and object extraction

Before we can start with hands gesture recognition, first of all we need to extract human?s body, which demonstrates some gesture, and find a good moment, when the actual gesture recognition should be done. For both these tasks we are going to reuse some motion detection ideas described in the dedicated to motion detection article.

For object extraction task we are going to use the approach, which is based on background modeling. Let?s suppose that the very first frame of a video stream does not contain any moving objects, but just contains a background scene.

Background image

Of course such assumption can not be valid for call cases. But, first of all, it may be valid for most of the cases, so it is quite applicable, and the second ? our algorithm is going to be adaptive, so it could handle situations, when the first frame contains not only the background. But, let?s be consecutive ... So, our very fist frame can be taken as approximation of background frame.

// check background frame
if ( backgroundFrame == null )
    // save image dimension
    width     = image.Width;
    height    = image.Height;
    frameSize = width * height;

    // create initial backgroung image
    bitmapData = image.LockBits(
        new Rectangle( 0, 0, width, height ),
        ImageLockMode.ReadOnly, PixelFormat.Format24bppRgb );
    // apply grayscale filter getting unmanaged image
    backgroundFrame = grayscaleFilter.Apply( new UnmanagedImage( bitmapData ) );
    // unlock source image
    image.UnlockBits( bitmapData );


Now let?s suppose that after a while we receive a new frame, which contains some object, and our task is to extract it.

Object image

When we have two images, the background and the image with an object, we may use Difference filter to get a difference image:

// lock source image
bitmapData = image.LockBits(
    new Rectangle( 0, 0, width, height ),
    ImageLockMode.ReadOnly, PixelFormat.Format24bppRgb );

// apply the grayscale filter
grayscaleFilter.Apply( new UnmanagedImage( bitmapData ), currentFrame );

// unlock source image
image.UnlockBits( bitmapData );

// set backgroud frame as an overlay for difference filter
differenceFilter.UnmanagedOverlayImage = backgroundFrame;

// apply difference filter
differenceFilter.Apply( currentFrame, motionObjectsImage );
Difference image

On the difference image it is possible to see absolute difference between two images ? whiter areas show the areas of higher difference and black areas show the areas of no difference. The next two steps are:

  1. Threshold the difference image using Threshold filter, so each pixel may be classified as significant change (most probably caused by moving object) or as non significant change.
  2. Remove noise from the thresholded difference image using Opening filter. After this step the stand alone pixels, which could be caused by noisy camera and other circumstances, will be removed, so we?ll have an image, which depicts only more or less significant areas of changes (motion areas).
// apply threshold filter
thresholdFilter.ApplyInPlace( motionObjectsImage );

// apply opening filter to remove noise
openingFilter.ApplyInPlace( motionObjectsImage );
Thresholded image

It looks like we got quite good hands gesture image and we are ready for the next step ? recognition ... Not yet. The object?s image we got as an example represents quite recognizable human?s body, which demonstrates us some hands gesture. But, before we get such image in our video stream, we?ll receive a lot of other frames, where we may have many other different objects, which are far from being human body. Such objects could be anything else moving across the scene, or it even could be quite bigger noise than the one we filtered out before. To get rid of some false objects, let?s go through all objects in the image and check their size. To achieve this we are going to use BlobCounter class:

// process blobs
blobCounter.ProcessImage( motionObjectsImage );
Blob[] blobs = blobCounter.GetObjectInformation( );

int maxSize = 0;
Blob maxObject = new Blob( 0, new Rectangle( 0, 0, 0, 0 ) );

// find biggest blob
if ( blobs != null )
    foreach ( Blob blob in blobs )
        int blobSize = blob.Rectangle.Width * blob.Rectangle.Height;

        if ( blobSize > maxSize )
            maxSize = blobSize;
            maxObject = blob;

How are we going to use the information about the biggest object?s size? First of all we are going to implement adaptive background, which we?ve mentioned before. Suppose that from time to time we may have some minor changes in the scene, like minor changes of light condition, some movements of small objects or even a small object has appeared and stayed on the scene. To take these changes into account, we are going to have adaptive background ? we are going to change our background frame (which is initially initialized from the first video frame) in the direction of our changes using MoveTowards filter. The MoveTowards filter changes slightly one image in the direction to make smaller difference with the second provided image. For example, if we have a background image, which contains scene only, and an object image, which contains the same scene plus an object on it, then applying sequentially MoveTowards filter to the background image, will make it the same as object image after a while ? the more we apply MoveTowards filter to the background image, the more evident becomes the presence of the object on it (the background image becomes "closer" to the object image ? the difference becomes smaller).

So, we are checking the size of the biggest object in the current frame and, if it is not that big, we consider the object as not significant and we just update our background frame to adapt to the changes:

// if we have only small objects then let's adopt to changes in the scene
if ( ( maxObject.Rectangle.Width < 20 ) || ( maxObject.Rectangle.Height > 20 ) )
    // move background towards current frame
    moveTowardsFilter.UnmanagedOverlayImage = currentFrame;
    moveTowardsFilter.ApplyInPlace( backgroundFrame );

The second usage of maximum object?s size is to find the one, which is quite significant and which may potentially be a human?s body. To save CPU time our hands gesture recognition algorithm is not going to analyze any object, which is the biggest on the current frame, but only objects which satisfy some requirements:

if ( ( maxObject.Rectangle.Width >= minBodyWidth ) &&
     ( maxObject.Rectangle.Height >= minBodyHeight ) &&
     ( !firstFrame ) )
    // do further processing of the frame

Ok, now we have an image, which contains moving object, and the object's size is quite reasonable so it could be a human?s body potentially. Are we ready to pass the image to the hands gesture recognition module for further processing? Again, not yet ...

Yes, we?ve detected a quite big object, which may be a human?s body demonstrating some gesture. But, what if the object is still moving? What if the object did not stop yet and it is not yet ready to demonstrate us the real gesture it would like to demonstrate? Do we really want to pass all these frames to the hands gesture recognition module while the object is still in motion, loading our CPU with more computations? More of it, since the object is still in motion, we may even detect the gesture, which is not the one the object would like to demonstrate. So, let?s not hurry with gesture recognition yet.

After we've detected an abject, which is a candidate for further processing, we would like to give it a chance to stop for a while and demonstrate us something ? a gesture. If the object is constantly moving, it does not want to demonstrate us anything, so we can skip its processing. To catch the moment when the object has stopped, we are going to use another motion detector, which is based on between frames difference. The motion detector checks the amount of changes between two consequent video frames (the current and the previous one) and depending on this makes a decision if there is or no motion detected. But, in this particular case we are interested in not motion detection, but detection of motion absence.

// check motion level between frames
differenceFilter.UnmanagedOverlayImage = previousFrame;

// apply difference filter
differenceFilter.Apply( currentFrame, betweenFramesMotion );

// apply threshold filter
thresholdFilter.ApplyInPlace( betweenFramesMotion );

// apply opening filter to remove noise
openingFilter.ApplyInPlace( betweenFramesMotion );

// calculate amount of changed pixels
VerticalIntensityStatistics vis = new VerticalIntensityStatistics( betweenFramesMotion );

int[] histogram = vis.Gray.Values;
int   changedPixels = 0;

for ( int i = 0, n = histogram.Length; i < n; i++ )
    changedPixels += histogram[i] / 255;

// check motion level
if ( (double) changedPixels / frameSize <= motionLimit )
    // reset counters
    framesWithoutMotion = 0;
    framesWithoutGestureChange = 0;
    notDetected = true;

As it can be seen from the code above the between frames difference is checked by analyzing changedPixel variable, which is used to calculate the amount of changes in percents and then the value is compared with configured motion limit to check if we have motion or not. But, as it also can be seen from the code above, we don?t call gesture recognition routine immediately after we detect that there is no motion. Instead of this we keep counter, which calculates the amount of consequent frames without motion. And only when the amount of consequent frames without motion reaches some certain value, we finally pass the object to hands gesture recognition module.

// check if we don't have motion for a while
if ( framesWithoutMotion >= minFramesWithoutMotion )
    if ( notDetected )
        // extract the biggest blob
        blobCounter.ExtractBlobsImage( motionObjectsImage, maxObject );

        // recognize gesture from the image
        Gesture gesture = gestureRecognizer.Recognize( maxObject.Image, true );
        maxObject.Image.Dispose( );

One more comment before we move to the hands gesture recognition discussion. To make sure we don?t have false gesture recognition, we make one more additional check ? we check that the same gesture can be recognized on several consequent frames. This additional check makes sure that the object we?ve detected really demonstrates us one gesture for a while and that gesture recognition module provides an accurate result.

// check if gestures has changed since the previous frame
if (
    ( gesture.LeftHand == previousGesture.LeftHand ) &&
    ( gesture.RightHand == previousGesture.RightHand )
    framesWithoutGestureChange = 0;

// check if gesture was not changing for a while
if ( framesWithoutGestureChange >= minFramesWithoutGestureChange )
    if ( GestureDetected != null )
        GestureDetected( this, gesture );
    notDetected = false;

previousGesture = gesture;

Hands Gesture Recognition

Now, when we detected an object to process, we can analyze it trying to recognize a hands gesture. The hands gesture recognition algorithm described below assumes that target object occupies the entire image, but not part of it:

Sample objects to recognize

The idea of our hands gesture recognition algorithm is quite simple and 100% based on histograms and statistics, but not on something like pattern recognition, neural networks, etc. This makes this algorithm quite easy in implementation and understanding.

The core idea of this algorithm is based on analyzing two kinds of object?s histograms ? horizontal and vertical histograms, which can be calculated using HorizontalIntensityStatistics and VerticalIntensityStatistics classes:

Horizontal and Vertical histograms

We are going to start hands gesture recognition from utilizing horizontal histogram since for the first step it looks more useful. The first thing we are going to do is to find areas of the image, which are occupied by hands, and the area, which is occupied by torso.

Let?s take a closer look at the horizontal histogram. As it can be seen from the histogram, the hands? areas have relatively small values on the histogram, but the torso area is represented by a peak of high values. Taking into account some simple relative proportions of humans? body, we may say that human hand?s thickness can never exceed 30% percent of human?s body height (30% is quite big value, but let?s take this for safety and as an example). So, applying simple thresholding to the horizontal histogram, we can easily classify hands? areas and torso area:

// get statistics about horizontal pixels distribution 
HorizontalIntensityStatistics his = new HorizontalIntensityStatistics( bodyImageData );
int[] hisValues = (int[]) his.Gray.Values.Clone( );

// build map of hands (0) and torso (1)
double torsoLimit = torsoCoefficient * bodyHeight;

for ( int i = 0; i < bodyWidth; i++ )
    hisValues[i] = ( (double) hisValues[i] / 255 > torsoLimit ) ? 1 : 0;
Conversion of Horizontal's histogram

From the thresholded horizontal histogram we can easily calculate hands? length and body torso?s width ? the length of the right hand is the width of the empty area on the histogram from the right, the length of the left hand is the width of the empty area from the left and the torso?s width is the width of the area between empty areas:

// get hands' length
int leftHand = 0;
while ( ( hisValues[leftHand] == 0 ) && ( leftHand < bodyWidth ) )

int rightHand = bodyWidth - 1;
while ( ( hisValues[rightHand] == 0 ) && ( rightHand > 0 ) )
rightHand = bodyWidth - ( rightHand + 1 );

// get torso's width
int torsoWidth = bodyWidth - leftHand - rightHand;

Now, when we have hand?s length and torso?s width, we can determine if the hand is raised or not. For each hand, the algorithm is trying to detect if the hand is not raised, raised diagonally down, raised straight or raised diagonally up. All 4 possible positions are demonstrated on the image below in the order they were listed above:

Posible hand's positions

To check if a hand is raised or not we are going to use some statistical assumptions about body proportions again. If the hand is not raised it?s width on horizontal histogram will not exceed 30% of torso?s width, for example. Otherwise it is raised somehow.

// process left hand
if ( ( (double) leftHand / torsoWidth ) >= handsMinProportion )
    // hand is raised
    // hand is not raised

So far we are able to recognize one hand's position ? when hand is not raised. Now we need to complete the algorithm recognizing exact hand's position when it is raised. And to do this we?ll use the VerticalIntensityStatistics class, which was mentioned before. But now the class will be applied not to the entire object?s image, but only to the hand?s image:

// extract left hand's image
Crop cropFilter = new Crop( new Rectangle( 0, 0, leftHand, bodyHeight ) );
Bitmap leftHandImage = cropFilter.Apply( bodyImageData );

// get left hand's position
gesture.LeftHand = GetHandPosition( leftHandImage );

The image above contains quite good samples and using above histograms it is quite easy to recognize the gesture. But, in some cases we may not have such clear histograms like the ones above, but some noisy histograms, which may be caused by light conditions and shadows. So before making any final decision about the raised hand, let?s perform two small preprocessing steps of the vertical histogram. These two additional steps are quite simple, so their code is not provided here, but can be retrieved from the attached to the article source code.

1) First of all we need to remove low values from the histogram, which are lower then 10% of maximum histogram?s value, for example. The image below demonstrates a hand?s image, which contains some artifacts caused by shadows. Such type of artifacts can be easily removed by filtering low value on the histogram, what is also demonstrated on the image below (the histogram is filtered already).

2) Another type of issue, which we also need to take care about, is a ?twin? hand, which is actually a shadow. This also could be easily solved by walking through the histogram and removing all peaks, which are not the highest peak.

At this point we should have quite clear vertical histograms, like the ones we?ve seen before, so now we are few steps away from recognizing the hands gesture.

Let?s start with recognizing straight raised hand first. If we take a look at the image of straight hand, then we may make one more assumption about body proportions ? length of the hand is much bigger than its width. In the case of straight raised hand its histogram should have quite high, but thin peak. So, let?s use these properties to check if the hand is raised straight:

if ( ( (double) handImage.Width / ( histogram.Max - histogram.Min + 1 ) ) >
     minStraightHandProportion )
    handPosition = HandPosition.RaisedStraigh;
    // processing of diagonaly raised hand

(Note: Min and Max properties of Histogram class return minimum and maximum values with non-zero probability. In the above sample code these values are used to calculate the width of the histogram area occupied by hand. See documentation to AForge.Math namespace).

Now we need to make the last check to determine if the hand is raised diagonally up or diagonally down. As we can see from histograms of raised diagonally up/down hands, the peak for the diagonally up hand is shifted to the beginning of the histogram (to the top in the case of vertical histogram), but the peak of the diagonally down hand is shifted more to the center. Again we can use this property to check the exact type of raised hand:

if ( ( (double) histogram.Min / ( histogram.Max - histogram.Min + 1 ) ) <
     maxRaisedUpHandProportion )
    handPosition = HandPosition.RaisedDiagonallyUp;
    handPosition = HandPosition.RaisedDiagonallyDown;

We are done! Now our algorithm is able to recognize 4 positions of each hand. Applying the same for the second hand, our algorithm will provide next results for those 4 hands gestures, which were demonstrated above:

  • Left hand is not raised; Right hand is not raised;
  • Left hand is raised diagonally down; Right hand is not raised;
  • Left hand is raised straight; Right hand is not raised;
  • Left hand is raised diagonally up; Right hand is not raised.

If two not raised hands is not considered to be a gesture, then the algorithm can recognize 15 hands gestures, which are combination of different hands positions.


As we can see from the above article, we got algorithms, which, first of all, allow us to extract moving object from a video feed, and, the second, to recognize successfully hands gestures demonstrated by the object. The recognition algorithm is very simple and easy as in implementation, as in understanding. Also, since it is based only on information from histograms, it is quite efficient in performance and does not require a lot of computational resources, which is quite important in case if we need to process a lot of frames per second.

To make the algorithms easy to understand we?ve used generic image processing routines from AForge.Imaging library, which is part of AForge.NET framework. This means that going from generic routines to specialized (routines which may combine several steps in one) it is easily possible to improve performance of these algorithms even more.

Concerning possible areas of improvements of these algorithms, we may identify next areas:

  • More robust recognition in case of hands? shadows on walls;
  • Handling of dynamic scene, where different kind of motion may occur behind the main object.