My Random List

Nikolai Smolyanskiy’s blog

Face Tracking SDK in Kinect For Windows 1.5

Posted by nsmoly on May 21, 2012

After a long journey, my team at Microsoft shipped Face Tracking SDK as part of Kinect For Windows 1.5! I worked on the 3D face tracking technology (starting from the times when it was part of Avatar Kinect) and so I’d like to describe its capabilities and limitations in this post. First of all, here is the demo:

You can use the Face Tracking SDK in your program if you install Kinect for Windows Developer Toolkit 1.5. After you install it, go to the provided samples and run/build yourself “Face Tracking Visualization” C++ sample or “Face Tracking Basics-WPF” C# sample. Off course, you need to have Kinect camera attached to your PC ;-) The face tracking engine tracks at the speed of 4-8 ms per frame depending on how powerful your PC is. It does its computations on CPU only (does not use GPU, since it may be needed to render graphics).

If you look at the 2 mentioned code samples, you can see that it is relatively easy to add face tracking capabilities to your application. You need to link with a provided lib, place 2 dlls in the global path or in the working directory of your your executable (so they can be found) and add something like this to your code (this is in C++, you can also do it in C#, see the code samples):

// Include main Kinect SDK .h file
#include "NuiAPI.h"

// Include the face tracking SDK .h file
#include "FaceTrackLib.h"

// Create an instance of a face tracker
IFTFaceTracker* pFT = FTCreateFaceTracker();
    // Handle errors

// Initialize cameras configuration structures.
// IMPORTANT NOTE: resolutions and focal lengths must be accurate, since it affects tracking precision!
// It is better to use enums defined in NuiAPI.h

// Video camera config with width, height, focal length in pixels
// NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS focal length is computed for 640x480 resolution
// If you use different resolutions, multiply this focal length by the scaling factor

// Depth camera config with width, height, focal length in pixels
// NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS focal length is computed for 320x240 resolution
// If you use different resolutions, multiply this focal length by the scaling factor

// Initialize the face tracker
HRESULT hr = pFT->Initialize(&videoCameraConfig, &depthCameraConfig, NULL, NULL);
if( FAILED(hr) )
    // Handle errors

// Create a face tracking result interface
IFTResult* pFTResult = NULL;
hr = pFT->CreateFTResult(&pFTResult);
    // Handle errors

// Prepare image interfaces that hold RGB and depth data
IFTImage* pColorFrame = FTCreateImage();
IFTImage* pDepthFrame = FTCreateImage();
if(!pColorFrame || !pDepthFrame)
    // Handle errors

// Attach created interfaces to the RGB and depth buffers that are filled with
// corresponding RGB and depth frame data from Kinect cameras
pColorFrame->Attach(640, 480, colorCameraFrameBuffer, FORMAT_UINT8_R8G8B8, 640*3);
pDepthFrame->Attach(320, 240, depthCameraFrameBuffer, FTIMAGEFORMAT_UINT16_D13P3, 320*2);
// You can also use Allocate() method in which case IFTImage interfaces own their memory.
// In this case use CopyTo() method to copy buffers

FT_SENSOR_DATA sensorData;
sensorData.pVideoFrame = &colorFrame;
sensorData.pDepthFrame = &depthFrame;
sensorData.ZoomFactor = 1.0f;       // Not used must be 1.0
sensorData.ViewOffset = POINT(0,0); // Not used must be (0,0)

bool isFaceTracked = false;

// Track a face
while ( true )
    // Call Kinect API to fill videoCameraFrameBuffer and depthFrameBuffer with RGB and depth data

    // Check if we are already tracking a face
        // Initiate face tracking.
        // This call is more expensive and searches the input frame for a face.
        hr = pFT->StartTracking(&sensorData, NULL, NULL, pFTResult);
        if(SUCCEEDED(hr) && SUCCEEDED(pFTResult->Status))
            isFaceTracked = true;
            // No faces found
            isFaceTracked = false;
        // Continue tracking. It uses a previously known face position.
        // This call is less expensive than StartTracking()
        hr = pFT->ContinueTracking(&sensorData, NULL, pFTResult);
        if(FAILED(hr) || FAILED (pFTResult->Status))
            // Lost the face
            isFaceTracked = false;

    // Do something with pFTResult like visualize the mask, drive your 3D avatar,
    // recognize facial expressions

// Clean up

The code calls the face tracker by using either StartTracking() or ContinueTracking() functions. StartTracking() is a more expensive function since it searches for a face on a passed RGB frame. ContinueTracking() method uses previous face location to resume tracking. StartTracking() is more stable when you have big breaks between frames since it is stateless.

There are 2 modes in which the face tracker operates – with skeleton based information and without. In the 1st mode you pass an array with 2 head points to StartTracking/ContinueTracking methods. These head points are the end of the head bone contained in NUI_SKELETON_DATA structure returned by Kinect API. This head bone is indexed by NUI_SKELETON_POSITION_HEAD member of NUI_SKELETON_POSITION_INDEX enumeration. The 1st head point is the neck position and the 2nd head point is the head position. These points allow the face tracker to find a face faster and easier, so this mode is cheaper in terms of computer resources (and sometimes more reliable at big head rotations). The 2nd mode only requires color frame + depth frame to be passed with an optional region of interest parameter that tells the face tracker where to search on RGB frame for a user face. If the region of interest is not passed (passed as NULL), then the face tracker will try to find a face on a full RGB frame which is the slowest mode of operation of StartTracking() method. ContinueTracking() will use a previously found face and so is much faster.

Camera configuration structure -  it is very important to pass correct parameters in it like frame width, height and the corresponding camera focal length in pixels. We don’t read these automatically from Kinect camera to give more advanced users more flexibility. If don’t initialize them to the correct values (that can be read from Kinect APIs), the tracking accuracy will suffer or the tracking will fail entirely.

Frame of reference for 3D results -  the face tracking SDK uses both depth and color data, so we had to pick which camera space (video or depth) to use to compute 3D tracking results in. Due to some technical advantages we decided to do it in the color camera space. So the resulting frame of reference for 3D face tracking results is the video camera space. It is a right handed system with Z axis pointing towards a tracked person and Y pointing UP. The measurement units are meters. So it is very similar to Kinect’s skeleton coordinate frame with the exception of the origin and its optical axis orientation (the skeleton frame of reference is in the depth camera space). Online documentation has a sample that describes how to convert from color camera space to depth camera space.

Also, here are several things that will affect tracking accuracy:

1) Light – a face should be well lit without too many harsh shadows on it. Bright backlight or sidelight may make tracking worse.

2) Distance to the Kinect camera – the closer you are to the camera the better it will track. The tracking quality is best when you are closer than 1.5 meters (4.9 feet) to the camera. At closer range Kinect’s depth data is more precise and so the face tracking engine can compute face 3D points more accurately.

3) Occlusions – if you have thick glasses or Lincoln like beard, you may have issues with the face tracking. This is still an open area for improvement :-)  Face color is NOT an issue as can be seen on this video

Here are some technical details for more technologically/math minded people: We used the Active Apperance Model as the foundation for our 2D feature tracker. Then we extended our computation engine to use Kinect’s depth data, so it can track faces/heads in 3D. This made it much more robust compared to 2D feature point trackers. Active Appearance Model is not quite robust to handle all real world scenarios. Off course, we also used lots of secret sauce to make things working well together :-)  You can read about some of these algorithms here, here and here.

Have fun with the face tracking SDK!


Credits - Many people worked on this project or helped with their expertise:

Christian Hutema (led the project), Lin Liang, Nikolai Smolyanskiy, Sean Anderson, Evgeny Salnikov, Jayman Dalal, Jian Sun, Xin Tong, Zhengyou Zhang, Cha Zhang, Simon Baker, Qin Cai.

About these ads

98 Responses to “Face Tracking SDK in Kinect For Windows 1.5”

  1. Gold Post said

    Awesome work Nikolai. And you haven’t aged a day. :-)

  2. Alex0700 said

    Very good work, I’m trying to use it in c# to cut the head as the purple rect in c++ sample but there’s no GetFaceRect() magic method in c#.
    Is there anywhere we can find a map to see which positions map to each FeaturePoint (maybe not all, but the most useful ones)

    • nsmoly said

      Thanks! You can see definitions of the 2D points in the photo in the The API returns 87 2D points that you can use to get various face features. The rectangle is easy to get based on those. In addition to that you get 3D head pose and 3D animation and shape units that are parameters of the 3D model. You can get 3D model vertices if you feed tracking data into IFTModel interface. Unfortunately, C# API is a sample only and is pretty basic. You can get full API if you call directly into the interop. See the C# sample for FTInterop.cs it has all the method bindings to the native COM APIs. The 3D vertices returned by IFTModel interface are also semantic and stable (if vertex N is the corner of the left eye it will stay that way for other tracked frames).

  3. [...] [...]

  4. Ahn said

    Look so good! actually, i want to see your full code…
    i was running thw acatar mode.
    I want to mark rotation angle… like your screen
    how did you do??
    please reply my E-mail address. Thanks!

  5. @MarkDunne said


    I’m trying to use the orientation of the face, i.e., the tracked face’s “look direction”, and the orientation of either arm, i.e., where the arms are pointing so when a user directs their eye/face gaze and points at a real world object in front of a projection screen, I can work out the real world 3D point where these two vectors/rays meet/cross and have the avatar look out at that point from its virtual world. I’m using the XNA Avateering sample as the starting block for this experiment.

    Question 1: How do I get the look direction of the tracked face, is it simple derived from the rotation data for the tracked face?
    Question 2: Does the face-tracking just rely on the rgb camera or is it also using the depth camera to figure out where the position of the head is?

    Sometimes the triangle mesh appears totally in the wrong place on the screen, no where near my head. Also, lighting conditions play a huge role in successful face-tracking. It can be difficult to work with at night. Thank god for the Kinect Studio, one good recording and you never have to dance around the room again.

    • nsmoly said

      1) To get the vector that points to where the face looks you can either derive it from rotation angles or call IFTModel interface and pass the computed IFTResult object returned by the face tracker. IFTModel returns a list of 3D points on the face (feature points) that you can use to compute this vector. Or you can derive it from the euler rotation angles that are part of the IFTResult (call GetHeadPose()). The angles are in degrees and you should take into account that they are computed in the right handed coordinate system of the color camera. The 0,0,0 angles correspond to the face looking straight to the camera aligned with its optical axis.

      2) we use both rgb and depth camera. We do rely on both of them equally up to 1.5 meters away from the camer and then after that distance it starts “trusting” rgb data more and relies less on depth data since it becomes noisy. The light plays role in the tracking – harsh shadows or occlusions may affect it. At closer range <1.5 meters it affects it less since there is enough data from the camera for ok face tracking.

      The face triangles should be on your face. If they are really off then we might want to investigate.. You can capture a frame and contact me to look at it.


      • Michael said

        Is the depth data necessary? The official documentation says the depth is optional for face tracking.
        However, if NULL is passed to the second argument (pDepthCameraConfig) of IFTFaceTracker::Initialize(), IFTFaceTracker cannot be initialized properly… Is it really optional, since it is said the depth information is required for the tracking or detection?
        Also, if the depth data pointer is not passed to the FT_SENSOR_DATA, it always fails to locate the face.
        Thanks in advance,

      • nsmoly said

        Depth data is optional, but you must have Kinect connected to your PC for SDK to work. If you don’t use depth then make sure that you call IFTFaceTracker::Initialize() correctly (with no depth camera).

      • nsmoly said

        Also, if you don’t use depth, the accuracy drops quite a bit

      • Michael said

        Thanks for your reply! With the Kinect connected, I tried to replace Kinect video buffer with my video captured from webcam. When the depth camera points to the scene far away, like 1.5 meter away as you said, the tracking effect is acceptable.
        If I don’t use the depth information, I had a problem in calling theIFTFaceTracker::Initialize() correctly. By using ” FT_CAMERA_CONFIG myCameraConfig = {640, 480, 1.0}; HRESULT hr = pFT->Initialize(&myCameraConfig, NULL, NULL, NULL); ” from the sample code from, it kept returning E_POINTER. And this made the CreateFTResult fail as well.

      • nsmoly said

        Sorry, I misled you in my previous comments — the Kinect for Windows Face Tracking SDK requires you to pass a depth frame and initialize it to be used with the depth camera. You can use your HD color camera to augment face tracking with higher quality video feed, but then you must implement your own Depth to Color UV conversion function (and pass it to the face tracker initialization).

        Thanks for using it.

  6. Naeem Aleahmadi said

    hi, i wondered how to assigne one of face features’s x,y to mouse cursor. or at least writing em to the valuable .

    • nsmoly said

      You can read 2D facial points, 3D head position and 3D animation coefficients from IFTResult interface (in C++). There is a corresponding thing in C# interop API. Then you can call IFTModel and pass that data to create 3D mask with 121 3D vertices located on your face. 2D and 3D points stay on your face even when it moves meaning – if left eye corner corresponds to 3Dvertex N then N will always track that point.

  7. buchtak said

    Very nice work.The tracker seems to generalize to arbitrary faces very well. I was kinda surprised about this as poor generalization is a known problem of AAM. Are you planning to publish the details of the algorithm at ICCV, CVPR or somewhere else? It would be nice if you’d uncover at least some of the secret sauce ;-).

    Also, is/will it be possible to create and train a custom AAM?

    • nsmoly said

      this is not just AAM that you see there. AAM alone drifts quickly. It needs more constraints to “nail” the mask to the face. We will probably publish details soon.

  8. Jayneil said

    Hi! Great work! I am currently doing research in Computer Vision and trying to develop a head pose estimation system in real time. Now, I had a question. Assuming that when the user looks straight at the camera(this being the origin 0,0,0 point), he has the option to turn his face 90 degrees to the right or the left. So until what angle or to how much degrees can the SDK track the head and give its direction in terms of Raw, pitch and yaw? For example if the users moves his head left 90 degrees from the origin point, will the SDK be able to track the face movement all the way or there is some limitation? Also, regarding getting the head movement in terms of real time coordinates(like raw, pitch and yaw), what manipulations/algorithms does one have to apply or is there a pre built function in the SDK for that?

    • nsmoly said

      The FT SDK returns pitch, yaw and roll angles (in degrees) of the tracked head for each frame of data. See IFTResult interface on how to read them.

      The angle limits thatft sdk can track are:
      Yaw: +-45 degrees
      Roll: +-90 degrees
      Pitch: +-25 degrees

  9. Pixel said

    Wow, Very Impressive ..
    so my next task is run this demo project in my laptop ….

    tin tin ta tin

  10. Predrag Popovic said


    I was looking at your sample FaceTracking3D-WPF and as far as I understand by using GetTriangles method I can get vertices, and using it I can create 3dmodel, now what I would like to do is to map result to some face 3d model (avatar like in documentation , designed in Maya or 3Dmax . Any suggestions how to do this, what should my CG artist supply to me .

    • nsmoly said

      You can either map ft sdk animation units to some predetermined bone movements (based on some mapping logic) or you can use an algorithm similar to icp – ft sdk model vertices can be mapped to your rendered model vertices and to deduct how to move your model (to find deformation parameters) you can solve least squares problem (find your model parameters that minimize squared distance between 2 models)

  11. Jun Li said

    Hi, very good job!
    I found if my face get close to kinect, then it does not work. It seems that I have to keep my face at least 80cm~1m away from kinect . Can I change this distance? I mean is it possible that it works when my face 50~60cm close to kinect? Thanks a lot.

    • nsmoly said

      You need to switch Kinect to “near mode” in which depth data is available at short range ~40 cm. We use depth data and so when you are too close there is a black hole on a face and the tracker stops working. Near mode pushes the range closer to the camera. It can be switched on when you initialize Kinect camera via its api.

      • Jun Li said

        I am using the kinect for XBOX 360, is it possible to switch to the near mode? Thanks

      • nsmoly said

        I don’t think it is possible with Kinect for Xbox. You need Kinect for Windows camera for the near mode.

  12. p12 said

    I’ve sorted my first issue, however I am getting a lot of ‘undeclared identifier’ errors :

    1> main.cpp
    1>main.cpp(69): error C2065: ‘colorCameraFrameBuffer’ : undeclared identifier
    1>main.cpp(70): error C2065: ‘depthCameraFrameBuffer’ : undeclared identifier
    1>main.cpp(75): error C2065: ‘colorFrame’ : undeclared identifier
    1>main.cpp(76): error C2065: ‘depthFrame’ : undeclared identifier
    1>main.cpp(78): error C2661: ‘tagPOINT::tagPOINT’ : no overloaded function takes 2 arguments
    1>main.cpp(88): error C3861: ‘ProcessKinectIO’: identifier not found
    1>main.cpp(95): error C2065: ‘pFTResult’ : undeclared identifier
    1>main.cpp(96): error C2065: ‘pFTResult’ : undeclared identifier
    1>main.cpp(96): error C2227: left of ‘->Status’ must point to class/struct/union/generic type
    1> type is ”unknown-type”
    1>main.cpp(110): error C2065: ‘pFTResult’ : undeclared identifier
    1>main.cpp(111): error C2065: ‘pFTResult’ : undeclared identifier
    1>main.cpp(111): error C2227: left of ‘->Status’ must point to class/struct/union/generic type
    1> type is ”unknown-type”
    1>main.cpp(124): error C2065: ‘pFTResult’ : undeclared identifier
    1>main.cpp(124): error C2227: left of ‘->Release’ must point to class/struct/union/generic type
    1> type is ”unknown-type”
    1>Build FAILED.
    1>Time Elapsed 00:00:03.34
    ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

    Is there something I have missed in your tutorial ?

    • nsmoly said

      Hello, it seems like you’ve got lots of undeclared variables (or missing #include FaceTrackLib.h). Please see C++ code sample provided in the Face Tracking SDK for a compilable and working code sample. The code from my page is only an example it will not compile as is since it is not a finished project.


    • Todd said

      For the camera data, i think that is just psudocode. You do better follow the KinectSDK1.5 Face Tracking Sample Code for a really workable configuration.

      • nsmoly said

        Yes, the camera code is just pseudocode and the code in general on this page is NOT intended for compilation! It is just a “text” sample. Please use C++ or C# samples provided with the Face tracking SDK – they have fully operational code that you can build and modify.

  13. Vankatesh said

    I have a few questions:

    1. Is the face tracking API supported completely in C++ as well as C# or just C++ at the moment? Which language do you recommend to use?

    2. I copy pasted your code and tried to execute it in a Visual Studio 2010 Win32 console application template. These are the errors that I got:

    IntelliSense: identifier “colorCameraFrameBuffer” is undefined
    IntelliSense: identifier “FORMAT_UINT8_R8G8B8″ is undefined
    IntelliSense: identifier “depthCameraFrameBuffer” is undefined
    IntelliSense: identifier “colorFrame” is undefined
    IntelliSense: identifier “depthFrame” is undefined
    IntelliSense: no suitable constructor exists to convert from “int” to “tagPOINT”
    IntelliSense: expected a ‘)’
    IntelliSense: identifier “ProcessKinectIO” is undefined
    IntelliSense: class “IFTResult” has no member “Status”
    IntelliSense: class “IFTResult” has no member “Status”

    What did I do wrong?

    3. In the sample visualization code provided in C++ in the developer toolkit ,it uses WPF and hence it is very difficult to display the yaw,pitch, roll angles on screen. Also, there are very few resources online that mention how to enable a console in WPF in VIsual Studio using C++. Can you please help out here?

    • nsmoly said

      The provided sample on my page is just a “text sample” it is not intended for compilation (has pseudocode in it). You can build code samples included into the face tracking sdk. They should build and there are samples for C++ and C#. The API itself is native COM object and easy to consume in C++. For C#, look at the C# code sample – it has an assembly with the interop that makes it simple to call FT COM API.

  14. Jayneil said

    I just tested the Face API and displayed the head orientation angles. I had one more question regarding that. All the functions in the Face Tracking API give output/results in reference to the Kinect coordinate frame. So, the results you get would make sense as long as the coordinate frame of the user’s head(coordinate system where user’s head is at the origin) is completely parallel to the the Kinect coordinate system. But this will not be the case always as the user will keep on moving. So, is there a hidden transformation of coordinate system that takes place inside the Kinect SDK(meaning the data calculated using Kinect coordinate frame is converted in terms to the user coordinate frame)? Please do advise regarding this.

    • nsmoly said

      The Face Tracking SDK coordinate system is just like Kinect’s skeleton coordinate frame – right handed, Z pointing towards user, Y – up, but the origin is shifted to the video camera’s optical center. Kinect’s skeletal tracking coorindate system has the center in the depth camera (IR) optical center. This is because the face tracking API uses RGB + depth data heavily and does its computations in the visual camera frame of ref. Kinect’s skeletal tracking does it purely in IR camera space. You can convert from FT API coordinate frame to Kinect’s skeletal coordinate frame by following this sample (not sure if there is one on MSDN):
      This function just approximates the real frame to frame transfer. Future versions of Kinect API may contain the API for this. Use this code as is, not warranties or anything :-)

      This function demonstrates a simplified (and approximate) way of converting from the color camera space to the depth camera space.
      It takes a 3D point in the color camera space and returns its coordinates in the depth camera space.
      The algorithm is as follows:
      1) take a point in the depth camera space that is near the resulting converted 3D point. As a “good enough approximation”
      we take the coordinates of the original color camera space point.
      2) Project the depth camera space point to (u,v) depth image space
      3) Convert depth image (u,v) coordinates to (u’,v’) color image coordinates with Kinect API
      4) Un-projected converted (u’,v’) color image point to the 3D color camera space (uses known Z from the depth space)
      5) Find the translation vector between two spaces as translation = colorCameraSpacePoint – depthCameraSpacePoint
      6) Translate the original passed color camera space 3D point by the inverse of the computed translation vector.

      This algorithm is only a rough approximation and assumes that the transformation between camera spaces is roughly the same in
      a small neighbourhood of a given point.
      HRESULT ConvertFromColorCameraSpaceToDepthCameraSpace(const XMFLOAT3* pPointInColorCameraSpace, XMFLOAT3* pPointInDepthCameraSpace)
      // Camera settings – these should be changed according to camera mode
      float depthImageWidth = 320.0f;
      float depthImageHeight = 240.0f;
      float depthCameraFocalLengthInPixels = NUI_CAMERA_DEPTH_NOMINAL_FOCAL_LENGTH_IN_PIXELS;
      float colorImageWidth = 640.0f;
      float colorImageHeight = 480.0f;
      float colorCameraFocalLengthInPixels = NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS;

      // Take a point in the depth camera space near the expected resulting point. Here we use the passed color camera space 3D point
      // We want to convert it from depth camera space back to color camera space to find the shift vector between spaces. Then
      // we will apply reverse of this vector to go back from the color camera space to the depth camera space
      XMFLOAT3 depthCameraSpace3DPoint = *pPointInColorCameraSpace;

      // Project depth camera 3D point (0,0,1) to depth image
      XMFLOAT2 depthImage2DPoint;
      depthImage2DPoint.x = depthImageWidth * 0.5f + ( depthCameraSpace3DPoint.x / depthCameraSpace3DPoint.z ) * depthCameraFocalLengthInPixels;
      depthImage2DPoint.y = depthImageHeight * 0.5f – ( depthCameraSpace3DPoint.y / depthCameraSpace3DPoint.z ) * depthCameraFocalLengthInPixels;

      // Transform from the depth image space to the color image space
      POINT colorImage2DPoint;
      HRESULT hr = NuiImageGetColorPixelCoordinatesFromDepthPixel(
      NUI_IMAGE_RESOLUTION_640x480, &viewArea,
      LONG(depthImage2DPoint.x + 0.5f), LONG(depthImage2DPoint.y+0.5f), USHORT(depthCameraSpace3DPoint.z*1000.0f) << NUI_IMAGE_PLAYER_INDEX_SHIFT,
      &colorImage2DPoint.x, &colorImage2DPoint.y );
      return hr;

      // Unproject in the color camera space
      XMFLOAT3 colorCameraSpace3DPoint;
      colorCameraSpace3DPoint.z = depthCameraSpace3DPoint.z;
      colorCameraSpace3DPoint.x = (( float(colorImage2DPoint.x) – colorImageWidth*0.5f ) / colorCameraFocalLengthInPixels) * colorCameraSpace3DPoint.z;
      colorCameraSpace3DPoint.y = ((-float(colorImage2DPoint.y) + colorImageHeight*0.5f ) / colorCameraFocalLengthInPixels) * colorCameraSpace3DPoint.z;

      // Compute the translation from the depth to color camera spaces
      XMVECTOR vTranslationFromColorToDepthCameraSpace = XMLoadFloat3(&colorCameraSpace3DPoint) – XMLoadFloat3(&depthCameraSpace3DPoint);

      // Transform the original color camera 3D point to the depth camera space by using the inverse of the computed shift vector
      XMVECTOR v3DPointInKinectSkeletonSpace = XMLoadFloat3(pPointInColorCameraSpace) – vTranslationFromColorToDepthCameraSpace;
      XMStoreFloat3(pPointInDepthCameraSpace, v3DPointInKinectSkeletonSpace);

      return S_OK;

  15. Todd said

    Hello, Nikolai
    I want to use the Microsoft Face Tracking SDK to track some 3D model sequences which is not from kinect camera. Basically I can give the rendered 3D textured model image (640×480, R8G8B8) and corresponding depth image(640×480, UINT16) to the IFTFaceTrack. Although I did not give the head and neck position to the StartTracking function, I hope it is still able to track the model.

    Currently, I found that in order to activate the face tracker engine, I need to call NuiInitialize() although I actually did not use data from Kinect. The problem is that the startTracking method always return E_POINTER. I know that IFTFaceTrack class can accept color image in R8G8B8 format and depth image in D16 format, which I currently choose. However, in the FaceTracker Demo code in the SDK, the color image format is BGRX and depth is D13P3. Does the data format have to be this?

    Another Question, since the camera parameter is significant for the tracking result, how can I get the focal length in pixels for a virtual camera in DirectX?

    Hope the Face Tracking SDK can work with other source of data~~~

    BTW, thanks so much for the great work on this SDK!

    • nsmoly said

      Hi Todd, yes you must have attached Kinect and call NuiInitialize() to use the face tracking API (commercial reasons). But with those you can still feed the API from other sources like your rendered 3D faces as long as the format of RGB and depth images is correct. The API can accept several formats for RGB frames:
      FTIMAGEFORMAT_UINT8_GR8, FTIMAGEFORMAT_UINT8_R8G8B8, FTIMAGEFORMAT_UINT8_A8R8G8B8, FTIMAGEFORMAT_UINT8_X8R8G8B8, FTIMAGEFORMAT_UINT8_B8G8R8X8, FTIMAGEFORMAT_UINT8_B8G8R8A8 and for depth frames it must be: FTIMAGEFORMAT_UINT16_D13P3 (last 3 bits are reserved for Kinect’s player ID!). If you pass something wrong it should return FT_ERROR_INVALID_INPUT_IMAGE.

      The focal length in case of DirectX rendering is tricky to compute in pixels – your view/projection matrix should have a focal length as part of it (mixed with other values). The projection matrix is still perspective and so works the same way as real camera projection. How to turn that into pixel value, I am not sure at the moment. For the real camera you can get its focal length in millimeters and then estimate it in pixels based on the FOV or sensor size/resolution. In the rendered world you need to know the physical dimensions of it and somehow estimate the focal length. You can try various values and see if the tracker succeeds. The average value for a real webcam or for tracking videos that we use OK in the past is ~513-560 pixels.

      • Todd said

        Yeah, I notice that my depth format wasn’t correct. So I will change it to see if I can get the startTracking function return true once. For the focal length, I found that if you define it wrong, the tracking face result may be off the face in the color frame. I hope the wrong focal length would not lead to tracking failure~~~~

  16. Is the face data accurate enough to be used to distinguish different people? In other words, 3D facial recognition? Looking at I see that different shape units are available, such as “Eye separation distance”. I would use these measurements to build a profile for each person during a registration process. Then later, I could compare these profiles against an unknown in order to determine who the person is. This would seem possible now – or am I missing something?

    • nsmoly said

      Yes, you can use shape units returned by the API as a help to recognize people. Although, the underlying 3D model is still pretty basic (low def Candide model) and so the computed shape coefficients are not sufficient for strong classification. You can combine them with other image based classifiers to get to a good level of reconition (use it as a weak classifier in combination with other classification methods). I would not recommend building your recognition system solely for on current shape units – you will have too many false positives (they will get much better some time in the future though :-))

  17. Jayneil said

    ‘Get2DShapePoints’ function in the Kinect Face Tracking SDK gives the 87 feature points in 2D. How to get the Z coordinate for these points? Can I use ‘NuiTransformSkeletonToDepthImage’ function for this? Also the Face Tracking API follows the Kinect coordinate frame(With Kinect camera at 0,0,0). Does the skeletal tracking follow the same coordinate frame or a different one?

  18. [...] Face Tracking SDK in Kinect For Windows 1.5 [...]

  19. Really nice work there, Nikolai.
    But i am running into trouble trying to track the face without a skeleton in c#.
    My sample is very similar to the c# example included in the sdk.
    I modified it, so i don’t need to rely on skeletons. My problem is, that the frame.trackSuccessful property is always false. The call is this.faceTracker.Track(colorImageFormat, colorImage, depthImageFormat, depthImage); The reference mentions that there are 3 possible calls: only colorimage+depthimage, colorimage+depthimage+region of interest or colorimage+depthimage+skeleton. The calls without skeleton never work.
    What also bugs me is that cpu usage is just at about 10% whereas the c++ sample produces 30% on my machine.

    thanks in advance!


    • nsmoly said

      The C# sample was intended to be used with the skeleton and so it may have some shortcomings when you run it without the skeleton. The underlying C++ API can track with and without the skeleton info available. Most likely you have some issue in C# code somewhere. 30% CPU utilization on C++ sample looks right. 10% is too little I think

  20. dany said

    can it track multiple face at once?

    • nsmoly said

      Yes you can. You will have to create multiple instances of IFTFaceTracker and then give regions of interest to each of them (of corresponding faces, that you can detect with the face detection api).

  21. Todd said

    Hello, Nikolai

    This is the student who is trying to use the Kinect Face Tracking SDK to track 3D model sequence which is not from Kinect Camera. Currently, the program still failed at FT_ERROR_HEAD_SEARCH_FAILED stage. I feel it is close to the success but just don’t know what is the problem so I come here and ask for your kind help.

    Basically, I describe my problem in the MicroSoft forum for Kinect SDK 1.5. The link is as below:

    Besides that, I want to know if the SDK has specific requirement to track a face. For example:

    1. Is the initial 3D head position or the skeleton information necessary for tracker to start, since in my case there is no body.

    2. Does the SDK has specific requirement about the pixel range for the face model in the depth? For example, the depth image I get from the SingleFace demo can track a face which has depth in range of [129, 186], where the nose tip is the lowest value and the contour has the highest value. Which means the head has a range of depth about (186-129=)57. Is that necessary to keep for 3D data from other source too?

    3. The only part which may be different from the Kinect camera is the mapping function between the texture and the depth. Currently I define them like this:

    The FTRegDepthToColor is self-defined function to map my own depth info to color info, which is:

    FTAPI FTRegDepthToColor(UINT depthFrameWidth, UINT depthFrameHeight, UINT colorFrameWidth, UINT colorFrameHeight,
    FLOAT zoomFactor, POINT viewOffset, LONG depthX, LONG depthY, USHORT depthZ, LONG* pColorX, LONG* pColorY)
    HRESULT hr = S_OK;
    *pColorX = depthX;
    *pColorY = depthY;
    return hr;

    Is there a problem with this?

    Please look at my post questions at the Microsoft forum for more information about what I have done. Hope I am not so far from the key~~~

    Thanks again in advance!

    • nsmoly said

      Hi Todd, what you are trying to do is possible. You have to provide your own mapping function from depth to color pixels and you have to set focal length for video and depth cameras to the right values (this is important). The error that you are having — FT_ERROR_HEAD_SEARCH_FAILED typically happens when the API cannot find a face on a color frame or when the depth area that corresponds to the color frame face are is invalid (for instance all values are too far from reality). Also, you need to provide depth frame in Kinect’s format which has 3 lowest bits reserved as a user ID. You can set this user ID for all pixels to be 1.

      Replies to other questions:
      1) Kinect’s skeleton is not needed for FT API to work. If you don’t provide a head orientation vector or region of interest then the API will do face detection over the full color frame and will track a found face.
      2) The FT API is able to track faces when the face area width/height is bigger than ~25 pixels (actual number varies). In depth it is less important, but it may fail when you’ve got too few pixels.
      3) You mapping function is correct. Also, you need to set a focal length for the depth camera to some realistic value (in pixels). You can estimate if if you know a field of view of your camera. The FT API uses simplistic pinhole camera model and assumes the same focal length in X and Y.


      • Cédric said


        I am also trying to use the Face Tracker of the Microsoft Kinect SDK with some inputs different from the Kinect camera (by the way, thanks for this great code). I have nearly succeed on using it. However, I still have a rotation misalignment between my 3D inputs and the result of the face tracker (about 6 degrees on the pitch angle).

        I expect that there is an issue with the focal length of the depth camera. I generate my own depth map from a 3D model, so I can tune the focal length parameter used for the generation. However, when I modify the focal length used at the initialization of face tracker, it has no effect on the results. I have tried a lot of different values (500, 571, 531, 200, 800, 1, 10000), but the result of the face tracker is always the same… Is it a bug ? Can you at least provide me the value used by default for the focal length of the depth camera ?


      • nsmoly said

        The default depth camera focal length in Kinect is defined by – NUI_CAMERA_DEPTH_NOMINAL_FOCAL_LENGTH_IN_PIXELS define and is set to 285.63 for 320×240 resolution. So you probably should try that value. Not sure how 3D data is generated in your case, so a different focal length could be better for you… The Kinect values are in NuiAPI.h and in NuiImageCamera.h in Kinect SDK.


      • Todd said

        Dear Nicolai

        Thanks so much for your reply. your words:”you can set this user ID for all pixels to be 1″ actually is very important. the SDK is really a nice one for 3D data research.

  22. Hi,

    I’m new to Kinect Programming. I’m try to use Get2DShapePoints() to get the absolute points in 2D. The sample code only uses GetProjected3DShape which only gives mapped points. I want the raw points. Also, the 87 points mentioned on cant be mapped to the 122 points in GetProjected3DShape. Please help!

  23. yu said

    Thank you very much for your working and demo, I learned a lot. And I have a question that does the face tracking engine use the normal video as the input data? That means asynchronous tracking, we first use the camera records the face images as a video files (like .avi), and then use this video file as the input data to track the face feature points, can face tracking sdk can do it?
    Thank you very much and I am eager for your reply.

    • nsmoly said

      You can do this as long as you have Kinect camera attached to your computer. Then you need to feed your video frames to the face tracking API and provide correct camera configuration, i.e. resolution and focal length. It is hard to estimate a focal length for a video if you don’t know camera characteristics that it was made with. You can estimate it from a field of view though. The approximate focal length for most consumer webcameras is about 500-600 pixels. The focal length affects the quality of tracking greatly.

  24. raicon said

    it is possible to save a face bmp. picture ? …. how ?

    I think I need to save in a buffer only the face pixels…..the pixels from the pink rectangle. But i don`t know how. I need your help please !

    I want to modify the sample application to save a bitmap file on my computer !

    • nsmoly said

      you can cut that portion of the rgb frame and use Windows APIs to save it as a BMP :-)

      • raicon said

        how to cut this portion ? i tried to use the coordinates used by sdk to draw the pink rectangle to save only the face into a bitmap…. but in my seved picture i don`t have the face. Can someone show me how to cut this portion ?

  25. kang said

    Hi. I’m trying to make a simple face tracking program by following your code. Initializing is OK, both color image and depth image come in nicely, but I have no idea why StartTracking returns -1884553214(=0x8fac0002). In case of FaceTrackingVisualization sample, StartTracking returns S_OK and the program works fine.

    • kang said

      I think I’ve solved the problem.

      I fount out that the image format of the depth buffer should be FTIMAGEFORMAT_UINT16_D13P3, not FTIMAGEFORMAT_UINT16_D16.

  26. Todd said

    Hello, Nikolai

    I tried the tracking on the 3D model and find that the feature points of eyes do not move for eye close motion. Is that correct or I did something wrong which affected the eye tracking? Thank you!

    • nsmoly said

      Do you mean – does it track the eye gaze? It does not track the eye gaze. It only computes the position of the eyes on the face.

      • Todd said

        sorry Nikolai, I mean eye blink. when I close, looks like both 2D result and 3D model do not follow the motion. Just want to make sure. Thank you!!!!

  27. raicun said

    How i can use this application to save into a bitmap the only the face ?

    • nsmoly said

      You can read the face rectangle from IFTResult and then use it to cut this RECT out of input color frame that you passed to the engine to get this IFTResult. The rectangle is in color frame image coordinates.

  28. Marco said

    Hi, I cant find the way to get per pixel data of any IFTImage. Is there a way?

    What I want to do is to send the color video buffer via sockets to a Adobe AIR program.

    You know if there may be another way to do this?

    Thanks in advance!

  29. Cluster said

    It will really help, if you make a post regarding Face Tracking in C#.

    One other thing, i want to know that . . .
    I am want to do Face Recognition by using these Feature Points . . .

    Will you please suggest me how i can utilize these Feature Points and Depth information for Face Recognition purpose not for Face Detection (You are already doing Face Detection) ?

    Thanks in advance.

  30. Cluster said

    And please re share the samples of conference

  31. Cluster said

    How we get Face Feature points with depth information too ?

    • Todd said

      need to know the relation between the texture and the depth. There definitely is offset between the two camera.

      • Cluster said

        Yes, its true there is offset between the cameras.
        If we look at the default example of Face Tracking in C#, this.facePoints = frame.GetProjected3DShape(); return Feature Points collection. By using this i can only get X, Y of Feature Point. I need depth information as well. How we can achieve this ?

        CoordinateMapper.MapColorFrameToDepthFrame return depth points ? I am not getting any clue how i can use these depth points with correspond to Feature Points ?

      • nsmoly said

        depth and color camera 3D spaces are not the same and there is an offset between them (strictly speaking they are also rotated relative to each other since the camera optical axis are not perfeclty parallel). FT SDK returns 3D results in color camera 3D space and in color camers 2D uv space (2D points). To convert back to depth camera space (where Kinect skeleton exists) you need to apply extrinsic transform (rotation, translation). It is not easy to get it from Kinect APIs (no way in fact), but see this post –
        to get an approximation of it.


      • Todd said

        m_pFaceTracker->Initialize(&videoConfig, pDepthConfig, FTRegDepthToColor, NULL);

        The FTRegDepthToColor is self-defined function to map my own depth info to color info. In C++ version, if you use kinect default, you leave the 3rd parameter of Initialize as NULL, so the face tracker uses the kinect mapping function. Have you tried to look into this direction? I do not use the kinect texture and depth mapping yet so i am not sure the answer to your question. But hope this reply helps!

      • Cluster said

        Todd, you are right on your place.
        But i am not looking for texture mapping.

        Let suppose, you get face feature points through this.facePoints = frame.GetProjected3DShape(); It’s just a coordinate (123, 213) etc. Nothing else. I am looking for Depth information with these points. Like (X, Y, Depth) = (123,213, 1423) depth in mm.

        What’ should i have to do to get depth information correspond to that particular point ?

        I am confused and i new to this.
        Please guide

      • Todd said

        OK, now I see. So actually what you try to do is reversed mapping of projected 2D coordinates to the original 3D coordinates in world space. You need the world matrix, view matrix and projection matrix. Since the class also output the translation, rotation, scale parameters, world matrix should be easy to build. The other two I am not sure.

      • Cluster said

        Again misunderstanding!
        Depth information is separate and feature point information is separate.
        Todd do you have any media of communication so that i can discuss with you easily ? I am on skype. Please share any media of communication ASAP.

      • Todd said

        Hello, Cluster
        The func you use is “frame.GetProjected3DShape()”, which is used to draw the yellow line of 3D face model on the IFTImage object. If you learned CG before, it is straight-forward that this 2D coordinate is actually from 3D coordinate after its projection. In my opinion, it is not correct to direct use the 2D (x, y) in its 3D (x, y, z). What make sense is to unproject the 2D coordinate to 3D, but why?

        If you go to the Microsoft Kinect Face Tracking SDK web page, you can find that the author use a simple standard 3D face model to adapt to the kinect captured information. In current face tracking domain, the idea to use 3D genetic face model to adapt to the 2D video or texture has been talked about a lot. So, if you really use “frame.GetProjected3DShape()”, you might need to consider to unproject to get the coordinate right.

        However, if you just need 2D coordinate of a pixel and its represented depth value, you need to know the mapping function of the two camera. Just google “Kinect two camera calibration” to get some useful information.

        Hope helps!

      • nsmoly said

        IFTResult has 2D points that are feature points on the face in color (u,v) space. It also, has so called “animation units”, 3D head pose and “shape units”, which you can use to reconstruct the 3D points on the face (121 of them) by calling IFTModel interface. There are 2 methods – one to get 3D points given 3D pose, animation units and shape units (from IFTResult interface) and second to get projected 3D points on uv color image. 2D points from IFTResult are a bit different and exist only in the uv 2D space.

        Sorry for this duality. We will fix it in the future releases. Also, the returned 3D points are in the color camera 3D space! not in Kinect’s depth camera space (like the skeleton). See my other post about how to convert from color camera space to depth camera space. This release is still a bit experimental, so the API is not fully backed (so it is part of optional dev kit). Future releases will be 100% integrated with the skeleton API.

      • nsmoly said

        Unfortunately, Kinect API does not have a function to map color UV coordinates to depth UV coordinates. You can create an approximation of it (invert provided depth UV to color UV function) if you do the following:
        1) if depth you use is 2 times smaller than color, then scale down your color UV by 2 (divide by 2) to get 2st approximation to get depth UV coordinates.
        2) Sample Z in the neighbourhood of the converted depth UV point (you may have a hole there, but you need to find at least 1 closest point with valid Z). Then apply forward transform from Kinect API to jump back from depth UV to color UV.
        3) Calculate a vector between your original color UV point and newly computed color UV point. Scale it down by 2 and apply this “shift vector” to your depth point (from which you got your last color UV point) to get a new “better” depth UV estimation.
        4) repeat steps 2 and 3 until convergence (until your point in color stops moving or the vector becomes very small).

        I’ll try to post this code soon online.


    • nsmoly said

      If you call IFTModel interface it will return you the 3D points of the face mask in color camera space (3D points). Then apply method described here –
      to move to the depth camera space (3D space with the origin at the depth camera center, or the Kinect skeleton space) if needed.

  32. Avner said

    Is there a way to use images on the computer that weren’t captured by the kinect camera (jpegs for instance) – using the kinect SDK to get the 87 image points?

    • nsmoly said

      You can do it if you simulate depth frame and have connected Kinect to your PC

      • Avner said

        Unfortunately I have never used the Kinect SDK and it is more as proof of concept for some ASM algorithm I was working on. Can you supply a code snippet that takes a image.jpg path and returns a list of coordinates?
        I have no idea how to simulate a “depth frame” :) BTW is it possible to simulate a kinect camera connection (Maybe through a driver)?

  33. PabloRomero said

    Hi, i’ve seen what in your youtube video you show the positions (yaw, roll, pitch). My question is:

    How can i show this in sample kinect program? Is there a done function? And if it is not how can i draw the information in the video in real time?

    By the way, thanks,


  34. Hi, is the toolkit able to track non-symmetric expression, like lips and eyebrow?

  35. anuran said

    iam trying to record the values of Action units points AU to file , but values of AUs are changing continuously for static face (tested with a dummy face, with no change in position of kinect, light, dummy face background) anyway how to stop that from happening?

  36. nivanka said

    Is there a way to recognize a specific action by the user (for example, a nod gesture)?
    Thank you. :)

  37. Dylan said

    What is the rotation order of the rotation result of IFTResult::Get3DPose? Is it XYZ, ZXY, etc.? There seems to be no official documentation on it. Also does the Face Tracking API filter its input, or would it benefit from the depth input being prefiltered by me to remove flickering and fill in holes?

  38. [...] [...]

  39. Niket said

    I want to eliminate the dependency on the IR camera in this application. In this manner I won’t be able to get the 3D face mask but can I still get the box around the person’s face?

    • nsmoly said

      You cannot eliminate dependency on Kinect camera in this API. If you need only face detection, then I suggest using available face detection APIs (for example from OpenCV)

  40. Palash Goyal said

    I am printing the values of Action Units but they are nowhere close to the expected values. Is there any mistake on my part? Is there any way to train the Kinect SDK with the new face and use that model to get AUs(using the SDK tracking function)?

    • nsmoly said

      AU values may change from person to person. So if you want to use it to drive an avatar you need extra filtering/smoothing on top.

  41. yu said

    I am using face tracking sdk to do some face tracking, and there are some tracking failures. I outputed the tracking status by using IFTResult::GetStatus method. The return values are several situations:

    first situation:return value is -1884553213, this situation means FT_ERROR_FACE_DETECTOR_FAILED, the sdk provides the meaning of the value.

    second situation: return value is -1884553208, it means FT_ERROR_EVAL_FAILED, the sdk also provides the meaning.

    third: return value is -1884553204, the sdk does not provide the meaning of the value.

    fourth: return value is -1884553205, the sdk does not provide the meaning of the value too.

    So, I would like to know what are the meanings of later 2 return values?

    About all the 4 kinds of tracking failed, could you give me any suggestions/ideas/information?

    Thanks in advance!

    • yu said

      About the later 2 values, I found their meanings, the 3rd means FT_ERROR_USER_LOST, the 4th means FT_ERROR_HEAD_SEARCH_FAILED. So about these 4 kinds of failure, could you give me any suggestions/ideas/information?
      Thank you very much!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 38 other followers

%d bloggers like this: