VR Drone

April 5, 2016 — Leave a comment

VR Drone based on DJI S900 hexacopter. The camera is attached at the bottom.


I made this prototype in 2015 while exploring if it is possible to fuse Virtual Reality headsets and drones to make flying them easier. This is a drone (hexacopter to be precise) that has a custom made wide-angle stereo camera. It has two calibrated fisheye cameras in parallel setup with human like baseline (calibrating fisheye was a bit tricky). The camera images form two stereo hemispheres for left and right eyes. The cameras are 185 degrees and the stereo effect is good in 160 degrees range. There are some convergence issues on the sides, but it’s ok for peripheral vision.


Stereo Camera With Tegra TK1


The camera streams real-time H264 stereo video feed at 30 fps over WiFi or 4G. The streaming and image processing is done on NVIDIA Tegra TK1 board. Big Thanks to NVIDIA for making such an awesome mobile computer! Tegra TX1 is even better now (in 2016).

The ground station receives the video feed and renders it in Oculus Rift DK2 at 75 fps with upsampling. The higher frame rate rendering is important for reducing VR sickness. In theory, one needs to stream 90 fps from the camera to avoid VR sickness, but in practice it consumes too much network bandwidth and is not practical (as of 20015-2016). Wireless streaming from a moving drone with existing protocols (over WiFi or 4G) proved to be a challenge. In addition to streaming issues, I had one nasty flyaway and one crash due to radio interferences between GPS and other antennas. I ended up adding some RF shielding (photos below). I guess, aluminum foil is critical tech in airspace.


RF shielding

A drone operator has a 180 degrees immersive view with digital panning in the Virtual Reality headset and can fly the drone by controlling it. This project uses digital panning and not mechanical gimbals like other projects I looked at. Digital panning reduces VR sickness due to very little latency in viewing rendering (latency is <10ms when looking around). In some way, user experience is similar to a cinema where the video is 24fps, but looking around has much higher fps (human like fps).

Here is the flight video (with Oculus footage):


I tested this system with about 15 people. Test pilots could take off, fly and land with no issues while in VR. Users reported the sensation of flying and out of body experiences. About 70% of users did not feel issues with the motion/VR sickness due to the rendering/control schema used in this project. Usually, people experience less sickness when standing as opposed to seated flying.









Sailing simulation demo

February 18, 2014 — Leave a comment

Few years ago (2010) I made a sailing simulation demo. This video shows some footage from this demo:


The sailboat model is motivated by the real physics of sailing, but it is not 100% correct. I made a dynamical system that roughly approximates the real thing. After reading several white papers, I realized that making a “real” simulator would be a huge endeavor and so I cut few corners :-). But even with my simplifications, the model can tack realistically, can get stuck “in irons”, the wind can blow the boat backwards, etc. Also, I implemented a simple collision system, AI that drives NPC boats (with red sails), line of sight, wave/ocean simulation, skeletal animation, light, crafted 3D models, etc. AI that controls NPC boats does not cheat and controls the same dynamical system as the player’s boat by doing global optimization over the control/goal function. The goal of the NPC boats is to catch the player’s boat with white sails. The demo uses DirectX 9.0 and uses a basic framework that comes with DX9 code samples.

You can use the demo source code if you wish (as is and with no warranty). The ZIP file with the C++ source code (VS 2010 project), shaders and the demo executable (runs on Windows with DX9 installed) is located in this file – SailingSim.zip


My team at Microsoft shipped the high definition face tracking and face shape modelling tech as part of XBox One Kinect back in November 2013. These are some facial expressions that can be tracked by Kinect in real time and animated in 3D (gray masks) with pretty good precision:



The system finds a big number of semantic points on a user’s face in 3D. The developers can then use this information to animate avatars or do some other processing (like recognize expressions, etc.). This video shows the early prototype of this technology that I made back in 2012:     This is a bit more technical video that demos the 3D mask painted over the video stream (for those who like this stuff):     The algorithm that we created is a fusion of a 2D video feature tracker based on Active Appearance Model (AAM) and a 3D tracker based on Iterative Closest Point (ICP) algorithm, which aligns 3D face model based on depth data from Kinect. Both trackers are resolved together and linked by special 2D-3D constraints such that the alignment makes sense and looks natural. We are going to publish details as a paper some time soon. We also compute 3D face shapes for a given user from a set of input frames (RGB + depth).

We published this paper that describes the face tracking and modeling algorithms in details. The resulting 3D models are pretty realistic and look like this:




Now it is up to game developers to use this tech in their titles! The API is available as part of Xbox One XDK. For example you can produce NPC faces like this in your game:

Nikolai as a textured shape model

Nikolai as a textured shape model


The 3D model that we shipped in Xbox One Kinect + face capture system is very capable and flexible. The face tracking and face shape modelling is used in just released Kinect Sports Rivals game by RARE studio. You can see the face shape modelling demo in this video:



Check these videos by DIY UCap System

this one shows the final “product” – http://vimeo.com/64562384

this one shows the underlying tech – http://vimeo.com/64563659

and this one shows some of the static expressions – http://vimeo.com/64565212

This is a really cool demo of what will be possible in computer graphics –

Those lights and reflections are just too crazy to believe in! I wonder how fast it may drain a mobile device battery…


Kinect has 2 cameras – video and depth (IR) and thefore there are 2 different coordinate systems where you can compute things – the depth camera coordinate frame of reference (that Kinect’s skeleton uses and returns results in) and the color camera coordinate system. The Face Tracking API, which we shipped with Kinect For Windows SDK 1.5 developer toolkit, computes results in the color camera coordinate frame since it uses RGB data a lot. To use the face tracking 3D results with Kinect skeleton you may want to convert from the color camera space to depth camera space. Both systems are right handed coordinate systems with Z pointed out (towards a user) and Y pointed UP, but 2 systems do not have the same origin and their axis are not colinear due to camera differences. Therefore you need to convert from one system to another.

Unfortunately, Kinect API does not provide this functionality yet. The proper way to convert between 2 camera spaces is to calibrate these cameras and use their extrinsic parameters for conversion. Unfortunately, Kinect API does not expose those and it does not provide any function that does the conversion. So I came up with the code (below) that can be used to approximately convert from the color camera space to the depth camera space. This code only approximates the “real” conversion, so understand this when using it. The code is provided “as is” with not warranties, use it at your own risk 🙂

    This function demonstrates a simplified (and approximate) way of converting from the color camera space to the depth camera space.
    It takes a 3D point in the color camera space and returns its coordinates in the depth camera space.
    The algorithm is as follows:
        1) take a point in the depth camera space that is near the resulting converted 3D point. As a "good enough approximation"
        we take the coordinates of the original color camera space point.
        2) Project the depth camera space point to (u,v) depth image space
        3) Convert depth image (u,v) coordinates to (u',v') color image coordinates with Kinect API
        4) Un-projected converted (u',v') color image point to the 3D color camera space (uses known Z from the depth space)
        5) Find the translation vector between two spaces as translation = colorCameraSpacePoint - depthCameraSpacePoint
        6) Translate the original passed color camera space 3D point by the inverse of the computed translation vector.

    This algorithm is only a rough approximation and assumes that the transformation between camera spaces is roughly the same in
    a small neighbourhood of a given point.
HRESULT ConvertFromColorCameraSpaceToDepthCameraSpace(const XMFLOAT3* pPointInColorCameraSpace, XMFLOAT3* pPointInDepthCameraSpace)
    // Camera settings - these should be changed according to camera mode
    float depthImageWidth = 320.0f;
    float depthImageHeight = 240.0f;
    float depthCameraFocalLengthInPixels = NUI_CAMERA_DEPTH_NOMINAL_FOCAL_LENGTH_IN_PIXELS;
    float colorImageWidth = 640.0f;
    float colorImageHeight = 480.0f;
    float colorCameraFocalLengthInPixels = NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS;

    // Take a point in the depth camera space near the expected resulting point. Here we use the passed color camera space 3D point
    // We want to convert it from depth camera space back to color camera space to find the shift vector between spaces. Then
    // we will apply reverse of this vector to go back from the color camera space to the depth camera space
    XMFLOAT3 depthCameraSpace3DPoint = *pPointInColorCameraSpace;

    // Project depth camera 3D point (0,0,1) to depth image
    XMFLOAT2 depthImage2DPoint;
    depthImage2DPoint.x = depthImageWidth  * 0.5f + ( depthCameraSpace3DPoint.x / depthCameraSpace3DPoint.z ) * depthCameraFocalLengthInPixels;
    depthImage2DPoint.y = depthImageHeight * 0.5f - ( depthCameraSpace3DPoint.y / depthCameraSpace3DPoint.z ) * depthCameraFocalLengthInPixels;

    // Transform from the depth image space to the color image space
    POINT colorImage2DPoint;
    HRESULT hr = NuiImageGetColorPixelCoordinatesFromDepthPixel(
         NUI_IMAGE_RESOLUTION_640x480, &viewArea,
         LONG(depthImage2DPoint.x + 0.5f), LONG(depthImage2DPoint.y+0.5f), USHORT(depthCameraSpace3DPoint.z*1000.0f) << NUI_IMAGE_PLAYER_INDEX_SHIFT,
         &colorImage2DPoint.x, &colorImage2DPoint.y );
        return hr;

    // Unproject in the color camera space
    XMFLOAT3 colorCameraSpace3DPoint;
    colorCameraSpace3DPoint.z = depthCameraSpace3DPoint.z;
    colorCameraSpace3DPoint.x = (( float(colorImage2DPoint.x) - colorImageWidth*0.5f  ) / colorCameraFocalLengthInPixels) * colorCameraSpace3DPoint.z;
    colorCameraSpace3DPoint.y = ((-float(colorImage2DPoint.y) + colorImageHeight*0.5f ) / colorCameraFocalLengthInPixels) * colorCameraSpace3DPoint.z;

    // Compute the translation from the depth to color camera spaces
    XMVECTOR vTranslationFromColorToDepthCameraSpace = XMLoadFloat3(&colorCameraSpace3DPoint) - XMLoadFloat3(&depthCameraSpace3DPoint);

    // Transform the original color camera 3D point to the depth camera space by using the inverse of the computed shift vector
    XMVECTOR v3DPointInKinectSkeletonSpace = XMLoadFloat3(pPointInColorCameraSpace) - vTranslationFromColorToDepthCameraSpace;
    XMStoreFloat3(pPointInDepthCameraSpace, v3DPointInKinectSkeletonSpace);

    return S_OK;

After a long journey, my team at Microsoft shipped Face Tracking SDK as part of Kinect For Windows 1.5! I worked on the 3D face tracking technology (starting from the times when it was part of Avatar Kinect) and so I’d like to describe its capabilities and limitations in this post. First of all, here is the demo:


You can use the Face Tracking SDK in your program if you install Kinect for Windows Developer Toolkit 1.5. After you install it, go to the provided samples and run/build yourself “Face Tracking Visualization” C++ sample or “Face Tracking Basics-WPF” C# sample. Off course, you need to have Kinect camera attached to your PC 😉 The face tracking engine tracks at the speed of 4-8 ms per frame depending on how powerful your PC is. It does its computations on CPU only (does not use GPU, since it may be needed to render graphics).

If you look at the 2 mentioned code samples, you can see that it is relatively easy to add face tracking capabilities to your application. You need to link with a provided lib, place 2 dlls in the global path or in the working directory of your your executable (so they can be found) and add something like this to your code (this is in C++, you can also do it in C#, see the code samples):

// Include main Kinect SDK .h file
#include "NuiAPI.h"

// Include the face tracking SDK .h file
#include "FaceTrackLib.h"

// Create an instance of a face tracker
IFTFaceTracker* pFT = FTCreateFaceTracker();
    // Handle errors

// Initialize cameras configuration structures.
// IMPORTANT NOTE: resolutions and focal lengths must be accurate, since it affects tracking precision!
// It is better to use enums defined in NuiAPI.h

// Video camera config with width, height, focal length in pixels
// NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS focal length is computed for 640x480 resolution
// If you use different resolutions, multiply this focal length by the scaling factor

// Depth camera config with width, height, focal length in pixels
// NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS focal length is computed for 320x240 resolution
// If you use different resolutions, multiply this focal length by the scaling factor

// Initialize the face tracker
HRESULT hr = pFT->Initialize(&videoCameraConfig, &depthCameraConfig, NULL, NULL);
if( FAILED(hr) )
    // Handle errors

// Create a face tracking result interface
IFTResult* pFTResult = NULL;
hr = pFT->CreateFTResult(&pFTResult);
    // Handle errors

// Prepare image interfaces that hold RGB and depth data
IFTImage* pColorFrame = FTCreateImage();
IFTImage* pDepthFrame = FTCreateImage();
if(!pColorFrame || !pDepthFrame)
    // Handle errors

// Attach created interfaces to the RGB and depth buffers that are filled with
// corresponding RGB and depth frame data from Kinect cameras
pColorFrame->Attach(640, 480, colorCameraFrameBuffer, FORMAT_UINT8_R8G8B8, 640*3);
pDepthFrame->Attach(320, 240, depthCameraFrameBuffer, FTIMAGEFORMAT_UINT16_D13P3, 320*2);
// You can also use Allocate() method in which case IFTImage interfaces own their memory.
// In this case use CopyTo() method to copy buffers

FT_SENSOR_DATA sensorData;
sensorData.pVideoFrame = &colorFrame;
sensorData.pDepthFrame = &depthFrame;
sensorData.ZoomFactor = 1.0f;       // Not used must be 1.0
sensorData.ViewOffset = POINT(0,0); // Not used must be (0,0)

bool isFaceTracked = false;

// Track a face
while ( true )
    // Call Kinect API to fill videoCameraFrameBuffer and depthFrameBuffer with RGB and depth data

    // Check if we are already tracking a face
        // Initiate face tracking.
        // This call is more expensive and searches the input frame for a face.
        hr = pFT->StartTracking(&sensorData, NULL, NULL, pFTResult);
        if(SUCCEEDED(hr) && SUCCEEDED(pFTResult->Status))
            isFaceTracked = true;
            // No faces found
            isFaceTracked = false;
        // Continue tracking. It uses a previously known face position.
        // This call is less expensive than StartTracking()
        hr = pFT->ContinueTracking(&sensorData, NULL, pFTResult);
        if(FAILED(hr) || FAILED (pFTResult->Status))
            // Lost the face
            isFaceTracked = false;

    // Do something with pFTResult like visualize the mask, drive your 3D avatar,
    // recognize facial expressions

// Clean up

The code calls the face tracker by using either StartTracking() or ContinueTracking() functions. StartTracking() is a more expensive function since it searches for a face on a passed RGB frame. ContinueTracking() method uses previous face location to resume tracking. StartTracking() is more stable when you have big breaks between frames since it is stateless.

There are 2 modes in which the face tracker operates – with skeleton based information and without. In the 1st mode you pass an array with 2 head points to StartTracking/ContinueTracking methods. These head points are the end of the head bone contained in NUI_SKELETON_DATA structure returned by Kinect API. This head bone is indexed by NUI_SKELETON_POSITION_HEAD member of NUI_SKELETON_POSITION_INDEX enumeration. The 1st head point is the neck position and the 2nd head point is the head position. These points allow the face tracker to find a face faster and easier, so this mode is cheaper in terms of computer resources (and sometimes more reliable at big head rotations). The 2nd mode only requires color frame + depth frame to be passed with an optional region of interest parameter that tells the face tracker where to search on RGB frame for a user face. If the region of interest is not passed (passed as NULL), then the face tracker will try to find a face on a full RGB frame which is the slowest mode of operation of StartTracking() method. ContinueTracking() will use a previously found face and so is much faster.

Camera configuration structure –  it is very important to pass correct parameters in it like frame width, height and the corresponding camera focal length in pixels. We don’t read these automatically from Kinect camera to give more advanced users more flexibility. If don’t initialize them to the correct values (that can be read from Kinect APIs), the tracking accuracy will suffer or the tracking will fail entirely.

Frame of reference for 3D results –  the face tracking SDK uses both depth and color data, so we had to pick which camera space (video or depth) to use to compute 3D tracking results in. Due to some technical advantages we decided to do it in the color camera space. So the resulting frame of reference for 3D face tracking results is the video camera space. It is a right handed system with Z axis pointing towards a tracked person and Y pointing UP. The measurement units are meters. So it is very similar to Kinect’s skeleton coordinate frame with the exception of the origin and its optical axis orientation (the skeleton frame of reference is in the depth camera space). Online documentation has a sample that describes how to convert from color camera space to depth camera space.

Also, here are several things that will affect tracking accuracy:

1) Light – a face should be well lit without too many harsh shadows on it. Bright backlight or sidelight may make tracking worse.

2) Distance to the Kinect camera – the closer you are to the camera the better it will track. The tracking quality is best when you are closer than 1.5 meters (4.9 feet) to the camera. At closer range Kinect’s depth data is more precise and so the face tracking engine can compute face 3D points more accurately.

3) Occlusions – if you have thick glasses or Lincoln like beard, you may have issues with the face tracking. This is still an open area for improvement 🙂  Face color is NOT an issue as can be seen on this video

Here are some technical details for more technologically/math minded people: We used the Active Apperance Model as the foundation for our 2D feature tracker. Then we extended our computation engine to use Kinect’s depth data, so it can track faces/heads in 3D. This made it much more robust compared to 2D feature point trackers. Active Appearance Model is not quite robust to handle all real world scenarios. Off course, we also used lots of secret sauce to make things working well together 🙂  You can read about some of these algorithms here, here and here.

Have fun with the face tracking SDK!

We published this paper that describes the face tracking algorithm in details.

Credits – Many people worked on this project or helped with their expertise:

Christian Hutema (led the project), Lin Liang, Nikolai Smolyanskiy, Sean Anderson, Evgeny Salnikov, Jayman Dalal, Jian Sun, Xin Tong, Zhengyou Zhang, Cha Zhang, Simon Baker, Qin Cai.