-
Notifications
You must be signed in to change notification settings - Fork 72
Home
#FAQ - Frequently Asked Questions
Is this an open source version of the hand tracking from the RealSense SDK?
- Actually No. The software here reimplements some ideas from the 2013 Intel® Skeletal Hand Tracking Library Experimental Release and adds some more modern CNN approaches for better initialization/recovery to improve robustness. Furthermore, unlike SDK offerings, what's provided here is not an end-to-end solution, but rather a collection of resources and tools to help create one.
Why release this software? What value does it have after enough data has been collected?
- The more data the better. Also if there are some sensor or other HW changes, it may be necessary to recollect new data. The characteristics of the data can vary from one camera configuration to the next. Its not just differences between the RealSense SR300 and RS400 cameras, but within the RS400 product line, there can be variety of stereo baselines and illumination models. Furthermore, there are different usage models for hand tracking - will the camera be in a laptop lid, on a desk facing up, or on a VR HMD facing outward? The ideal training dataset will depend on the circumstances of the application.
What about using the other channels such as IR and RGB as inputs to the CNN?
- Sure, that data is already being saved by the annotation program and is easily added to the CNN architecture. The reason it is not being done currently with the SR300 is because the depth data by itself has fairly consistent and crisp silhouette contours which seemed to provide adequate results with the CNN trained just using the depth channel.
Why is the focused hand segment at a different angle in the 64x64 segmented view compared to the full depth image view?
- The goal is to always align the wrist and palm upward in order to reduce the variety of hand poses that the CNN would ever see. This reduces the learning burden so to have fewer nodes/parameters in the CNN, and requires fewer annotated frames to be collected in order to cover the range of all possible inputs.
Is there a technical write-up of the approach used in this software?
- The readme file references the dynamics based hand tracking paper from 2013. This describes the basic idea of fitting a hand model to the point cloud depth data by adding additional point-plane constraints to a articulated rigid-body solver. This technique works frame-to-frame until tracking is lost and the model is caught in a bad local-minimum or misfit. To improve robustness, this original research suggested spawning a number of simulations using a variety of ad-hock ways to search for the best possible fit. But this isn't needed anymore when using machine learning (ML).
How exactly does machine learning help with hand tracking?
- depth pixels from a depth camera indicate that there is something at some point xyz in the view volume, but it doesn't know what. Even if it is assume to be a hand, it is still unknown where on the hand (which finger) each depth pixel corresponds to - even though it is painfully obvious which finger is which according to anybody who looks at the image. CNN classification provides this same human-understanding level of information. The CNN used here is trained to identify approximate fingertip and palm locations on the cropped 2D depth image. So even starting from a single frame, there is (hopefully) enough information to remove all ambiguity and initialize the model close enough to the user's true pose. Then a simple fitting of the 3D model to the point cloud input should converge to a precise and correctly matching pose, just like snapping a final piece of a puzzle into place.
Can using just ML provide all of the full pose information on its own? In the future, will a model based system even be necessary for an interactive application with hand tracking?
- That is a very real possibility. At some point the frame-to-frame dynamics based fitting algorithm may only be useful for automatically annotating ground truth data-sets before any ML system has been trained to do the job. It may also be that, for performance/efficiency reasons, it might still make sense to use a fitting approach at the last stage if it requires less CPU/GPU resources.
How many frames need to be captured and properly labelled to train a CNN?
- The trained CNN weights (binary block in .cnnb file) that was included in this repo was trained using over 100K frames. Only actual captured depth frames were used. Its not just the number of frames, but to ensure there is good coverage over all the possible hand poses. I collected all the data myself and, not surprisingly, it works great for me. I'm able to clench and roll my fist and have the system follow my motion - a situation where there isn't enough geometric features for the point-cloud-fitting to adequately track the hand (a depth camera cant see a spinning ball). Unfortunately, the tracking system didn't work as well for some of the other people here in the lab with hands shaped or sized different than mine. So probably many more frames (millions) from a variety of users would be helpful.
Why not just synthetically generate training data instead of going to the trouble of collecting data-sets?
- Using both would be best. It still is helpful to use true camera data since, even with adding simulated camera noise and depth reconstruction artifacts to synthetic data, it may not cover all the variety that an actual device experiences. With the ability to auto-label, there is also the possibility of improving ML performance with some on-site re-training for a particular user and/or environment.
Have there been any advances in 3D geometric model fitting since 2013?
- See the Articulated ICP research by Andrea Tagliasacchi et al. Same objective, more mathematical rigor in the formulation of the solution. For our purposes, we continued with a dynamics based rigidbody model - a very successful solution for a wide variety of applications in gaming, animation, and robotics that's easily extensible for experimenting with new ideas. For example, the key landmark locations provided by our CNN was integrated into the solution by adding constraints on the known parts of the model and solving for them along with all the other joint attachment, range limit, collision, and point cloud constraints. Rather than concentrating on hand model improvements, over here the research emphasis (aka demo of the day) was more focused on HW depth sensing and prototyping full systems that incorporate various ML ideas.
Can the accuracy be improved? The frame-to-frame tracking even under slow movement isn't quite correct?
- The best way to improve to the 3D model fitting aspect, for near pixel-perfect ground-truth, isn't to focus on the solver, but rather to use a more accurate model. (sidenote, for some reason i personally never noticed any inaccuracies with the model included with this software.) Describing hand variety requires a number of parameters. Unfortunately, the software here only provides manual uniform scaling. Anything beyond this would require manually editing the model file. For a more automatic calibration, relevant research to consider here includes Online Generative Model Personalization for Hand Tracking by A. Tkach et al.
What are the HW requirements to run the software and what level of optimizations have been done?
- The software runs on x86 CPU. There is only minimal optimizations within the code in order to achieve a reasonable frame rate on a typical PC. In particular, there is a minor amount of SIMD vectorization in some of the CNN propagation and a background thread is utilized during hand tracking runtime to keep the main thread at full fps. This was intentional in order to keep the code as simple as possible. The goal of this open source repo is to maximize the educational benefit to a wide audience of students and researchers interested in hand pose estimation. Not everyone in the CV or ML community (or the game industry for that matter) will be familiar with heavily optimized low level C++ code like you see in typical graphics engines.
also i just realized while using this tool essentially just record a section of video where you're holding the same pose at the beginning, make it easy to recognize the pose, then rotate your hand to mask parts and in the annotator just lock the pose, lol
- Yup Justin. you got it.
Are there any plans for Unity or Unreal Engine support?
- No plans in the immediate future, but its not hard to grab the transforms from the hand tracker and pass them into a game engine to drive a skeletal rig. The software provided here is intended to help build a tracking system. The intended audience is students and researchers working on pose estimation. The fidelity of the tracking provided out-of-the-box may not be sufficient for application developers to use in their programs.