DeltaDorsal | Leyi Zou

Abstract

The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position.

DeltaDorsal extracts skin deformations for hand pose estimation without the need for temporal continuity. Our approach improves tracking under self-occlusion and in scenarios where conventional visual cues are weak or absent. Results show that DeltaDorsal outperforms state-of-art hand pose models in egocentric, self-occluded conditions and better recognizes subtle gestures previously difficult to capture from purely visual data.

Main Contributions

An analysis of the prevalence and impact of self-occlusion scenarios in common egocentric hand datasets, motivating the use of dorsal features.
Developed an open-source end-to-end pipeline that transforms dorsal skin imagery into hand pose predictions and click detection without temporal dependencies.
An evaluation of the system’s performance on 12 participants versus state-of-the-art baselines, as well as analyses with respect to occlusion, skin tone, image size, and backbone.
Several exemplary applications of the system in key hand interactions, including pinching, tapping, and isometric force click.

High-Resolution Dorsal Dataset Collection

We collected a new dataset of over 170,000 high-resolution frames of dorsal hand data across 17 gestures from 12 participants.

Examples of each gesture collected in our data collection. Not depicted are fanning (30s) and free form (60s). On the top are the following: An aligned image of the reference, the picture of the dorsal features during this gesture, and the cosine similarity mapping for the DINO features generated from the reference and the current image. The color of the similarity map indicates a smaller cosine similarity (darker is more different). (I: index finger, M: middle finger, R: ring finger, P: pinky, T: thumb).

System Design

System architecture. Users input a ``reference'' image of their hand in a neutral position and a picture of the hand in some gesture. An initial hand pose prediction from HaMeR is then used to align both hands so that their dorsal features are spatially localized. Images of dorsal features are then fed into DINOv3 to extract image features. These features, along with the cosine similarity and difference between the ``reference'' and current image's features, are fed into the change encoder. A regression head then predicts the current hand pose, which can be processed with MANO using a prior shape prediction to generate a hand mesh. Optionally, users can use the initial translation prediction from HaMeR to localize the final mesh in the camera frame.

Evaluation

Our proposed system reduces the mean per-joint angle error (MPJAE) by over 18% compared to SOTA models, mitigates the negative impacts of self-occlusion, and is not meaningfully affected by skin color. To demonstrate the practical utility of our approach, we evaluate its performance on downstream applications like pinch and tap detection. Finally, to illustrate the potential of our skin deformation analysis, we showcase an interaction not possible with conventional egocentric methods: isometric “force click” detection with no discernible hand motion, akin to a trackpad press on surface or pressing fingers together from an already-touching pose.

For more system infomation please refer to our paper.