Researchers Develop a Novel Vote-Based Model for More Accurate Hand-Held Object Pose Estimation
- Research
Researchers have developed a new deep learning framework for accurate and robust hand-held object pose estimation
Estimating the pose of hand-held objects is a critical and challenging problem in robotics and computer vision. While leveraging multi-modal RGB and depth data is a promising solution, existing approaches still face challenges due to hand-induced occlusions and multimodal data fusion. In a new study, researchers developed a novel deep learning framework that addresses these issues by introducing a novel vote-based fusion module and a hand-aware pose estimation module.

Title: Novel deep learning framework for accurate and robust hand-object pose estimation
Caption:The proposed framework will enable robots to accurately and 老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐 efficiently handle complex objects, while also advancing augmented reality technologies to support 老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐 lifelike hand-object interactions.
Credit :Dan Ruscoe from Flickr
Source Link: https://openverse.org/image/f3ea2cc0-5dde-4804-842b-6b710256a785?q=Robotic+Arm&p=53
License: CC BY 2.0
Usage restrictions:You are free to share and adapt the material. Attribution is required, with a link to the license, and you must indicate if changes are made to the work.
Many robotic applications rely on robotic arms or hands to handle different types of objects. Estimating the pose of such hand-held objects is an important yet challenging task in robotics, computer vision and even in augmented reality (AR) applications. A promising direction is to utilize multi-modal data, such as color (RGB) and depth (D) images. With the increasing availability of 3D sensors, many machine learning approaches have emerged to leverage this technique.
However, existing approaches still face two main challenges. First, they face accuracy drops when hands occlude the objects held, obscuring critical features required for pose estimation. Additionally, hand-object interactions introduce non-rigid transformations, which further complicate the issue. This happens when hands change the shape or structure of the held object, such as when squeezing a soft ball, distorting the object’s perceived shape. Second, most current techniques extract features from separate RGB and RGB-D backbones, which are then fused at the feature level. Since these two backbones handle inherently different modalities, this fusion can result in representation distribution shifts, meaning features learned from RGB images may misalign with those extracted from RGB-D inputs, affecting pose estimation. Further老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐, during fine-tuning, dense interactions between the two backbones cause performance disruptions and limit the benefits of incorporating RGB features.
To address these issues, a research team led by Associate Professor Phan Xuan Tan from the Innovative Global Program, College of Engineering at Shibaura Institute of Technology, Japan, along with Dr. Dinh-Cuong Hoang and other researchers from FPT University, Vietnam, developed an innovative deep-neural network specifically designed for pose estimation using RGB-D images. “The key innovation of our deep learning framework lies in a vote-based fusion mechanism, which effectively integrates both 2D (RGB) and 3D (depth) keypoints, while addressing hand-induced occlusions and the difficulties of fusing multimodal data. Additionally, it decouples the learning process and incorporates a self-attention-based hand-object interaction model, resulting in substantial improvements,” explains Dr. Tan. Their study was made available online on February 17, 2025, and will be published in Volume 120 of the Alexandria Engineering Journal in May 2025.
The proposed deep-learning framework comprises four components: backbones to extract high-dimensional features from 2D images and 3D point cloud data, voting modules, a novel vote-based fusion module, and a hand-aware object pose estimation module. Initially, the 2D and 3D backbones predict 2D and 3D keypoints of both hands and objects from the RGB-D images. Keypoints refer to the meaningful locations in the input images that help describe the pose of the hands and objects. Next, the voting modules within each backbone independently cast votes for their respective keypoints.
These votes are then integrated by the vote-based fusion model, which dynamically combines the 2D and 3D votes using radius-based neighborhood projection and channel attention mechanisms. The former preserves local information, while the latter adapts to varying input conditions, ensuring robustness and accuracy. This vote-based fusion effectively leverages the strengths of RGB and depth information, mitigating the impact of hand-induced occlusions and misalignment, therefore, enabling accurate hand-object pose estimation.
The final component, the hand-aware object pose estimation module, further improves accuracy by using a self-attention mechanism to capture the complex relationships between hand and object keypoints. This allows the system to account for the non-rigid transformations caused by different hand poses and grips.
To test their framework, the researchers conducted experiments on three public datasets. The results showed significant improvements in accuracy (up to 15%) and robustness over state-of-the-art approaches. Further老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐, on-site experiments demonstrated an average precision of 76.8%, with performance improvements of up to 13.9% compared to existing methods. The framework also achieves inference times of up to 40 milliseconds without refinement and 200 milliseconds with refinement, demonstrating real-world applicability.
“Our research directly addresses a long-standing bottleneck in the robotics and computer vision industries—accurate object pose estimation in occluded, dynamic, and complex hand-object interaction scenarios,” remarks Dr. Tan. “Our approach is not only 老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐 accurate but also simpler than many existing techniques. It has the potential to accelerate the deployment of AI-powered systems, such as efficient automated robotic assembly lines, human-assistive robotics, and immersive AR/VR technologies.”
Overall, this innovative approach represents a significant step forward in robotics, enabling robots to 老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐 effectively handle complex objects and advancing AR technologies to model 老挝磨丁赌场_老挝赌场-中国竞彩网官网推荐 lifelike hand-object interactions.
Reference
Title of original paper: |
Vote-based multimodal fusion for hand-held object pose estimation |
Journal: |
Alexandria Engineering Journal |
DOI: |
Authors
About Associate Professor Phan Xuan Tan from SIT, Japan
Dr. Phan Xuan Tan is an Associate Professor at the College of Engineering, Shibaura Institute of Technology (SIT), Japan. His research focuses on computer vision, image processing, deep learning, vision-based robotics, and artificial intelligence safety. He earned a B.E. degree in Electrical-Electronic Engineering from the Military Technical Academy and an M.S. degree in Computer and Communication Engineering from the Hanoi University of Science and Technology, Vietnam. He received his Ph.D. in Functional Control Systems from SIT.
Funding Information
N/A