You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping

ICRA 2025

1Fudan University 2Tencent Robotics X Lab

* equal contributions   corresponding author  

We propose a unified, single-stage method for articulated object 6D pose estimation named YOEO, which enables real-time robotic manipulation.

Abstract

This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.

Method Overview

The workflow of our proposed YOEO framework. The Feature Extraction module extracts the per-point feature from an partial point cloud. They are fed into three parallel modules to predict the NPCS maps, semantic labels and the offsets to centroids of each point. A clustering algorithm is then applied to distinguish different instances with the same semantic label and points on the same instance. Finally, an aligning algorithm is applied to the predicted npcs map and real point cloud to estimate 6DoF pose parameters.

The detailed architecture of our YOEO.

FC: Fully Connected layer, LFA: Local Feature Aggregation, RS: Random Sampling, MLP: shared Multi-Layer Perceptron, US: Up-sampling.

Experimets on the GAPartNet dataset.

Experimets of the real-world perception.

Sim to Real video

More Qualatative Results on Datasets

BibTeX

@inproceedings{

}

Acknowledgements