skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 2, 2026

Title: Keypoint abstraction using large models for object-relative imitation learning.
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate taskrelevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels.  more » « less
Award ID(s):
2214177
PAR ID:
10629485
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
IEEE International Conference on Robotics and Automation
Date Published:
Journal Name:
Proceedings IEEE International Conference on Robotics and Automation
ISSN:
1050-4729
Format(s):
Medium: X
Location:
Atlanta, Georgia
Sponsoring Org:
National Science Foundation
More Like this
  1. With the advent of pre-trained language models (LMs), increasing research efforts have been focusing on infusing commonsense and domain-specific knowledge to prepare LMs for downstream tasks. These works attempt to leverage knowledge graphs, the de facto standard of symbolic knowledge representation, along with pre-trained LMs. While existing approaches leverage external knowledge, it remains an open question how to jointly incorporate knowledge graphs represented in varying contexts — from local (e.g., sentence), document-level, to global knowledge, to enable knowledge-rich and interpretable exchange across contexts. In addition, incorporating varying contexts can especially benefit long document understanding tasks that leverage pre-trained LMs, typically bounded by the input sequence length. In light of these challenges, we propose KALM, a language model that jointly leverages knowledge in local, document-level, and global contexts for long document understanding. KALM firstly encodes long documents and knowledge graphs into the three knowledge-aware context representations. KALM then processes each context with context-specific layers. These context-specific layers are followed by a ContextFusion layer that facilitates knowledge exchange to derive an overarching document representation. Extensive experiments demonstrate that KALM achieves state-of-the-art performance on three long document understanding tasks across 6 datasets/settings. Further analyses reveal that the three knowledge-aware contexts are complementary and they all contribute to model performance, while the importance and information exchange patterns of different contexts vary on different tasks and datasets. 
    more » « less
  2. This paper presents a visual servoing method for controlling a robot in the configuration space by purely using its natural features. We first created a data collection pipeline that uses camera intrinsics, extrinsics, and forward kinematics to generate 2D projections of a robot's joint locations (keypoints) in image space. Using this pipeline, we are able to collect large sets of real-robot data, which we use to train realtime keypoint detectors. The inferred keypoints from the trained model are used as control features in an adaptive visual servoing scheme that estimates, in runtime, the Jacobian relating the changes of the keypoints and joint velocities. We compared the 2D configuration control performance of this method to the skeleton-based visual servoing method (the only other algorithm for purely vision-based configuration space visual servoing), and demonstrated that the keypoints provide more robust and less noisy features, which result in better transient response. We also demonstrate the first vision-based 3D configuration space control results in the literature, and discuss its limitations. Our data collection pipeline is available at https://github.com/JaniC-WPI/KPDataGenerator.git which can be utilized to collect image datasets and train realtime keypoint detectors for various robots and environments. 
    more » « less
  3. The success of 6-DoF grasp learning with point cloud input is tempered by the computational costs resulting from their unordered nature and pre-processing needs for reducing the point cloud to a manageable size. These properties lead to failure on small objects with low point cloud cardinality. Instead of point clouds, this manuscript explores grasp generation directly from the RGB-D image input. The approach, called Keypoint-GraspNet (KGN), operates in perception space by detecting projected gripper keypoints in the image, then recovering their SE(3) poses with a PnP algorithm. Training of the network involves a synthetic dataset derived from primitive shape objects with known continuous grasp families. Trained with only single-object synthetic data, Keypoint-GraspNet achieves superior result on our single-object dataset, comparable performance with state-of-art baselines on a multi-object test set, and outperforms the most competitive baseline on small objects. Keypoint-GraspNet is more than 3x faster than tested point cloud methods. Robot experiments show high success rate, demonstrating KGN's practical potential. 
    more » « less
  4. Promising results have been achieved recently in category-level manipulation that generalizes across object instances. Nevertheless, it often requires expensive real-world data collection and manual specification of semantic keypoints for each object category and task. Additionally, coarse keypoint predictions and ignoring intermediate action sequences hinder adoption in complex manipulation tasks beyond pick-and-place. This work proposes a novel, category-level manipulation framework that leverages an object-centric, category-level representation and model-free 6 DoF motion tracking. The canonical object representation is learned solely in simulation and then used to parse a category-level, task trajectory from a single demonstration video. The demonstration is reprojected to a target trajectory tailored to a novel object via the canonical representation. During execution, the manipulation horizon is decomposed into longrange, collision-free motion and last-inch manipulation. For the latter part, a category-level behavior cloning (CatBC) method leverages motion tracking to perform closed-loop control. CatBC follows the target trajectory, projected from the demonstration and anchored to a dynamically selected category-level coordinate frame. The frame is automatically selected along the manipulation horizon by a local attention mechanism. This framework allows to teach different manipulation strategies by solely providing a single demonstration, without complicated manual programming. Extensive experiments demonstrate its efficacy in a range of challenging industrial tasks in highprecision assembly, which involve learning complex, long-horizon policies. The process exhibits robustness against uncertainty due to dynamics as well as generalization across object instances and scene configurations. 
    more » « less
  5. This paper presents a novel strategy to train keypoint detection models for robotics applications. Our goal is to develop methods that can robustly detect and track natural features on robotic manipulators. Such features can be used for vision-based control and pose estimation purposes, when placing artificial markers (e.g. ArUco) on the robot’s body is not possible or practical in runtime. Prior methods require accurate camera calibration and robot kinematic models in order to label training images for the keypoint locations. In this paper, we remove these dependencies by utilizing inpainting methods: In the training phase, we attach ArUco markers along the robot’s body and then label the keypoint locations as the center of those markers. We, then, use an inpainting method to reconstruct the parts of the robot occluded by the ArUco markers. As such, the markers are artificially removed from the training images, and labeled data is obtained to train markerless keypoint detection algorithms without the need for camera calibration or robot models. Using this approach, we trained a model for realtime keypoint detection and used the inferred keypoints as control features for an adaptive visual servoing scheme. We obtained successful control results with this fully model-free control strategy, utilizing natural robot features in the runtime and not requiring camera calibration or robot models in any stage of this process. 
    more » « less