This paper presents a visual servoing method for controlling a robot in the configuration space by purely using its natural features. We first created a data collection pipeline that uses camera intrinsics, extrinsics, and forward kinematics to generate 2D projections of a robot's joint locations (keypoints) in image space. Using this pipeline, we are able to collect large sets of real-robot data, which we use to train realtime keypoint detectors. The inferred keypoints from the trained model are used as control features in an adaptive visual servoing scheme that estimates, in runtime, the Jacobian relating the changes of the keypoints and joint velocities. We compared the 2D configuration control performance of this method to the skeleton-based visual servoing method (the only other algorithm for purely vision-based configuration space visual servoing), and demonstrated that the keypoints provide more robust and less noisy features, which result in better transient response. We also demonstrate the first vision-based 3D configuration space control results in the literature, and discuss its limitations. Our data collection pipeline is available at https://github.com/JaniC-WPI/KPDataGenerator.git which can be utilized to collect image datasets and train realtime keypoint detectors for various robots and environments.
more »
« less
Utilizing Inpainting for Training Keypoint Detection Algorithms Towards Markerless Visual Servoing
This paper presents a novel strategy to train keypoint detection models for robotics applications. Our goal is to develop methods that can robustly detect and track natural features on robotic manipulators. Such features can be used for vision-based control and pose estimation purposes, when placing artificial markers (e.g. ArUco) on the robot’s body is not possible or practical in runtime. Prior methods require accurate camera calibration and robot kinematic models in order to label training images for the keypoint locations. In this paper, we remove these dependencies by utilizing inpainting methods: In the training phase, we attach ArUco markers along the robot’s body and then label the keypoint locations as the center of those markers. We, then, use an inpainting method to reconstruct the parts of the robot occluded by the ArUco markers. As such, the markers are artificially removed from the training images, and labeled data is obtained to train markerless keypoint detection algorithms without the need for camera calibration or robot models. Using this approach, we trained a model for realtime keypoint detection and used the inferred keypoints as control features for an adaptive visual servoing scheme. We obtained successful control results with this fully model-free control strategy, utilizing natural robot features in the runtime and not requiring camera calibration or robot models in any stage of this process.
more »
« less
- Award ID(s):
- 1928506
- PAR ID:
- 10561835
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-8457-4
- Page Range / eLocation ID:
- 3086 to 3092
- Subject(s) / Keyword(s):
- robot control, vision-based control, adaptive control, keypoint tracking
- Format(s):
- Medium: X
- Location:
- Yokohama, Japan
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate taskrelevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels.more » « less
-
Existing approaches for autonomous control of pan-tilt-zoom (PTZ) cameras use multiple stages where object detection and localization are performed separately from the control of the PTZ mechanisms. These approaches require manual labels and suffer from performance bottlenecks due to error propagation across the multi-stage flow of information. The large size of object detection neural networks also makes prior solutions infeasible for real-time deployment in resource-constrained devices. We present an end-to-end deep reinforcement learning (RL) solution called Eagle1 to train a neural network policy that directly takes images as input to control the PTZ camera. Training reinforcement learning is cumbersome in the real world due to labeling effort, runtime environment stochasticity, and fragile experimental setups. We introduce a photo-realistic simulation framework for training and evaluation of PTZ camera control policies. Eagle achieves superior camera control performance by maintaining the object of interest close to the center of captured images at high resolution and has up to 17% more tracking duration than the state-of-the-art. Eagle policies are lightweight (90x fewer parameters than Yolo5s) and can run on embedded camera platforms such as Raspberry PI (33 FPS) and Jetson Nano (38 FPS), facilitating real-time PTZ tracking for resource-constrained environments. With domain randomization, Eagle policies trained in our simulator can be transferred directly to real-world scenarios2.more » « less
-
In clinical practice, therapeutic and diagnostic endoluminal procedures of the human body often use a scope, catheter, or passive pill-shaped camera. Unfortunately, such devices used in the circulatory system and gastrointestinal tract are often uncomfortable, invasive, and require the patient to be sedated. With current technology, regions of the body are often inaccessible to the clinician. Herein, a magnetically actuated soft endoluminal inchworm robot that may extend clinicians’ ability to reach further into the human body and practice new procedures is described, modeled, and analyzed. A detailed locomotion model is pro- posed that takes into account the elastic deformation of the robot and its interactions with the environment. The model is validated with in vitro and ex vivo (pig intestine) physical experiments and is shown to capture the robot’s gait characteristics through a lumen. Utilizing dimensional analysis, the effects of the mechanical properties and design variables on the robot’s motion are investigated further to advance the understanding of this endoluminal robot concept.more » « less
-
A Global Correction Framework for Camera Registration in Video See-Through Augmented Reality SystemsAbstract Augmented reality (AR) enhances the user’s perception of the real environment by superimposing virtual images generated by computers. These virtual images provide additional visual information that complements the real-world view. AR systems are rapidly gaining popularity in various manufacturing fields such as training, maintenance, assembly, and robot programming. In some AR applications, it is crucial for the invisible virtual environment to be precisely aligned with the physical environment to ensure that human users can accurately perceive the virtual augmentation in conjunction with their real surroundings. The process of achieving this accurate alignment is known as calibration. During some robotics applications using AR, we observed instances of misalignment in the visual representation within the designated workspace. This misalignment can potentially impact the accuracy of the robot’s operations during the task. Based on the previous research on AR-assisted robot programming systems, this work investigates the sources of misalignment errors and presents a simple and efficient calibration procedure to reduce the misalignment accuracy in general video see-through AR systems. To accurately superimpose virtual information onto the real environment, it is necessary to identify the sources and propagation of errors. In this work, we outline the linear transformation and projection of each point from the virtual world space to the virtual screen coordinates. An offline calibration method is introduced to determine the offset matrix from the head-mounted display (HMD) to the camera, and experiments are conducted to validate the improvement achieved through the calibration process.more » « less
An official website of the United States government

