skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Large Model’s Ability to Identify 3D Objects as a Function of Viewing Angle
Virtual reality is progressively more widely used to support embodied AI agents, such as robots, which frequently engage in ‘sim-to-real’ based learning approaches. At the same time, tools such as large vision-and-language models offer new capabilities that tie into a wide variety of tasks and capabilities. In order to understand how such agents can learn from simulated environments, we explore a language model’s ability to recover the type of object represented by a photorealistic 3D model as a function of the 3D perspective from which the model is viewed. We used photogrammetry to create 3D models of commonplace objects and rendered 2D images of these models from an fixed set of 420 virtual camera perspectives. A well-studied image and language model (CLIP) was used to generate text (i.e., prompts) corresponding to these images. Using multiple instances of various object classes, we studied which camera perspectives were most likely to return accurate text categorizations for each class of object.  more » « less
Award ID(s):
2145642 2024878
PAR ID:
10511952
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
Proceedings of the IEEE Artificial Intelligence x Virtual Reality (AIxVR) Conference
ISBN:
979-8-3503-7202-1
Page Range / eLocation ID:
14 to 15
Format(s):
Medium: X
Location:
Los Angeles, CA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Integrating multimodal data such as RGB and LiDAR from multiple views significantly increases computational and communication demands, which can be challenging for resource-constrained autonomous agents while meeting the time-critical deadlines required for various mission-critical applications. To address this challenge, we propose CoOpTex, a collaborative task execution framework designed for cooperative perception in distributed autonomous systems (DAS). CoOpTex contribution is twofold: (a) CoOpTex fuses multiview RGB images to create a panoramic camera view for 2D object detection and utilizes 360° LiDAR for 3D object detection, improving accuracy with a lightweight Graph Neural Network (GNN) that integrates object coordinates from both perspectives, (b) To optimize task execution and meet the deadline, CoOpTex dynamically offloads computationally intensive image stitching tasks to auxiliary devices when available and adjusts frame capture rates for RGB frames based on device mobility and processing capabilities. We implement CoOpTex in real-time on static and mobile heterogeneous autonomous agents, which helps to significantly reduce deadline violations by 100% while improving frame rates for 2D detection by 2.2 times in stationary and 2 times in mobile conditions, demonstrating its effectiveness in enabling real-time cooperative perception. 
    more » « less
  2. We focus on addressing the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-training (CLIP) models. Centered on our hypothesis that counting knowledge can be abstracted into linear vectors within the text embedding space, we develop a parameter-efficient fine-tuning method and several zero-shot methods to improve CLIP's counting accuracy. Through comprehensive experiments, we demonstrate that our learning-based method not only outperforms full-model fine-tuning in counting accuracy but also retains the broad capabilities of pre-trained CLIP models. Our zero-shot text embedding editing techniques are also effective in situations where training data is scarce, and can be extended to improve Stable Diffusion's ability to generate images with precise object counts. We also contribute two specialized datasets to train and evaluate CLIP’s counting capabilities. Our code is available at https://github.com/UW-Madison-Lee-Lab/CLIP_Counting. 
    more » « less
  3. The success of image generative models has enabled us to build methods that can edit images based on text or other user input. However, these methods are bespoke, imprecise, require additional information, or are limited to only 2D image edits. We present GeoDiffuser, a zero-shot optimization-based method that unifies common 2D and 3D image-based object editing capabilities into a single method. Our key insight is to view image editing operations as geometric transformations. We show that these transformations can be directly incorporated into the attention layers in diffusion models to implicitly perform editing operations. Our training-free optimization method uses an objective function that seeks to preserve object style but generate plausible images, for instance with accurate lighting and shadows. It also inpaints disoccluded parts of the image where the object was originally located. Given a natural image and user input, we segment the foreground object using SAM and estimate a corresponding transform which is used by our optimization approach for editing. GeoDiffuser can perform common 2D and 3D edits like object translation, 3D rotation, and removal. We present quantitative results, including a perceptual study, that shows how our approach is better than existing methods. 
    more » « less
  4. We provide an approach to reconstruct spatiotemporal 3D models of aging objects such as fruit containing time-varying shape and appearance using multi-view time-lapse videos captured by a microenvironment of Raspberry Pi cameras. Our approach represents the 3D structure of the object prior to aging using a static 3D mesh reconstructed from multiple photographs of the object captured using a rotating camera track. We manually align the 3D mesh to the images at the first time instant. Our approach automatically deforms the aligned 3D mesh to match the object across the multi-viewpoint time-lapse videos. We texture map the deformed 3D meshes with intensities from the frames at each time instant to create the spatiotemporal 3D model of the object. Our results reveal the time dependence of volume loss due to transpiration and color transformation due to enzymatic browning on banana peels and in exposed parts of bitten fruit. 
    more » « less
  5. Contemporary developments in computer vision and artificial intelligence show promise to greatly improve the lives of those with disabilities. In this paper, we propose one such development: a wearable object recognition device in the form of eyewear. Our device is specialized to recognize items from the produce section of a grocery store, but serves as a proof of concept for any similar object recognition wearable. It is user friendly, featuring buttons that are pressed to capture images with the built-in camera. A convolutional neural network (CNN) is used to train the object recognition system. After the object is recognized, a text-to-speech system is utilized to inform the user which object they are holding in addition to the price of the product. With accuracy rates of 99.35%, our product has proven to successfully identify objects with greater correctness than existing models. 
    more » « less