skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction
People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: How can we assist pBLV in recognizing scenes, identifying objects, and detecting potential tripping hazards in unfamiliar environments, where existing assistive technologies often falter due to their lack of robustness? We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything Model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing environmental objects and scenic landmarks, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method can recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.  more » « less
Award ID(s):
2345139 2236097
PAR ID:
10559620
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
MPDI
Date Published:
Journal Name:
Journal of Imaging
Volume:
10
Issue:
5
ISSN:
2313-433X
Page Range / eLocation ID:
103
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Walking in real-world environments involves constant decision-making, e.g., when approaching a staircase, an individual decides whether to engage (climbing the stairs) or avoid. For the control of assistive robots (e.g., robotic lower-limb prostheses), recognizing such motion intent is an important but challenging task, primarily due to the lack of available information. This paper presents a novel vision-based method to recognize an individual’s motion intent when approaching a staircase before the potential transition of motion mode (walking to stair climbing) occurs. Leveraging the egocentric images from a head-mounted camera, the authors trained a YOLOv5 object detection model to detect staircases. Subsequently, an AdaBoost and gradient boost (GB) classifier was developed to recognize the individual’s intention of engaging or avoiding the upcoming stairway. This novel method has been demonstrated to provide reliable (97.69%) recognition at least 2 steps before the potential mode transition, which is expected to provide ample time for the controller mode transition in an assistive robot in real-world use. 
    more » « less
  2. We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, large language models can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study recent papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models . 
    more » « less
  3. Emerging technologies offer the potential to expand the domain of the future workforce to extreme environments, such as outer space and alien terrains. To understand how humans navigate in such environments that lack familiar spatial cues this study examined spatial perception in three types of environments. The environments were simulated using virtual reality. We examined participants’ ability to estimate the size and distance of stimuli under conditions of minimal, moderate, or maximum visual cues, corresponding to an environment simulating outer space, an alien terrain, or a typical cityscape, respectively. The findings show underestimation of distance in both the maximum and the minimum visual cue environment but a tendency for overestimation of distance in the moderate environment. We further observed that depth estimation was substantially better in the minimum environment than in the other two environments. However, estimation of height was more accurate in the environment with maximum cues (cityscape) than the environment with minimum cues (outer space). More generally, our results suggest that familiar visual cues facilitated better estimation of size and distance than unfamiliar cues. In fact, the presence of unfamiliar, and perhaps misleading visual cues (characterizing the alien terrain environment), was more disruptive than an environment with a total absence of visual cues for distance and size perception. The findings have implications for training workers to better adapt to extreme environments. 
    more » « less
  4. BACKGROUND: Today, various emerging assistive applications (apps) running on smartphones have been introduced such as Seeing AI, TapTapSee, and BeMyEyes apps. The assistive apps are designed to assist people with visual impairment in navigating unfamiliar environments, reading text, identifying objects and persons. Yet, little is known about how those with visual impairment perceive the assistive apps. OBJECTIVE: This study aims to advance knowledge of user experience with those assistive apps. METHODS: To address the knowledge gap, this study conducted phone interviews with a convenience sample of 30 individuals with visual impairment. RESULTS: The results indicated that those with visual impairment showed a range of preferences, needs, and concerns about user interfaces and interactions with the assistive apps. DISCUSSIONS: Given their needs and concerns, this study offered a set of facilitators to promote user adoption of the assistive apps, which should be valuable guidance to user interface/interaction designers in the field. 
    more » « less
  5. Agaian, Sos S.; Jassim, Sabah A. (Ed.)
    Face recognition technologies have been in high demand in the past few decades due to the increase in human-computer interactions. It is also one of the essential components in interpreting human emotions, intentions, facial expressions for smart environments. This non-intrusive biometric authentication system relies on identifying unique facial features and pairing alike structures for identification and recognition. Application areas of facial recognition systems include homeland and border security, identification for law enforcement, access control to secure networks, authentication for online banking and video surveillance. While it is easy for humans to recognize faces under varying illumination conditions, it is still a challenging task in computer vision. Non-uniform illumination and uncontrolled operating environments can impair the performance of visual-spectrum based recognition systems. To address these difficulties, a novel Anisotropic Gradient Facial Recognition (AGFR) system that is capable of autonomous thermal infrared to visible face recognition is proposed. The main contribution of this paper includes a framework for thermal/fused-thermal-visible to visible face recognition system and a novel human-visual-system inspired thermal-visible image fusion technique. Extensive computer simulations using CARL, IRIS, AT&T, Yale and Yale-B databases demonstrate the efficiency, accuracy, and robustness of the AGFR system. Keywords: Infrared thermal to visible facial recognition, anisotropic gradient, visible-to-visible face recognition, nonuniform illumination face recognition, thermal and visible face fusion method 
    more » « less