skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction
People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: How can we assist pBLV in recognizing scenes, identifying objects, and detecting potential tripping hazards in unfamiliar environments, where existing assistive technologies often falter due to their lack of robustness? We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything Model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing environmental objects and scenic landmarks, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method can recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.  more » « less
Award ID(s):
2345139 2236097
PAR ID:
10559620
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
MPDI
Date Published:
Journal Name:
Journal of Imaging
Volume:
10
Issue:
5
ISSN:
2313-433X
Page Range / eLocation ID:
103
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Walking in real-world environments involves constant decision-making, e.g., when approaching a staircase, an individual decides whether to engage (climbing the stairs) or avoid. For the control of assistive robots (e.g., robotic lower-limb prostheses), recognizing such motion intent is an important but challenging task, primarily due to the lack of available information. This paper presents a novel vision-based method to recognize an individual’s motion intent when approaching a staircase before the potential transition of motion mode (walking to stair climbing) occurs. Leveraging the egocentric images from a head-mounted camera, the authors trained a YOLOv5 object detection model to detect staircases. Subsequently, an AdaBoost and gradient boost (GB) classifier was developed to recognize the individual’s intention of engaging or avoiding the upcoming stairway. This novel method has been demonstrated to provide reliable (97.69%) recognition at least 2 steps before the potential mode transition, which is expected to provide ample time for the controller mode transition in an assistive robot in real-world use. 
    more » « less
  2. We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, large language models can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study recent papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models . 
    more » « less
  3. Emerging technologies offer the potential to expand the domain of the future workforce to extreme environments, such as outer space and alien terrains. To understand how humans navigate in such environments that lack familiar spatial cues this study examined spatial perception in three types of environments. The environments were simulated using virtual reality. We examined participants’ ability to estimate the size and distance of stimuli under conditions of minimal, moderate, or maximum visual cues, corresponding to an environment simulating outer space, an alien terrain, or a typical cityscape, respectively. The findings show underestimation of distance in both the maximum and the minimum visual cue environment but a tendency for overestimation of distance in the moderate environment. We further observed that depth estimation was substantially better in the minimum environment than in the other two environments. However, estimation of height was more accurate in the environment with maximum cues (cityscape) than the environment with minimum cues (outer space). More generally, our results suggest that familiar visual cues facilitated better estimation of size and distance than unfamiliar cues. In fact, the presence of unfamiliar, and perhaps misleading visual cues (characterizing the alien terrain environment), was more disruptive than an environment with a total absence of visual cues for distance and size perception. The findings have implications for training workers to better adapt to extreme environments. 
    more » « less
  4. BACKGROUND: Today, various emerging assistive applications (apps) running on smartphones have been introduced such as Seeing AI, TapTapSee, and BeMyEyes apps. The assistive apps are designed to assist people with visual impairment in navigating unfamiliar environments, reading text, identifying objects and persons. Yet, little is known about how those with visual impairment perceive the assistive apps. OBJECTIVE: This study aims to advance knowledge of user experience with those assistive apps. METHODS: To address the knowledge gap, this study conducted phone interviews with a convenience sample of 30 individuals with visual impairment. RESULTS: The results indicated that those with visual impairment showed a range of preferences, needs, and concerns about user interfaces and interactions with the assistive apps. DISCUSSIONS: Given their needs and concerns, this study offered a set of facilitators to promote user adoption of the assistive apps, which should be valuable guidance to user interface/interaction designers in the field. 
    more » « less
  5. A worker’s attentional and cognitive failures—such as lack of attention, failure to identify a tripping hazard, or misperception about a hazard’s risks—can lead to unsafe behaviors and, consequently, accidents. Previous literature has shown that individual characteristics such as personality may affect human’s selective attention. However, few studies have attempted to empirically examine how a worker’s personality affects attention and situation awareness on a jobsite. The present study examines how workers’ emotional stability (neuroticism) affects their cognitive failures (especially attentional failure) when they are exposed to fall-to-same-level hazardous conditions. To achieve this goal—and given that eye movements represent the most direct manifestation of visual attention—the personalities of construction workers were assessed via self-completion questionnaires, and their attention and situation awareness were monitored continuously and in real-time using a mobile wearable eye-tracking apparatus. Correlational analyses revealed the significant relationship between neuroticism and the attentional distribution of workers. These results suggest that workers do not allocate their attention equally to all hazardous areas and these differences in attentional distribution are modulated by personality characteristics (neuroticism). A more detailed investigation of this connection yielded a specific pattern: less neurotic workers periodically look down and scan ahead to obtain feedforward information about tripping hazards, and these individuals remain fully aware of the environment and its associated hazards. The findings of this study suggest the value assessing personality to identify workers who are more likely to be involved in accidents. 
    more » « less