Regardless of how much data artificial intelligence agents have available, agents will inevitably encounter previously unseen situations in real-world deployments. Reacting to novel situations by acquiring new information from other people—socially situated learning—is a core faculty of human development. Unfortunately, socially situated learning remains an open challenge for artificial intelligence agents because they must learn how to interact with people to seek out the information that they lack. In this article, we formalize the task of socially situated artificial intelligence—agents that seek out new information through social interactions with people—as a reinforcement learning problem where the agent learns to identify meaningful and informative questions via rewards observed through social interaction. We manifest our framework as an interactive agent that learns how to ask natural language questions about photos as it broadens its visual intelligence on a large photo-sharing social network. Unlike active-learning methods, which implicitly assume that humans are oracles willing to answer any question, our agent adapts its behavior based on observed norms of which questions people are or are not interested to answer. Through an 8-mo deployment where our agent interacted with 236,000 social media users, our agent improved its performance at recognizing new visual information by 112%. A controlled field experiment confirmed that our agent outperformed an active-learning baseline by 25.6%. This work advances opportunities for continuously improving artificial intelligence (AI) agents that better respect norms in open social environments.
more »
« less
Analyzing the Roles of Language and Vision in Learning from Limited Data
Does language help make sense of the visual world? How important is it to actually see the world rather than having it described with words? These basic questions about the na- ture of intelligence have been difficult to answer because we only had one example of an intelligent system – humans – and limited access to cases that isolated language or vision. How- ever, the development of sophisticated Vision-Language Mod- els (VLMs) by artificial intelligence researchers offers us new opportunities to explore the contributions that language and vi- sion make to learning about the world. We ablate components from the cognitive architecture of these models to identify their contributions to learning new tasks from limited data. We find that a language model leveraging all components recovers a majority of a VLM’s performance, despite its lack of visual in- put, and that language seems to allow this by providing access to prior knowledge and reasoning.
more »
« less
- Award ID(s):
- 2107048
- PAR ID:
- 10542074
- Publisher / Repository:
- Proceedings of the Annual Meeting of the Cognitive Science Society (CogSci), 2024.
- Date Published:
- ISSN:
- 1069-7977
- Format(s):
- Medium: X
- Location:
- Netherlands
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Psychologists recognize Raven’s Progressive Matrices as a useful test of general human intelligence. While many computational models investigate various forms of top-down, deliberative reasoning on the test, there has been less research on bottom-up perceptual processes, like Gestalt image completion, that are also critical in human test performance. In this work, we investigate how Gestalt visual reasoning on the Raven’s test can be modeled using generative image inpainting techniques from computer vision. We demonstrate that a reasoning agent that has access to an off- the-shelf inpainting model trained only on photorealistic images of objects achieves a score of 27/36 on the Colored Progressive Matrices, which corresponds to average performance for nine-year-old children. We also show that when our agent uses inpainting models trained on other datasets (faces, places, and textures), it does not perform as well. Our results illustrate how learning visual regularities in real-world images can translate into successful reasoning about artificial test stimuli. On the flip side, our results also highlight the limitations of such transfer, which may contribute to explanations for why intelligence tests like the Raven’s are often sensitive to people’s individual sociocultural backgrounds.more » « less
-
The ability to process social information is a critical component of children’s early language and cognitive development. However, as children reach their first birthday, they begin to locomote themselves, dramatically affecting their visual access to this information. How do these postural and locomotor changes affect children’s access to the social information relevant for word-learning? Here, we explore this question by using head-mounted cameras to record 36 infants’ (8-16 months of age) egocentric visual perspective and use computer vision algorithms to estimate the proportion of faces and hands in infants’ environments. We find that infants’ posture and orientation to their caregiver modulates their access to social information, confirming previous work that suggests motoric developments play a significant role in the emergence of children’s linguistic and social capacities. We suggest that the combined use of head-mounted cameras and the application of new computer vision techniques is a promising avenue for understanding the statistics of infants’ visual and linguistic experience.more » « less
-
Actions’ play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform ‘Reasoning about Actions & Change’ (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.more » « less
-
We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data.more » « less
An official website of the United States government

