skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to see if they have actually grasped the concepts they are meant to capture. In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet [2019]. In particular, we describe ConceptARC, a new, publicly available benchmark in the ARC domain that systematically assesses abstraction and generalization abilities on a number of basic spatial and semantic concepts. ConceptARC differs from the original ARC dataset in that it is specifically organized around "concept groups" -- sets of problems that focus on specific concepts and that are vary in complexity and level of abstraction. We report results on testing humans on this benchmark as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI's GPT-4. Our results show that humans substantially outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. We believe that this benchmark will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems.  more » « less
Award ID(s):
2139983
PAR ID:
10448578
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Transactions on machine learning research
ISSN:
2835-8856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the difficulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance of a concept must also understand other instances, as a human would. In this paper, we argue that understanding a concept requires the ability to use it in varied contexts. Accordingly, we propose systematic evaluations centered around concepts, by probing a system's ability to use a given concept in many different instantiations. We present case studies of such an evaluations on two domains -- RAVEN (inspired by Raven's Progressive Matrices) and the Abstraction and Reasoning Corpus (ARC) -- that have been used to develop and assess abstraction abilities in AI systems. Our concept-based approach to evaluation reveals information about AI systems that conventional test sets would have left hidden. 
    more » « less
  2. Abstract Conceptual abstraction and analogy‐making are key abilities underlying humans' abilities to learn, reason, and robustly adapt their knowledge to new domains. Despite a long history of research on constructing artificial intelligence (AI) systems with these abilities, no current AI system is anywhere close to a capability of forming humanlike abstractions or analogies. This paper reviews the advantages and limitations of several approaches toward this goal, including symbolic methods, deep learning, and probabilistic program induction. The paper concludes with several proposals for designing challenge tasks and evaluation measures in order to make quantifiable and generalizable progress in this area. 
    more » « less
  3. null (Ed.)
    The Abstraction and Reasoning Corpus (ARC) is a set of tasks that tests an agent’s ability to flexibly solve novel problems. While most ARC tasks are easy for humans, they are challenging for state-of-the-art AI. How do we build intelligent systems that can generalize to novel situations and understand human instructions in domains such as ARC? We posit that the answer may be found by studying how humans communicate to each other in solving these tasks. We present LARC, the Language-annotated ARC: a collection of natural language descriptions by a group of human participants, unfamiliar both with ARC and with each other, who instruct each other on how to solve ARC tasks. LARC contains successful instructions for 88% of the ARC tasks. We analyze the collected instructions as ‘natural programs’, finding that most natural program concepts have analogies in typical computer programs. However, unlike how one precisely programs a computer, we find that humans both anticipate and exploit ambiguities to communicate effectively. We demonstrate that a state-of-the-art program synthesis technique, which leverages the additional language annotations, outperforms its language-free counterpart. 
    more » « less
  4. Educational AI (AIEd) systems are increasingly designed and evaluated with an awareness of the hybrid nature of adaptivity in real-world educational settings. In practice, beyond being a property of AIEd systems alone, adaptivity is often jointly enacted by AI systems and human facilitators (e.g., teachers or peers). Despite much recent research activity, theoretical and conceptual guidance for the design of such human–AI systems remains limited. In this paper we explore how adaptivity may be shared across AIEd systems and the various human stakeholders who work with them. Based on a comparison of prior frameworks, which tend to examine adaptivity in AIEd systems or human coaches separately, we first synthesize a set of dimensions general enough to capture human–AI hybrid adaptivity. Using these dimensions, we then present a conceptual framework to map distinct ways in which humans and AIEd systems can augment each other’s abilities. Through examples, we illustrate how this framework can be used to characterize prior work and envision new possibilities for human–AI hybrid approaches in education. 
    more » « less
  5. Abstract Knowledge representation and reasoning (KRR) systems describe and reason with complex concepts and relations in the form of facts and rules. Unfortunately, wide deployment of KRR systems runs into the problem that domain experts have great difficulty constructing correct logical representations of their domain knowledge. Knowledge engineers can help with this construction process, but there is a deficit of such specialists. The earlier Knowledge Authoring Logic Machine (KALM) based on Controlled Natural Language (CNL) was shown to have very high accuracy for authoring facts and questions. More recently, KALMFL, a successor of KALM, replaced CNL withfactualEnglish, which is much less restrictive and requires very little training from users. However, KALMFLhas limitations in representing certain types of knowledge, such as authoring rules for multi-step reasoning or understanding actions with timestamps. To address these limitations, we propose KALMRAto enable authoring of rules and actions. Our evaluation using the UTI guidelines benchmark shows that KALMRAachieves a high level of correctness (100%) on rule authoring. When used for authoring and reasoning with actions, KALMRAachieves more than 99.3% correctness on the bAbI benchmark, demonstrating its effectiveness in more sophisticated KRR jobs. Finally, we illustrate the logical reasoning capabilities of KALMRAby drawing attention to the problems faced by the recently made famous AI, ChatGPT. 
    more » « less