skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on October 5, 2026

Title: Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
Language model (LM) agents are increasingly used as autonomous decision-makers which need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world—key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs’ ability to explore and infer causal relationships, using the well-established Blicket Test paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This “disjunctive bias” persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not child-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.  more » « less
Award ID(s):
1922658
PAR ID:
10649411
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Conference on Language Modeling 2025
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract In recent years, large language models (LLMs) and vision language models (VLMs) have excelled at tasks requiring human-like reasoning, inspiring researchers in engineering design to use language models (LMs) as surrogate evaluators of design concepts. But do these models actually evaluate designs like humans? While recent work has shown that LM evaluations sometimes fall within human variance on Likert-scale grading tasks, those tasks often obscure the reasoning and biases behind the scores. To address this limitation, we compare LM word embeddings (trained to capture semantic similarity) with human-rated similarity embeddings derived from triplet comparisons (“is A closer to B than C?”) on a dataset of design sketches and descriptions. We assess alignment via local tripletwise similarity and embedding distances, allowing for deeper insights than raw Likert-scale scores provide. We also explore whether describing the designs to LMs through text or images improves alignment with human judgments. Our findings suggest that text alone may not fully capture the nuances humans key into, yet text-based embeddings outperform their multimodal counterparts on satisfying local triplets. On the basis of these insights, we offer recommendations for effectively integrating LMs into design evaluation tasks. 
    more » « less
  2. Answering open-domain questions requires world knowledge about in-context entities. As pre-trained Language Models (LMs) lack the power to store all required knowledge, external knowledge sources, such as knowledge graphs, are often used to augment LMs. In this work, we propose knOwledge REasOning empowered Language Model (OREOLM), which consists of a novel Knowledge Interaction Layer that can be flexibly plugged into existing Transformer-based LMs to interact with a differentiable Knowledge Graph Reasoning module collaboratively. In this way, LM guides KG to walk towards the desired answer, while the retrieved knowledge improves LM. By adopting OREOLM to RoBERTa and T5, we show significant performance gain, achieving state-of-art results in the Closed-Book setting. The performance enhancement is mainly from the KG reasoning’s capacity to infer missing relational facts. In addition, OREOLM provides reasoning paths as rationales to interpret the model’s decision. 
    more » « less
  3. Do LMs infer the semantics of text from co-occurrence patterns in their training data? Merrill et al. (2022) argue that, in theory, sentence co-occurrence probabilities predicted by an optimal LM should reflect the entailment relationship of the constituent sentences, but it is unclear whether probabilities predicted by neural LMs encode entailment in this way because of strong assumptions made by Merrill et al. (namely, that humans always avoid redundancy). In this work, we investigate whether their theory can be used to decode entailment relations from neural LMs. We find that a test similar to theirs can decode entailment relations between natural sentences, well above random chance, though not perfectly, across many datasets and LMs. This suggests LMs implicitly model aspects of semantics to predict semantic effects on sentence co-occurrence patterns. However, we find the test that predicts entailment in practice works in the opposite direction to the theoretical test. We thus revisit the assumptions underlying the original test, finding its derivation did not adequately account for redundancy in human-written text. We argue that better accounting for redundancy related to explanations might derive the observed flipped test and, more generally, improve computational models of speakers in linguistics. 
    more » « less
  4. A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LITCAB, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LITCAB improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CAT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LITCAB with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs. 
    more » « less
  5. Fitch, T.; Lamm, C.; Leder, H.; Teßmar-Raible, K. (Ed.)
    Listening to music activates representations of movement and social agents. Why? We ask whether high-level causal reasoning about how music was generated can lead people to link musical sounds with animate agents. To test this, we asked whether people (N=60) make flexible inferences about whether an agent caused musical sounds, integrating information from the sounds’ timing and from the visual context in which it was produced. Using a 2x2 within-subject design, we found evidence of causal reasoning: In a context where producing a musical sequence would require self-propelled movement, people inferred that an agent had been present causing the sounds. When the context provided an alternative possible explanation, this ‘explained away’ the agent, reducing the tendency to infer an agent was present for the same acoustic stimuli. People can use causal reasoning to infer whether an agent produced musical sounds, suggesting that high-level cognition can link music with social concepts. 
    more » « less