skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong
Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.  more » « less
Award ID(s):
2229885
PAR ID:
10522356
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)
Date Published:
Format(s):
Medium: X
Location:
Mexico City, Mexico
Sponsoring Org:
National Science Foundation
More Like this
  1. BackgroundLaypeople have easy access to health information through large language models (LLMs), such as ChatGPT, and search engines, such as Google. Search engines transformed health information access, and LLMs offer a new avenue for answering laypeople’s questions. ObjectiveWe aimed to compare the frequency of use and attitudes toward LLMs and search engines as well as their comparative relevance, usefulness, ease of use, and trustworthiness in responding to health queries. MethodsWe conducted a screening survey to compare the demographics of LLM users and nonusers seeking health information, analyzing results with logistic regression. LLM users from the screening survey were invited to a follow-up survey to report the types of health information they sought. We compared the frequency of use of LLMs and search engines using ANOVA and Tukey post hoc tests. Lastly, paired-sample Wilcoxon tests compared LLMs and search engines on perceived usefulness, ease of use, trustworthiness, feelings, bias, and anthropomorphism. ResultsIn total, 2002 US participants recruited on Prolific participated in the screening survey about the use of LLMs and search engines. Of them, 52% (n=1045) of the participants were female, with a mean age of 39 (SD 13) years. Participants were 9.7% (n=194) Asian, 12.1% (n=242) Black, 73.3% (n=1467) White, 1.1% (n=22) Hispanic, and 3.8% (n=77) were of other races and ethnicities. Further, 1913 (95.6%) used search engines to look up health queries versus 642 (32.6%) for LLMs. Men had higher odds (odds ratio [OR] 1.63, 95% CI 1.34-1.99; P<.001) of using LLMs for health questions than women. Black (OR 1.90, 95% CI 1.42-2.54; P<.001) and Asian (OR 1.66, 95% CI 1.19-2.30; P<.01) individuals had higher odds than White individuals. Those with excellent perceived health (OR 1.46, 95% CI 1.1-1.93; P=.01) were more likely to use LLMs than those with good health. Higher technical proficiency increased the likelihood of LLM use (OR 1.26, 95% CI 1.14-1.39; P<.001). In a follow-up survey of 281 LLM users for health, most participants used search engines first (n=174, 62%) to answer health questions, but the second most common first source consulted was LLMs (n=39, 14%). LLMs were perceived as less useful (P<.01) and less relevant (P=.07), but elicited fewer negative feelings (P<.001), appeared more human (LLM: n=160, vs search: n=32), and were seen as less biased (P<.001). Trust (P=.56) and ease of use (P=.27) showed no differences. ConclusionsSearch engines are the primary source of health information; yet, positive perceptions of LLMs suggest growing use. Future work could explore whether LLM trust and usefulness are enhanced by supplementing answers with external references and limiting persuasive language to curb overreliance. Collaboration with health organizations can help improve the quality of LLMs’ health output. 
    more » « less
  2. Recent years have seen great success of large language models (LLMs) in performing many natural language processing tasks with impressive performance, including tasks that directly serve users such as question answering and text summarization. They open up unprecedented opportunities for transforming information retrieval (IR) research and applications. However, concerns such as halluciation undermine their trustworthiness, limiting their actual utility when deployed in real-world applications, especially high-stake applications where trust is vital. How can we both exploit the strengths of LLMs and mitigate any risk caused by their weaknesses when applying LLMs to IR? What are the best opportunities for us to apply LLMs to IR? What are the major challenges that we will need to address in the future to fully exploit such opportunities? Given the anticipated growth of LLMs, what will future information retrieval systems look like? Will LLMs eventually replace an IR system? In this perspective paper, we examine these questions and provide provisional answers to them. We argue that LLMs will not be able to replace search engines, and future LLMs would need to learn how to use a search engine so that they can interact with a search engine on behalf of users. We conclude with a set of promising future research directions in applying LLMs to IR. 
    more » « less
  3. null (Ed.)
    Automatically generated explanations of how machine learning (ML) models reason can help users understand and accept them. However, explanations can have unintended consequences: promoting over-reliance or undermining trust. This paper investigates how explanations shape users' perceptions of ML models with or without the ability to provide feedback to them: (1) does revealing model flaws increase users' desire to "fix" them; (2) does providing explanations cause users to believe - wrongly - that models are introspective, and will thus improve over time. Through two controlled experiments - varying model quality - we show how the combination of explanations and user feedback impacted perceptions, such as frustration and expectations of model improvement. Explanations without opportunity for feedback were frustrating with a lower quality model, while interactions between explanation and feedback for the higher quality model suggest that detailed feedback should not be requested without explanation. Users expected model correction, regardless of whether they provided feedback or received explanations. 
    more » « less
  4. Generative AI, particularly Large Language Models (LLMs), has revolutionized human-computer interaction by enabling the generation of nuanced, human-like text. This presents new opportunities, especially in enhancing explainability for AI systems like recommender systems, a crucial factor for fostering user trust and engagement. LLM-powered AI-Chatbots can be leveraged to provide personalized explanations for recommendations. Although users often find these chatbot explanations helpful, they may not fully comprehend the content. Our research focuses on assessing how well users comprehend these explanations and identifying gaps in understanding. We also explore the key behavioral differences between users who effectively understand AI-generated explanations and those who do not. We designed a three-phase user study with 17 participants to explore these dynamics. The findings indicate that the clarity and usefulness of the explanations are contingent on the user asking relevant follow-up questions and having a motivation to learn. Comprehension also varies significantly based on users’ educational backgrounds. 
    more » « less
  5. This paper systematically explores how Large Language Models (LLMs) generate explanations of code examples of the type used in intro-to-programming courses. As we show, the nature of code explanations generated by LLMs varies considerably based on the wording of the prompt, the target code examples being explained, the programming language, the temperature parameter, and the version of the LLM. Nevertheless, they are consistent in two major respects for Java and Python: the readability level, which hovers around 7-8 grade, and lexical density, i.e., the relative size of the meaninful words with respect to the total explanation size. Furthermore, the explanations score very high in correctness but less on three other metrics: completeness, conciseness, and contextualization. 
    more » « less