On a Scale of 1 to 5, How Reliable Are AI User Studies? A Call for Developing Validated, Meaningful Scales and Metrics about User Perceptions of AI Systems

Tolsdorf, Jan; Luo, Alan F; Kodwani, Monica; Eum, Junho; Sharif, Mahmood; Mazurek, Michelle L; Aviv, Adam J

With their increased capability, AI-based chatbots have become increasingly popular tools to help users answer complex queries. However, these chatbots may hallucinate, or generate incorrect but very plausible-sounding information, more frequently than previously thought. Thus, it is crucial to examine strategies to mitigate human susceptibility to hallucinated output. In a between-subjects experiment, participants completed a difficult quiz with assistance from either a polite or neutral-toned AI chatbot, which occasionally provided hallucinated (incorrect) information. Signal detection analysis revealed that participants interacting with polite-AI showed modestly higher sensitivity in detecting hallucinations and a more conservative response bias compared to those interacting with neutral-toned AI. While the observed effect sizes were modest, even small improvements in users’ ability to detect AI hallucinations can have significant consequences, particularly in high-stakes domains or when aggregated across millions of AI interactions.

More Like this