NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis

Chen, Chacha; Liu, Han; Yang, Jiamin; Mervak, Benjamin M; Kalaycioglu, Bora; Lee, Grace; Cakmakli, Emre; Bonatti, Matteo; Pudu, Sridhar; Kahraman, Osman; et al (June 2025, ACM)

Despite the growing interest in human-AI decision making, experimental studies with domain experts remain rare, largely due to the complexity of working with domain experts and the challenges in setting up realistic experiments. In this work, we conduct an in-depth collaboration with radiologists in prostate cancer diagnosis based on MRI images. Building on existing tools for teaching prostate cancer diagnosis, we develop an interface and conduct two experiments to study how AI assistance and performance feedback shape the decision making of domain experts. In Study 1, clinicians were asked to provide an initial diagnosis (human), then view the AI's prediction, and subsequently finalize their decision (human-AI team). In Study 2 (after a memory wash-out period), the same participants first received aggregated performance statistics from Study 1, specifically their own performance, the AI's performance, and their human-AI team performance, and then directly viewed the AI's prediction before making their diagnosis (i.e., no independent initial diagnosis). These two workflows represent realistic ways that clinical AI tools might be used in practice, where the second study simulates a scenario where doctors can adjust their reliance and trust on AI based on prior performance feedback. Our findings show that, while human-AI teams consistently outperform humans alone, they still underperform the AI due to under-reliance, similar to prior studies with crowdworkers. Providing clinicians with performance feedback did not significantly improve the performance of human-AI teams, although showing AI decisions in advance nudges people to follow AI more. Meanwhile, we observe that the ensemble of human-AI teams can outperform AI alone, suggesting promising directions for human-AI collaboration.
more » « less
Free, publicly-accessible full text available June 23, 2026
Hypothesis Generation with Large Language Models

https://doi.org/10.18653/v1/2024.nlp4science-1.10

Zhou, Yangqiaoyu; Liu, Haokun; Srivastava, Tejes; Mei, Hongyuan; Tan, Chenhao (November 2024, Association for Computational Linguistics)

Full Text Available
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Kiciman, Emre; Ness, Robert; Sharma, Amit; Tan, Chenhao (October 2024, Transactions on machine learning research)

The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a "behavorial" study of LLMs to benchmark their capability in generating causal arguments. Across a wide range of tasks, we find that LLMs can generate text corresponding to correct causal arguments with high probability, surpassing the best-performing existing methods. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain) and event causality (86% accuracy in determining necessary and sufficient causes in vignettes). We perform robustness checks across tasks and show that the capabilities cannot be explained by dataset memorization alone, especially since LLMs generalize to novel datasets that were created after the training cutoff date. That said, LLMs exhibit unpredictable failure modes, and we discuss the kinds of errors that may be improved and what are the fundamental limits of LLM-based answers. Overall, by operating on the text metadata, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. As a result, LLMs may be used by human domain experts to save effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. Given that LLMs ignore the actual data, our results also point to a fruitful research direction of developing algorithms that combine LLMs with existing causal techniques. Code and datasets are available at https://github.com/py-why/pywhy-llm.
more » « less
Full Text Available
Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports

Cao, Tianyu; Raman, Natraj; Dervovic, Danial; Tan, Chenhao (October 2024, COLM)

As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization.
more » « less
Full Text Available
CHIME: LLM-Assisted Hierarchical Organization of Scientific Studies for Literature Review Support

https://doi.org/10.18653/v1/2024.findings-acl.8

Hsu, Chao-Chun; Bransom, Erin; Sparks, Jenna; Kuehl, Bailey; Tan, Chenhao; Wadden, David; Wang, Lucy; Naik, Aakanksha (August 2024, Association for Computational Linguistics)

Full Text Available
Explanation in the Era of Large Language Models

https://doi.org/10.18653/v1/2024.naacl-tutorials.3

Zhu, Zining; Chen, Hanjie; Ye, Xi; Lyu, Qing; Tan, Chenhao; Marasovic, Ana; Wiegreffe, Sarah (June 2024, Association for Computational Linguistics)

Full Text Available
Pragmatic Radiology Report Generation

Nguyen, Dang; Chen, Chacha; He, He; Tan, Chenhao (December 2023, Proceedings of the 3rd Machine Learning for Health Symposium)

When pneumonia is not found on a chest X-ray, should the report describe this negative observation or omit it? We argue that this question cannot be answered from the X-ray alone and requires a pragmatic perspective, which captures the communicative goal that radiology reports serve between radiologists and patients. However, the standard image-to-text formulation for radiology report generation fails to incorporate such pragmatic intents. Following this pragmatic perspective, we demonstrate that the indication, which describes why a patient comes for an X-ray, drives the mentions of negative observations. We thus introduce indications as additional input to report generation. With respect to the output, we develop a framework to identify uninferable information from the image, which could be a source of model hallucinations, and limit them by cleaning groundtruth reports. Finally, we use indications and cleaned groundtruth reports to develop pragmatic models, and show that they outperform existing methods not only in new pragmatics-inspired metrics (e.g., +4.3 Negative F1) but also in standard metrics (e.g., +6.3 Positive F1 and +11.0 BLEU-2).
more » « less
Full Text Available
Ecologically Valid Explanations for Label Variation in NLI

https://doi.org/10.18653/v1/2023.findings-emnlp.712

Jiang, Nan-Jiang; Tan, Chenhao; de Marneffe, Marie-Catherine (November 2023, Findings of EMNLP)

Full Text Available
Selective Explanations: Leveraging Human Input to Align Explainable AI

https://doi.org/10.1145/3610206

Lai, Vivian; Zhang, Yiming; Chen, Chacha; Liao, Q. Vera; Tan, Chenhao (September 2023, Proceedings of the ACM on Human-Computer Interaction)

While a vast collection of explainable AI (XAI) algorithms has been developed in recent years, they have been criticized for significant gaps with how humans produce and consume explanations. As a result, current XAI techniques are often found to be hard to use and lack effectiveness. In this work, we attempt to close these gaps by making AI explanations selective ---a fundamental property of human explanations---by selectively presenting a subset of model reasoning based on what aligns with the recipient's preferences. We propose a general framework for generating selective explanations by leveraging human input on a small dataset. This framework opens up a rich design space that accounts for different selectivity goals, types of input, and more. As a showcase, we use a decision-support task to explore selective explanations based on what the decision-maker would consider relevant to the decision task. We conducted two experimental studies to examine three paradigms based on our proposed framework: in Study 1, we ask the participants to provide critique-based or open-ended input to generate selective explanations (self-input). In Study 2, we show the participants selective explanations based on input from a panel of similar users (annotator input). Our experiments demonstrate the promise of selective explanations in reducing over-reliance on AI and improving collaborative decision making and subjective perceptions of the AI system, but also paint a nuanced picture that attributes some of these positive effects to the opportunity to provide one's own input to augment AI explanations. Overall, our work proposes a novel XAI framework inspired by human communication behaviors and demonstrates its potential to encourage future work to make AI explanations more human-compatible.
more » « less
Full Text Available
Learning to Ignore Adversarial Attacks

https://doi.org/10.18653/v1/2023.eacl-main.216

Zhang, Yiming; Zhou, Yangqiaoyu; Carton, Samuel; Tan, Chenhao (September 2023, Association for Computational Linguistics)

Full Text Available

« Prev Next »

Search for: All records