NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Generative AI enables medical image segmentation in ultra low-data regimes

https://doi.org/10.1038/s41467-025-61754-6

Zhang, Li; Jindal, Basu; Alaa, Ahmed; Weinreb, Robert; Wilson, David; Segal, Eran; Zou, James; Xie, Pengtao (July 2025, Nature Communications)

Semantic segmentation of medical images is pivotal in applications like disease diagnosis and treatment planning. While deep learning automates this task effectively, it struggles in ultra low-data regimes for the scarcity of annotated segmentation masks. To address this, we propose a generative deep learning framework that produces high-quality image-mask pairs as auxiliary training data. Unlike traditional generative models that separate data generation from model training, ours uses multi-level optimization for end-to-end data generation. This allows segmentation performance to guide the generation process, producing data tailored to improve segmentation outcomes. Our method demonstrates strong generalization across 11 medical image segmentation tasks and 19 datasets, covering various diseases, organs, and modalities. It improves performance by 10–20% (absolute) in both same- and out-of-domain settings and requires 8–20 times less training data than existing approaches. This greatly enhances the feasibility and cost-effectiveness of deep learning in data-limited medical imaging scenarios.
more » « less
Full Text Available
CollabLLM: From Passive Responders to Active Collaborators

Wu, Shirley; Galley, Michel; Peng, Baolin; Cheng, Hao; Li, Gavin; Dou, Yao; Cai, Weixin; Zou, James; Leskovec, Jure; Gao, Jianfeng (July 2025, International Conference on Machine Learning)

Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce COLLABLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, COLLABLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more humancentered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. COLLABLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where COLLABLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.
more » « less
Full Text Available
Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Wang, Jiachen T; Yang, Tianji; Zou, James; Kwon, Yongchan; Jia, Ruoxi (February 2025, Proceedings of the 41st International Conference on Machine Learning (ICML 2024))

Full Text Available
Avatar: Optimizing llm agents for tool usage via contrastive reasoning

Wu, Shirley; Zhao, Shiyu; Huang, Qian; Huang, Kexin; Yasunaga, Michihiro; Cao, Kaidi; Ioannidis, Vassilis N; Subbian, Karthik; Leskovec, Jure; Zou, James (December 2024, Advances in neural information processing systems)

Full Text Available
STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Wu, Shirley; Zhao, Shiyu; Yasunaga, Michihiro; Huang, Kexin; Cao, Kaidi; Huang, Qian; Ioannidis, Vassilis N; Subbian, Karthik; Zou, James; Leskovec, Jure (December 2024, Advances in neural information processing systems)

Full Text Available
Learning and Forgetting Unsafe Examples in Large Language Models

Zhao, Jiachen; Deng, Zhun; Madras, David; Zou, James; Ren, Mengye (July 2024, International Conference on Machine Learning (ICML))

As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the “ForgetFilter” algorithm, which filters unsafe data based on how strong the model’s forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs’ ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
more » « less
Full Text Available
Simple linear attention language models balance the recall-throughput tradeoff

Arora, Simran; Eyuboglu, Sabri; Zhang, Michael; Timalsina, Aman; Alberti, Silas; Zou, James; Rudra, Atri; Re, Christopher (July 2024, Proceedings of the 41st International Conference on Machine Learning)

Full Text Available
Zoology: Measuring and Improving Recall in Efficient Language Models

Arora, Simran; Eyuboglu, Sabri; Timalsina, Aman; Johnson, Isys; Poli, Michael; Zou, James; Rudra, Atri; Ré, Christopher (May 2024, Proceedings of 12th International Conference on Learning Representations (ICLR))

Full Text Available
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews.

Liang, Weixin; Izzo, Zachary; Zhang, Yaohui; Lepp, Haley; Cao, Hancheng; Zhao, Xuandong; Chen, Lingjiao; Ye, Haotian; Liu, Sheng; Huang, Zhi; et al (July 2024, ICML conference.)

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
more » « less
Full Text Available
Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis

https://doi.org/10.1056/AIoa2400196

Liang, Weixin; Zhang, Yuhui; Cao, Hancheng; Wang, Binglu; Ding, Daisy Yi; Yang, Xinyu; Vodrahalli, Kailas; He, Siyu; Smith, Daniel Scott; Yin, Yian; et al (July 2024, NEJM AI)

BACKGROUND Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production challenges the conventional scienti c feedback mechanisms. High-quality peer reviews are increasingly dif cult to obtain. METHODS We created an automated pipeline using Generative Pretrained Transformer 4 (GPT-4) to provide comments on scienti c papers. We evaluated the quality of GPT-4’s feedback through two large-scale studies. We rst quantitatively compared GPT-4’s gen- erated feedback with human peer reviewers’ feedback in general scienti c papers from 15 Nature family journals (3096 papers in total) and the International Conference on Learning Representations (ICLR) machine learning conference (1709 papers). To speci - cally assess GPT-4’s performance on biomedical papers, we also analyzed a subset of 425 health sciences papers from the Nature portfolio and a random sample of 666 sub- missions to eLife. Additionally, we conducted a prospective user study with 308 research- ers from 110 institutions in the elds of arti cial intelligence and computational biology to understand how researchers perceive feedback generated by our system on their own papers. RESULTS The overlap in the points raised by GPT-4 and by human reviewers (average overlap of 30.85% for Nature journals and 39.23% for ICLR) is comparable with the over- lap between two human reviewers (average overlap of 28.58% for Nature journals and 35.25% for ICLR). Results on eLife and a subset of health sciences papers as categorized by the Nature portfolio show similar patterns. In our prospective user study, more than half (57.4%) of the users found GPT-4–generated feedback helpful/very helpful, and 82.4% found it more bene cial than feedback from at least some human reviewers. We also identify several limitations of large language model (LLM)–generated feedback. CONCLUSIONS Through both retrospective and prospec- tive evaluation, we nd substantial overlap between LLM and human feedback as well as positive user perceptions regarding the usefulness of LLM feedback. Although human expert review should continue to be the foundation of the scienti c process, LLM feedback could bene t researchers, especially when timely expert feedback is not available and in earlier stages of manuscript preparation. (Funded by the Chan–Zuckerberg Initiative and the Stanford Interdisciplin- ary Graduate Fellowship.)
more » « less
Full Text Available

« Prev Next »

Search for: All records