Improving Data Efficiency via Curating LLM-Driven Rating Systems

Pang, Jinlong; Wei, Jiaheng; Shah, Ankit; Zhu, Zhaowei; Wang, Yaxuan; Qian, Chen; Liu, Yang; Bao, Yujia; Wei, Wei

Citation Details

This content will become publicly available on April 24, 2026

Improving Data Efficiency via Curating LLM-Driven Rating Systems

Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less." more »

Award ID(s):: 2143895

PAR ID:: 10630293

Author(s) / Creator(s):: Pang, Jinlong; Wei, Jiaheng; Shah, Ankit; Zhu, Zhaowei; Wang, Yaxuan; Qian, Chen; Liu, Yang; Bao, Yujia; Wei, Wei

Publisher / Repository:: The Thirteenth International Conference on Learning Representations

Date Published:: 2025-04-24

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on April 24, 2026
Conference Proceeding:
The DOI is not currently available.

More Like this