DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Patel, Ajay; Raffel, Colin; Callison-Burch, Chris

doi:10.18653/v1/2024.acl-long.208

Citation Details

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this ACL 2024 theme track paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at: https://github.com/datadreamer-dev/DataDreamer. more »

Award ID(s):: 1928474

PAR ID:: 10563514

Author(s) / Creator(s):: Patel, Ajay; Raffel, Colin; Callison-Burch, Chris

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2024-01-01

Page Range / eLocation ID:: 3781 to 3799

Subject(s) / Keyword(s):: LLMs prompting frameworks synthetic data

Format(s):: Medium: X

Location:: Bangkok, Thailand

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.18653/v1/2024.acl-long.208

More Like this