The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

Huang, Tzu-Heng; Cao, Catherine; Bhargava, Vaishnavi; Sala, Frederic

Citation Details

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500×. We release our code here: https://github.com/SprocketLab/Alchemist more »

Award ID(s):: 2106707

PAR ID:: 10595224

Author(s) / Creator(s):: Huang, Tzu-Heng; Cao, Catherine; Bhargava, Vaishnavi; Sala, Frederic

Publisher / Repository:: NeurIPS

Date Published:: 2024-12-10

Format(s):: Medium: X

Location:: Conference on Neural Information Processing Systems (NeurIPS)

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this