Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

Li, Dongyue; Zhang, Ziniu; Wang, Lu; Zhang, Hongyang R

Citation Details

This content will become publicly available on November 16, 2025

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from n auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in chain-of-thought fine-tuning. The key challenge of this problem is that not all auxiliary tasks are useful to improve the performance of the target task. Thus, choosing the right subset of auxiliary tasks is crucial. Conventional subset selection methods, such as forward & backward selection, are unsuitable for LM fine-tuning because they require repeated training on subsets of auxiliary tasks. This paper introduces a new algorithm to estimate model fine-tuning performances without repeated training. Our algorithm first performs multitask training using the data of all the tasks to obtain a meta initialization. Then, we approximate the model fine-tuning loss of a subset using functional values and gradients from the meta initialization. Empirically, we find that this gradient-based approximation holds with remarkable accuracy for twelve transformer-based LMs. Thus, we can now estimate fine-tuning performances on CPUs within a few seconds. We conduct extensive experiments to validate our approach, delivering a speedup of 30× over conventional subset selection while incurring only 1% error of the true fine-tuning performances. In downstream evaluations of instruction tuning and chain-of-thought fine-tuning, our approach improves over prior methods that utilize gradient or representation similarity for subset selection by up to 3.8%. more »

Award ID(s):: 2412008

PAR ID:: 10595812

Author(s) / Creator(s):: Li, Dongyue; Zhang, Ziniu; Wang, Lu; Zhang, Hongyang R

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2024-11-16

Journal Name:: Findings of the Association for Computational Linguistics: EMNLP 2024

ISSN:: 0736-587X

Page Range / eLocation ID:: 5608-5623

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on November 16, 2025
Journal Article:
The DOI is not currently available.

More Like this