Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning

Zhang, Ruisu; Chen, Yicong; Lee, Kangwook

Citation Details

This content will become publicly available on January 20, 2026

Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning

We focus on addressing the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-training (CLIP) models. Centered on our hypothesis that counting knowledge can be abstracted into linear vectors within the text embedding space, we develop a parameter-efficient fine-tuning method and several zero-shot methods to improve CLIP's counting accuracy. Through comprehensive experiments, we demonstrate that our learning-based method not only outperforms full-model fine-tuning in counting accuracy but also retains the broad capabilities of pre-trained CLIP models. Our zero-shot text embedding editing techniques are also effective in situations where training data is scarce, and can be extended to improve Stable Diffusion's ability to generate images with precise object counts. We also contribute two specialized datasets to train and evaluate CLIP’s counting capabilities. Our code is available at https://github.com/UW-Madison-Lee-Lab/CLIP_Counting. more »

Award ID(s):: 2339978

PAR ID:: 10595988

Author(s) / Creator(s):: Zhang, Ruisu; Chen, Yicong; Lee, Kangwook

Publisher / Repository:: Journal of Machine Learning Research, Inc. / OpenReview

Date Published:: 2025-01-20

Journal Name:: Transactions on machine learning research

ISSN:: 2835-8856

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on January 20, 2026
Journal Article:
The DOI is not currently available.

More Like this