NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

https://doi.org/10.52202/079017-2952

Somepalli, Gowthami; Chowdhury, Arkabandhu; Basri, Ronen; Geiping, Jonas; Goldstein, Tom; Jacobs, David (December 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully “contextual” scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model’s ability to provide scene captions. Lastly, we observe that our model responds well to prompt engineering and few-shot in-context learning techniques, enabling the user to adapt it to any new movie with very little additional annotation.
more » « less
Full Text Available
CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Somepalli, Gowthami; Chowdhury, Arkabandhu; Geiping, Jonas; Basri, Ronen; Goldstein, Tom; Jacobs, David W (November 2024, Advances in Neural Information Processing Systems)

Full Text Available
Transformers Can Do Arithmetic with the Right Embeddings

McLeish, Sean; Bansal, Arpit; Stein, Alex; Jain, Neel; Kirchenbauer, John; Bartoldson, Brian R; Kailkhura, Bhavya; Bhatele, Abhinav; Geiping, Jonas; Schwarzschild, Avi; et al (December 2024, ArXiv)

Full Text Available
A Watermark for Large Language Models

Kirchenbauer, John; Geiping, Jonas; Wen, Yuxin; Katz, Jonathan; Miers, Ian; Goldstein, Tom (July 2024, PMLR)

In this paper, Kirchenbauer et. al. use a novel watermarking technology to watermark the output of large language models (LLMs) like ChatGP, which is often in the form of AI-generated text, and mitigate the harms associated with the increasing usage of these technologies. They note some of the capabilities of these LLM models as writing documents, creating executable code, and answering questions, often with human-like capabilities. In addition, they list some of the harms as social engineering and election manipulation campaigns that exploit automated bots on social media platforms, creation of fake news and web content, and use of AI systems for cheating onacademic writing and coding assignments. As for implications for policy makers, this technology can be utilized as a means to regulate and oversee the use of these LLMs on all public and social fronts where their AI-generated text output could pose a potential harm, such as those listed by the authors. (Methods and Metrics, watermarking LLM output)
more » « less
Full Text Available
Coercing LLMs to do and reveal (almost) anything

Geiping, Jonas; Stein, Alex; Shu, Manli; Saifullah, Khalid; Wen, Yuxin; Goldstein, Tom (May 2024, ArXiv)

Full Text Available
Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust

Wen, Yuxin; Kirchenbauer, John; Geiping, Jonas; Goldstein, Tom (December 2023, NeurIPS 2023)

Watermarking the outputs of generative models is a crucial technique for tracing copyright and preventing potential harm from AI-generated content. In this paper, we introduce a novel technique called Tree-Ring Watermarking that robustly fingerprints diffusion model outputs. Unlike existing methods that perform post-hoc modifications to images after sampling, Tree-Ring Watermarking subtly influences the entire sampling process, resulting in a model fingerprint that is invisible to humans. The watermark embeds a pattern into the initial noise vector used for sampling. These patterns are structured in Fourier space so that they are invariant to convolutions, crops, dilations, flips, and rotations. After image generation, the watermark signal is detected by inverting the diffusion process to retrieve the noise vector, which is then checked for the embedded signal. We demonstrate that this technique can be easily applied to arbitrary diffusion models, including text-conditioned Stable Diffusion, as a plug-in with negligible loss in FID. Our watermark is semantically hidden in the image space and is far more robust than watermarking alternatives that are currently deployed. Code is available at https://github.com/YuxinWenRick/tree-ring-watermark.
more » « less
Full Text Available
Understanding and Mitigating Copying in Diffusion Models

Somepalli, Gowthami; Singla, Vasu; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (December 2023, NeurIPS 2023)

This paper proposes solutions to detecting and mitigating the blatant replication and memorization of data used to train text-to-image generators, especially Stable Diffusion. The potential for diffusion models to reproduce copyrighted or private images without user knowledge poses significant ethical and legal challenges. For lawmakers, this highlights the need for clear guidelines and regulations around the use of such models, especially in commercial applications.
more » « less
Full Text Available
A Simple and Efficient Baseline for Data Attribution on Images

Singla, Vasu; Sandoval-Segura, Pedro; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (December 2023, NeurIPS 2023 Workshop ATTRIB)

Data attribution methods play a crucial role in understanding machine learning models, providing insight into which training data points are most responsible for model outputs during deployment. However, current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. These approaches therefore come at a high computational cost, are memory intensive, and are hard to scale to large models or datasets. In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution. Our method is model-agnostic and scales easily to large datasets. We show results on CIFAR-10 and ImageNet, achieving strong performance that rivals or outperforms state-of-the-art approaches at a fraction of the compute or memory cost. Contrary to prior work, our results reinforce the intuition that a model's prediction on one image is most impacted by visually similar training samples. Our approach serves as a simple and efficient baseline for data attribution on images.
more » « less
Full Text Available
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery

Wen, Yuxin; Jain, Neel; Kirchenbauer, John; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (December 2023, NeurIPS 2023)

The strength of modern generative models lies in their ability to be controlled through text-based prompts. Typical "hard" prompts are made from interpretable words and tokens, and must be hand-crafted by humans. There are also "soft" prompts, which consist of continuous feature vectors. These can be discovered using powerful optimization methods, but they cannot be easily interpreted, re-used across models, or plugged into a text-based interface. We describe an approach to robustly optimize hard text prompts through efficient gradient-based optimization. Our approach automatically generates hard text-based prompts for both text-to-image and text-to-text applications. In the text-to-image setting, the method creates hard prompts for diffusion models, allowing API users to easily generate, discover, and mix and match image concepts without prior knowledge on how to prompt the model. In the text-to-text setting, we show that hard prompts can be automatically discovered that are effective in tuning LMs for classification.
more » « less
Full Text Available
Cramming: Training a Language Model on a single GPU in one day.

Geiping, Jonas; Goldstein, Tom (July 2023, PMLR)

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting. We provide code to reproduce all experiments at github.com/JonasGeiping/cramming.
more » « less
Full Text Available

« Prev Next »

Search for: All records