Large language models (LLMs) are increasingly integrated into a variety of writing tasks. While these tools can help people by generating ideas or producing higher quality work, like many other AI tools they may risk causing a variety of harms, disproportionately burdening historically marginalized groups. In this work, we introduce and evaluate perceptual harm, a term for the harm caused to users when others perceive or suspect them of using AI. We examined perceptual harms in three online experiments, each of which entailed human participants evaluating the profiles for fictional freelance writers. We asked participants whether they suspected the freelancers of using AI, the quality of their writing, and whether they should be hired. We found some support for perceptual harms against for certain demographic groups, but that perceptions of AI use negatively impacted writing evaluations and hiring outcomes across the board.
more »
« less
A Watermark for Large Language Models
In this paper, Kirchenbauer et. al. use a novel watermarking technology to watermark the output of large language models (LLMs) like ChatGP, which is often in the form of AI-generated text, and mitigate the harms associated with the increasing usage of these technologies. They note some of the capabilities of these LLM models as writing documents, creating executable code, and answering questions, often with human-like capabilities. In addition, they list some of the harms as social engineering and election manipulation campaigns that exploit automated bots on social media platforms, creation of fake news and web content, and use of AI systems for cheating onacademic writing and coding assignments. As for implications for policy makers, this technology can be utilized as a means to regulate and oversee the use of these LLMs on all public and social fronts where their AI-generated text output could pose a potential harm, such as those listed by the authors. (Methods and Metrics, watermarking LLM output)
more »
« less
- Award ID(s):
- 2229885
- PAR ID:
- 10522346
- Publisher / Repository:
- PMLR
- Date Published:
- Format(s):
- Medium: X
- Location:
- Honolulu, HI
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Text watermarks for large language models (LLMs) have been commonly used to identify the origins of machine-generated content, which is promising for assessing liability when combating deepfake or harmful content. While existing watermarking techniques typically prioritize robustness against removal attacks, unfortunately, they are vulnerable to spoofing attacks: malicious actors can subtly alter the meanings of LLM-generated responses or even forge harmful content, potentially misattributing blame to the LLM developer. To overcome this, we introduce a bi-level signature scheme, Bileve, which embeds fine-grained signature bits for integrity checks (mitigating spoofing attacks) as well as a coarse-grained signal to trace text sources when the signature is invalid (enhancing detectability) via a novel rank-based sampling strategy. Compared to conventional watermark detectors that only output binary results, Bileve can differentiate 5 scenarios during detection, reliably tracing text provenance and regulating LLMs. The experiments conducted on OPT-1.3B and LLaMA-7B demonstrate the effectiveness of Bileve in defeating spoofing attacks with enhanced detectability.more » « less
-
Zhang, Zheng (Ed.)The integrity of peer review is fundamental to scientific progress, but the rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently. Although some venues have banned LLM-assisted reviewing, enforcement remains difficult as existing detection tools cannot reliably distinguish between fully generated reviews and those merely polished with AI assistance. In this work, we address the challenge of detecting LLM-generated reviews. We consider the approach of performing indirect prompt injection via the paper’s PDF, prompting the LLM to embed a covert watermark in the generated review, and subsequently testing for presence of the watermark in the review. We identify and address several pitfalls in naïve implementations of this approach. Our primary contribution is a rigorous watermarking and detection framework that offers strong statistical guarantees. Specifically, we introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections such as Bonferroni, while making no assumptions about the nature of human-written reviews. We explore multiple indirect prompt injection strategies–including font-based embedding and obfuscated prompts–and evaluate their effectiveness under various reviewer defense scenarios. Our experiments find high success rates in watermark embedding across various LLMs. We also empirically find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice. In contrast, we find that Bonferroni-style corrections are too conservative to be useful in this setting.more » « less
-
The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. It is important to be able to distinguish machine-generated text from human-authored content. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on LLM output watermarking techniques—as opposed to image or model watermarks—and proposes MARKMYWORDS, a comprehensive benchmark for them under different natural language tasks. We focus on three main metrics: quality, size (i.e., the number of tokens needed to detect a watermark), and tamper resistance (i.e., the ability to detect a watermark after perturbing marked text). Current watermarking techniques are nearly practical enough for real-world use: Kirchenbauer et al.'s scheme can watermark models like Llama 2 7B-chat or Mistral-7B-Instruct with no perceivable loss in quality on natural language tasks, the watermark can be detected with fewer than 100 tokens, and their scheme offers good tamper resistance to simple perturbations. However, they struggle to efficiently watermark code generations. We publicly release our benchmark (https://github.com/wagner-group/MarkMyWords).more » « less
-
The powerful capabilities of LLMs stem from their rich training data and high-quality labeled datasets, making the training of strong LLMs a resource-intensive process, which elevates the importance of IP protection for such LLMs. Compared to gathering high-quality labeled data, directly sampling outputs from these fully trained LLMs as training data presents a more cost-effective approach. This practice—where a suspect model is fine-tuned using high-quality data derived from these LLMs, thereby gaining capabilities similar to the target model—can be seen as a form of IP infringement against the original LLM. In recent years, LLM watermarks have been proposed and used to detect whether a text is AI-generated. Intuitively, if data sampled from a watermarked LLM is used for training, the resulting model would also be influenced by this watermark. This raises the question: can we directly use such watermarks to detect IP infringement of LLMs? In this paper, we explore the potential of LLM watermarks for detecting model infringement. We find that there are two issues with direct detection: (1) The queries used to sample output from the suspect LLM have a significant impact on detectability. (2) The watermark that is easily learned by LLMs exhibits instability regarding the watermark's hash key during detection. To address these issues, we propose LIDet, a detection method that leverages available anchor LLMs to select suitable queries for sampling from the suspect LLM. Additionally, it adapts the detection threshold to mitigate detection failures caused by different hash keys. To demonstrate the effectiveness of this approach, we construct a challenging model set containing multiple suspect LLMs on which direct detection methods struggle to yield effective results. Our method achieves over 90% accuracy in distinguishing between infringing and clean models, demonstrating the feasibility of using LLM watermarks to detect LLM IP infringement.more » « less
An official website of the United States government

