NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Sampling from Your Language Model One Byte at a Time

Hayase, Jonathan; Liu, Alisa; Smith, Noah A; Oh, Sewoong (July 2025, cs.CL)

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals. Code is available at this https URL .
more » « less
Free, publicly-accessible full text available July 11, 2026
Constructions of locally recoverable codes with large availability

https://doi.org/10.1007/s10623-025-01624-w

Micheli, Giacomo; Pallozzi_Lavorante, Vincenzo; Shukul, Abhi; Smith, Noah (April 2025, Designs, Codes and Cryptography)

Free, publicly-accessible full text available April 5, 2026
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Hayase, Jonathan; Liu, Alisa; Choi, Yejin; Oh, Sewoong; Smith, Noah A (November 2024, https://doi.org/10.48550/arXiv.2407.16607)

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
more » « less
Full Text Available
Tuning Language Models by Proxy

Liu, Alisa; Han, Xiaochuang; Wang, Yizhong; Tsvetkov, Yulia; Choi, Yejin; Smith, Noah A (October 2024, Conference on Language Modeling)

Full Text Available
How Language Model Hallucinations Can Snowball

Zhang, Muru; Press, Ofir; Merrill, William; Liu, Alisa; Smith, Noah (July 2024, International Conference on Machine Learning (ICML) 2024)

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, respectively. We refer to this phenomenon as hallucination snowballing: an LM over-commits to early mistakes, leading to more mistakes that it otherwise would not make.
more » « less
Full Text Available
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Min, Sewon; Gururangan, Suchin; Wallace, Eric; Shi, Weijia; Hajishirzi, Hannaneh; Smith, Noah; Zettlemoyer, Luke (May 2024, ICLR)

Full Text Available
Set the Clock: Temporal Alignment of Pretrained Language Models

https://doi.org/10.18653/v1/2024.findings-acl.892

Zhao, Bowen; Brumbaugh, Zander; Wang, Yizhong; Hajishirzi, Hannaneh; Smith, Noah (January 2024, Association for Computational Linguistics)

Full Text Available
How far can camels go? exploring the state of instruction tuning on open resources

Wang, Yizhong; Ivison, Hamish; Dasigi, Pradeep; Hessel, Jack; Khot, Tushar; Chandu, Khyathi; Wadden, David; MacMillan, Kelsey; Smith, Noah; Beltagy, Iz; et al (May 2024, Neurips)

Full Text Available
Measuring and Narrowing the Compositionality Gap in Language Models

https://doi.org/10.18653/v1/2023.findings-emnlp.378

Press, Ofir; Zhang, Muru; Min, Sewon; Schmidt, Ludwig; Smith, Noah A; Lewis, Mike (December 2023, Findings of the Association for Computational Linguistics: EMNLP 2023)

Full Text Available
What is in my big data?

Elazar, Yanai; Bhagia, Akshita; Magnusson, Ian; Ravichander, Abhilasha; Schwenk, Dustin; Suhr, Alane; Walsh, Pete; Groeneveld, Dirk; Soldaini, Luca; Singh, Sameer; et al (May 2024, ICLR)

Full Text Available

« Prev Next »

Search for: All records