This content will become publicly available on October 10, 2026
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
- Award ID(s):
- 2313028
- PAR ID:
- 10633745
- Publisher / Repository:
- 2nd Conference on Language Modeling (COLM 2025)
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
An official website of the United States government
