NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SuperBPE: Space Travel for Language Models

Liu, A; Hayase, J; Hofmann, V; Oh, S; Smith, N A; Choi, Y (April 2025, https://doi.org/10.48550/arXiv.2503.13423)

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.
more » « less
Free, publicly-accessible full text available April 14, 2026
Pairing interaction from three-dimensional acoustic plasmon demon modes in Sr2RuO4

Ihm, J; Choi, Y W; Cohen, M L (October 2024, Physical review B)

Full Text Available
Electronic reconstruction in confined ${SrRuO}_{3}$ monolayers

https://doi.org/10.1103/PhysRevB.110.235104

Lamichhane, U; Sankhi, B; Kundu, N; Fabbris, G; Choi, Y; Haskel, D; McChesney, J L; Cao, Yue; Li, J; Bisogni, V; et al (December 2024, Physical Review B)

We report the observation of an electronic reconstruction in dimensionally controlled ruthenate heterostructures synthesized by pulsed laser deposition. High structural and electronic quality of superlattices comprised of a single SrRuO3 layer inter-spaced with varying thicknesses of insulating SrTiO3 layers was verified by reflection high energy electron diffraction, atomic force microscopy, x-ray diffraction, reciprocal space mapping, and x-ray absorption spectroscopy. X-ray absorption spectroscopy evidences a confinement-driven evolution of the Ru electronic configuration from the d5L to the d4 state. Significant increases of the spin-orbit coupling are observed in connection with the configuration changes supporting recent works identifying large enhancement of the magnetic anisotropy. The growth of high quality two-dimensional confined ruthenate layers under precisely controlled environments highlights the potential to directly manipulate interlayer coupling and selectively perturb the electronic state in ruthenates in analogy to superconducting Sr2RuO4.
more » « less
Free, publicly-accessible full text available December 2, 2025
Generative Multi-Physics Models for System Power and Thermal Analysis Using Conditional Generative Adversarial Networks

https://doi.org/10.1109/EPEPS58208.2023.10314864

Kashyap, P.; Cheng, C.; Choi, Y.; Franzon, P. (October 2023, IEEE 32nd Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS))
Exploring magnetic anisotropy and robustness of the $J_{eff} = 1 / 2$ state under substantial orthorhombic distortion in $S r_{2} Ir O_{4}$ thin films

https://doi.org/10.1103/PhysRevB.109.104415

Shrestha, S.; Choi, Y.; Krautloher, M.; Zhu, M.; Hwang, J.; Keimer, B.; Seo, A.; Kim, J-W (March 2024, Physical Review B)
Design and Evaluation of Nanoscale Materials with Programmed Responsivity towards Epigenetic Enzymes

https://doi.org/10.1101/2024.03.26.585429

Ray, P.; Sedigh, A.; Confeld, M.; Alhalhooly, L.; Iduoku, K.; Casanola-Martin, G. M.; Pham-The, H.; Rasulev, B.; Choi, Y.; Yang, Z.; et al (March 2024, bioRxiv)

Self-assembled materials capable of modulating their assembly properties in response to specific enzymes play a pivotal role in advancing ‘intelligent’ encapsulation platforms for biotechnological applications. Here, we introduce a previously unreported class of synthetic nanomaterials that programmatically interact with histone deacetylase (HDAC) as the triggering stimulus for disassembly. These nanomaterials consist of co-polypeptides comprising poly (acetyl L-lysine) and poly(ethylene glycol) blocks. Under neutral pH conditions, they self-assemble into particles. However, their stability is compromised upon exposure to HDACs, depending on enzyme concentration and exposure time. Our investigation, utilizing HDAC8 as the model enzyme, revealed that the primary mechanism behind disassembly involves a decrease in amphiphilicity within the block copolymer due to the deacetylation of lysine residues within the particles’ hydrophobic domains. To elucidate the response mechanism, we encapsulated a fluorescent dye within these nanoparticles. Upon incubation with HDAC, the nanoparticle structure collapsed, leading to controlled release of the dye over time. Notably, this release was not triggered by denatured HDAC8, other proteolytic enzymes like trypsin, or the co-presence of HDAC8 and its inhibitor. We further demonstrated the biocompatibility and cellular effects of these materials and conducted a comprehensive computational study to unveil the possible interaction mechanism between enzymes and particles. By drawing parallels to the mechanism of naturally occurring histone proteins, this research represents a pioneering step toward developing functional materials capable of harnessing the activity of epigenetic enzymes such as HDACs.
more » « less
Full Text Available
Spectroscopic Analysis of Pictor II: a very low metallicity ultra-faint dwarf galaxy bound to the Large Magellanic Cloud

https://doi.org/10.33232/001c.142989

Pace, Andrew B; Li, T S; Ji, A P; Simon, J D; Cerny, W; Senkevich, A M; Drlica-Wagner, A; Bechtol, K; Tan, C Y; Chiti, A; et al (January 2025, The Open Journal of Astrophysics)

We present Magellan/IMACS and Magellan/MIKE spectroscopy of the ultra-faint dwarf (UFD) galaxy Pictor~II (Pic~II) that is located only 12 kpc from the Large Magellanic Cloud (LMC). From the IMACS spectroscopy, we identify 13 member stars and measure a mean heliocentric velocity of , a velocity dispersion of , a mean metallicity of , and an upper limit on the metallicity dispersion of . We measure detailed elemental abundances for the brightest star, finding $[Fe/H] = - 3.3$ , high [ $α$ /Fe] ratios, and no detectable neutron capture elements, similar to stars in other UFDs. However, this star has an unusually high [Sc/Fe] ratio. The dynamical mass-to-light ratio ( $M / L = 760_{- 420}^{+ 910} M_{⊙} L_{⊙}^{- 1}$ ), size, and chemical abundances confirms that Pic~II is a dark matter-dominated dwarf galaxy. We perform detailed orbit modeling of Pic~II in a combined Milky Way (MW) and LMC potential and find that Pic~II is highly likely to be a long-term LMC satellite. Furthermore, we find that Pic II is likely still bound to the LMC today. Pic~II is the seventh LMC-associated UFD and among the most metal-poor UFDs known. We further update the morphological parameters with deeper Dark Energy Camera (DECam) photometry, compute the dark matter properties for dark matter indirect detection searches, verify the extremely low metallicity with narrowband CaHK imaging, and briefly discuss tidal influences of the LMC and MW.
more » « less
Free, publicly-accessible full text available January 1, 2026
Resonantly Enhanced Electromigration Forces for Adsorbates on Graphene

Choi, Y. W.; Cohen, M. L. (November 2022, Physical review letters)

Full Text Available
Relational abstraction in early childhood: Three cultures and three trajectories.

Carstensen, A.; Kim, M.; Kim, G.; Jin, M.; Kang, M.; Choi, Y.; & Walker C.M. (January 2023, Proceedings of the Annual Conference of the Cognitive Science Society)

Full Text Available
Relational abstraction in early childhood: Three cultures and three trajectories

Carstensen, A.; Kim, M.; Kim, G.; Jin, M.; Kang, M.; Choi, Y.; Walker, C. (January 2023, Proceedings of the Annual Conference of the Cognitive Science Society)

Full Text Available

« Prev Next »

Search for: All records