NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SuperBPE: Space Travel for Language Models

Liu, A; Hayase, J; Hofmann, V; Oh, S; Smith, N A; Choi, Y (April 2025, https://doi.org/10.48550/arXiv.2503.13423)

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.
more » « less
Free, publicly-accessible full text available April 14, 2026
Learning the closest product state

Bakshi, A; Bostanci, J; Kretschmer, W; Landau, Z; Li, J; Liu, A; O'Donnell, R; Tang, E (March 2025, https://doi.org/10.48550/arXiv.2411.04283)

We study the problem of finding a (pure) product state with optimal fidelity to an unknown n-qubit quantum state ρ, given copies of ρ. This is a basic instance of a fundamental question in quantum learning: is it possible to efficiently learn a simple approximation to an arbitrary state? We give an algorithm which finds a product state with fidelity ε-close to optimal, using N=npoly(1/ε) copies of ρ and poly(N) classical overhead. We further show that estimating the optimal fidelity is NP-hard for error ε=1/poly(n), showing that the error dependence cannot be significantly improved. For our algorithm, we build a carefully-defined cover over candidate product states, qubit by qubit, and then demonstrate that extending the cover can be reduced to approximate constrained polynomial optimization. For our proof of hardness, we give a formal reduction from polynomial optimization to finding the closest product state. Together, these results demonstrate a fundamental connection between these two seemingly unrelated questions. Building on our general approach, we also develop more efficient algorithms in three simpler settings: when the optimal fidelity exceeds 5/6; when we restrict ourselves to a discrete class of product states; and when we are allowed to output a matrix product state.
more » « less
Free, publicly-accessible full text available March 31, 2026
Influence of excess silicon on polytype selection during metal-mediated epitaxy of GaN nanowires

https://doi.org/10.1063/5.0210669

Liu, A; Xi, Z; Li, M; Yang, J C; Qi, L; Goldman, R S (July 2024, Applied Physics Letters)

We have examined the origins of polytype selection during metal-mediated molecular-beam epitaxy of GaN nanowires (NWs). High-angle annular dark-field scanning transmission electron microscopy reveals [111]-oriented zinc blende (ZB) NWs and [0001]-oriented wurtzite (WZ) NWs, with SixNy at the interface between individual NWs and the Si (001) substrate. Quantitative energy dispersive x-ray spectroscopy reveals a notably higher Si concentration of 7.0% ± 2.3% in zinc blende (ZB) NWs than 2.3% ± 1.2% in wurtzite (WZ) NWs. Meanwhile, density functional theory calculations show that incorporation of 8 at. % Si on the Ga sublattice inverts the difference in formation energies between WZ and ZB GaN, such that the ZB polytype of GaN is stabilized. This identification of Si and other ZB polytype stabilizers will enable the development of polytype heterostructures in a wide variety of WZ-preferring compounds.
more » « less
Full Text Available
Advancing Planar Magnetic Microswimmers: Swimming, Channel Navigation, and Surface Motion

Duygu, Yasin C; Kararsiz, G; Liu, A; Cheang, U K; Leshansky, Alexander M; Kim, Min Jun (June 2024, Korea Robotics Society)

Full Text Available
Advancing Planar Magnetic Microswimmers: Swimming, Channel Navigation, and Surface Motion

Duygu, Yasin C; Kararsiz, G; Liu, A; Cheang, U K; Leshansky, Alexander M; Kim, Min Jun (June 2024, Korea Robotics Society)

Planar magnetic microswimmers are well-suited for in vivo biomedical applications due to their cost-effective mass production through standard photolithography techniques. The precise control of their motion in diverse environments is a critical aspect of their application. This study demonstrates the control of these swimmers individually and as a swarm, exploring navigation through channels and showcasing their functional capabilities for future biomedical settings. We also introduce the capability of microswimmers for surface motion, complementing their traditional fluid-based propulsion and extending their functionality. Our research reveals that microswimmers with varying magnetization directions exhibit unique trajectory patterns, enabling complex swarm tasks. This study further delves into the behavior of these microswimmers in intricate environments, assessing their adaptability and potential for advanced applications. The findings suggest that these microswimmers could be pivotal in areas such as targeted drug delivery and precision medical procedures, marking significant progress in the biomedical and micro-robotic fields and offering new insights into their control and behavior in diverse environments.
more » « less
Full Text Available
Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

Jiang, Z; Liu, A; Van_Durme, Benjamin (February 2024, Transactions of the Association for Computational Linguistics)

Full Text Available
The Full Landscape of Robust Mean Testing: Sharp Separations between Oblivious and Adaptive Contamination

Canonne, C. L; Hopkins, S. B.; Li, J.; Liu, A.; Narayanan, S.: (November 2023, Foundations of Computer Science (FOCS))

We consider the question of Gaussian mean testing, a fundamental task in high-dimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the information-theoretic and computational landscapes for robust mean testing. In the exponential-time setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an ε-fraction of adversarial corruptions, to be Θ~(max(d√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))) while the complexity against adaptive adversaries is Θ~(max(d√α2,dε2α4)) which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomial-time setting, we close a gap in the literature by providing a polynomial-time algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a low-degree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the oblivious-adversary setting.
more » « less
Full Text Available
Investigating mutual coupling in the hydrogen epoch of reionization array and mitigating its effects on the 21-cm power spectrum

https://doi.org/10.1093/mnras/staf1012

Rath, E; Pascua, R; Josaitis, A T; Ewall-Wice, A; Fagnoni, N; de Lera Acedo, E; Martinot, Z E; Abdurashidova, Z; Adams, T; Aguirre, J E; et al (July 2025, Monthly Notices of the Royal Astronomical Society)

ABSTRACT Interferometric experiments designed to detect the highly redshifted 21-cm signal from neutral hydrogen are producing increasingly stringent constraints on the 21-cm power spectrum, but some k-modes remain systematics-dominated. Mutual coupling is a major systematic that must be overcome in order to detect the 21-cm signal, and simulations that reproduce effects seen in the data can guide strategies for mitigating mutual coupling. In this paper, we analyse 12 nights of data from the Hydrogen Epoch of Reionization Array and compare the data against simulations that include a computationally efficient and physically motivated semi-analytic treatment of mutual coupling. We find that simulated coupling features qualitatively agree with coupling features in the data; however, coupling features in the data are brighter than the simulated features, indicating the presence of additional coupling mechanisms not captured by our model. We explore the use of fringe-rate filters as mutual coupling mitigation tools and use our simulations to investigate the effects of mutual coupling on a simulated cosmological 21-cm power spectrum in a ‘worst case’ scenario where the foregrounds are particularly bright. We find that mutual coupling contaminates a large portion of the ‘EoR Window’, and the contamination is several orders-of-magnitude larger than our simulated cosmic signal across a wide range of cosmological Fourier modes. While our fiducial fringe-rate filtering strategy reduces mutual coupling by roughly a factor of 100 in power, a non-negligible amount of coupling cannot be excised with fringe-rate filters, so more sophisticated mitigation strategies are required.
more » « less
Free, publicly-accessible full text available July 7, 2026
A Multi-level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference

https://doi.org/10.1109/TASLP.2023.3270771

Li, S.; Hu, X.; Lin, L.; Liu, A.; Wen, L.; Yu, Philip S. (January 2023, IEEEACM transactions on audio speech and language processing)

Full Text Available
Inverse Scaling: When Bigger Isn’t Better

McKenzie, IR; Lyzhov, A; Pieler, M; Parrish, A; Mueller, A; Prabhu, A; McLean, E; Kirtland, A; Ross, A; Liu, A; et al (February 2024, Transactions on machine learning research)

Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.
more » « less
Full Text Available

« Prev Next »

Search for: All records