skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Choi, Y"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall. 
    more » « less
    Free, publicly-accessible full text available April 14, 2026
  2. We report the observation of an electronic reconstruction in dimensionally controlled ruthenate heterostructures synthesized by pulsed laser deposition. High structural and electronic quality of superlattices comprised of a single SrRuO3 layer inter-spaced with varying thicknesses of insulating SrTiO3 layers was verified by reflection high energy electron diffraction, atomic force microscopy, x-ray diffraction, reciprocal space mapping, and x-ray absorption spectroscopy. X-ray absorption spectroscopy evidences a confinement-driven evolution of the Ru electronic configuration from the d5L to the d4 state. Significant increases of the spin-orbit coupling are observed in connection with the configuration changes supporting recent works identifying large enhancement of the magnetic anisotropy. The growth of high quality two-dimensional confined ruthenate layers under precisely controlled environments highlights the potential to directly manipulate interlayer coupling and selectively perturb the electronic state in ruthenates in analogy to superconducting Sr2RuO4. 
    more » « less
    Free, publicly-accessible full text available December 2, 2025
  3. Self-assembled materials capable of modulating their assembly properties in response to specific enzymes play a pivotal role in advancing ‘intelligent’ encapsulation platforms for biotechnological applications. Here, we introduce a previously unreported class of synthetic nanomaterials that programmatically interact with histone deacetylase (HDAC) as the triggering stimulus for disassembly. These nanomaterials consist of co-polypeptides comprising poly (acetyl L-lysine) and poly(ethylene glycol) blocks. Under neutral pH conditions, they self-assemble into particles. However, their stability is compromised upon exposure to HDACs, depending on enzyme concentration and exposure time. Our investigation, utilizing HDAC8 as the model enzyme, revealed that the primary mechanism behind disassembly involves a decrease in amphiphilicity within the block copolymer due to the deacetylation of lysine residues within the particles’ hydrophobic domains. To elucidate the response mechanism, we encapsulated a fluorescent dye within these nanoparticles. Upon incubation with HDAC, the nanoparticle structure collapsed, leading to controlled release of the dye over time. Notably, this release was not triggered by denatured HDAC8, other proteolytic enzymes like trypsin, or the co-presence of HDAC8 and its inhibitor. We further demonstrated the biocompatibility and cellular effects of these materials and conducted a comprehensive computational study to unveil the possible interaction mechanism between enzymes and particles. By drawing parallels to the mechanism of naturally occurring histone proteins, this research represents a pioneering step toward developing functional materials capable of harnessing the activity of epigenetic enzymes such as HDACs. 
    more » « less
  4. We present Magellan/IMACS and Magellan/MIKE spectroscopy of the ultra-faint dwarf (UFD) galaxy Pictor~II (Pic~II) that is located only 12 kpc from the Large Magellanic Cloud (LMC). From the IMACS spectroscopy, we identify 13 member stars and measure a mean heliocentric velocity of , a velocity dispersion of , a mean metallicity of , and an upper limit on the metallicity dispersion of . We measure detailed elemental abundances for the brightest star, finding [Fe/H] = 3.3 , high [ α /Fe] ratios, and no detectable neutron capture elements, similar to stars in other UFDs. However, this star has an unusually high [Sc/Fe] ratio. The dynamical mass-to-light ratio ( M / L = 760 420 + 910 M L 1 ), size, and chemical abundances confirms that Pic~II is a dark matter-dominated dwarf galaxy. We perform detailed orbit modeling of Pic~II in a combined Milky Way (MW) and LMC potential and find that Pic~II is highly likely to be a long-term LMC satellite. Furthermore, we find that Pic II is likely still bound to the LMC today. Pic~II is the seventh LMC-associated UFD and among the most metal-poor UFDs known. We further update the morphological parameters with deeper Dark Energy Camera (DECam) photometry, compute the dark matter properties for dark matter indirect detection searches, verify the extremely low metallicity with narrowband CaHK imaging, and briefly discuss tidal influences of the LMC and MW. 
    more » « less
    Free, publicly-accessible full text available January 1, 2026