skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Thursday, February 12 until 1:00 AM ET on Friday, February 13 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Choi, Y."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall. 
    more » « less
  2. We report the observation of an electronic reconstruction in dimensionally controlled ruthenate heterostructures synthesized by pulsed laser deposition. High structural and electronic quality of superlattices comprised of a single SrRuO3 layer inter-spaced with varying thicknesses of insulating SrTiO3 layers was verified by reflection high energy electron diffraction, atomic force microscopy, x-ray diffraction, reciprocal space mapping, and x-ray absorption spectroscopy. X-ray absorption spectroscopy evidences a confinement-driven evolution of the Ru electronic configuration from the d5L to the d4 state. Significant increases of the spin-orbit coupling are observed in connection with the configuration changes supporting recent works identifying large enhancement of the magnetic anisotropy. The growth of high quality two-dimensional confined ruthenate layers under precisely controlled environments highlights the potential to directly manipulate interlayer coupling and selectively perturb the electronic state in ruthenates in analogy to superconducting Sr2RuO4. 
    more » « less
  3. We present the photometric redshift characterization and calibration for the Dark Energy Camera All Data Everywhere (DECADE) weak lensing dataset: a catalog of 107 million galaxies observed by the Dark Energy Camera (DECam) in the northern Galactic cap. The redshifts are estimated from a combination of wide-field photometry, deep-field photometry with associated redshift estimates, and a transfer function between the wide field and deep field that is estimated using a source injection catalog. We construct four tomographic bins for the galaxy catalog, and estimate the redshift distribution, n ( z ) , within each one using the Self-organizing Map Photo-Z (SOMPZ) methodology. Our estimates include the contributions from sample variance, zeropoint calibration uncertainties, and redshift biases, as quantified for the deep-field dataset. The total uncertainties on the mean redshifts are σ z 0.01 . The SOMPZ estimates are then compared to those from the clustering redshift method, obtained by cross-correlating our source galaxies with galaxies in spectroscopic surveys, and are shown to be consistent with each other. 
    more » « less
  4. We present the pipeline for the cosmic shear analysis of the Dark Energy Camera All Data Everywhere (DECADE) weak lensing dataset: a catalog consisting of 107 million galaxies observed by the Dark Energy Camera (DECam) in the northern Galactic cap. The catalog derives from a large number of disparate observing programs and is therefore more inhomogeneous across the sky compared to existing lensing surveys. First, we use simulated data-vectors to show the sensitivity of our constraints to different analysis choices in our inference pipeline, including sensitivity to residual systematics. Next we use simulations to validate our covariance modeling for inhomogeneous datasets. Finally, we show that our choices in the end-to-end cosmic shear pipeline are robust against inhomogeneities in the survey, by extracting relative shifts in the cosmology constraints across different subsets of the footprint/catalog and showing they are all consistent within 1 σ to 2 σ . This is done for forty-six subsets of the data and is carried out in a fully consistent manner: for each subset of the data, we re-derive the photometric redshift estimates, shear calibrations, survey transfer functions, the data vector, measurement covariance, and finally, the cosmological constraints. Our results show that existing analysis methods for weak lensing cosmology can be fairly resilient towards inhomogeneous datasets. This also motivates exploring a wider range of image data for pursuing such cosmological constraints. 
    more » « less
  5. Abstract The metallicity distribution function (MDF) and internal chemical variations of a galaxy are fundamental to understand its formation and assembly history. In this work, we analyze photometric metallicities for 3883 stars over 7 half-light radii (rh) in the Sculptor (Scl) dwarf spheroidal (dSph) galaxy, using new narrowband imaging data from the Mapping the Ancient Galaxy in CaHK (MAGIC) survey conducted with the Dark Energy Camera (DECam) at the 4 m Blanco Telescope. This work demonstrates the scientific potential of MAGIC using the Scl dSph galaxy, one of the most well-studied satellites of the Milky Way. Our sample ranges from [Fe/H] ≈ –4.0 to [Fe/H] ≈ –0.6, includes six new extremely metal-poor candidates ([Fe/H] ≤ –3.0), and is almost 3 times larger than the largest spectroscopic metallicity data set in the Scl dSph. Our spatially unbiased sample of metallicities provides a more accurate representation of the MDF, revealing a more metal-rich peak than observed in the most recent spectroscopic sample. It also reveals a break in the metallicity gradient, with a strong change in the slope: from −3.26 ± 0.18 dex deg−1for stars inside ∼1rhto −0.55 ± 0.26 dex deg−1for the outer part of the Scl dSph. Our study demonstrates that combining photometric metallicity analysis with the wide field of view of DECam offers an efficient and unbiased approach for studying the stellar populations of dwarf galaxies in the Local Group. 
    more » « less
  6. We present the Dark Energy Camera All Data Everywhere (DECADE) weak lensing dataset: a catalog of 107 million galaxies observed by the Dark Energy Camera (DECam) in the northern Galactic cap. This catalog was assembled from public DECam data including survey and standard observing programs. These data were consistently processed with the Dark Energy Survey Data Management pipeline as part of the DECADE campaign and serve as the basis of the DECam Local Volume Exploration survey (DELVE) Early Data Release 3 (EDR3). We apply the Metacalibration measurement algorithm to generate and calibrate galaxy shapes. After cuts, the resulting cosmology-ready galaxy shape catalog covers a region of 5,412 deg2 with an effective number density of 4.59 arcmin−2. The coadd images used to derive this data have a median limiting magnitude of r=23.6, i=23.2, and z=22.6, estimated at S/N=10 in a 2 arcsecond aperture. We present a suite of detailed studies to characterize the catalog, measure any residual systematic biases, and verify that the catalog is suitable for cosmology analyses. In parallel, we build an image simulation pipeline to characterize the remaining multiplicative shear bias in this catalog, which we measure to be m=(−2.454±0.124)×10−2 for the full sample. Despite the significantly inhomogeneous nature of the data set, due to it being an amalgamation of various observing programs, we find the resulting catalog has sufficient quality to yield competitive cosmological constraints. 
    more » « less
  7. Self-assembled materials capable of modulating their assembly properties in response to specific enzymes play a pivotal role in advancing ‘intelligent’ encapsulation platforms for biotechnological applications. Here, we introduce a previously unreported class of synthetic nanomaterials that programmatically interact with histone deacetylase (HDAC) as the triggering stimulus for disassembly. These nanomaterials consist of co-polypeptides comprising poly (acetyl L-lysine) and poly(ethylene glycol) blocks. Under neutral pH conditions, they self-assemble into particles. However, their stability is compromised upon exposure to HDACs, depending on enzyme concentration and exposure time. Our investigation, utilizing HDAC8 as the model enzyme, revealed that the primary mechanism behind disassembly involves a decrease in amphiphilicity within the block copolymer due to the deacetylation of lysine residues within the particles’ hydrophobic domains. To elucidate the response mechanism, we encapsulated a fluorescent dye within these nanoparticles. Upon incubation with HDAC, the nanoparticle structure collapsed, leading to controlled release of the dye over time. Notably, this release was not triggered by denatured HDAC8, other proteolytic enzymes like trypsin, or the co-presence of HDAC8 and its inhibitor. We further demonstrated the biocompatibility and cellular effects of these materials and conducted a comprehensive computational study to unveil the possible interaction mechanism between enzymes and particles. By drawing parallels to the mechanism of naturally occurring histone proteins, this research represents a pioneering step toward developing functional materials capable of harnessing the activity of epigenetic enzymes such as HDACs. 
    more » « less