skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Li, J"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available September 1, 2026
  2. Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a 3∼10% performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms. 
    more » « less
    Free, publicly-accessible full text available July 24, 2026
  3. Improvements in language models are often driven by improving the quality of the data we train them on, which can be limiting when strong supervision is scarce. In this work, we show that paired preference data consisting of individually weak data points can enable gains beyond the strength of each individual data point. We formulate the delta learning hypothesis to explain this phenomenon, positing that the relative quality delta between points suffices to drive learning via preference tuning--even when supervised finetuning on the weak data hurts. We validate our hypothesis in controlled experiments and at scale, where we post-train 8B models on preference data generated by pairing a small 3B model's responses with outputs from an even smaller 1.5B model to create a meaningful delta. Strikingly, on a standard 11-benchmark evaluation suite (MATH, MMLU, etc.), our simple recipe matches the performance of Tulu 3, a state-of-the-art open model tuned from the same base model while relying on much stronger supervisors (e.g., GPT-4o). Thus, delta learning enables simpler and cheaper open recipes for state-of-the-art post-training. To better understand delta learning, we prove in logistic regression that the performance gap between two weak teacher models provides useful signal for improving a stronger student. Overall, our work shows that models can learn surprisingly well from paired data that might typically be considered weak. 
    more » « less
    Free, publicly-accessible full text available July 8, 2026
  4. Free, publicly-accessible full text available April 1, 2026
  5. Placing a dataset A={ai}i∈[n]⊂ℝd in radial isotropic position, i.e., finding an invertible R∈ℝd×d such that the unit vectors {(Rai)‖Rai‖−12}i∈[n] are in isotropic position, is a powerful tool with applications in functional analysis, communication complexity, coding theory, and the design of learning algorithms. When the transformed dataset has a second moment matrix within a exp(±ϵ) factor of a multiple of Id, we call R an ϵ-approximate Forster transform. We give a faster algorithm for computing approximate Forster transforms, based on optimizing an objective defined by Barthe [Barthe98]. When the transform has a polynomially-bounded aspect ratio, our algorithm uses O(ndω−1(nϵ)o(1)) time to output an ϵ-approximate Forster transform with high probability, when one exists. This is almost the natural limit of this approach, as even evaluating Barthe's objective takes O(ndω−1) time. Previously, the state-of-the-art runtime in this regime was based on cutting-plane methods, and scaled at least as ≈n3+n2dω−1. We also provide explicit estimates on the aspect ratio in the smoothed analysis setting, and show that our algorithm similarly improves upon those in the literature. To obtain our results, we develop a subroutine of potential broader interest: a reduction from almost-linear time sparsification of graph Laplacians to the ability to support almost-linear time matrix-vector products. We combine this tool with new stability bounds on Barthe's objective to implicitly implement a box-constrained Newton's method [CMTV17, ALOW17]. 
    more » « less
    Free, publicly-accessible full text available April 8, 2026
  6. Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments. 
    more » « less
    Free, publicly-accessible full text available June 5, 2026
  7. Free, publicly-accessible full text available May 4, 2026
  8. We study the problem of finding a (pure) product state with optimal fidelity to an unknown n-qubit quantum state ρ, given copies of ρ. This is a basic instance of a fundamental question in quantum learning: is it possible to efficiently learn a simple approximation to an arbitrary state? We give an algorithm which finds a product state with fidelity ε-close to optimal, using N=npoly(1/ε) copies of ρ and poly(N) classical overhead. We further show that estimating the optimal fidelity is NP-hard for error ε=1/poly(n), showing that the error dependence cannot be significantly improved. For our algorithm, we build a carefully-defined cover over candidate product states, qubit by qubit, and then demonstrate that extending the cover can be reduced to approximate constrained polynomial optimization. For our proof of hardness, we give a formal reduction from polynomial optimization to finding the closest product state. Together, these results demonstrate a fundamental connection between these two seemingly unrelated questions. Building on our general approach, we also develop more efficient algorithms in three simpler settings: when the optimal fidelity exceeds 5/6; when we restrict ourselves to a discrete class of product states; and when we are allowed to output a matrix product state. 
    more » « less
    Free, publicly-accessible full text available March 31, 2026
  9. Diffusion models (DMs) create samples from a data distribution by starting from random noise and iteratively solving a reverse-time ordinary differential equation (ODE). Because each step in the iterative solution requires an expensive neural function evaluation (NFE), there has been significant interest in approximately solving these diffusion ODEs with only a few NFEs without modifying the underlying model. However, in the few NFE regime, we observe that tracking the true ODE evolution is fundamentally impossible using traditional ODE solvers. In this work, we propose a new method that learns a good solver for the DM, which we call Solving for the Solver (S4S). S4S directly optimizes a solver to obtain good generation quality by learning to match the output of a strong teacher solver. We evaluate S4S on six different pre-trained DMs, including pixel-space and latent-space DMs for both conditional and unconditional sampling. In all settings, S4S uniformly improves the sample quality relative to traditional ODE solvers. Moreover, our method is lightweight, data-free, and can be plugged in black-box on top of any discretization schedule or architecture to improve performance. Building on top of this, we also propose S4S-Alt, which optimizes both the solver and the discretization schedule. By exploiting the full design space of DM solvers, with 5 NFEs, we achieve an FID of 3.73 on CIFAR10 and 13.26 on MS-COCO, representing a 1.5× improvement over previous training-free ODE methods. 
    more » « less
    Free, publicly-accessible full text available February 24, 2026