Zeroth-order fine-tuning eliminates explicit back-propagation and reduces memory overhead for large language models (LLMs), making it a promising approach for on-device fine-tuning tasks. However, existing memory-centric accelerators fail to fully leverage these benefits due to inefficiencies in balancing bit density, compute-in-memory capability, and endurance-retention trade-off. We present a reliability-aware, analog multi-level-cell (MLC) eDRAM-RRAM compute-in-memory (CIM) solution co-designed with zeroth-order optimization for language model fine-tuning. An RRAM-assisted eDRAM MLC programming scheme is developed, along with a process-voltage-temperature (PVT)-robust, large-sensing-window time-to-digital converter (TDC). The MLC-eDRAM integrating two-finger MOM provides 12× improvement in bit density over state-of-the-art MLC design. Another 5× density and 2× retention benefits are gained by adopting BEOL In2O3 FETs.
more »
« less
This content will become publicly available on June 5, 2026
Zeroth-Order Optimization Finds Flat Minima
Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known about the implicit regularization that provides a fine-grained characterization of which particular solutions are reached. This paper shows that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, a measure widely used to distinguish between sharp and flat minima. The authors provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, defining flat minima as minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support the theoretical findings.
more »
« less
- Award ID(s):
- 2505865
- PAR ID:
- 10631828
- Publisher / Repository:
- https://doi.org/10.48550/arXiv.2506.05454
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with alternative regularization methods, such as weight decay and data augmentation, leading to stronger regularization. Second, our approach remains highly effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.more » « less
-
In order to meet the requirements for safety and latency in many IoT applications, intelligent decisions must be made right here right now at the network edge, calling for edge intelligence. To facilitate fast edge learning, this work advocates a platform-aided federated meta-learning architecture, where a set of edge nodes joint force to learn a meta-model (i.e., model initialization for adaptation in a new learning task) by exploiting the similarity among edge nodes as well as the cloud knowledge transfer. The federated meta-learning problem is cast as a regularized stochastic optimization problem, using Bregman Divergence between the edge model and the cloud pre-trained model as the regularization. We then devise an alternating direction method of multiplier (ADMM) based Hessian-free federated meta-learning algorithm, called ADMM-FedMeta, with inexact Hessian estimation. Further, we analyze the convergence properties and the rapid adaptation performance of ADMM-FedMeta for the general non-convex case. The theoretical results show that under mild conditions, ADMM-FedMeta converges to an $$\epsilon$$-approximate first-order stationary point after at most $$\mathcal{O}(1/\epsilon^2)$$ communication rounds. Extensive experimental studies on benchmark datasets demonstrate the effectiveness and efficiency of ADMM-FedMeta, and showcase that ADMM-FedMeta outperforms the existing baselines.more » « less
-
Functionally constrained stochastic optimization problems, where neither the objective function nor the constraint functions are analytically available, arise frequently in machine learning applications. In this work, assuming we only have access to the noisy evaluations of the objective and constraint functions, we propose and analyze stochastic zeroth-order algorithms for solving this class of stochastic optimization problem. When the domain of the functions is [Formula: see text], assuming there are m constraint functions, we establish oracle complexities of order [Formula: see text] and [Formula: see text] in the convex and nonconvex settings, respectively, where ϵ represents the accuracy of the solutions required in appropriately defined metrics. The established oracle complexities are, to our knowledge, the first such results in the literature for functionally constrained stochastic zeroth-order optimization problems. We demonstrate the applicability of our algorithms by illustrating their superior performance on the problem of hyperparameter tuning for sampling algorithms and neural network training. Funding: K. Balasubramanian was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, University of California–Davis, and the National Science Foundation [Grant DMS-2053918].more » « less
-
Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. In particular, no current analysis for quasi-Newton methods guarantees global convergence with an explicit superlinear convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and we relate the bounded regret of the online problem to the superlinear convergence of our method.more » « less
An official website of the United States government
