This paper proposes fast polynomial evaluation methods for correctly rounded elementary functions generated using our RLibm approach. The resulting functions produce correct results for all inputs with multiple representations and rounding modes. Given an oracle, the RLibm approach approximates the correctly rounded result rather than the real value of an elementary function. A key observation is that there is an interval of real values around the correctly rounded result such that any real value in it rounds to the correct result. This interval is the maximum freedom available to RLibm’s polynomial generation procedure. Subsequently, the problem of generating correctly rounded elementary functions using these intervals can be structured as a linear programming problem. Our prior work on the RLibm approach uses Horner’s method for polynomial evaluation. This paper explores polynomial evaluation techniques such as Knuth’s coefficient adaptation procedure, parallel execution of operations using Estrin’s procedure, and the use of fused multiply-add operations in the context of the RLibm approach. If we take the polynomial generated by the RLibm approach and subsequently perform polynomial evaluation optimizations, it results in incorrect results due to rounding errors during polynomial evaluation. Hence, we propose to integrate the fast polynomial evaluation procedure in the RLibm’s polynomial generation process. Our new polynomial evaluation procedure that combines parallel execution with fused multiply-add operations outperforms the Horner’s method used by RLibm’s correctly rounded functions. We show the resulting polynomials for 32-bit float are not only correct but also faster than prior functions in RLibm by 24%
more »
« less
This content will become publicly available on June 11, 2026
ON SPEEDING UP LANGUAGE MODEL EVALUATION
Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an adaptive approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources—resulting in 85-95% LLM cost savings. Our code is available at https://github.com/kilian-group/banditeval.
more »
« less
- Award ID(s):
- 1934714
- PAR ID:
- 10615991
- Publisher / Repository:
- International Conference on Learning Representations
- Date Published:
- ISBN:
- 9798331320850
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Large language models are typically aligned with human preferences by optimizing reward models (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to overoptimization, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM’s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.more » « less
-
Metamorphic testing is an advanced technique to test programs without a true test oracle such as machine learning applications. Because these programs have no general oracle to identify their correctness, traditional testing techniques such as unit testing may not be helpful for developers to detect potential bugs. This paper presents a novel system, KABU, which can dynamically infer properties of methods' states in programs that describe the characteristics of a method before and after transforming its input. These Metamorphic Properties (MPs) are pivotal to detecting potential bugs in programs without test oracles, but most previous work relies solely on human effort to identify them and only considers MPs between input parameters and output result (return value) of a program or method. This paper also proposes a testing concept, Metamorphic Differential Testing (MDT). By detecting different sets of MPs between different versions for the same method, KABU reports potential bugs for human review. We have performed a preliminary evaluation of KABU by comparing the MPs detected by humans with the MPs detected by KABU. Our preliminary results are promising: KABU can find more MPs than human developers, and MDT is effective at detecting function changes in methods.more » « less
-
null (Ed.)This paper proposes a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is also appropriate for structured prediction tasks like sequence-to-sequence models. Additionally, we use this approach to perform a comparison of the importance maps created by our previously introduced technique using “bubble noise” to identify important points through correlation with a baseline approach based on smoothed speech energy and forced alignment. Our results show that the bubble analysis approach is better at identifying important speech regions than this baseline on 100 sentences from the AMI corpus.more » « less
-
We propose a novel trace-guided approach to tackle the challenges of ambiguity and generalization in synthesis of recursive functional programs from input-output examples. Our approach augments the search space of programs with recursion traces consisting of recursive subcalls of the programs. Our method is based on a new version space algebra (VSA) for succinct representation and efficient manipulation of pairs of recursion traces and programs that are consistent with each other. We have implemented this approach in a tool called SyRup and evaluated it on benchmarks from prior work. Our evaluation demonstrates that SyRup not only requires fewer examples to achieve a certain success rate than existing synthesizers, but is also less sensitive to the quality of the examples.more » « less
An official website of the United States government
