<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models</title></titleStmt>
			<publicationStmt>
				<publisher>ICML</publisher>
				<date>07/14/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10621225</idno>
					<idno type="doi"></idno>
					
					<author>Parshin Shojaee</author><author>Kazem Meidani</author><author>Shashank Gupta</author><author>Amir Farimani</author><author>Chandan Reddy</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Scientific equation discovery has long been a cornerstone of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, it is difficult to assess the true discovery capabilities of these methods because existing benchmarks often use well-known equations. This makes them vulnerable to memorization by LLMs and results in inflated performance metrics that do not reflect genuine discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring datadriven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the bestperforming system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Proceedings of the 42 nd International Conference on Machine <ref type="bibr">Learning, Vancouver, Canada. PMLR 267, 2025.</ref> Copyright 2025 by the author(s). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Equation discovery, the process of uncovering symbolic mathematical expressions from observational data, has been a cornerstone of scientific advancement. This task, also known as symbolic regression (SR), goes beyond mere datadriven predictive modeling by seeking interpretable mathematical relations that reveal the underlying mechanisms of natural phenomena. When scientists derive mathematical equations from empirical data, they gain more than just predictive power -they obtain insights into fundamental physical principles, enable extrapolation beyond observed data, and facilitate knowledge transfer across scientific domains <ref type="bibr">(Langley, 1981;</ref><ref type="bibr">Schmidt &amp; Lipson, 2009)</ref>.</p><p>Standard approaches to equation discovery have primarily relied on genetic programming (GP) and evolutionary algorithms <ref type="bibr">(Cranmer, 2023;</ref><ref type="bibr">La Cava et al., 2021)</ref>, which represent mathematical expressions as trees and navigate the vast space of possible equations through evolutionary search techniques. However, these methods face two fundamental challenges. First, the NP-hard nature of equation discovery <ref type="bibr">(Virgolin &amp; Pissis, 2022)</ref> makes their random mutation and crossover operations computationally prohibitive across</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Goal / Instruction -Discover the mathematical equation/law that describes [output variable] based on given [input features].</head><p>-Use domain-specific knowledge of [the scientific field] and provided data samples to find an equation that is scientifically valid and fits the data well. The benchmark tasks (left) combine scientific context with numerical data. The discovery process (middle) iteratively leverages LLM's scientific knowledge and data-driven reasoning to generate hypotheses for underlying equations. Discovered hypotheses, represented as equation strings, trees, or programs, are then evaluated (right) using multiple metrics including data fidelity, symbolic accuracy, and computational efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data</head><p>vast search spaces. Second, unlike human scientists who leverage their domain knowledge and expertise to guide hypothesis formation, these approaches are mostly purely data-driven, and isolated from existing scientific knowledge. These limitations have motivated researchers to develop methods that incorporate scientific domain knowledge into the equation discovery process.</p><p>Large Language Models (LLMs) have recently emerged as a promising solution to these challenges, offering a new paradigm for scientific equation discovery. LLMs, trained on vast corpora of scientific literature, possess extensive embedded scientific knowledge. This has sparked significant interest in leveraging LLMs for scientific equation discovery, with several recent works demonstrating their potential <ref type="bibr">(Shojaee et al., 2024b;</ref><ref type="bibr">Ma et al., 2024;</ref><ref type="bibr">Grayeli et al., 2024;</ref><ref type="bibr">Merler et al., 2024;</ref><ref type="bibr">Du et al., 2024;</ref><ref type="bibr">Reddy &amp; Shojaee, 2024;</ref><ref type="bibr">Zhang et al., 2024)</ref>. These LLM-based approaches have shown to enhance the equation hypothesis generation process by incorporating scientific priors, guiding the exploration of equation search spaces more efficiently, and providing interpretable reasoning for the search process.</p><p>Despite the promising potential of LLM-based equation discovery methods, their rigorous and robust evaluation still remains an open challenge. The current scientific equation discovery benchmarks are primarily represented by SRBench <ref type="bibr">(La Cava et al., 2021)</ref> and SRSD <ref type="bibr">(Matsubara et al., 2022)</ref>. SRBench incorporates two key data groups for this purpose: the Feynman physics equations <ref type="bibr">(Udrescu &amp; Tegmark, 2020)</ref>, and Strogatz dynamical systems <ref type="bibr">(La Cava et al., 2016;</ref><ref type="bibr">Strogatz, 2018)</ref>. A notable extension to this framework is SRSD <ref type="bibr">(Matsubara et al., 2022)</ref>, which enhances the Feynman benchmark by incorporating physically meaningful sampling ranges for data points. However, these benchmarks exhibit significant limitations for the evaluation of LLM-based methods. Their problems are mostly based on known physics equations from textbooks, which makes them often subject to memorization by LLMs.</p><p>As noted by <ref type="bibr">(Shojaee et al., 2024b)</ref>, LLMs frequently succeed on these common equation discovery benchmarks through simple recitation based on variable names and problem descriptions, rather than the actual process of datadriven discovery and reasoning. Our analysis (shown in Fig. <ref type="figure">1</ref>) also confirms this finding -the sudden drop in the numeric error curve within the first few iterations and significantly lower symbolic error on Feynman problems indicate memorized solutions rather than a meaningful search towards discovery. To mitigate this issue, <ref type="bibr">(Shojaee et al., 2024b;</ref><ref type="bibr">Ma et al., 2024)</ref> have introduced a handful of five custom-crafted problems designed to prevent memorization by manually modifying known physical models. While these efforts represent a step forward, the small scale and limited diversity of these problem sets are insufficient to provide a comprehensive evaluation framework for emerging LLM-based methods in scientific equation discovery.</p><p>A more robust and systematic benchmark is needed to enable standardized evaluation and foster the development of innovative methods in this emerging field.</p><p>In this paper, we introduce LLM-SRBench, a new benchmark designed to rigorously evaluate the capabilities of LLM-based scientific equation discovery methods. LLM-SRBench addresses the limitations of existing benchmarks by constructing problem sets that avoid trivial recitation while leveraging the scientific priors embedded in LLMs, simulating conditions akin to scientific discovery. The benchmark is structured around two main categories of problems, each targeting distinct aspects of equation discovery. The first category focuses on transforming common scientific problems, such as those from the Feynman equations, into different mathematical representations of the same underlying physical problem. By symbolically altering input-output mappings and generating less common mathematical forms for the same problem, we challenge LLM-based equation discovery to go beyond memorization of the common forms. This approach is motivated by recent findings on the fragility of LLMs' reasoning capabilities to unfamiliar representations of otherwise familiar problems <ref type="bibr">(Mirzadeh et al., 2024;</ref><ref type="bibr">Xie et al., 2024;</ref><ref type="bibr">Wu et al., 2023)</ref>.</p><p>The second category extends the approach introduced by <ref type="bibr">(Shojaee et al., 2024b)</ref>, which combines known terms in the underlying equation with synthetic, novel terms to create problems that go beyond memorization and demand datadriven reasoning. We expand this idea into a comprehensive set of benchmark problems spanning diverse scientific domains. These problems incorporate carefully designed synthetic terms that are both novel and plausible. We further verify the solvability of the generated equations using numerical solvers, ensuring that the benchmark problems remain grounded in physical feasibility while presenting meaningful challenges for LLM-based discovery methods.</p><p>LLM-SRBench comprises 111 problems in the first category (LSR-Transform), and 128 problems in the second category (LSR-Synth), spanning four scientific domains: chemistry (36), biology (24), physics (43), and material science (25). We comprehensively benchmark state-of-theart LLM-based scientific equation discovery methods with several LLM backbones on these datasets. Our experiments reveal several key insights into the capabilities and limitations of current LLM-based scientific equation discovery methods. Results show that the best model can only solve 31.5% of problems on LSR-Transform and 28.1% on LSR-Synth. This underscores the challenging nature of the tasks in LLM-SRBench and highlights its potential as a critical evaluation foundation for future LLM-based scientific equation discovery methods. Overall, the contributions of this work are as follows:</p><p>&#8226; We introduce LLM-SRBench, the first comprehensive benchmark with 239 challenging problems across various scientific domains, designed to evaluate LLM-based scientific equation discovery methods.</p><p>&#8226; We propose a novel benchmark design through alternative mathematical representations (LSR-Transform) and synthetic, discovery-driven problems (LSR-Synth) to ensure rigorous evaluation of scientific reasoning and discovery capabilities beyond LLM memorization.</p><p>&#8226; Extensive experiments on state-of-the-art methods reveal performance peaks at 31%, highlighting the benchmark's challenging nature and its potential for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">LLM-SRBench</head><p>We introduce LLM-SRBench, a novel benchmark designed to evaluate LLM-based methods for data-driven scientific equation discovery. As shown in Fig. <ref type="figure">2</ref>, in this benchmark, a "data-driven scientific equation discovery" task is defined as follows: Given a task dataset D, the corresponding scientific context C, the objective is to derive a hypothesis h that represents the underlying mathematical relations behind the data with high precision and scientific plausibility. This process resembles the iterative search and refinement undertaken by human scientists, where LLMs act as optimizers, proposing and refining hypotheses based on both scientific knowledge and empirical data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">LSR-Transform</head><p>This category is designed to evaluate whether LLM-based methods can discover equations in less common mathematical forms, avoiding reliance on memorization of well-known representations. This approach is motivated by the observation that LLMs often struggle with unfamiliar instantiations of otherwise familiar problems, as highlighted by recent studies on the fragility of LLM reasoning <ref type="bibr">(Mirzadeh et al., 2024;</ref><ref type="bibr">Xie et al., 2024;</ref><ref type="bibr">Wu et al., 2023)</ref>. By transforming existing benchmark problems into different mathematical representations, we challenge LLMs' capabilities in datadriven scientific equation discovery and reasoning.</p><p>We build on the Feynman <ref type="bibr">(Udrescu &amp; Tegmark, 2020)</ref> benchmark (current standard benchmark in scientific equation discovery), which consists of 100 physics equations, and systematically transform these equations into alternative mathematical forms (examples in App. A.1). As demonstrated in Fig. <ref type="figure">3</ref>(a), the transformation process involves seven key steps: 1) Equation Collection: We gather the original mathematical expressions, along with their input and output variables, and scientific problem descriptions from the Feynman benchmark. 2) Select Pivot Variable:</p><p>For each equation, we choose an input feature to become the new target variable. 3) Feature-Target Transformation: We transform the dataset by switching the roles of the selected input feature and the original target variable. 4) Symbolic Transformation: Using the SymPy library in Python on the parsed expressions, we solve each equation with respect to the selected input variable, treating it as the new output and the original output variable as an input in the transformed equation. 5) Solvability Check:</p><p>We retain only those transformations that are analytically solvable, ensuring the feasibility of the resulting equations. 6) Dataset Refinement: For the transformed equations with altered data domains (e.g., due to square roots or denominators), we filter the original Feynman dataset to ensure all data points fall within the valid domains of the new equations. 7) Problem Reformulation: Using LLM (GPT-4o), we generate a new natural language specification for each transformed problem. During this data generation process, we constrain the transformed equations' complexity (measured by expression tree node count) to the range of original Feynman dataset distribution (full analysis in Fig. <ref type="figure">8</ref>, App.A.1). This allows us to focus on the semantic aspects of discovery-specifically the interplay between reasoning and memorization of the mathematical forms-rather than conflating performance with the ability to handle syntactically complex and lengthy hypotheses. We also exclude transformed problems that LLM can solve through direct sampling without requiring access to data.</p><p>This process yields 111 total transformed equations derived from the 100 original Feynman problems. Each transformed equation shares the same scientific context, problem description, and variables as its original counterpart but presents a less common mathematical form to be discovered. The goal of LSR-Transform is not to discover new equations but to evaluate whether LLM-based systems can validate discoveries from non-trivial, data-driven transformations of known equations. To support scientific knowledge-guided discovery, each task in LSR-Transform is supplemented with a natural language description of the scientific problem and dataset, including variable names and their meanings. These descriptions are absent in the original Feynman benchmark but they are needed for LLM-based scientific equation discovery methods to provide scientific context in prompts for knowledge-guided equation discovery by LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">LSR-Synth</head><p>This category is designed to assess whether LLMs can discover equations that incorporate new synthetic terms alongside known terms, requiring scientific as well as data-driven reasoning rather than reliance on memorization. The LSR-Synth dataset is motivated by the approach introduced in <ref type="bibr">(Shojaee et al., 2024b)</ref> for the handful of manually designed problems and systematically expands it into a comprehensive set of benchmark problems across diverse scientific domains. By combining known terms with synthetic, novel terms, LLMs are challenged to demonstrate discovery capabilities in unobserved contexts, yet leverage their knowledge in the process. The LSR-Synth dataset spans four scientific domains: chemistry, biology, physics, and material science, focusing on key scientific problems, including reaction kinetics in chemistry, population growth in biology, damped harmonic oscillators in physics, and stress-strain relationships in material science (examples in App. A.2).</p><p>The data generation process for LSR-Synth involves multi-ple steps , as illustrated in Fig. <ref type="figure">3</ref>(b), to ensure the creation of high-quality, challenging benchmark problems: 1) Select Scientific Problem: We select problems from different scientific domains, such as reaction kinetics in chemistry or population dynamics in biology. 2) Known Term Generation: Given the problem description, we prompt an LLM (GPT-4o) to generate a list of common and well-known mathematical terms that typically appear in the underlying models. 3) Synthetic Term Generation: Similarly, we prompt the LLM to generate a list of diverse novel synthetic terms for a given scientific problem, along with descriptions of the problem and variables. For example, in chemistry reaction kinetics, known terms for reaction rate (dA/dt) based on concentration (A) and time (t) might include first-order (-kA) and second-order kinetics (-kA 2 ) or the exponential decay term -k exp (-k s t), while synthetic terms could represent non-linear high-order saturation, e.g., kA 2 /(1 + &#946;A 4 ), or non-linear quantum tunneling effects, e.g., kA exp (-&#947; t )/t 2 . 4) Solvability Check: After sampling from the generated known and synthetic terms and combining them into a complete mathematical expression, we verify the solvability of these expressions using numerical solvers such as solve ivp in Python. This step ensures that the expressions are feasible, providing a basis for generating datapoints. 5) Novelty Check: In the context of each scientific problem and the complete expression, we evaluate the novelty of the new generated task using LLM (GPT-4o) as a novelty evaluator. This step is to verify that the synthetic terms are novel in the provided context and require data-driven reasoning rather than relying on established knowledge to be discovered. 6) Datapoint Generation: For expressions that pass the solvability and novelty checks, we generate datapoints using numerical solvers based on the specified initial conditions and parameters. These datapoints are used to create the final task datasets. 7) Expert Validation: Finally, the filtered expressions, along with visualizations of their generated datapoints, are cross-checked by two subject matter experts to validate their plausibility. After these filtering steps, we finalize a candidate list of 128 problems across the four domains (36: chemistry; 24: biology; 43: physics; and 25: material science). More detailed analysis of LLM-SRBench datasets are provided in App. A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Evaluation</head><p>Evaluating LLM-based scientific equation discovery methods introduces unique challenges due to the open-ended nature of the task and diverse symbolic representation of hypotheses. A discovered equation can be assessed from two perspectives: (a) data fidelity, which measures how well the equation fits the observed and out-of-domain (OOD) data, and (b) symbolic accuracy, which evaluates the alignment with ground-truth symbolic equation hypotheses. Both perspectives are critical, as equations may exhibit similar symbolic forms but differ numerically, or vice versa.</p><p>Data Fidelity. We evaluate data-driven fidelity using two known metrics in equation discovery: (1) Acccuracy to tolerance &#964; (Acc &#964; ) <ref type="bibr">(Kamienny et al., 2022;</ref><ref type="bibr">Biggio et al., 2021)</ref>, and Normalized Mean Squared Error (NMSE). These metrics are computed on both in-domain test data and OOD data (when available) to assess generalization capacity, a crucial requirement for scientific equations.</p><p>Ntest i=1 (y i -&#563;) 2 Symbolic Accuracy. We evaluate symbolic accuracy with a model-based evaluation strategy using GPT-4o as an evaluator (prompt in App. B, Fig. <ref type="figure">11</ref>). This approach addresses the limitations of current symbolic metrics like recovery rate in symbolic regression <ref type="bibr">(La Cava et al., 2016)</ref>, which are very sensitive to exact symbolic matches and fail to account for mathematical equivalence, particularly in different hypothesis representations (e.g., equation as strings, expression trees, or Python programs). Here, GPT-4o evaluates mathematical equivalence by comparing the symbolic form of the predicted hypothesis versus the ground-truth equation after removing parameters and constants. The ability of LLMs to recognize semantic equivalence across different representations makes them particularly well-suited for evaluating LLM-based equation discovery methods, which often operate within a more diverse and open-ended hypothesis space. To validate this metric, two authors also independently evaluated symbolic equivalence on 130 sampled problems, finding 94.6% agreement between GPT-4o and human evaluators. App. B provides more details on the evaluation metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Experimental Setup</head><p>We benchmark state-of-the-art LLM-based scientific equation discovery methods using three LLM backbones: one open-source model (Llama-3.1-8B-Instruct) and two proprietary models (GPT-4o-mini and GPT-3.5-turbo). Each discovery task takes as input the problem description, variables, the corresponding dataset, and an instruction specifying the task. The discovery methods then generate and refine equation hypotheses through LLMs. To ensure fair comparison, we standardize each of the methods to use 1k LLM calls per problem while maintaining their core algorithmic designs and hyperparameter settings. Detailed implementation specifics and prompts of each method are provided in App. C. We </p><p>61 1.801 0.3697 0.0 0.0 0.0644 0.0 0.0 0.5481 0.0 0.0 0.0459 0.0 0.0 0.0826 GPT-3.5-turbo 2.10 1.801 0.3553 0.0 8.33 0.0023 0.0 4.16 0.5990 0.0 2.27 0.0274 0.0 0.0 0.0277 GPT-4o-mini 7.21 6.306 0.2631 0.0 13.88 0.0221 0.0 4.16 0.4648 4.54 9.09 0.0647 0.0 0.0 0.0484 SGA (Ma et al., 2024) Llama-3.1-8B-Instruct 2.70 0.909 0.3519 0.0 8.33 0.0458 0.0 0.0 0.2416 0.0 2.27 0.1549 0.0 12.12 0.0435 GPT-3.5-turbo 0.0 0.909 0.3465 0.0 8.33 0.0071 0.0 8.33 0.1279 2.27 4.54 0.0249 0.0 28.10 0.0019 GPT-4o-mini 9.91 8.11 0.2321 0.0 16.66 5.46e-4 4.16 12.51 0.0128 4.54 9.09 0.0511 0.0 36.11 6.02e-4 LaSR (Grayeli et al., 2024) Llama-3.1-8B-Instruct 5.41 45.94 0.0021 0.0 27.77 2.77e-4 4.16 16.66 2.73e-4 4.54 25.02 0.0018 8.21 64.22 7.44e-5 GPT-3.5-turbo 12.61 47.74 0.0015 0.0 38.89 1.51e-4 0.0 16.66 2.31e-4 6.81 22.71 0.0011 20.66 64.09 3.77e-5 GPT-4o-mini 6.31 50.45 0.0011 2.77 38.92 9.11e-5 8.33 20.83 1.53e-4 9.91 31.81 9.94e-4 28.12 72.04 9.23e-6 LLM-SR (Shojaee et al., 2024b) Llama-3.1-8B-Instruct 30.63 38.55 0.0101 8.33 66.66 8.01e-6 25.30 58.33 1.04e-6 6.97 34.09 1.23e-4 4.10 88.12 1.15e-7 GPT-3.5-turbo 10.81 10.81 0.1449 0.0 50.22 2.87e-5 0.0 25.03 2.33e-5 0.0 25.12 8.84e-4 12.42 82.14 2.75e-8 GPT-4o-mini 31.53 39.64 0.0091 11.11 52.77 4.12e-6 16.66 29.16 3.06e-6 9.91 36.36 7.62e-5 20.24 88.28 3.21e-9</p><p>evaluate the following discovery methods:</p><p>LLM-SR <ref type="bibr">(Shojaee et al., 2024b)</ref>, a program search equation discovery method that generates hypotheses of equation skeleton as Python functions with the main idea of combining LLMs' scientific knowledge with multi-island evolutionary search guided by feedback from data.</p><p>LaSR <ref type="bibr">(Grayeli et al., 2024)</ref>, a concept learning equation discovery method that finds abstract textual concepts of mathematical relations from successful equation hypotheses with LLMs and uses these concepts to evolve new hypotheses through a hybrid approach of evolutionary search (with PySR <ref type="bibr">(Cranmer, 2023)</ref>) and LLM-guided search.</p><p>SGA <ref type="bibr">(Ma et al., 2024)</ref>, a bilevel optimization equation discovery method that iteratively combines LLMs for discrete hypothesis generation of scientific laws and physical simulations in PyTorch for continuous parameter optimization with respect to data.</p><p>Direct Prompting (DataBlind) serves as a baseline for generating hypotheses purely from contextual information without access to data. By not using data-driven reasoning and refinement in the hypothesis generation, this baseline helps to assess LLMs' memorization of the problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Main Results</head><p>Our experimental results (Table <ref type="table">1</ref>) reveals several key insights into the strengths and limitations of LLM-based scientific equation discovery methods. Overall, performance remains relatively low across both symbolic and numeric metrics, underscoring the fundamental challenges of this task. One key observation is the poor performance of direct prompting method (DataBlind), which only relies on LLMs' knowledge about the problem without access to data for data-driven refinement. This result underscores the necessity of combining LLM reasoning with observational data, as relying solely on prior knowledge proves insufficient for accurate equation discovery across different problems in LLM-SRBench. We observe that on LSR-Transform data group, LaSR achieves the highest numerical accuracy, leading in both Acc 0.1 and NMSE, while LLM-SR with GPT-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chemistry</head><p>Biology Material Science Physics 4o-mini outperforms other methods in symbolic accuracy (&#8764;31%). This comparative advantage inverts in the LSR-Synth material science problems, where LaSR consistently yields better symbolic accuracy and LLM-SR achieves better numerical precision, suggesting that different equation discovery strategies may be better suited to different problems.</p><p>Another notable observation is the consistent outperformance of models using GPT-4o-mini and Llama-3.1-8B compared to those based on GPT-3.5-turbo. This may be due to improved reasoning architectures or better effectiveness of smaller, less opinionated models in the search and exploration needed for navigating space of possible equations. The lower performance on LSR-Synth compared to LSR-Transform tasks also indicates that the ability to find transformed variants of known problems does not necessarily extend to more challenging scenarios involving novel synthetic terms, where systematic data-driven exploration becomes essential.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Analysis</head><p>LSR-Transform vs. Feynman datasets. We analyze the performance gap between Feynman and LSR-Transform datasets across different equation complexity levels, measured by the number of nodes in the corresponding expression tree (La <ref type="bibr">Cava et al., 2021)</ref>. Fig. <ref type="figure">4</ref> shows the aggregated average performance (over all methods and LLM backbones) in terms of both symbolic accuracy (a) and numeric precision (b). It can be observed that even at the same complexity levels, LSR-Transform problems are substantially more challenging for current discovery methods than original Feynman problems. Also, this performance disparity persists even for simpler problems ([0-15] nodes), indicating that the challenging nature of LSR-Transform problems for LLM-based scientific equation discovery methods is not necessarily due to the structural complexity.</p><p>Performance on In-domain vs. OOD. Generalization to unseen data is a fundamental requirement for scientific laws and a critical aspect of equation discovery. A correct mathematical model of observations should not only fit observed data but also extrapolate accurately to out-of-domain (OOD) scenarios. However, current equation discovery benchmarks largely overlook this aspect. In this work, we advocate for explicit OOD assessment in scientific equation discovery by introducing held-out OOD test sets in our benchmark. To systematically evaluate generalization beyond observed data, we generate dedicated OOD test sets for synthetic problems in the LSR-Synth category (see App.</p><p>A for details on data generation). Fig. <ref type="figure">5</ref> provides a comparative analysis of ID vs. OOD results. As expected, all discovery methods exhibit higher NMSE in OOD settings, indicating degraded generalization compared to in-domain data. Among the evaluated methods, LLM-SR achieves the lowest NMSE across both ID and OOD settings, while direct prompting performs the worst. Also, we observe some domain-specific variations in generalization performance: the performance gap between ID and OOD is more pronounced in chemistry and biology problems compared to physics and material science, although the complexity of problems are designed to be similar, as shown in Fig. <ref type="figure">10</ref>. This suggests that different scientific problems may pose distinct challenges for equation discovery methods, highlighting the need for future research to develop more robust approaches for different scientific disciplines.</p><p>OOD generalization and symbolic accuracy. We further analyzed the correlation between our proposed symbolic accuracy metric (Sec. 2.3) and data-driven extrapolation performance in OOD settings (averaged over all LSR-Synth domains). As shown in Fig. <ref type="figure">6</ref>, symbolic accuracy exhibits a strong positive correlation with numerical precision (Acc 0.1 ) on OOD data and a corresponding negative correlation with numerical error (NMSE). This strong correlation observed between symbolic and OOD performance provides two key insights: First, it establishes OOD evaluation as a powerful approach for assessing the discovery of generalizable equations-an aspect often underutilized in symbolic regression research; second, it validates our LLM-based symbolic evaluation approach through its strong alignment with numeric generalization performance.</p><p>More detailed experimental results, including both qualitative analyses of discovered equations and quantitative performance comparisons across scientific equation discovery methods and LLMs, are provided in App. D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Related Work</head><p>AI for Scientific Discovery. Recent advancements in AI for science highlight the ability of LLMs to generate scientific hypotheses by leveraging their extensive knowledge and reasoning capabilities <ref type="bibr">(Lu et al., 2024;</ref><ref type="bibr">Ji et al., 2024;</ref><ref type="bibr">Reddy &amp; Shojaee, 2024)</ref>. LLM agents, when augmented with external tools and scientific simulators, have shown promise in automated scientific data-driven analysis <ref type="bibr">(Majumder et al., 2024a)</ref>. While recent benchmarks have been developed to evaluate LLMs and agents in hypothesis generation and scientific question answering <ref type="bibr">(Majumder et al., 2024b;</ref><ref type="bibr">Chen et al., 2024)</ref>, evaluation for equation discovery and symbolic regression-one of the core tasks in scientific discovery-remains yet unexplored.</p><p>Symbolic Regression. Symbolic regression approaches fall into three main categories: search-based methods that explore equation spaces via evolutionary algorithms or reinforcement learning <ref type="bibr">(Schmidt &amp; Lipson, 2009;</ref><ref type="bibr">Cranmer, 2023;</ref><ref type="bibr">Petersen et al., 2021;</ref><ref type="bibr">Sun et al., 2023)</ref>, learning-based methods leveraging pre-trained Transformers on synthetic data <ref type="bibr">(Biggio et al., 2021;</ref><ref type="bibr">Kamienny et al., 2022)</ref>, and hybrid approaches that guide search using neural priors <ref type="bibr">(Landajuela et al., 2022;</ref><ref type="bibr">Shojaee et al., 2024a;</ref><ref type="bibr">Mundhenk et al., 2021;</ref><ref type="bibr">Meidani et al., 2023)</ref>. While these methods have advanced the field of automated symbolic function discovery from data, they mostly lack mechanisms to incorporate scientific domain knowledge into the discovery process.</p><p>LLMs for Equation Discovery. Recent work has leveraged LLM-based symbolic regression to enhance scientific equation discovery through various approaches leveraging LLMs' knowledge. LLM-SR <ref type="bibr">(Shojaee et al., 2024b)</ref> utilizes LLMs' embedded scientific knowledge to generate initial equation hypotheses in the form of Python programming functions, which are then refined through adaptive mutation and crossover operations with LLMs as evolutionary optimizers. In-Context Symbolic Regression (ICSR) <ref type="bibr">(Merler et al., 2024)</ref> employs an iterative few-shot learning paradigm over expression candidates, using previously tested successful expressions along with their fitness scores to guide the generation of improved candidates. LaSR <ref type="bibr">(Grayeli et al., 2024)</ref> alternates between hypothesis evolution, concept abstraction, and concept iteration phases to build a learned library of scientific concepts for mathematical relations needed to find the equation for a given data. The learned concepts are then used with pure evolutionary search methods <ref type="bibr">(Cranmer, 2023)</ref> like PySR <ref type="bibr">(Cranmer, 2023)</ref> as well as LLM-guided search to guide the equation hypothesis evolution. Scientific Generative Agent (SGA) (Ma et al., 2024) also implements a bilevel optimization framework for equation discovery where LLMs iteratively propose discrete hypotheses for scientific laws while physical simulations in PyTorch provide experimental validation and data-driven parameter optimization. Symbolic Regression Benchmarks. Symbolic regression benchmarks can be broadly categorized into scientific discovery-oriented and general-purpose mathematical discovery collections. The scientific equation discovery benchmarks are primarily represented by the SRBench (La Cava et al., 2021) and SRSD (Matsubara et al., 2022) benchmarks. SRBench incorporates two key data groups for this purpose: the Feynman physics equations (Udrescu &amp; Tegmark, 2020), and Strogatz dynamical systems (La Cava et al., 2016; Strogatz, 2018). A notable extension to this framework is presented in SRSD (Matsubara et al., 2022), which enhances the Feynman benchmark by incorporating physically meaningful sampling ranges for datapoints. The second category includes benchmarks like the Nguyen collection (Uy et al., 2011) and SRBench's black-box regression problems (La Cava et al., 2016) which include datasets without scientific contexts. However, these existing benchmarks are not well-suited for evaluating LLM-based equation discovery methods. These general-purpose benchmarks focus on the data-driven discovery of abstract mathematical functions without scientific context, while the former scientific benchmarks consist of well-known equations likely memorized by LLMs, enabling success through recitation rather than scientific reasoning and discovery. Our work extends this line of research by focusing on scientific equation discovery with LLMs, designing the first comprehensive benchmark to assess discovery capabilities of LLM-based scientific equation discovery methods beyond memorization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>We introduce LLM-SRBench, the first comprehensive benchmark for LLM-driven scientific equation discovery, encompassing 239 tasks across two distinct categories: LSR-Transform (111 problems derived from transformations of established physical models) and LSR-Synth (128 novel synthetic problems spanning four scientific disciplines). Our benchmark provides a standardized and multi-faceted evaluation protocol for assessing scientific equation discovery with LLMs, accommodating diverse hypothesis representations, including expression strings and programs. Extensive experiments with state-of-the-art discovery methods and various LLM backbones on LLM-SRBenchshow a peak performance of only 31%, highlighting the significant challenges and open research opportunities in this domain. We envision that LLM-SRBench benchmark datasets and its evaluation protocol could serve as a foundation for future research, driving progress in automated equation discovery and advancing our understanding of LLMs in symbolic reasoning needed in scientific discovery.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impact Statement</head><p>The development and future adoption of LLM-SRBench as a benchmark for evaluating LLM-based scientific equation discovery has the potential to significantly impact the field of artificial intelligence for science and scientific discovery.</p><p>There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.</p><p>The LSR-Transform is the first category of datasets in LLM-SRBench, designed to evaluate the ability of LLM-based scientific equation discovery methods in less common mathematical forms. This dataset challenges LLM-based discovery methods to avoid reliance on memorization of well-known representations and instead reason through unfamiliar instantiations of familiar problems. This approach is motivated by the observation that LLMs often struggle with unfamiliar instantiations of otherwise familiar problems, as highlighted by recent studies on the fragility of LLM reasoning <ref type="bibr">(Mirzadeh et al., 2024)</ref>. By transforming existing benchmark problems into alternative mathematical representations, LSR-Transform provides a rigorous testbed to evaluate how well LLM-based discovery methods perform in both (1) semantic scientific reasoning, which draws on LLMs' built-in scientific knowledge, and (2) data-driven reasoning, which utilizes experimental feedback for equation discovery. LSR-Transform builds on the Feynman benchmark (Udrescu &amp; Tegmark, 2020), a widely used standard benchmark in scientific equation discovery and symbolic regression. The Feynman benchmark consists of 100 physics equations from Feynman Lecture Series<ref type="foot">foot_0</ref> , representing fundamental laws in physics. While the Feynman benchmark has been instrumental in evaluating symbolic regression methods, it primarily tests the ability to recover equations in their standard, well-known forms which are mostly memorized by LLMs. However, real-world scientific equation discovery often involves reasoning about unknown equations based on domain expertise and knowledge from literature as well as empirical data observations. To address this gap, LSR-Transform transforms the original Feynman equations into less common alternative mathematical forms of the same physical problem by switching input-output variables and symbolically solving for the new target variables. &#119864; = 1 4 &#119898; &#120596; 2 + &#120596; 0 2 &#119909; 2 &#119898; = 4&#119864; &#120596; 2 + &#120596; 0 2 &#119909; 2 &#120596; = 4&#119864; &#119898;&#119909; 2 -&#120596; 0 2 &#119881; &#119890; = 1 4&#120587;&#120598; . &#119901; &#119889; cos &#120579; &#119903; 2 &#119901; &#119889; = 4&#120587;&#120598;&#119903; 2 &#119881; &#119890; cos &#120579; &#119903; = &#119901; &#119889; cos &#120579; 4&#120587;&#120598;&#119881; &#119890; &#120579; = arccos( 4&#120587;&#120598;&#119903; 2 &#119881; &#119890; &#119901; &#119889; ) &#119868; = &#119868; 0 exp &#119902;. &#119881; &#119896; &#119887; . &#119879; -1 &#119902; = &#119896; &#119887; . &#119879; &#119881; ln( &#119868; &#119868; 0 + 1) &#119868; 0 = &#119868; exp &#119902;. &#119881; &#119896; &#119887; . &#119879; -1 &#119879; = &#119902;. &#119881; &#119896; &#119887; . ln( &#119868; &#119868; 0 + 1)</p><p>Find a mathematical expression that represents the total energy (&#119864;) of a harmonic oscillator system, given data on the mass of the object (&#119898;), the angular frequency of the system (&#120596;), the natural angular frequency of the system ( &#120596;&#8320; ), and the displacement from the equilibrium position (&#119909;).</p><p>Find a mathematical expression that represents the electric potential (&#119881; &#119890; ) at a point in space due to an electric dipole, given data on the dipole moment (&#119901; &#119889; ), the angle between the dipole axis and the radius vector to the point (&#120579;), the distance from the dipole to the point (&#119903;), and the permittivity of free space (&#120598; ).</p><p>Find a mathematical expression that represents the current (&#119868;) in a semiconductor diode, given data on the saturation current (&#119868; 0 ), the elementary charge (&#119902;), the voltage across the diode (&#119881;), Boltzmann's constant (&#119896; &#119887; ), and the absolute temperature (&#119879;).  4E to mass (m) or angular frequency (&#969;). Similarly, in the electric potential equation V e = 1 4&#960;&#1013;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Description of Original Problem Original Problem Transformed Problem Examples</head><p>is also transformed into p d = 4&#960;&#1013;r 2 Ve cos (&#952;) , and r = p d cos (&#952;) 4&#960;&#1013;Ve , showcasing how the problem is reformulated to solve for dipole moment (p d ), and distance (r). These transformations introduce less-common mathematical representations that are simple but not trivial for LLMs to find from the problem description and data. By systematically altering the input-output relationships into new analytically solvable symbolic forms, LSR-Transform challenges models to reason through unfamiliar mathematical forms, testing their ability to generalize beyond memorized representations and leverage data-driven reasoning to find new forms.</p><p>The transformed expressions generally exhibit higher complexity than the original physical laws in the Feynman benchmark. To maintain our focus on evaluating semantic complexity (reasoning and memorization capabilities) rather than syntactic complexity and lengthy hypotheses, we deliberately filtered out LSR-transform expressions with significantly higher complexities from the dataset. This filtering ensures that the benchmark primarily challenges discovery models' ability to understand and conduct both scientific and data-driven reasoning rather than their capacity to model longer and more complex mathematical expressions. Figure <ref type="figure">8</ref> demonstrates the complexity distribution between the original Feynman Benchmark problems versus their transformed counterparts in LSR-Transform. Following <ref type="bibr">(La Cava et al., 2021)</ref>, the complexity of each hypothesis (i.e., expression) is quantified as the number of nodes in the expression tree representation of the equation. The expression tree is constructed by parsing the equation into its constituent unary and binary operators, variables, and constants.</p><p>Finally, we also exclude the transformed problems that LLM (Llama-3.1-8B-Instruct) can solve through direct sampling without requiring access to data. This process creates a dataset of 111 transformed equations, each sharing the same scientific context and variables as its original counterpart but presenting a less common mathematical form. The goal of LSR-Transform is not to discover new equations but to evaluate whether LLM-based systems can guide discoveries from non-trivial, data-driven transformations of known equations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Details of Filtering Process</head><p>This section provides a comprehensive breakdown of the filtering steps applied during the LSR-Transform dataset generation, addressing the apparent reduction from 100 original Feynman problems to 111 transformed equations. The LSR-Transform dataset generation involves multiple filtering stages that significantly reduce the number of candidate problems. Starting from 100 original Feynman problems, the transformation process initially generates 471 candidate equations by selecting different pivot variables for each equation and performing feature-target transformations. This expansion reflects an average of approximately 4.7 transformed candidates per original problem, demonstrating the diversity introduced by considering multiple input variables as potential targets. The first major filtering occurs during the solvability check using SymPy's symbolic solver (Step 5 in Figure <ref type="figure">3</ref>), which eliminates 53 problems (11.3% of candidates) that cannot be analytically solved for the target variable. These typically include transcendental equations without closed-form solutions, high-degree polynomial equations where symbolic solutions become intractable, and equations involving complex multi-valued functions. After this stage, 418 problems remain. Notably, no equations are eliminated during dataset refinement (Step 6 in Figure <ref type="figure">3</ref>). This stage focuses solely on filtering individual datapoints to ensure they fall within the valid domains of the transformed equations (e.g., ensuring positive values under square roots, avoiding division by zero), while the equations themselves remain intact. The most significant reduction occurs during complexity filtering, where 307 problems (73.4% of remaining candidates) are eliminated, resulting in the final 111 problems. This filtering serves a crucial purpose: to ensure that the challenging nature of LSR-Transform stems from semantic complexity (reasoning about the scientific problem and unfamiliar mathematical forms) rather than syntactic complexity (handling lengthy expressions). Following La <ref type="bibr">Cava et al. (?)</ref>, complexity is measured as the number of nodes in the expression tree representation of each equation. Following this definition, we constrain the complexity distribution to match that of the original Feynman benchmark (Figure <ref type="figure">8</ref>). In other words, transformed equations with complexity significantly exceeding the original Feynman distribution are exclude. These design choices maintains focus on testing reasoning capabilities while preserving analytical tractability and scientific diversity across physics domains. As demonstrated in Figure <ref type="figure">8</ref>, even after filtering, LSR-Transform problems remain substantially more challenging than original Feynman problems at same levels of complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. LSR-Synth</head><p>The LSR-Synth is the second category of datasets in LLM-SRBench which is a collection of synthetic problems designed to benchmark the performance of LLMs in scientific equation discovery. This dataset is particularly focused on generating plausible yet challenging equation discovery problems that span multiple scientific domains, including chemistry, physics, biology, and material science. The problems in LSR-Synth are constructed by combining known terms, which are wellestablished in the scientific literature, with synthetic terms that introduce novel and plausible variations to the equations.</p><p>Figure <ref type="figure">9</ref> provides examples of problems from the LSR-Synth. These examples demonstrate the dataset's design, which combines well-established mathematical and scientific expressions with novel, domain-specific variations to create challenging models that address the trivial LLM memorization. Each equation is composed of both known and synthetic terms (highlighted in red). Known terms are terms that are commonly found in scientific equations and are well-documented in the literature for that domain and specific problem. For example, terms like -C 0 A(t) and -C 0 A(t) 2 are typical in chemistry reactions as the first-order and second-order kinetics. These terms are included to ensure that the problems remain grounded in the established scientific context, providing a foundation for the LLM-based methods to build upon for equation discovery related to each scientific problem. On the other hand, synthetic terms are introduced to create novel variations in the problems to avoid trivial LLM memorization. For instance, terms like sin ( A(t)) and cos (log (A(t) + 1)) in chemistry reaction kinetics are designed to challenge the LLM-based discovery models by introducing non-linearities and interactions that are not commonly seen in standard models. These terms are critical for testing the ability of LLM-based equation discovery models to generalize beyond memorization of standard known formulations and discover new patterns from data-driven reasoning and refinement. The combination of known and synthetic terms in LSR-Synth creates a dataset that is both challenging and representative of established scientific problems. This approach enables rigorous evaluation of models' capabilities in interpreting and discovering complex scientific equations, striking a balance between domain familiarity and innovative data-driven reasoning. To generate these known and synthetic terms across various domains, we leverage LLM (GPT-4o) by providing problem domain context and descriptions, prompting it to generate candidate terms. These suggested terms and equations are then filtered based on solvability and novelty criteria, followed by domain expert validation.</p><p>Figure <ref type="figure">10</ref> provides an analysis of the complexity of the problems in the LSR-Synth dataset. Similar to Figure <ref type="figure">8</ref>, complexity is quantified as the number of nodes in the expression tree. This figure highlights the diverse nature of the LSR-Synth dataset, with complexity levels ranging from simple expressions to highly complex ones. By spanning a wide range of domains (chemistry, physics, biology, and material science) and hypothesis complexities, LSR-Synth serves as a comprehensive dataset for evaluating the capabilities of LLMs in scientific equation discovery.</p><p>Once the structure of equations is generated, their parameters (coefficients) are sampled randomly from specified and scientifically valid ranges, and then data are generated through different solution methods depending on the domain. For dynamical systems (chemical reactions, population dynamics, and physical oscillators), we employ numerical integration using SciPy's solve ivp with the RK45 method, while static relationships (material stress-strain) are evaluated directly over predetermined input ranges. For each domain, we generate 5000 evenly spaced samples. In dynamical systems, these samples span the time interval t &#8712; [0, 60], while for material stress-strain relationships, the samples cover strain &#1013; &#8712; [0, 0.6] and temperature T &#8712; [273, 573]K. To evaluate out-of-distribution (OOD) generalization, for time-dependent systems, we designate the last 500 time points as the out-of-domain (OOD) test set, with the remaining 4500 points used for in-domain (ID) training and validation. Similarly, for the stress-strain domain, the OOD test set comprises the last 500 points based</p><p>&#119862; 0 . sin &#119905; -&#119862; 1 . &#119909;(&#119905;) -&#119862; 2 . &#119909;. exp(-&#119909;(&#119905;) ) &#119862; 0 . sin &#119905; -&#119862; 1 . &#119909; &#119905; 3 -&#119862; 2 . sin &#119909; &#119905; . &#119907;(&#119905;) -&#119862; 3 . sin &#119907; &#119905; &#119862; 0 . sin &#119905; -&#119862; 1 . &#119907; &#119905; -&#119862; 2 . sin &#119909; &#119905; . &#119907; &#119905; + &#119862; 3 . &#119909; &#119905; 2 . &#119907; &#119905; -&#119862; 4 . &#119909;(&#119905;). exp -&#119909; &#119905; -&#119862; 0 . &#119860; &#119905; -&#119862; 1 . sin &#119860; &#119905; -&#119862; 0 . &#119860; &#119905; 2 -&#119862; 1 . exp -&#119862; 2 . &#119905; + &#119862; 3 . sin &#119862; 4 . &#119860; &#119905; -&#119862; 0 . &#119860; &#119905; 2 -&#119862; 1 . &#119860; &#119905; -&#119862; 2 . cos log &#119860; &#119905; + 1 &#119903; 1 -&#119875; &#119905; &#119870; 0 . &#119875; &#119905; + &#119903;. &#119875; &#119905; 2 &#120572;&#119875; &#119905; + 1 &#119903;. &#119875; &#119905; + &#119903; 1 -&#119875; &#119905; &#119870; 0 . &#119875; &#119905; + &#120573;&#119875; &#119905; sin(&#120596;&#119905;) &#119903; 1 -&#119875; &#119905; &#119870; 0 . &#119875; &#119905; + &#119903; 1 -&#119875; &#119905; &#119870; 0 . -1 + &#119875; &#119905; &#120572; . &#119875; &#119905; + &#119903;. (1 -exp -&#120574;&#119875; &#119905; . &#119875; &#119905; Example of LSR-Synth Problems with Known and Synthetic Terms Physics: Acceleration with respect to Time, Displacement, and Velocity Biology: Growth rate with respect to Time and Population size Chemistry: Reaction rate with respect to Time and Concentration &#119862; 0 . 1 -&#119862; 1 . &#119879; -&#119879; 0 . &#120598; + &#119862; 2 . exp -&#119879; -&#119879; 0 2 . &#120598; &#119862; 0 . 1 -&#119862; 1 . &#119879; -&#119879; 0 . &#120598; -&#119862; 2 &#119879; -&#119879; 0 + &#119862; 3 . &#119879; -&#119879; 0 . log(&#120598; + 1) &#119862; 0 . 1 -&#119862; 1 . &#119879; -&#119879; 0 . &#120598; + &#119862; 2 . &#120598; &#119862; 3 . exp(-&#119862; 4 &#119862; 5 . &#119879; ) + &#119862; 6 . exp -&#119879; -&#119879; 0 2 . &#120598; Material Science: Stress with respect to Strain and Temperature Known Terms Synthetic Terms on temperature values, maintaining a consistent evaluation framework across all domains. The data generation process incorporates the same quality control criteria used in equation generation. Generated solutions must satisfy: (1) solvability within specified numerical tolerance, (2) meaningful physical behavior (avoiding divergence or constant solutions), and (3) uniqueness compared to existing solutions (using RMSE thresholds). These criteria ensure that the final dataset contains diverse, physically meaningful, and numerically stable solutions suitable for benchmarking equation discovery methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Evaluation Details</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1. Data Fidelity</head><p>We evaluate the data-driven performance of discovered equations through multiple complementary metrics focusing on both predictive accuracy and generalization capability. The primary metrics include Accuracy to Tolerance (Acc &#964; ), and Normalized Mean Squared Error (NMSE). The Acc &#964; metric provides a binary assessment of prediction accuracy based on point-wise relative error. An equation is considered accurate if the maximum relative error across all test tolerance &#964; . Formally:</p><p>where &#375;i represents the predicted value, y i is the true value, and N test is the number of test samples. The indicator function 1(&#8226;) returns 1 if the condition is satisfied and 0 otherwise. This metric is particularly useful for cases where maintaining a consistent level of accuracy across all predictions is crucial, as it identifies equations that might have occasional but significant deviations from the true values. NMSE also provides a continuous measure of the overall prediction quality, normalized by the scale of the true values: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NMSE =</head><p>This normalization makes the metric scale-invariant, allowing meaningful comparisons across different datasets and equation types. The NMSE ranges from 0 to &#8734;, where 0 indicates perfect prediction. Unlike Acc &#964; , NMSE provides a more nuanced view of model performance by considering the magnitude of prediction errors across all test points rather than just their maximum relative error. Beyond standard predictive metrics, we also place particular emphasis on evaluation of out-of-distribution (OOD) generalization, a critical requirement for scientific equations. For datasets in LSR-Synth which have been generated synthetically, we evaluate the discovered hypotheses on held-out OOD test sets to also assess the extrapolation capabilities. The performance gap between in-domain and OOD test sets (&#8710;NMSE and &#8710;Acc &#964; ) provides valuable insights into the generalizability of the discovered equations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Symbolic Accuracy</head><p>We introduce a novel evaluation methodology for equation discovery that leverages LLM (GPT-4o) as an evaluator for assessing mathematical equivalence between predicted and gold equation hypotheses. Traditional metrics in symbolic regression, such as recovery rate (La <ref type="bibr">Cava et al., 2016)</ref>, exact match, or normalized tree edit distance <ref type="bibr">(Matsubara et al., 2022)</ref>, often fail to capture the true semantic equivalence of mathematical expressions, especially when dealing with different representation formats or algebraically equivalent forms. Our approach employs GPT-4o as an automated evaluator, capable of analyzing symbolic equivalence across diverse representation formats including equation strings, expression trees, and executable programs. The evaluation process begins by pre-processing the hypotheses by (1) removing additional information (such as natural language comments in the case of programs), and (2) replacing constants with placeholder parameter vectors, focusing solely on logical structure and mathematical relations. To assess the reliability of this LLMbased symbolic evaluation approach for equation discovery, we conducted a human evaluation study. Two of the authors independently assessed mathematical symbolic equivalence on a set of 130 randomly sampled problems. The validation study revealed a 94.6% agreement rate between GPT-4o and human evaluators, where agreement rate is calculated as the percentage of cases where both LLM and human evaluators made the same judgment about the mathematical equivalence between predicted and ground truth equations (123 out of 130).</p><p>Figure <ref type="figure">11</ref> provides the prompt used for our GPT-4o based evaluation of the mathematical symbolic equivalence between the generated hypothesis (in the form of program or expression) against the ground truth equation. In this setting, the GPT-4o first articulates its mathematical reasoning before making an equivalence binary assessment. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2.2. LASR</head><p>We use the default prompts from LaSR's <ref type="bibr">(Grayeli et al., 2024)</ref> public code repository (<ref type="url">https://github.com/  trishullab/LibraryAugmentedSymbolicRegression.jl</ref>), which includes:</p><p>1. The LLMINIT prompt, which is used in an LLM-augmented initialization operation.</p><p>2. LLMMUTATION prompt is used to mutate an expression based on a set of concepts.</p><p>3. LLMCROSSOVER prompt is used to construct a new expression from the crossover of two sampled expressions based on a set of concepts.</p><p>4. LLM Concept Abstraction prompt in CONCEPTABSTRACTION function, which extracts a natural language concept from current trends of hypotheses at each iteration.</p><p>5. LLM Concept Evolution prompt in CONCEPTEVOLUTION function, which creates a new concept that follows a set of ideas in the current library.</p><p>In the following, we provide examples of these prompts.</p><p>1. LLMINIT prompt.</p><p>&lt;System prompt&gt; You are a helpful assistant that proposes a mathematical expression by following three provided suggestions.</p><p>An expression must consist of the following variables: {{variables}}. All constants will be represented with the symbol C. Each expression will only use these operators: {{operators}}.</p><p>&lt;User prompt&gt; Suggestion 1: {{assump1}} Suggestion 2: {{assump2}} Suggestion 3: {{assump3}}</p><p>Propose {{N}} expressions that would be appropriate given the suggestions. Provide short commentary for each of your decisions. End with a JSON list that enumerates the proposed expressions following this format: '''json ["expr1", "expr2", ... "expr{{N}}" ] '''</p><p>2. LLMMUTATION prompt.</p><p>&lt;System prompt&gt; You are a helpful assistant that mutates a mathematical expression by following a few provided suggestions. You will be given three suggestions and a single reference expression to mutate. An expression must consist of the following variables: {{variables}}. All constants will be represented with the symbol C. Each expression will only use these operators: {{operators}}.</p><p>&lt;User prompt&gt; Suggestion 1: {{assump1}} Suggestion 2: {{assump2}} Suggestion 3: {{assump3}} Reference Expression: {{expr}} Propose {{N}} expressions that would be appropriate given the suggestions and references. Provide short commentary for each of your decisions. End with a JSON list that enumerates the proposed expressions following this format: '''json ["expr1", "expr2", ... "expr{{N}}" ] '''</p><p>3. LLMCROSSOVER prompt.</p><p>&lt;System prompt&gt; You are a helpful assistant that recombines two mathematical expressions by following a few provided suggestions. You will be given three suggestions and two reference expressions to recombine. An expression must consist of the following variables: {{variables}}. All constants will be represented with the symbol C. Each expression will only use these operators: {{operators}}.</p><p>&lt;User prompt&gt; Suggestion 1: {{assump1}} Suggestion 2: {{assump2}} Suggestion 3: {{assump3}} Reference Expression 1: {{expr1}} Reference Expression 2: {{expr2}} Propose {{N}} expressions that would be appropriate given the suggestions and references. Provide short commentary for each of your decisions. End with a JSON list that enumerates the proposed expressions following this format: '''json ["expr1", "expr2", ... "expr{{N}}" ] ''' 4. LLM Concept Abstraction prompt. &lt;System prompt&gt; You are a helpful assistant that hypothesizes about the underlying assumptions that generated a list of good and bad mathematical expressions in detailed ways. My ultimate goal is to discover what assumptions generated the observed good mathematical expressions and excludes the bad mathematical expressions. Focus more on the good expressions, their mathematical structure, and any relation to physical concepts. Note that capital C represents an arbitrary constant &lt;User prompt&gt; Good Expression 1: {{gexpr1}} Good Expression 2: {{gexpr2}} Good Expression 3: {{gexpr3}} Good Expression 4: {{gexpr4}} Good Expression 5: {{gexpr5}} Bad Expression 1: {{bexpr1}} Bad Expression 2: {{bexpr2}} Bad Expression 3: {{bexpr3}} Bad Expression 4: {{bexpr4}} Bad Expression 5: {{bexpr5}} Propose {{N}} hypotheses that would be appropriate given the expressions. Provide short commentary for each of your decisions. Do not talk about topics related to the simplicity or complexity of the expressions. I want ideas that are unique and interesting enough to amaze the world's best mathematicians. End with a JSON list that enumerates the proposed hypotheses following this format: '''json ["hyp1", "hyp2", ... "hyp{{N}}" ] ''' 5. LLM Concept Evolution prompt. &lt;System prompt&gt; You are an insightful assistant skilled in logical reasoning and deduction. Your task is to analyze a set of ideas and infer nontrivial conclusions that logically follow from them. The ultimate goal is to uncover underlying principles or properties of the hidden expressions. Focus on providing logical conclusions that are unique, interesting, and profound. &lt;User prompt&gt; Idea 1: {{idea1}} Idea 2: {{idea2}} Idea 3: {{idea3}} Idea 4: {{idea4}} Idea 5: {{idea5}} Based on these ideas, deduce {{N}} logical conclusions or hypotheses that directly follow from them. Provide a brief explanation for each conclusion, highlighting the logical connections between the ideas. Avoid discussing topics related to the simplicity or complexity of the expressions. Conclude with a JSON list that enumerates the proposed conclusions in the following format: '''json ["Conclusion 1", "Conclusion 2", ... "Conclusion {{N}}" ] ''' C.2.3. SGA</p><p>The following prompts are used in our implementation of SGA <ref type="bibr">(Ma et al., 2024)</ref> for scientific equation discovery tasks, following the original implementation SGA's public code repository (<ref type="url">https://github.com/PingchuanMa/SGA</ref>), which includes:</p><p>System prompt for task.</p><p>comparing different equation discovery methods with GPT-4o-mini as the LLM backbone, and examining different LLM backbones when using LLM-SR method. The substantial variance in NMSE performance across samples reflects the diverse complexity inherent in our benchmark-stemming from both the varying mathematical transformations in LSR-Transform and the different combinations of known and synthetic terms in LSR-Synth datasets. Notably, the relative difficulty of datasets varies across methods and LLM backbones, suggesting that different methods and LLMs possess distinct capabilities in terms of leveraging domain knowledge, reasoning, and generating novel hypotheses. Symbolic Accuracy and Generalization. For scientific equation discovery methods, both symbolic accuracy and out-ofdomain generalization serve as crucial evaluation metrics, reflecting the methods' ability to uncover true governing equations.</p><p>Figure <ref type="figure">13</ref> examines the relationship between these metrics, plotting symbolic accuracy against both OOD accuracy and OOD NMSE across all method-LLM-domain combinations in LSR-Synth. The strong correlation observed between symbolic and OOD performance yields two important insights: first, it establishes OOD evaluation as a powerful metric for assessing the discovery of generalizable equations, an approach historically underutilized in symbolic regression; second, it validates our LLM-based symbolic evaluation approach through its strong alignment with numeric generalization performance.</p><p>Qualitative Analysis of Outputs. To provide deeper insights into the behavior of different discovery methods, Figure <ref type="figure">14</ref> illustrates their final discovered hypotheses on a biological population growth problem (BPG0) using Llama-3.1-8B as the LLM backbone. Direct Prompting (Figure <ref type="figure">14</ref> These examples demonstrate the diverse approaches methods take in balancing scientific interpretability with mathematical expressiveness when discovering equation structures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Discussion and Future Directions</head><p>Our findings from LLM-SRBench reveal several key insights that inform the design of future LLMs for scientific discovery applications. Scientific equation discovery remains a challenging problem for LLMs, requiring a complex interplay of domain knowledge, search capabilities with data-driven feedback, and mathematical manipulation skills. Our results demonstrate that this problem poses significant challenges for LLM-based discovery frameworks across different model architectures, suggesting that current approaches may be fundamentally limited in their ability to perform genuine scientific discovery.</p><p>This work questions the current evaluation paradigm for equation discovery in emerging LLM-based techniques. We demonstrate that existing benchmarks for this task are susceptible to memorization and inadequate for evaluating these techniques' true scientific discovery capabilities. Motivated by these limitations, we designed LLM-SRBench to address the memorization issue through two key innovations: synthetic imaginary scenarios (LSR-Synth category) that are not based on existing scientific knowledge and require data-driven discovery tools for solution, and transformed equations (LSR-Transform category) that convert common forms of scientifically known equations into less familiar formulations. The LSR-Synth category targets genuine innovation in LLM-based discovery techniques by eliminating the possibility of recalling memorized equations, while LSR-Transform problems are difficult to recite from memory and require reasoning over hypothesis generation steps, making them suitable candidates for evaluating recently emerging LLM-based scientific discovery agents. While the mathematical transformations in LSR-Transform are algebraically valid, their scientific meaningfulness varies considerably across contexts. Many transformations correspond to legitimate physics problems from the Feynman Lecture Series collection and represent alternative problem formulations with practical significance. For example, in the Harmonic Oscillator Energy problem, the original formulation E = 1 4 m(&#969; 2 + &#969; 2 0 )x 2 expresses energy as a function of system parameters, while the transformed version m = 4E (&#969; 2 +&#969; 2 0 )x 2 determines the mass required for given energy storage. This transformation maintains scientific meaning by addressing the engineering question of what mass is needed to store a specific amount of energy in an oscillating system, and such inversions are common in engineering design problems where system parameters must be determined to achieve desired performance characteristics. Similarly, the Electric Potential problem transforms from V e = 1 4&#960;&#1013; p d cos(&#952;) r 2 (potential at a point due to a dipole) to r = p d cos(&#952;) 4&#960;&#1013;Ve</p><p>(distance for a given potential), addressing the practical question of determining measurement distances in electrostatic experiments or sensor design.</p><p>However, not all transformations maintain clear physical interpretability. Some result in equations where the target variable appears in complex functional forms that may not correspond to natural physical questions, such as solving for angular frequency in oscillatory systems yielding expressions involving square roots of differences that lack intuitive physical meaning. Additionally, certain transformations may obscure natural causal relationships-transforming from "force causes acceleration" to "acceleration determines force" maintains mathematical validity but may not reflect underlying physical causality. The LSR-Transform category represents a deliberate balance between mathematical rigor and physical meaningfulness by constraining the complexity of transformed problems to match original problems, focusing on semantic rather than syntactic challenges in scientific equation discovery, while maintaining the original scientific context and variable meanings to ensure that underlying physics remains relevant even when mathematical formulation changes. The varying scientific meaningfulness of transformations reflects broader challenges in automated scientific discovery that warrant future investigation. Automated discovery systems must incorporate mechanisms to evaluate not only data-driven correctness but also scientific plausibility and interpretability of generated hypotheses, as mathematical validity alone is insufficient for meaningful scientific contribution. The most effective approach to scientific equation discovery likely involves close collaboration between AI systems, which excel at exploring vast hypothesis spaces, and human domain scientists, who can assess scientific meaningfulness and guide discovery directions based on deep contextual understanding. Future equation discovery methods could improve by incorporating literature retrieval tools to build grounding foundations for scientific context and domain knowledge, helping to prioritize discoveries that are mathematically valid, data-consistent, novel, and scientifically meaningful. The field needs evaluation frameworks that assess not just mathematical correctness but also scientific novelty, interpretability, and practical applicability of discovered equations, moving beyond narrow accuracy metrics toward a more comprehensive understanding of what constitutes valuable scientific discovery in the age of LLMs with their vast scientific knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Comparison with Standard (non-LLM) Symbolic Regression Baselines</head><p>To further validate the utility of LLM-SRBench and demonstrate the advantages of LLM-based approaches, we conducted additional experiments comparing LLM-based methods with traditional symbolic regression techniques that do not incorporate domain knowledge. We evaluated PySR <ref type="bibr">(Cranmer, 2023)</ref>, a state-of-the-art symbolic regression method based on genetic programming, on all LLM-SRBench datasets. PySR operates purely on numerical data points without access to the scientific context, variable descriptions, or domain knowledge that LLM-based methods can leverage in discovery process. We used PySR's default configuration with the same computational budget (equivalent number of evaluations) as the LLM-based methods to ensure fair comparison. Table <ref type="table">3</ref> presents the performance comparison between the best-performing LLM-based method from Table <ref type="table">1</ref> and PySR across all LLM-SRBench datasets. The results reveal several key insights about the complementary strengths and limitations of non-LLM versus LLM-based approaches in equation discovery.</p><p>PySR demonstrates competitive and sometimes even better numerical accuracy (Acc 0.1 ) across all datasets. However, PySR consistently shows significantly lower symbolic accuracy, particularly struggling with non-physics domains where it achieves 0% symbolic accuracy on chemistry, biology, and material science datasets. The performance gap is most pronounced in problems that require specialized scientific knowledge. While PySR can fit mathematical patterns in the data, it lacks the scientific intuition to discover equations that align with established physical principles or domain-specific terminology. Interestingly, PySR shows relatively better performance on physics problems, achieving modest symbolic accuracy of 4.54% on LSR-Synth Physics and 8.11% on LSR-Transform (which is based on Feynman physics equations). This suggests that physics problems may contain mathematical patterns that are more aligned with the dictionary design in PySR. So they can be discovered better through the data-driven search pipeline designed in PySR. These findings strengthen the motivation for LLM-based scientific equation discovery and demonstrate that LLM-SRBench successfully captures challenges in equation discovery that traditional symbolic regression methods cannot adequately address through numerical data-driven optimization alone.   </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://space.mit.edu/home/tegmark/aifeynman.html</p></note>
		</body>
		</text>
</TEI>
