skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Incorporating background knowledge in symbolic regression using a computer algebra system
Abstract Symbolic regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order of magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find incorporating these constraints in Bayesian SR (as the Bayesian prior) is better than by modifying the fitness function in the GA.  more » « less
Award ID(s):
2138938
PAR ID:
10511871
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IOP Publishing
Date Published:
Journal Name:
Machine Learning: Science and Technology
Volume:
5
Issue:
2
ISSN:
2632-2153
Format(s):
Medium: X Size: Article No. 025057
Size(s):
Article No. 025057
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Compact symbolic expressions have been shown to be more efficient than neural network (NN) models in terms of resource consumption and inference speed when implemented on custom hardware such as field-programmable gate arrays (FPGAs), while maintaining comparable accuracy (Tsoiet al2024EPJ Web Conf.29509036). These capabilities are highly valuable in environments with stringent computational resource constraints, such as high-energy physics experiments at the CERN Large Hadron Collider. However, finding compact expressions for high-dimensional datasets remains challenging due to the inherent limitations of genetic programming (GP), the search algorithm of most symbolic regression (SR) methods. Contrary to GP, the NN approach to SR offers scalability to high-dimensional inputs and leverages gradient methods for faster equation searching. Common ways of constraining expression complexity often involve multistage pruning with fine-tuning, which can result in significant performance loss. In this work, we propose S y m b o l N e t , a NN approach to SR specifically designed as a model compression technique, aimed at enabling low-latency inference for high-dimensional inputs on custom hardware such as FPGAs. This framework allows dynamic pruning of model weights, input features, and mathematical operators in a single training process, where both training loss and expression complexity are optimized simultaneously. We introduce a sparsity regularization term for each pruning type, which can adaptively adjust its strength, leading to convergence at a target sparsity ratio. Unlike most existing SR methods that struggle with datasets containing more than O ( 10 ) inputs, we demonstrate the effectiveness of our model on the LHC jet tagging task (16 inputs), MNIST (784 inputs), and SVHN (3072 inputs). 
    more » « less
  2. Machine learning at the extreme edge has enabled a plethora of intelligent, time-critical, and remote applications. However, deploying interpretable artificial intelligence systems that can perform high-level symbolic reasoning and satisfy the underlying system rules and physics within the tight platform resource constraints is challenging. In this paper, we introduceTinyNS, the first platform-aware neurosymbolic architecture search framework for joint optimization of symbolic and neural operators.TinyNSprovides recipes and parsers to automatically write microcontroller code for five types of neurosymbolic models, combining the context awareness and integrity of symbolic techniques with the robustness and performance of machine learning models.TinyNSuses a fast, gradient-free, black-box Bayesian optimizer over discontinuous, conditional, numeric, and categorical search spaces to find the best synergy of symbolic code and neural networks within the hardware resource budget. To guarantee deployability,TinyNStalks to the target hardware during the optimization process. We showcase the utility ofTinyNSby deploying microcontroller-class neurosymbolic models through several case studies. In all use cases,TinyNSoutperforms purely neural or purely symbolic approaches while guaranteeing execution on real hardware. 
    more » « less
  3. A key challenge in conformer sampling is finding low-energy conformations with a small number of energy evaluations. We recently demonstrated the Bayesian Optimization Algorithm (BOA) is an effective method for finding the lowest energy conformation of a small molecule. Our approach balances between exploitation and exploration, and is more efficient than exhaustive or random search methods. Here, we extend strategies used on proteins and oligopeptides ( e.g. Ramachandran plots of secondary structure) and study correlated torsions in small molecules. We use bivariate von Mises distributions to capture correlations, and use them to constrain the search space. We validate the performance of our new method, Bayesian Optimization with Knowledge-based Expected Improvement (BOKEI), on a dataset consisting of 533 diverse small molecules, using (i) a force field (MMFF94); and (ii) a semi-empirical method (GFN2), as the objective function. We compare the search performance of BOKEI, BOA with Expected Improvement (BOA-EI), and a genetic algorithm (GA), using a fixed number of energy evaluations. In more than 60% of the cases examined, BOKEI finds lower energy conformations than global optimization with BOA-EI or GA. More importantly, we find correlated torsions in up to 15% of small molecules in larger data sets, up to 8 times more often than previously reported. The BOKEI patterns not only describe steric clashes, but also reflect favorable intramolecular interactions such as hydrogen bonds and π–π stacking. Increasing our understanding of the conformational preferences of molecules will help improve our ability to find low energy conformers efficiently, which will have impact in a wide range of computational modeling applications. 
    more » « less
  4. We recently developed an Effective Field Theory (EFT) for rotational bands in odd-mass nuclei. Here we use EFT expressions to perform a Bayesian analysis of data on the rotational energy levels of 99 Tc, 155,157 Gd, 159 Dy, 167,169 Er, 167,169 Tm, 183 W, 235 U and 239 Pu. The error model in our Bayesian analysis includes both experimental and EFT truncation uncertainties. It also accounts for the fact that low-energy constants (LECs) at even and odd orders are expected to have different sizes. We use Markov Chain Monte Carlo (MCMC) sampling to explore the joint posterior of the EFT and error-model parameters and show both the LECs and the breakdown scale can be reliably determined. We extract the LECs up to fourth order in the EFT and find that, provided we correctly account for EFT truncation errors in our likelihood, results for lower-order LECs are stable as we go to higher orders. LEC results are also stable with respect to the addition of higher-energy data. We extract the expansion parameter for all the nuclei listed above and find a clear correlation between the extracted and the expected value of the inverse breakdown scale, W , based on the single-particle and vibrational energy scales. However, the W that actually determines the convergence of the EFT expansion is markedly smaller than would be naively expected based on those scales. 
    more » « less
  5. ABSTRACT Recovering credible cosmological parameter constraints in a weak lensing shear analysis requires an accurate model that can be used to marginalize over nuisance parameters describing potential sources of systematic uncertainty, such as the uncertainties on the sample redshift distribution n(z). Due to the challenge of running Markov chain Monte Carlo (MCMC) in the high-dimensional parameter spaces in which the n(z) uncertainties may be parametrized, it is common practice to simplify the n(z) parametrization or combine MCMC chains that each have a fixed n(z) resampled from the n(z) uncertainties. In this work, we propose a statistically principled Bayesian resampling approach for marginalizing over the n(z) uncertainty using multiple MCMC chains. We self-consistently compare the new method to existing ones from the literature in the context of a forecasted cosmic shear analysis for the HSC three-year shape catalogue, and find that these methods recover statistically consistent error bars for the cosmological parameter constraints for predicted HSC three-year analysis, implying that using the most computationally efficient of the approaches is appropriate. However, we find that for data sets with the constraining power of the full HSC survey data set (and, by implication, those upcoming surveys with even tighter constraints), the choice of method for marginalizing over n(z) uncertainty among the several methods from the literature may modify the 1σ uncertainties on Ωm–S8 constraints by ∼4 per cent, and a careful model selection is needed to ensure credible parameter intervals. 
    more » « less