skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 1, 2025

Title: Anomaly-aware summary statistic from data batches
A<sc>bstract</sc> Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly large and inclusive samples of acquired data with resource efficient computational methods. In this work we consider the New Physics Learning Machine (NPLM), a multivariate goodness-of-fit test built on the Neyman-Pearson maximum-likelihood-ratio construction, and we address the problem of testing large size samples under computational and storage resource constraints. We propose to perform parallel NPLM routines over batches of the data, and to combine them by locally aggregating over the data-to-reference density ratios learnt by each batch. The resulting data hypothesis defining the likelihood-ratio test is thus shared over the batches, and complies with the assumption that the expected rate of new physical processes is time invariant. We show that this method outperforms the simple sum of the independent tests run over the batches, and can recover, or even surpass, the sensitivity of the single test run over the full data. Beside the significant advantage for the offline application of NPLM to large size samples, the proposed approach offers new prospects toward the use of NPLM to construct anomaly-aware summary statistics in quasi-online data streaming scenarios.  more » « less
Award ID(s):
2019786
PAR ID:
10570062
Author(s) / Creator(s):
Publisher / Repository:
Springer Nature Link}
Date Published:
Journal Name:
Journal of High Energy Physics
Volume:
2024
Issue:
12
ISSN:
1029-8479
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The wealth of high-quality observational data from the epoch of reionization that will become available in the next decade motivates further development of modeling techniques for their interpretation. Among the key challenges in modeling reionization are (1) its multi-scale nature, (2) the computational demands of solving the radiative transfer (RT) equation, and (3) the large size of reionization's parameter space. In this paper, we present and validate a new RT code designed to confront these challenges.FlexRT(Flexible Radiative Transfer) combines adaptive ray tracing with a highly flexible treatment of the intergalactic ionizing opacity. This gives the user control over how the intergalactic medium (IGM) is modeled, and provides a way to reduce the computational cost of aFlexRTsimulation by orders of magnitude while still accounting for small-scale IGM physics. Alternatively, the user may increase the angular and spatial resolution of the algorithm to run a more traditional reionization simulation.FlexRThas already been used in several contexts, including simulations of the Lyman-αforest of high-zquasars, the redshifted 21cm signal from reionization, as well as in higher resolution reionization simulations in smaller volumes. In this work, we motivate and describe the code, and validate it against a set of standard test problems from the Cosmological Radiative Transfer Comparison Project. We find thatFlexRTis in broad agreement with a number of existing RT codes in all of these tests. Lastly, we compareFlexRTto an existing adaptive ray tracing code to validateFlexRTin a cosmological reionization simulation. 
    more » « less
  2. Abstract In this work, we address the question of how to enhance signal-agnostic searches by leveraging multiple testing strategies. Specifically, we consider hypothesis tests relying on machine learning, where model selection can introduce a bias towards specific families of new physics signals. Focusing on the New Physics Learning Machine, a methodology to perform a signal-agnostic likelihood-ratio test, we explore a number of approaches to multiple testing, such as combiningp-values and aggregating test statistics. Our findings show that it is beneficial to combine different tests, characterised by distinct choices of hyperparameters, and that performances comparable to the best available test are generally achieved, while also providing a more uniform response to various types of anomalies. This study proposes a methodology that is valid beyond machine learning approaches and could in principle be applied to a larger class model-agnostic analyses based on hypothesis testing. 
    more » « less
  3. Doglioni, C.; Kim, D.; Stewart, G.A.; Silvestris, L.; Jackson, P.; Kamleh, W. (Ed.)
    This paper is based on a talk given at Computing in High Energy Physics in Adelaide, South Australia, Australia in November 2019. It is partially intended to explain the context of DUNE Computing for computing specialists. The Deep Underground Neutrino Experiment (DUNE) collaboration consists of over 180 institutions from 33 countries. The experiment is in preparation now, with commissioning of the first 10kT fiducial volume Liquid Argon TPC expected over the period 2025-2028 and a long data taking run with 4 modules expected from 2029 and beyond. An active prototyping program is already in place with a short test-beam run with a 700T, 15,360 channel prototype of single-phase readout at the Neutrino Platform at CERN in late 2018 and tests of a similar sized dual-phase detector scheduled for mid-2019. The 2018 test-beam run was a valuable live test of our computing model. The detector produced raw data at rates of up to 2GB/s. These data were stored at full rate on tape at CERN and Fermilab and replicated at sites in the UK and Czech Republic. In total 1.2 PB of raw data from beam and cosmic triggers were produced and reconstructed during the six week testbeam run. Baseline predictions for the full DUNE detector data, starting in the late 2020’s are 30-60 PB of raw data per year. In contrast to traditional HEP computational problems, DUNE’s Liquid Argon TPC data consist of simple but very large (many GB) 2D data objects which share many characteristics with astrophysical images. This presents opportunities to use advances in machine learning and pattern recognition as a frontier user of High Performance Computing facilities capable of massively parallel processing. 
    more » « less
  4. Abstract Machine learning models are susceptible to being misled by biases in training data that emphasize incidental correlations over the intended learning task. In this study, we demonstrate the impact of data bias on the performance of a machine learning model designed to predict the likelihood of synthesizability of crystal compounds. The model performs a binary classification on labeled crystal samples. Despite using the same architecture for the machine learning model, we showcase how the model’s learning and prediction behavior differs once trained on distinct data. We use two data sets for illustration: a mixed-source data set that integrates experimental and computational crystal samples and a single-source data set consisting of data exclusively from one computational database. We present simple procedures to detect data bias and to evaluate its effect on the model’s performance and generalization. This study reveals how inconsistent, unbalanced data can propagate bias, undermining real-world applicability even for advanced machine learning techniques. 
    more » « less
  5. null (Ed.)
    Abstract In this study, we propose a scalable batch sampling scheme for optimization of simulation models with spatially varying noise. The proposed scheme has two primary advantages: (i) reduced simulation cost by recommending batches of samples at carefully selected spatial locations and (ii) improved scalability by actively considering replicating at previously observed sampling locations. Replication improves the scalability of the proposed sampling scheme as the computational cost of adaptive sampling schemes grow cubicly with the number of unique sampling locations. Our main consideration for the allocation of computational resources is the minimization of the uncertainty in the optimal design. We analytically derive the relationship between the “exploration versus replication decision” and the posterior variance of the spatial random process used to approximate the simulation model’s mean response. Leveraging this reformulation in a novel objective-driven adaptive sampling scheme, we show that we can identify batches of samples that minimize the prediction uncertainty only in the regions of the design space expected to contain the global optimum. Finally, the proposed sampling scheme adopts a modified preposterior analysis that uses a zeroth-order interpolation of the spatially varying simulation noise to identify sampling batches. Through the optimization of three numerical test functions and one engineering problem, we demonstrate (i) the efficacy and of the proposed sampling scheme to deal with a wide array of stochastic functions, (ii) the superior performance of the proposed method on all test functions compared to existing methods, (iii) the empirical validity of using a zeroth-order approximation for the allocation of sampling batches, and (iv) its applicability to molecular dynamics simulations by optimizing the performance of an organic photovoltaic cell as a function of its processing settings. 
    more » « less