skip to main content


Title: Automated Collider Event Selection, Plotting, & Machine Learning with AEACuS, RHADAManTHUS, & MInOS
A trio of automated collider event analysis tools are described and demonstrated, in the form of a quick-start tutorial. AEACuS interfaces with the standard MadGraph/MadEvent, Pythia, and Delphes simulation chain, via the Root file output. An extensive algorithm library facilitates the computation of standard collider event variables and the transformation of object groups (including jet clustering and substructure analysis). Arbitrary user-defined variables and external function calls are also supported. An efficient mechanism is provided for sorting events into channels with distinct features. RHADAManTHUS generates publication-quality one- and two-dimensional histograms from event statistics computed by AEACuS, calling MatPlotLib on the back end. Large batches of simulation (representing either distinct final states and/or oversampling of a common phase space) are merged internally, and per-event weights are handled consistently throughout. Arbitrary bin-wise functional transformations are readily specified, e.g. for visualizing signal-to-background significance as a function of cut threshold. MInOS implements machine learning on computed event statistics with XGBoost. Ensemble training against distinct background components may be combined to generate composite classifications with enhanced discrimination. ROC curves, as well as score distribution, feature importance, and significance plots are generated on the fly. Each of these tools is controlled via instructions supplied in a reusable cardfile, employing a simple, compact, and powerful meta-language syntax.  more » « less
Award ID(s):
2112799
NSF-PAR ID:
10351875
Author(s) / Creator(s):
Editor(s):
Arbey, Alexandre; Bélanger, G.; Desai, Nishita; Gonzalo, Tomas; Harlander, Robert V.
Date Published:
Journal Name:
Computational Tools for High Energy Physics and Cosmology (CompTools2021)
Page Range / eLocation ID:
027
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Statistical geneticists employ simulation to estimate the power of proposed studies, test new analysis tools, and evaluate properties of causal models. Although there are existing trait simulators, there is ample room for modernization. For example, most phenotype simulators are limited to Gaussian traits or traits transformable to normality, while ignoring qualitative traits and realistic, non-normal trait distributions. Also, modern computer languages, such as Julia, that accommodate parallelization and cloud-based computing are now mainstream but rarely used in older applications. To meet the challenges of contemporary big studies, it is important for geneticists to adopt new computational tools.

    Results

    We present , an open-source Julia package that makes it trivial to quickly simulate phenotypes under a variety of genetic architectures. This package is integrated into our OpenMendel suite for easy downstream analyses. Julia was purpose-built for scientific programming and provides tremendous speed and memory efficiency, easy access to multi-CPU and GPU hardware, and to distributed and cloud-based parallelization. is designed to encourage flexible trait simulation, including via the standard devices of applied statistics, generalized linear models (GLMs) and generalized linear mixed models (GLMMs). also accommodates many study designs: unrelateds, sibships, pedigrees, or a mixture of all three. (Of course, for data with pedigrees or cryptic relationships, the simulation process must include the genetic dependencies among the individuals.) We consider an assortment of trait models and study designs to illustrate integrated simulation and analysis pipelines. Step-by-step instructions for these analyses are available in our electronic Jupyter notebooks on Github. These interactive notebooks are ideal for reproducible research.

    Conclusion

    The package has three main advantages. (1) It leverages the computational efficiency and ease of use of Julia to provide extremely fast, straightforward simulation of even the most complex genetic models, including GLMs and GLMMs. (2) It can be operated entirely within, but is not limited to, the integrated analysis pipeline of OpenMendel. And finally (3), by allowing a wider range of more realistic phenotype models, brings power calculations and diagnostic tools closer to what investigators might see in real-world analyses.

     
    more » « less
  2. Abstract High energy collisions at the High-Luminosity Large Hadron Collider (LHC) produce a large number of particles along the beam collision axis, outside of the acceptance of existing LHC experiments. The proposed Forward Physics Facility (FPF), to be located several hundred meters from the ATLAS interaction point and shielded by concrete and rock, will host a suite of experiments to probe standard model (SM) processes and search for physics beyond the standard model (BSM). In this report, we review the status of the civil engineering plans and the experiments to explore the diverse physics signals that can be uniquely probed in the forward region. FPF experiments will be sensitive to a broad range of BSM physics through searches for new particle scattering or decay signatures and deviations from SM expectations in high statistics analyses with TeV neutrinos in this low-background environment. High statistics neutrino detection will also provide valuable data for fundamental topics in perturbative and non-perturbative QCD and in weak interactions. Experiments at the FPF will enable synergies between forward particle production at the LHC and astroparticle physics to be exploited. We report here on these physics topics, on infrastructure, detector, and simulation studies, and on future directions to realize the FPF’s physics potential. 
    more » « less
  3. null (Ed.)
    Abstract This paper describes a study of techniques for identifying Higgs bosons at high transverse momenta decaying into bottom-quark pairs, $$H \rightarrow b\bar{b}$$ H → b b ¯ , for proton–proton collision data collected by the ATLAS detector at the Large Hadron Collider at a centre-of-mass energy $$\sqrt{s}=13$$ s = 13   $$\text {TeV}$$ TeV . These decays are reconstructed from calorimeter jets found with the anti- $$k_{t}$$ k t $$R = 1.0$$ R = 1.0 jet algorithm. To tag Higgs bosons, a combination of requirements is used: b -tagging of $$R = 0.2$$ R = 0.2 track-jets matched to the large- R calorimeter jet, and requirements on the jet mass and other jet substructure variables. The Higgs boson tagging efficiency and corresponding multijet and hadronic top-quark background rejections are evaluated using Monte Carlo simulation. Several benchmark tagging selections are defined for different signal efficiency targets. The modelling of the relevant input distributions used to tag Higgs bosons is studied in 36 fb $$^{-1}$$ - 1 of data collected in 2015 and 2016 using $$g\rightarrow b\bar{b}$$ g → b b ¯ and $$Z(\rightarrow b\bar{b})\gamma $$ Z ( → b b ¯ ) γ event selections in data. Both processes are found to be well modelled within the statistical and systematic uncertainties. 
    more » « less
  4. Abstract

    Comprehensive and accurate analysis of respiratory and metabolic data is crucial to modelling congenital, pathogenic and degenerative diseases converging on autonomic control failure. A lack of tools for high‐throughput analysis of respiratory datasets remains a major challenge. We present Breathe Easy, a novel open‐source pipeline for processing raw recordings and associated metadata into operative outcomes, publication‐worthy graphs and robust statistical analyses including QQ and residual plots for assumption queries and data transformations. This pipeline uses a facile graphical user interface for uploading data files, setting waveform feature thresholds and defining experimental variables. Breathe Easy was validated against manual selection by experts, which represents the current standard in the field. We demonstrate Breathe Easy's utility by examining a 2‐year longitudinal study of an Alzheimer's disease mouse model to assess contributions of forebrain pathology in disordered breathing. Whole body plethysmography has become an important experimental outcome measure for a variety of diseases with primary and secondary respiratory indications. Respiratory dysfunction, while not an initial symptom in many of these disorders, often drives disability or death in patient outcomes. Breathe Easy provides an open‐source respiratory analysis tool for all respiratory datasets and represents a necessary improvement upon current analytical methods in the field.image

    Key points

    Respiratory dysfunction is a common endpoint for disability and mortality in many disorders throughout life.

    Whole body plethysmography in rodents represents a high face‐value method for measuring respiratory outcomes in rodent models of these diseases and disorders.

    Analysis of key respiratory variables remains hindered by manual annotation and analysis that leads to low throughput results that often exclude a majority of the recorded data.

    Here we present a software suite, Breathe Easy, that automates the process of data selection from raw recordings derived from plethysmography experiments and the analysis of these data into operative outcomes and publication‐worthy graphs with statistics.

    We validate Breathe Easy with a terabyte‐scale Alzheimer's dataset that examines the effects of forebrain pathology on respiratory function over 2 years of degeneration.

     
    more » « less
  5. Free energies as a function of a selected set of collective variables are commonly computed in molecular simulation and of significant value in understanding and engineering molecular behavior. These free energy surfaces are most commonly estimated using variants of histogramming techniques, but such approaches obscure two important facets of these functions. First, the empirical observations along the collective variable are defined by an ensemble of discrete observations, and the coarsening of these observations into a histogram bin incurs unnecessary loss of information. Second, the free energy surface is itself almost always a continuous function, and its representation by a histogram introduces inherent approximations due to the discretization. In this study, we relate the observed discrete observations from biased simulations to the inferred underlying continuous probability distribution over the collective variables and derive histogram-free techniques for estimating this free energy surface. We reformulate free energy surface estimation as minimization of a Kullback−Leibler divergence between a continuous trial function and the discrete empirical distribution and show that this is equivalent to likelihood maximization of a trial function given a set of sampled data. We then present a fully Bayesian treatment of this formalism, which enables the incorporation of powerful Bayesian tools such as the inclusion of regularizing priors, uncertainty quantification, and model selection techniques. We demonstrate this new formalism in the analysis of umbrella sampling simulations for the χ torsion of a valine side chain in the L99A mutant of T4 lysozyme with benzene bound in the cavity. 
    more » « less