skip to main content

Title: Particle Cloud Generation with Message Passing Generative Adversarial Networks
In high energy physics (HEP), jets are collections of correlated particles produced ubiquitously in particle collisions such as those at the CERN Large Hadron Collider (LHC). Machine-learning-based generative models, such as generative adversarial networks (GANs), have the potential to significantly accelerate LHC jet simulations. However, despite jets having a natural representation as a set of particles in momentum-space, a.k.a. a particle cloud, to our knowledge there exist no generative models applied to such a dataset. We introduce a new particle cloud dataset (JetNet), and, due to similarities between particle and point clouds, apply to it existing point cloud GANs. Results are evaluated using (1) the 1-Wasserstein distance between high- and low-level feature distributions, (2) a newly developed Fréchet ParticleNet Distance, and (3) the coverage and (4) minimum matching distance metrics. Existing GANs are found to be inadequate for physics applications, hence we develop a new message passing GAN (MPGAN), which outperforms existing point cloud GANs on virtually every metric and shows promise for use in HEP. We propose JetNet as a novel point-cloud-style dataset for the machine learning community to experiment with, and set MPGAN as a benchmark to improve upon for future generative models.
; ; ; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In general-purpose particle detectors, the particle-flow algorithm may be used to reconstruct a comprehensive particle-level view of the event by combining information from the calorimeters and the trackers, significantly improving the detector resolution for jets and the missing transverse momentum. In view of the planned high-luminosity upgrade of the CERN Large Hadron Collider (LHC), it is necessary to revisit existing reconstruction algorithms and ensure that both the physics and computational performance are sufficient in an environment with many simultaneous proton–proton interactions (pileup). Machine learning may offer a prospect for computationally efficient event reconstruction that is well-suited to heterogeneous computingmore »platforms, while significantly improving the reconstruction quality over rule-based algorithms for granular detectors. We introduce MLPF, a novel, end-to-end trainable, machine-learned particle-flow algorithm based on parallelizable, computationally efficient, and scalable graph neural network optimized using a multi-task objective on simulated events. We report the physics and computational performance of the MLPF algorithm on a Monte Carlo dataset of top quark–antiquark pairs produced in proton–proton collisions in conditions similar to those expected for the high-luminosity LHC. The MLPF algorithm improves the physics response with respect to a rule-based benchmark algorithm and demonstrates computationally scalable particle-flow reconstruction in a high-pileup environment.

    « less
  2. The DeepLearningEpilepsyDetectionChallenge: design, implementation, andtestofanewcrowd-sourced AIchallengeecosystem Isabell Kiral*, Subhrajit Roy*, Todd Mummert*, Alan Braz*, Jason Tsay, Jianbin Tang, Umar Asif, Thomas Schaffter, Eren Mehmet, The IBM Epilepsy Consortium◊ , Joseph Picone, Iyad Obeid, Bruno De Assis Marques, Stefan Maetschke, Rania Khalaf†, Michal Rosen-Zvi† , Gustavo Stolovitzky† , Mahtab Mirmomeni† , Stefan Harrer† * These authors contributed equally to this work † Corresponding authors:,,,, ◊ Members of the IBM Epilepsy Consortium are listed in the Acknowledgements section J. Picone and I. Obeid are with Temple University, USA. T. Schaffter is with Sage Bionetworks, USA. E. Mehmetmore »is with the University of Illinois at Urbana-Champaign, USA. All other authors are with IBM Research in USA, Israel and Australia. Introduction This decade has seen an ever-growing number of scientific fields benefitting from the advances in machine learning technology and tooling. More recently, this trend reached the medical domain, with applications reaching from cancer diagnosis [1] to the development of brain-machine-interfaces [2]. While Kaggle has pioneered the crowd-sourcing of machine learning challenges to incentivise data scientists from around the world to advance algorithm and model design, the increasing complexity of problem statements demands of participants to be expert data scientists, deeply knowledgeable in at least one other scientific domain, and competent software engineers with access to large compute resources. People who match this description are few and far between, unfortunately leading to a shrinking pool of possible participants and a loss of experts dedicating their time to solving important problems. Participation is even further restricted in the context of any challenge run on confidential use cases or with sensitive data. Recently, we designed and ran a deep learning challenge to crowd-source the development of an automated labelling system for brain recordings, aiming to advance epilepsy research. A focus of this challenge, run internally in IBM, was the development of a platform that lowers the barrier of entry and therefore mitigates the risk of excluding interested parties from participating. The challenge: enabling wide participation With the goal to run a challenge that mobilises the largest possible pool of participants from IBM (global), we designed a use case around previous work in epileptic seizure prediction [3]. In this “Deep Learning Epilepsy Detection Challenge”, participants were asked to develop an automatic labelling system to reduce the time a clinician would need to diagnose patients with epilepsy. Labelled training and blind validation data for the challenge were generously provided by Temple University Hospital (TUH) [4]. TUH also devised a novel scoring metric for the detection of seizures that was used as basis for algorithm evaluation [5]. In order to provide an experience with a low barrier of entry, we designed a generalisable challenge platform under the following principles: 1. No participant should need to have in-depth knowledge of the specific domain. (i.e. no participant should need to be a neuroscientist or epileptologist.) 2. No participant should need to be an expert data scientist. 3. No participant should need more than basic programming knowledge. (i.e. no participant should need to learn how to process fringe data formats and stream data efficiently.) 4. No participant should need to provide their own computing resources. In addition to the above, our platform should further • guide participants through the entire process from sign-up to model submission, • facilitate collaboration, and • provide instant feedback to the participants through data visualisation and intermediate online leaderboards. The platform The architecture of the platform that was designed and developed is shown in Figure 1. The entire system consists of a number of interacting components. (1) A web portal serves as the entry point to challenge participation, providing challenge information, such as timelines and challenge rules, and scientific background. The portal also facilitated the formation of teams and provided participants with an intermediate leaderboard of submitted results and a final leaderboard at the end of the challenge. (2) IBM Watson Studio [6] is the umbrella term for a number of services offered by IBM. Upon creation of a user account through the web portal, an IBM Watson Studio account was automatically created for each participant that allowed users access to IBM's Data Science Experience (DSX), the analytics engine Watson Machine Learning (WML), and IBM's Cloud Object Storage (COS) [7], all of which will be described in more detail in further sections. (3) The user interface and starter kit were hosted on IBM's Data Science Experience platform (DSX) and formed the main component for designing and testing models during the challenge. DSX allows for real-time collaboration on shared notebooks between team members. A starter kit in the form of a Python notebook, supporting the popular deep learning libraries TensorFLow [8] and PyTorch [9], was provided to all teams to guide them through the challenge process. Upon instantiation, the starter kit loaded necessary python libraries and custom functions for the invisible integration with COS and WML. In dedicated spots in the notebook, participants could write custom pre-processing code, machine learning models, and post-processing algorithms. The starter kit provided instant feedback about participants' custom routines through data visualisations. Using the notebook only, teams were able to run the code on WML, making use of a compute cluster of IBM's resources. The starter kit also enabled submission of the final code to a data storage to which only the challenge team had access. (4) Watson Machine Learning provided access to shared compute resources (GPUs). Code was bundled up automatically in the starter kit and deployed to and run on WML. WML in turn had access to shared storage from which it requested recorded data and to which it stored the participant's code and trained models. (5) IBM's Cloud Object Storage held the data for this challenge. Using the starter kit, participants could investigate their results as well as data samples in order to better design custom algorithms. (6) Utility Functions were loaded into the starter kit at instantiation. This set of functions included code to pre-process data into a more common format, to optimise streaming through the use of the NutsFlow and NutsML libraries [10], and to provide seamless access to the all IBM services used. Not captured in the diagram is the final code evaluation, which was conducted in an automated way as soon as code was submitted though the starter kit, minimising the burden on the challenge organising team. Figure 1: High-level architecture of the challenge platform Measuring success The competitive phase of the "Deep Learning Epilepsy Detection Challenge" ran for 6 months. Twenty-five teams, with a total number of 87 scientists and software engineers from 14 global locations participated. All participants made use of the starter kit we provided and ran algorithms on IBM's infrastructure WML. Seven teams persisted until the end of the challenge and submitted final solutions. The best performing solutions reached seizure detection performances which allow to reduce hundred-fold the time eliptologists need to annotate continuous EEG recordings. Thus, we expect the developed algorithms to aid in the diagnosis of epilepsy by significantly shortening manual labelling time. Detailed results are currently in preparation for publication. Equally important to solving the scientific challenge, however, was to understand whether we managed to encourage participation from non-expert data scientists. Figure 2: Primary occupation as reported by challenge participants Out of the 40 participants for whom we have occupational information, 23 reported Data Science or AI as their main job description, 11 reported being a Software Engineer, and 2 people had expertise in Neuroscience. Figure 2 shows that participants had a variety of specialisations, including some that are in no way related to data science, software engineering, or neuroscience. No participant had deep knowledge and experience in data science, software engineering and neuroscience. Conclusion Given the growing complexity of data science problems and increasing dataset sizes, in order to solve these problems, it is imperative to enable collaboration between people with differences in expertise with a focus on inclusiveness and having a low barrier of entry. We designed, implemented, and tested a challenge platform to address exactly this. Using our platform, we ran a deep-learning challenge for epileptic seizure detection. 87 IBM employees from several business units including but not limited to IBM Research with a variety of skills, including sales and design, participated in this highly technical challenge.« less
  3. Abstract. Mixed-phase Southern Ocean clouds are challenging to simulate, and theirrepresentation in climate models is an important control on climatesensitivity. In particular, the amount of supercooled water and frozen massthat they contain in the present climate is a predictor of their planetaryfeedback in a warming climate. The recent Southern Ocean Clouds, Radiation, Aerosol Transport Experimental Study (SOCRATES) vastly increased theamount of in situ data available from mixed-phase Southern Ocean clouds usefulfor model evaluation. Bulk measurements distinguishing liquid and ice watercontent are not available from SOCRATES, so single-particle phaseclassifications from the Two-Dimensional Stereo (2D-S) probe are invaluablefor quantifying mixed-phase cloud properties. Motivatedmore »by the presence oflarge biases in existing phase discrimination algorithms, we develop a noveltechnique for single-particle phase classification of binary 2D-S images usinga random forest algorithm, which we refer to as the University of WashingtonIce–Liquid Discriminator (UWILD). UWILD uses 14 parameters computed frombinary image data, as well as particle inter-arrival time, to predict phase.We use liquid-only and ice-dominated time periods within the SOCRATES datasetas training and testing data. This novel approach to model training avoidsmajor pitfalls associated with using manually labeled data, including reducedmodel generalizability and high labor costs. We find that UWILD is wellcalibrated and has an overall accuracy of 95 % compared to72 % and 79 % for two existing phase classificationalgorithms that we compare it with. UWILD improves classifications of smallice crystals and large liquid drops in particular and has more flexibilitythan the other algorithms to identify both liquid-dominated and ice-dominatedregions within the SOCRATES dataset. UWILD misclassifies a small percentageof large liquid drops as ice. Such misclassified particles are typicallyassociated with model confidence below 75 % and can easily befiltered out of the dataset. UWILD phase classifications show that particleswith area-equivalent diameter (Deq)  < 0.17 mm are mostlyliquid at all temperatures sampled, down to −40 ∘C. Largerparticles (Deq>0.17 mm) are predominantly frozen at alltemperatures below 0 ∘C. Between 0 and 5 ∘C,there are roughly equal numbers of frozen and liquid mid-sized particles (0.170.33 mm) are mostly frozen. We also use UWILD's phaseclassifications to estimate sub-1 Hz phase heterogeneity, and we showexamples of meter-scale cloud phase heterogeneity in the SOCRATES dataset.« less
  4. We describe the outcome of a data challenge conducted as part of the Dark Machines ( initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenged aims to detect signals of new physics at the Large Hadron Collider (LHC) using unsupervised machine learning algorithms. First, we propose how an anomaly score could be implemented to define model-independent signal regions in LHC searches. We define and describe a large benchmark dataset, consisting of >1 billion simulated LHC events corresponding to 10\, fb^{-1} 10 f b − 1 of proton-proton collisions at a center-of-mass energy of 13 TeV.more »We then review a wide range of anomaly detection and density estimation algorithms, developed in the context of the data challenge, and we measure their performance in a set of realistic analysis environments. We draw a number of useful conclusions that will aid the development of unsupervised new physics searches during the third run of the LHC, and provide our benchmark dataset for future studies at Code to reproduce the analysis is provided at« less
  5. The goal of generative models is to learn the intricate relations between the data to create new simulated data, but current approaches fail in very high dimensions. When the true data-generating process is based on physical processes, these impose symmetries and constraints, and the generative model can be created by learning an effective description of the underlying physics, which enables scaling of the generative model to very high dimensions. In this work, we propose Lagrangian deep learning (LDL) for this purpose, applying it to learn outputs of cosmological hydrodynamical simulations. The model uses layers of Lagrangian displacements of particles describingmore »the observables to learn the effective physical laws. The displacements are modeled as the gradient of an effective potential, which explicitly satisfies the translational and rotational invariance. The total number of learned parameters is only of order 10, and they can be viewed as effective theory parameters. We combine N-body solver fast particle mesh (FastPM) with LDL and apply it to a wide range of cosmological outputs, from the dark matter to the stellar maps, gas density, and temperature. The computational cost of LDL is nearly four orders of magnitude lower than that of the full hydrodynamical simulations, yet it outperforms them at the same resolution. We achieve this with only of order 10 layers from the initial conditions to the final output, in contrast to typical cosmological simulations with thousands of time steps. This opens up the possibility of analyzing cosmological observations entirely within this framework, without the need for large dark-matter simulations.

    « less