Abstract Analysis on HEP data is an iterative process in which the results of one step often inform the next. In an exploratory analysis, it is common to perform one computation on a collection of events, then view the results (often with histograms) to decide what to try next. Awkward Array is a Scikit-HEP Python package that enables data analysis with array-at-a-time operations to implement cuts as slices, combinatorics as composable functions, etc. However, most C++ HEP libraries, such as FastJet, have an imperative, one-particle-at-a-time interface, which would be inefficient in Python and goes against the grain of the array-at-a-time logic of scientific Python. Therefore, we developed fastjet, a pip-installable Python package that provides FastJet C++ binaries, the classic (particle-at-a-time) Python interface, and the new array-oriented interface for use with Awkward Array. The new interface streamlines interoperability with scientific Python software beyond HEP, such as machine learning. In one case, adopting this library along with other array-oriented tools accelerated HEP analysis code by a factor of 20. It was designed to be easily integrated with libraries in the Scikit-HEP ecosystem, including Uproot (file I/O), hist (histogramming), Vector (Lorentz vectors), and Coffea (high-level glue). We discuss the design of the fastjet Python library, integrating the classic interface with the array oriented interface and with the Vector library for Lorentz vector operations. The new interface was developed as open source.
more »
« less
hep_tables: Heterogeneous Array Programming for HEP
Array operations are one of the most concise ways of expressing common filtering and simple aggregation operations that are the hallmark of a particle physics analysis: selection, filtering, basic vector operations, and filling histograms. The High Luminosity run of the Large Hadron Collider (HL-LHC), scheduled to start in 2026, will require physicists to regularly skim datasets that are over a PB in size, and repeatedly run over datasets that are 100’s of TB’s – too big to fit in memory. Declarative programming techniques are a way of separating the intent of the physicist from the mechanics of finding the data and using distributed computing to process and make histograms. This paper describes a library that implements a declarative distributed framework based on array programming. This prototype library provides a framework for different sub-systems to cooperate in producing plots via plug-in’s. This prototype has a ServiceX data-delivery sub-system and an awkward array sub-system cooperating to generate requested data or plots. The ServiceX system runs against ATLAS xAOD data and flat ROOT TTree’s and awkward on the columnar data produced by ServiceX.
more »
« less
- Award ID(s):
- 1836650
- PAR ID:
- 10354363
- Editor(s):
- Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A.
- Date Published:
- Journal Name:
- EPJ Web of Conferences
- Volume:
- 251
- ISSN:
- 2100-014X
- Page Range / eLocation ID:
- 03061
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Choreographic programming is an emerging paradigm for programming distributed systems. In choreographic programming, the programmer describes the behavior of the entire system as a single, unified program -- a choreography-- which is then compiled to individual programs that run on each node, via a compilation step called endpoint projection. We present a new model for functional choreographic programming where choreographies are expressed as computations in a monad. Our model supports cutting-edge choreographic programming features that enable modularity and code reuse: in particular, it supports higher-order choreographies, in which a choreography may be passed as an argument to another choreography, and location-polymorphic choreographies, in which a choreography can abstract over nodes. Our model is implemented in a Haskell library, HasChor, which lets programmers write choreographic programs while using the rich Haskell ecosystem at no cost, bringing choreographic programming within reach of everyday Haskellers. Moreover, thanks to Haskell's abstractions, the implementation of the HasChor library itself is concise and understandable, boiling down endpoint projection to its short and simple essence.more » « less
-
Access libraries such as ROOT[1] and HDF5[2] allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. For example, access libraries often implement buffering and data layout that assume that large, single-threaded sequential access patterns are causing less overall latency than small parallel random access: while this is true for spinning media, it is not true for flash media. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph’s extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations.more » « less
-
null (Ed.)Interactive visualization design and research have primarily focused on local data and synchronous events. However, for more complex use cases—e.g., remote database access and streaming data sources—developers must grapple with distributed data and asynchronous events. Currently, constructing these use cases is difficult and time-consuming; developers are forced to operationally program low-level details like asynchronous database querying and reactive event handling. This approach is in stark contrast to modern methods for browser-based interactive visualization, which feature high-level declarative specifications. In response, we present DIEL, a declarative framework that supports asynchronous events over distributed data. As in many declarative languages, DIEL developers specify only what data they want, rather than procedural steps for how to assemble it. Uniquely, DIEL models asynchronous events (e.g., user interactions, server responses) as streams of data that are captured in event logs. To specify the state of a visualization at any time, developers write declarative queries over the data and event logs; DIEL compiles and optimizes a corresponding dataflow graph, and automatically generates necessary low-level distributed systems details. We demonstrate DIEL's performance and expressivity through example interactive visualizations that make diverse use of remote data and asynchronous events. We further evaluate DIEL's usability using the Cognitive Dimensions of Notations framework, revealing wins such as ease of change, and compromises such as premature commitments.more » « less
-
Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A. (Ed.)The cabinetry library provides a Python-based solution for building and steering binned template fits. It tightly integrates with the pythonic High Energy Physics ecosystem, and in particular with pyhf for statistical inference. cabinetry uses a declarative approach for building statistical models, with a JSON schema describing possible configuration choices. Model building instructions can additionally be provided via custom code, which is automatically executed when applicable at key steps of the workflow. The library implements interfaces for performing maximum likelihood fitting, upper parameter limit determination, and discovery significance calculation. cabinetry also provides a range of utilities to study and disseminate fit results. These include visualizations of the fit model and data, visualizations of template histograms and fit results, ranking of nuisance parameters by their impact, a goodness-of-fit calculation, and likelihood scans. The library takes a modular approach, allowing users to include some or all of its functionality in their workflow.more » « less