skip to main content


This content will become publicly available on December 1, 2024

Title: Ntuple Wizard: An Application to Access Large-Scale Open Data from LHCb
Abstract Making the large data sets collected at the Large Hadron Collider (LHC) accessible to the world is a considerable challenge because of both the complexity and the volume of data. This paper presents the Ntuple Wizard, an application that leverages the existing computing infrastructure available to the LHCb collaboration in order to enable third-party users to request specific data. An intuitive web interface allows the discovery of accessible data sets and guides the user through the process of specifying a configuration-based request. The application allows for fine-grained control of the level of access granted to the public.  more » « less
Award ID(s):
2012926
NSF-PAR ID:
10443525
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Computing and Software for Big Science
Volume:
7
Issue:
1
ISSN:
2510-2036
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Balakrishnan, Mahesh ; Ghobadi, Manya (Ed.)
    Modern datacenter applications are concurrent, so they require synchronization to control access to shared data. Requests can contend for different combinations of locks, depending on application and request state. In this paper, we show that locks, especially blocking synchronization, can squander throughput and harm tail latency, even when the CPU is underutilized. Moreover, the presence of a large number of contention points, and the unpredictability in knowing which locks a request will require, make it difficult to prevent contention through overload control using traditional signals such as queueing delay and CPU utilization. We present Protego, a system that resolves these problems with two key ideas. First, it contributes a new admission control strategy that prevents compute congestion in the presence of lock contention. The key idea is to use marginal improvements in observed throughput, rather than CPU load or latency measurements, within a credit-based admission control algorithm that regulates the rate of incoming requests to a server. Second, it introduces a new latency-aware synchronization abstraction called Active Synchronization Queue Management (ASQM) that allows applications to abort requests if delays exceed latency objectives. We apply Protego to two real-world applications, Lucene and Memcached, and show that it achieves up to 3.3x more goodput and 12.2x lower 99th percentile latency than the state-of-the-art overload control systems while avoiding congestion collapse. 
    more » « less
  2. Modern datacenter applications are concurrent, so they require synchronization to control access to shared data. Requests can contend for different combinations of locks, depending on application and request state. In this paper, we show that locks, especially blocking synchronization, can squander throughput and harm tail latency, even when the CPU is underutilized. Moreover, the presence of a large number of contention points, and the unpredictability in knowing which locks a request will require, make it difficult to prevent contention through overload control using traditional signals such as queueing delay and CPU utilization. We present Protego, a system that resolves these problems with two key ideas. First, it contributes a new admission control strategy that prevents compute congestion in the presence of lock contention. The key idea is to use marginal improvements in observed throughput, rather than CPU load or latency measurements, within a credit-based admission control algorithm that regulates the rate of incoming requests to a server. Second, it introduces a new latency-aware synchronization abstraction called Active Synchronization Queue Management (ASQM) that allows applications to abort requests if delays exceed latency objectives. We apply Protego to two real-world applications, Lucene and Memcached, and show that it achieves up to 3.3x more goodput and 12.2x lower 99th percentile latency than the state-of-the-art overload control systems while avoiding congestion collapse. 
    more » « less
  3. null (Ed.)
    Research in astronomy is undergoing a major paradigm shift, transformed by the advent of large, automated, sky-surveys into a data-rich field where multi-TB to PB-sized spatio-temporal data sets are commonplace. For example the Legacy Survey of Space and Time; LSST) is about to begin delivering observations of >10^10 objects, including a database with >4 x 10^13 rows of time series data. This volume presents a challenge: how should a domain-scientist with little experience in data management or distributed computing access data and perform analyses at PB-scale? We present a possible solution to this problem built on (adapted) industry standard tools and made accessible through web gateways. We have i) developed Astronomy eXtensions for Spark, AXS, a series of astronomy-specific modifications to Apache Spark allowing astronomers to tap into its computational scalability ii) deployed datasets in AXS-queriable format in Amazon S3, leveraging its I/O scalability, iii) developed a deployment of Spark on Kubernetes with auto-scaling configurations requiring no end-user interaction, and iv) provided a Jupyter notebook, web-accessible, front-end via JupyterHub including a rich library of pre-installed common astronomical software (accessible at http://hub.dirac.institute). We use this system to enable the analysis of data from the Zwicky Transient Facility, presently the closest precursor survey to the LSST, and discuss initial results. To our knowledge, this is a first application of cloud-based scalable analytics to astronomical datasets approaching LSST-scale. The code is available at https://github.com/astronomy-commons. 
    more » « less
  4. Summary

    One of the activities of the Pacific Rim Applications and Grid Middleware Assembly (PRAGMA) is fostering Virtual Biodiversity Expeditions by bringing domain scientists and cyber infrastructure specialists together as a team. Over the past few years, PRAGMA members have been collaborating on virtualizing the Lifemapper software. Virtualization and cloud computing have introduced great flexibility and efficiency into IT projects. Virtualization refers to the technologies that provide a layer of abstraction between server hardware system and software that runs on it. This abstraction enables a logical view of computing resources and allows multiple servers to run on the same hardware. With this project, we are virtualizing Lifemapper by enabling its installation and configuration on a virtual cluster. Virtualization provides application scalability, maximizes resources utilization, and creates a more efficient, agile, and automated infrastructure. However, there are downsides to the complexity inherent in these environments, including the need for special techniques to deploy cluster hosts, dependence on virtual environments, and challenging application installation, management, and configuration. In this study, we report on progress of the Lifemapper virtualization framework focused on a reproducible and highly configurable infrastructure capable of fast deployment.

    Lifemapper is a distributed software application developed by the Biodiversity Institute at The University of Kansas. Lifemapper creates and maintains a publicly accessible archive of species distribution maps calculated from public specimen data. Lifemapper software also provides a suite of tools for biodiversity researchers that calculate single and multispecies distribution predictions and macroecological analyses through application programming interfaces. Our goal is to create a viable solution that can be easily adopted and reused by scientists from multiple institutions or projects. This solution (1) allows fast deployment of ready‐made cluster images, (2) reproduces the complete Lifemapper processing pipeline on demand at multiple sites and in different hosting environments, and (3) enables scientists to perform Lifemapper‐facilitated data processing on restricted‐use data, very large datasets, or other unique data.

    A key contribution of this work is describing the practical experience in taking a complex, clustered, domain‐specific, data analysis, and simulation system and enabling its operation on a variety of system configurations. Uses of this portability range from whole cluster replication to teaching and experimentation on a single laptop. System virtualization is used to practically define and make portable the full application stack, including all of its complex set of supporting software and allows Lifemapper deployment in a variety of environments.

     
    more » « less
  5. Many online learning platforms and MOOCs incorporate some amount of video-based content into their platform, but there are few randomized controlled experiments that evaluate the effectiveness of the different methods of video integration. Given the large amount of publicly available educational videos, an investigation into this content's impact on students could help lead to more effective and accessible video integration within learning platforms. In this work, a new feature was added into an existing online learning platform that allowed students to request skill-related videos while completing their online middle-school mathematics assignments. A total of 18,535 students participated in two large-scale randomized controlled experiments related to providing students with publicly available educational videos. The first experiment investigated the effect of providing students with the opportunity to request these videos, and the second experiment investigated the effect of using a multi-armed bandit algorithm to recommend relevant videos. Additionally, this work investigated which features of the videos were significantly predictive of students' performance and which features could be used to personalize students' learning. Ultimately, students were mostly disinterested in the skill-related videos, preferring instead to use the platforms existing problem-specific support, and there was no statistically significant findings in either experiment. Additionally, while no video features were significantly predictive of students' performance, two video features had significant qualitative interactions with students' prior knowledge, which showed that different content creators were more effective for different groups of students. These findings can be used to inform the design of future video-based features within online learning platforms and the creation of different educational videos specifically targeting higher or lower knowledge students. The data and code used in this work can be found at https://osf.io/cxkzf/. 
    more » « less