skip to main content

Title: Coffea-casa: an analysis facility prototype
Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. The “Coffea-casa” prototype analysis facility is an effort to provide users with alternate mechanisms to access computing resources and enable new programming paradigms. Instead of the command-line interface and asynchronous batch access, a notebook-based web interface and interactive computing is provided. Instead of writing event loops, the columnbased Coffea library is used. In this paper, we describe the architectural components of the facility, the services offered to end users, and how it integrates into a larger ecosystem for data access and authentication.
Authors:
; ; ; ; ; ;
Editors:
Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A.
Award ID(s):
1836650
Publication Date:
NSF-PAR ID:
10354367
Journal Name:
EPJ Web of Conferences
Volume:
251
Page Range or eLocation-ID:
02061
ISSN:
2100-014X
Sponsoring Org:
National Science Foundation
More Like this
  1. Workflow management systems (WMSs) are commonly used to organize/automate sequences of tasks as workflows to accelerate scientific discoveries. During complex workflow modeling, a local interactive workflow environment is desirable, as users usually rely on their rich, local environments for fast prototyping and refinements before they consider using more powerful computing resources. However, existing WMSs do not simultaneously support local interactive workflow environments and HPC resources. In this paper, we present an on-demand access mechanism to remote HPC resources from desktop/laptopbased workflow management software to compose, monitor and analyze scientific workflows in the CyberWater project. Cyber- Water is an open-data and open-modeling software framework for environmental and water communities. In this work, we extend the open-model, open-data design of CyberWater with on-demand HPC accessing capacity. In particular, we design and implement the LaunchAgent library, which can be integrated into the local desktop environment to allow on-demand usage of remote resources for hydrology-related workflows. LaunchAgent manages authentication to remote resources, prepares the computationally-intensive or data-intensive tasks as batch jobs, submits jobs to remote resources, and monitors the quality of services for the users. LaunchAgent interacts seamlessly with other existing components in CyberWater, which is now able to provide advantages of both feature-richmore »desktop software experience and increased computation power through on-demand HPC/Cloud usage. In our evaluations, we demonstrate how a hydrology workflow that consists of both local and remote tasks can be constructed and show that the added on-demand HPC/Cloud usage helps speeding up hydrology workflows while allowing intuitive workflow configurations and execution using a desktop graphical user interface.« less
  2. As the volume of data and technical complexity of large-scale analysis increases, many domain experts desire powerful computational and familiar analysis interface to fully participate in the analysis workflow by just focusing on individual datasets, leaving the large-scale computation to the system. Towards this goal, we investigate and benchmark a family of Divide-and-Conquer strategies that can help domain experts perform large-scale simulations by scaling up their analysis code written in R, the most popular data science and interactive analysis language. We implement the Divide-and-Conquer strategies that use R as the analysis (and computing) language, allowing advanced users to provide custom R scripts and variables to be fully embedded into the large-scale analysis workflow in R. The whole process will divide large-scale simulations tasks and conquer tasks with Slurm array jobs and R. Simulations and final aggregations are scheduled as array jobs in parallel means to accelerate the knowledge discovery process. The objective is to provide a new analytics workflow for performing similar large-scale analysis loops where expert users only need to focus on the Divide-and-Conquer tasks with the domain knowledge.
  3. R is the preferred language for Data analytics due to its open source development and high extensibility. Exponential growth in data has caused longer processing times leading to the rise in parallel computing technologies for analysis. Using R together with high performance computing resources is a cumbersome task. This paper proposes a framework that provides users with access to high-performance computing resources and simplifies the configuration, programming, uploading data and job scheduling through a web user interface. In addition to that, it provides two modes of parallelization of data-intensive computing tasks, catering to a wide range of users. The case studies emphasize the utility and efficiency of the framework. The framework provides better performance, ease of use and high scalability.
  4. Abstract Topographic differencing measures landscape change by comparing multitemporal high-resolution topography data sets. Here, we focused on two types of topographic differencing: (1) Vertical differencing is the subtraction of digital elevation models (DEMs) that span an event of interest. (2) Three-dimensional (3-D) differencing measures surface change by registering point clouds with a rigid deformation. We recently released topographic differencing in OpenTopography where users perform on-demand vertical and 3-D differencing via an online interface. OpenTopography is a U.S. National Science Foundation–funded facility that provides access to topographic data and processing tools. While topographic differencing has been applied in numerous research studies, the lack of standardization, particularly of 3-D differencing, requires the customization of processing for individual data sets and hinders the community’s ability to efficiently perform differencing on the growing archive of topography data. Our paper focuses on streamlined techniques with which to efficiently difference data sets with varying spatial resolution and sensor type (i.e., optical vs. light detection and ranging [lidar]) and over variable landscapes. To optimize on-demand differencing, we considered algorithm choice and displacement resolution. The optimal resolution is controlled by point density, landscape characteristics (e.g., leaf-on vs. leaf-off), and data set quality. We provide processing options derived frommore »metadata that allow users to produce optimal high-quality results, while experienced users can fine tune the parameters to suit their needs. We anticipate that the differencing tool will expand access to this state-of-the-art technology, will be a valuable educational tool, and will serve as a template for differencing the growing number of multitemporal topography data sets.« less
  5. Many newcomers to programming and computational thinking have been brought up on interactive, gamified learning environments. Introductory computer science courses at the university level need to dig deeper into these topics, but must do so with similarly engaging technologies and projects. To address this need, we have built a framework for a grid-based game API with event-based blocking and continuous non-blocking interfaces. The framework abstracts away much of the complexity of inputs and rendering and exposes a simple game grid similar to a 2D array indexed by rows and columns. As such, our project helps reinforce basic computing concepts (arrays, loops, OOP, recursion) with a customizable and engaging game interface. We have discussed the valuable influence of visual representations of student's data structures using BRIDGES in previous publications, and believe our game API can provide significance and intrigue for students in introductory courses and beyond. Our Bridges Games App website (http://bridges-games.herokuapp.com/) presents descriptions and instructions.