skip to main content


Title: Coffea-casa: an analysis facility prototype
Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. The “Coffea-casa” prototype analysis facility is an effort to provide users with alternate mechanisms to access computing resources and enable new programming paradigms. Instead of the command-line interface and asynchronous batch access, a notebook-based web interface and interactive computing is provided. Instead of writing event loops, the columnbased Coffea library is used. In this paper, we describe the architectural components of the facility, the services offered to end users, and how it integrates into a larger ecosystem for data access and authentication.  more » « less
Award ID(s):
1836650
NSF-PAR ID:
10354367
Author(s) / Creator(s):
; ; ; ; ; ;
Editor(s):
Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A.
Date Published:
Journal Name:
EPJ Web of Conferences
Volume:
251
ISSN:
2100-014X
Page Range / eLocation ID:
02061
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Workflow management systems (WMSs) are commonly used to organize/automate sequences of tasks as workflows to accelerate scientific discoveries. During complex workflow modeling, a local interactive workflow environment is desirable, as users usually rely on their rich, local environments for fast prototyping and refinements before they consider using more powerful computing resources. However, existing WMSs do not simultaneously support local interactive workflow environments and HPC resources. In this paper, we present an on-demand access mechanism to remote HPC resources from desktop/laptopbased workflow management software to compose, monitor and analyze scientific workflows in the CyberWater project. Cyber- Water is an open-data and open-modeling software framework for environmental and water communities. In this work, we extend the open-model, open-data design of CyberWater with on-demand HPC accessing capacity. In particular, we design and implement the LaunchAgent library, which can be integrated into the local desktop environment to allow on-demand usage of remote resources for hydrology-related workflows. LaunchAgent manages authentication to remote resources, prepares the computationally-intensive or data-intensive tasks as batch jobs, submits jobs to remote resources, and monitors the quality of services for the users. LaunchAgent interacts seamlessly with other existing components in CyberWater, which is now able to provide advantages of both feature-rich desktop software experience and increased computation power through on-demand HPC/Cloud usage. In our evaluations, we demonstrate how a hydrology workflow that consists of both local and remote tasks can be constructed and show that the added on-demand HPC/Cloud usage helps speeding up hydrology workflows while allowing intuitive workflow configurations and execution using a desktop graphical user interface. 
    more » « less
  2. As the volume of data and technical complexity of large-scale analysis increases, many domain experts desire powerful computational and familiar analysis interface to fully participate in the analysis workflow by just focusing on individual datasets, leaving the large-scale computation to the system. Towards this goal, we investigate and benchmark a family of Divide-and-Conquer strategies that can help domain experts perform large-scale simulations by scaling up their analysis code written in R, the most popular data science and interactive analysis language. We implement the Divide-and-Conquer strategies that use R as the analysis (and computing) language, allowing advanced users to provide custom R scripts and variables to be fully embedded into the large-scale analysis workflow in R. The whole process will divide large-scale simulations tasks and conquer tasks with Slurm array jobs and R. Simulations and final aggregations are scheduled as array jobs in parallel means to accelerate the knowledge discovery process. The objective is to provide a new analytics workflow for performing similar large-scale analysis loops where expert users only need to focus on the Divide-and-Conquer tasks with the domain knowledge. 
    more » « less
  3. Abstract Summary

    Foldit Standalone is an interactive graphical interface to the Rosetta molecular modeling package. In contrast to most command-line or batch interactions with Rosetta, Foldit Standalone is designed to allow easy, real-time, direct manipulation of protein structures, while also giving access to the extensive power of Rosetta computations. Derived from the user interface of the scientific discovery game Foldit (itself based on Rosetta), Foldit Standalone has added more advanced features and removed the competitive game elements. Foldit Standalone was built from the ground up with a custom rendering and event engine, configurable visualizations and interactions driven by Rosetta. Foldit Standalone contains, among other features: electron density and contact map visualizations, multiple sequence alignment tools for template-based modeling, rigid body transformation controls, RosettaScripts support and an embedded Lua interpreter.

    Availability and Implementation

    Foldit Standalone is available for download at https://fold.it/standalone, under the Rosetta license, which is free for academic and non-profit users. It is implemented in cross-platform C ++ and binary executables are available for Windows, macOS and Linux.

     
    more » « less
  4. Abstract Analysis description languages are declarative interfaces for HEP data analysis that allow users to avoid writing event loops, simplify code, and enable performance improvements to be decoupled from analysis development. One example is FuncADL, inspired by functional programming and developed using Python as a host language. FuncADL borrows concepts from database query languages to isolate the interface from the underlying physical and logical schemas. The same query can be used to select data from different sources and formats and with different execution mechanisms. FuncADL is one of the tools being developed by IRIS-HEP for highly scalable physics analysis for the LHC and HL-LHC. FuncADL is demonstrated by implementing example analysis tasks designed by HSF and IRIS-HEP. Another language example is ADL, which expresses the physics content of an analysis in a standard and unambiguous way, independent of computing frameworks. In ADL, analyses are described in human-readable text files composed of blocks with a keyword-expression structure. Two infrastructures are available to render ADL executable: CutLang, a runtime interpreter written in C++; and adl2tnm, a transpiler converting ADL into C++ or Python code. ADL/CutLang are already used in several physics studies and educational projects, and are adapted for use with LHC Open Data. 
    more » « less
  5. R is the preferred language for Data analytics due to its open source development and high extensibility. Exponential growth in data has caused longer processing times leading to the rise in parallel computing technologies for analysis. Using R together with high performance computing resources is a cumbersome task. This paper proposes a framework that provides users with access to high-performance computing resources and simplifies the configuration, programming, uploading data and job scheduling through a web user interface. In addition to that, it provides two modes of parallelization of data-intensive computing tasks, catering to a wide range of users. The case studies emphasize the utility and efficiency of the framework. The framework provides better performance, ease of use and high scalability. 
    more » « less