skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Scalable Systems Laboratory: a Platform for Software Innovation for HEP
The Scalable Systems Laboratory (SSL), part of the IRIS-HEP Software Institute, provides Institute participants and HEP software developers generally with a means to transition their R&D from conceptual toys to testbeds to production-scale prototypes. The SSL enables tooling, infrastructure, and services supporting innovation of novel analysis and data architectures, development of software elements and tool-chains, reproducible functional and scalability testing of service components, and foundational systems R&D for accelerated services developed by the Institute. The SSL is constructed with a core team having expertise in scale testing and deployment of services across a wide range of cyberinfrastructure. The core team embeds and partners with other areas in the Institute, and with LHC and other HEP development and operations teams as appropriate, to define investigations and required service deployment patterns. We describe the approach and experiences with early application deployments, including analysis platforms and intelligent data delivery systems.  more » « less
Award ID(s):
1836650
PAR ID:
10257007
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Doglioni, C.; Kim, D.; Stewart, G.A.; Silvestris, L.; Jackson, P.; Kamleh, W.
Date Published:
Journal Name:
EPJ Web of Conferences
Volume:
245
ISSN:
2100-014X
Page Range / eLocation ID:
05019
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    One of the most costly factors in providing a global computing infrastructure such as the WLCG is the human effort in deployment, integration, and operation of the distributed services supporting collaborative computing, data sharing and delivery, and analysis of extreme scale datasets. Furthermore, the time required to roll out global software updates, introduce new service components, or prototype novel systems requiring coordinated deployments across multiple facilities is often increased by communication latencies, staff availability, and in many cases expertise required for operations of bespoke services. While the WLCG (and distributed systems implemented throughout HEP) is a global service platform, it lacks the capability and flexibility of a modern platform-as-a-service including continuous integration/continuous delivery (CI/CD) methods, development-operations capabilities (DevOps, where developers assume a more direct role in the actual production infrastructure), and automation. Most importantly, tooling which reduces required training, bespoke service expertise, and the operational effort throughout the infrastructure, most notably at the resource endpoints (sites), is entirely absent in the current model. In this paper, we explore ideas and questions around potential NoOps models in this context: what is realistic given organizational policies and constraints? How should operational responsibility be organized across teams and facilities? What are the technical gaps? What are the social and cybersecurity challenges? Conversely what advantages does a NoOps model deliver for innovation and for accelerating the pace of delivery of new services needed for the HL-LHC era? We will describe initial work along these lines in the context of providing a data delivery network supporting IRIS-HEP DOMA R&D. 
    more » « less
  2. Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
    The IRIS-HEP software institute, as a contributor to the broader HEP Python ecosystem, is developing scalable analysis infrastructure and software tools to address the upcoming HL-LHC computing challenges with new approaches and paradigms, driven by our vision of what HL-LHC analysis will require. The institute uses a “Grand Challenge” format, constructing a series of increasingly large, complex, and realistic exercises to show the vision of HL-LHC analysis. Recently, the focus has been demonstrating the IRIS-HEP analysis infrastructure at scale and evaluating technology readiness for production. As a part of the Analysis Grand Challenge activities, the institute executed a “200 Gbps Challenge”, aiming to show sustained data rates into the event processing of multiple analysis pipelines. The challenge integrated teams internal and external to the institute, including operations and facilities, analysis software tools, innovative data delivery and management services, and scalable analysis infrastructure. The challenge showcases the prototypes — including software, services, and facilities — built to process around 200 TB of data in both the CMS NanoAOD and ATLAS PHYSLITE data formats with test pipelines. The teams were able to sustain the 200 Gbps target across multiple pipelines. The pipelines focusing on event rate were able to process at over 30 MHz. These target rates are demanding; the activity revealed considerations for future testing at this scale and changes necessary for physicists to work at this scale in the future. The 200 Gbps Challenge has established a baseline on today’s facilities, setting the stage for the next exercise at twice the scale. 
    more » « less
  3. Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
    This study explores enhancements in analysis speed, WAN bandwidth efficiency, and data storage management through an innovative data access strategy. The proposed model introduces specialized ‘delivery’ services for data preprocessing, which include filtering and reformatting tasks executed on dedicated hardware located alongside the data repositories at CERN’s Tier-0, Tier-1, or Tier-2 facilities. Positioned near the source storage, these services are crucial for limiting redundant data transfers and focus on sending only vital data to distant analysis sites, aiming to optimize network and storage use at those sites. Within the scope of the NSF-funded FABRIC Across Borders (FAB) initiative, we assess this model using an “in-network, edge” computing cluster at CERN, outfitted with substantial processing capabilities (CPU, GPU, and advanced network interfaces). This edge computing cluster features dedicated network peering arrangements that link CERN Tier-0, the FABRIC experimental network, and an analysis center at the University of Chicago, creating a solid foundation for our research. Central to our infrastructure is ServiceX, an R&D software project under the Data Organization, Management, and Access (DOMA) group of the Institute for Research and Innovation in Software for High Energy Physics - IRIS-HEP. ServiceX is a scalable filtering and reformatting service, designed to operate within a Kubernetes environment and deliver output to an S3 object store at an analysis facility. Our study assesses the impact of server-side delivery services in augmenting the existing HEP computing model, particularly evaluating their possible integration within the broader WAN infrastructure. This model could empower Tier-1 and Tier-2 centers to become efficient data distribution nodes, enabling a more cost-effective way to disseminate data to analysis sites and object stores, thereby improving data access and efficiency. This research is experimental and serves as a demonstrator of the capabilities and improvements that such integrated computing models could offer in the HL-LHC era. 
    more » « less
  4. To assure high software quality for large-scale industrial software systems, traditional approaches of software quality assurance, such as software testing and performance engineering, have been widely used within Alibaba, the world's largest retailer, and one of the largest Internet companies in the world. However, there still exists a high demand for software quality assessment to achieve high sustainability of business growth and engineering culture in Alibaba. To address this issue, we develop an industrial solution for software quality assessment by following the GQM paradigm in an industrial setting. Moreover, we integrate multiple assessment methods into our solution, ranging from metric selection to rating aggregation. Our solution has been implemented, deployed, and adopted at Alibaba: (1) used by Alibaba's Business Platform Unit to continually monitor the quality for 60+ core software systems; (2) used by Alibaba's R&D Efficiency Unit to support group-wide quality-aware code search and automatic code inspection. This paper presents our proposed industrial solution, including its techniques and industrial adoption, along with the lessons learned during the development and deployment of our solution. 
    more » « less
  5. Regression testing is an important but expensive activity in software development. Among various types of tests, web service tests are usually one of the most expensive (due to network communications) but widely adopted types of tests in commercial software development. Regression test selection (RTS) aims to reduce the number of tests which need to be retested by only running tests that are affected by code changes. Although a large number of RTS techniques have been proposed in the past few decades, these techniques have not been adopted on large-scale web service testing. This is because most existing RTS techniques either require direct code dependency between tests and code under test or cannot be applied on large scale systems with enough efficiency. In this paper, we present a novel RTS technique, TestSage, that performs RTS for web service tests on large scale commercial software. With a small overhead, TestSage is able to collect fine grained (function level) dependency between test and service under test that do not directly depend on each other. TestSage has also been successfully applied to large complex systems with over a million functions. We conducted experiments of TestSage on a large scale backend service at Google. Experimental results show that TestSage reduces 34% of testing time when running all AEC (Analysis, Execution and Collection) phases, 50% of testing time while running without collection phase. TestSage has been integrated with internal testing framework at Google and runs day-to-day at the company. 
    more » « less