skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LIDC: A Location Independent Multi-Cluster Computing Framework for Data Intensive Science
Scientific communities are increasingly using geographically distributed computing platforms. Current methods of compute placement rely on centralized controllers such as Kubernetes to match tasks with resources. This centralized model does not work well in multi-organizational collaborations. Workflows also depend on manual configurations made for a single platform and cannot adapt to changing infrastructure. This work introduces a decentralized control plane for placing computations on distributed clusters using semantic names. Semantic names are assigned to computations so they can be matched with named Kubernetes service endpoints. This approach has two main benefits. It allows job placement to be independent of location so any cluster with enough resources can run the computation. It also supports dynamic placement without requiring knowledge of cluster locations or predefined configurations.  more » « less
Award ID(s):
2430341 2126148 2019012
PAR ID:
10647725
Author(s) / Creator(s):
 ;  
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
760 to 764
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As distributed applications become increasingly complex, so do their scheduling requirements. This development calls for cluster schedulers that are not only general, but also evolvable. Unfortunately, most existing cluster schedulers are not evolvable: when confronted with new requirements, they need major rewrites to support these requirements. Examples include gang-scheduling support in Kubernetes [6, 39] or task-affinity in Spark [39]. Some cluster schedulers [14, 30] expose physical resources to applications to address this. While these approaches are evolvable, they push the burden of implementing scheduling mechanisms in addition to the policies entirely to the application. ESCHER is a cluster scheduler design that achieves both evolvability and application-level simplicity. ESCHER uses an abstraction exposed by several recent frameworks (which we call ephemeral resources) that lets the application express scheduling constraints as resource requirements. These requirements are then satisfied by a simple mechanism matching resource demands to available resources. We implement ESCHER on Kubernetes and Ray, and show that this abstraction can be used to express common policies offered by monolithic schedulers while allowing applications to easily create new custom policies hitherto unsupported. 
    more » « less
  2. Distributed cloud environments running data-intensive applications often slow down because of network congestion, uneven bandwidth, and data shuffling between nodes. Traditional host metrics such as CPU or memory do not capture these factors. Scheduling without considering network conditions causes poor placement, longer data transfers, and weaker job performance. This work presents a network-aware job scheduler that uses supervised learning to predict job completion time. The system collects real-time telemetry from all nodes, uses a trained model to estimate how long a job would take on each node, and ranks nodes to choose the best placement. The scheduler is evaluated on a geo-distributed Kubernetes cluster on the FABRIC testbed using network-intensive Spark workloads. Compared to the default Kubernetes scheduler, which uses only current resource availability, the supervised scheduler shows 34–54% higher accuracy in selecting the optimal node. The contribution is the demonstration of supervised learning for real-time, network-aware job scheduling on a multi-site cluster. 
    more » « less
  3. De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
    The OSG-operated Open Science Pool is an HTCondor-based virtual cluster that aggregates resources from compute clusters provided by several organizations. Most of the resources are not owned by OSG, so demand-based dynamic provisioning is important for maximizing usage without incurring excessive waste. OSG has long relied on GlideinWMS for most of its resource provisioning needs but is limited to resources that provide a Grid-compliant Compute Entrypoint. To work around this limitation, the OSG Software Team has developed a glidein container that resource providers could use to directly contribute to the OSPool. The problem with that approach is that it is not demand-driven, relegating it to backfill scenarios only. To address this limitation, a demand-driven direct provisioner of Kubernetes resources has been developed and successfully used on the NRP. The setup still relies on the OSG-maintained backfill container image but automates the provisioning matchmaking and successive requests. That provisioner has also been extended to support Lancium, a green computing cloud provider with a Kubernetes-like proprietary interface. The provisioner logic has been intentionally kept very simple, making this extension a low-cost project. Both NRP and Lancium resources have been provisioned exclusively using this mechanism for many months. 
    more » « less
  4. Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
    We explore the adoption of cloud-native tools and principles to forge flexible and scalable infrastructures, aimed at supporting analysis frameworks being developed for the ATLAS experiment in the High Luminosity Large Hadron Collider (HL-LHC) era. The project culminated in the creation of a federated platform, integrating Kubernetes clusters from various providers such as Tier-2 centers, Tier-3 centers, and from the IRIS-HEP Scalable Systems Laboratory, a National Science Foundation project. A unified interface was provided to streamline the management and scaling of containerized applications. Enhanced system scalability was achieved through integration with analysis facilities, enabling spillover of Jupyter/Binder notebooks and Dask workers to Tier-2 resources. We investigated flexible deployment options for a “stretched” (over the wide area network) cluster pattern, including a centralized “lights out management” model, remote administration of Kubernetes services, and a fully autonomous site-managed cluster approach, to accommodate varied operational and security requirements. The platform demonstrated its efficacy in multi-cluster demonstrators for low-latency analyses and advanced workflows with tools such as Coffea, ServiceX, Uproot and Dask, and RDataFrame, illustrating its ability to support various processing frameworks. The project also resulted in a robust user training infrastructure for ATLAS software and computing on-boarding events. 
    more » « less
  5. null (Ed.)
    Edge and fog computing encompass a variety of technologies that are poised to enable new applications across the Internet that support data capture, storage, processing, and communication across the networking continuum. These environments pose new challenges to the design and implementation of networks-as membership can be dynamic and devices are heterogeneous, widely distributed geographically, and in proximity to end-users, as is the case with mobile and Internet-of-Things (IoT) devices. We present a demonstration of EdgeVPN.io (Evio for short), an open-source programmable, software-defined network that addresses challenges in the deployment of virtual networks spanning distributed edge and cloud resources, in particular highlighting its use in support of the Kubernetes container orchestration middleware. The demo highlights a deployment of unmodified Kubernetes middleware across a virtual cluster comprising virtual machines deployed both in cloud providers, and in distinct networks at the edge-where all nodes are assigned private IP addresses and subject to different NAT (Network Address Translation) middleboxes, connected through an Evio virtual network. The demo includes an overview of the configuration of Kubernetes and Evio nodes and the deployment of Docker-based container pods, highlighting the seamless connectivity for TCP/IP applications deployed on the pods. 
    more » « less