skip to main content


Title: HTCondor data movement at 100 Gbps
HTCondor is a major workload management system used in distributed high throughput computing (dHTC) environments, e.g., the Open Science Grid. One of the distinguishing features of HTCondor is the native support for data movement, allowing it to operate without a shared filesystem. Coupling data handling and compute scheduling is both convenient for users and allows for significant infrastructure flexibility but does introduce some limitations. The default HTCondor data transfer mechanism routes both the input and output data through the submission node, making it a potential bottleneck. In this document we show that by using a node equipped with a 100 Gbps network interface (NIC) HTCondor can serve data at up to 90 Gbps, which is sufficient for most current use cases, as it would saturate the border network links of most research universities at the time of writing.  more » « less
Award ID(s):
2030508
NSF-PAR ID:
10357989
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 IEEE 17th International Conference on eScience (eScience)
Issue:
September 2021
Page Range / eLocation ID:
239 to 240
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ensuring high availability (HA) for software-based networks is a critical design feature that will help the adoption of software-based network functions (NFs) in production networks. It is important for NFs to avoid outages and maintain mission-critical operations. However, HA support for NFs on the critical data path can result in unacceptable performance degradation. We present REINFORCE, an integrated framework to support efficient resiliency for NFs and NF service chains. REINFORCE includes timely failure detection and consistent failover mechanisms. REINFORCE replicates state to standby NFs (local and remote) while enforcing correctness. It minimizes the number of state transfers by exploiting the concept of external synchrony, and leverages opportunistic batching and multi-buffering to optimize performance. Experimental results show that, even at line-rate packet processing (10 Gbps), REINFORCE achieves chain-level failover across servers in a LAN (or within the same node) within 10ms (100/μs), incurring less than 10% (1%) performance overhead, and adds average latency of only ~400/μs (5/μs), with a worst-case latency of less than 1ms (10/μs). 
    more » « less
  2. State machine replication (SMR) is a core mechanism for building highly available and consistent systems. In this paper, we propose Waverunner, a new approach to accelerate SMR using FPGA-based SmartNICs. Our approach does not implement the entire SMR system in hardware; instead, it is a hybrid software/hardware system. We make the observation that, despite the complexity of SMR, the most common routine—the data replication—is actually simple. The complex parts (leader election, failure recovery, etc.) are rarely used in modern datacenters where failures are only occasional. These complex routines are not performance critical; their software implementations are fast enough and do not need acceleration. Therefore, our system uses FPGA assistance to accelerate data replication, and leaves the rest to the traditional software implementation of SMR. Our Waverunner approach is beneficial in both the common and the rare case situations. In the common case, the system runs at the speed of the network, with a 99th percentile latency of 1.8 µs achieved without batching on minimum-size packets at network line rate (85.5 Gbps in our evaluation). In rare cases, to handle uncommon situations such as leader failure and failure recovery, the system uses traditional software to guarantee correctness, which is much easier to develop and maintain than hardware-based implementations. Overall, our experience confirms Waverunner as an effective and practical solution for hardware accelerated SMR—achieving most of the benefits of hardware acceleration with minimum added complexity and implementation effort. 
    more » « less
  3. With the ever-increasing size of training models and datasets, network communication has emerged as a major bottleneck in distributed deep learning training. To address this challenge, we propose an optical distributed deep learning (ODDL) architecture. ODDL utilizes a fast yet scalable all-optical network architecture to accelerate distributed training. One of the key features of the architecture is its flow-based transmit scheduling with fast reconfiguration. This allows ODDL to allocate dedicated optical paths for each traffic stream dynamically, resulting in low network latency and high network utilization. Additionally, ODDL provides physically isolated and tailored network resources for training tasks by reconfiguring the optical switch using LCoS-WSS technology. The ODDL topology also uses tunable transceivers to adapt to time-varying traffic patterns. To achieve accurate and fine-grained scheduling of optical circuits, we propose an efficient distributed control scheme that incurs minimal delay overhead. Our evaluation on real-world traces showcases ODDL’s remarkable performance. When implemented with 1024 nodes and 100 Gbps bandwidth, ODDL accelerates VGG19 training by 1.6× and 1.7× compared to conventional fat-tree electrical networks and photonic SiP-Ring architectures, respectively. We further build a four-node testbed, and our experiments show that ODDL can achieve comparable training time compared to that of anidealelectrical switching network.

     
    more » « less
  4. Biscarat, C. ; Campana, S. ; Hegner, B. ; Roiser, S. ; Rovelli, C.I. ; Stewart, G.A. (Ed.)
    CMS is tackling the exploitation of CPU resources at HPC centers where compute nodes do not have network connectivity to the Internet. Pilot agents and payload jobs need to interact with external services from the compute nodes: access to the application software (CernVM-FS) and conditions data (Frontier), management of input and output data files (data management services), and job management (HTCondor). Finding an alternative route to these services is challenging. Seamless integration in the CMS production system without causing any operational overhead is a key goal. The case of the Barcelona Supercomputing Center (BSC), in Spain, is particularly challenging, due to its especially restrictive network setup. We describe in this paper the solutions developed within CMS to overcome these restrictions, and integrate this resource in production. Singularity containers with application software releases are built and pre-placed in the HPC facility shared file system, together with conditions data files. HTCondor has been extended to relay communications between running pilot jobs and HTCondor daemons through the HPC shared file system. This operation mode also allows piping input and output data files through the HPC file system. Results, issues encountered during the integration process, and remaining concerns are discussed. 
    more » « less
  5. Abstract Causal structure learning (CSL) refers to the estimation of causal graphs from data. Causal versions of tools such as ROC curves play a prominent role in empirical assessment of CSL methods and performance is often compared with “random” baselines (such as the diagonal in an ROC analysis). However, such baselines do not take account of constraints arising from the graph context and hence may represent a “low bar”. In this paper, motivated by examples in systems biology, we focus on assessment of CSL methods for multivariate data where part of the graph structure is known via interventional experiments. For this setting, we put forward a new class of baselines called graph-based predictors (GBPs). In contrast to the “random” baseline, GBPs leverage the known graph structure, exploiting simple graph properties to provide improved baselines against which to compare CSL methods. We discuss GBPs in general and provide a detailed study in the context of transitively closed graphs, introducing two conceptually simple baselines for this setting, the observed in-degree predictor (OIP) and the transitivity assuming predictor (TAP). While the former is straightforward to compute, for the latter we propose several simulation strategies. Moreover, we study and compare the proposed predictors theoretically, including a result showing that the OIP outperforms in expectation the “random” baseline on a subclass of latent network models featuring positive correlation among edge probabilities. Using both simulated and real biological data, we show that the proposed GBPs outperform random baselines in practice, often substantially. Some GBPs even outperform standard CSL methods (whilst being computationally cheap in practice). Our results provide a new way to assess CSL methods for interventional data. 
    more » « less