NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Autoscaling High-Throughput Workloads on Container Orchestrators

https://doi.org/10.1109/CLUSTER49012.2020.00024

Zheng, Chao; Kremer-Herman, Nathaniel; Shaffer, Tim; Thain, Douglas (September 2020, IEEE Conference on Cluster Computing)
null (Ed.)
High-throughput computing (HTC) workloads seek to complete as many jobs as possible over a long period of time. Such workloads require efficient execution of many parallel jobs and can occupy a large number of resources for a longtime. As a result, full utilization is the normal state of an HTC facility. The widespread use of container orchestrators eases the deployment of HTC frameworks across different platforms,which also provides an opportunity to scale up HTC workloads with almost infinite resources on the public cloud. However, the autoscaling mechanisms of container orchestrators are primarily designed to support latency-sensitive microservices, and result in unexpected behavior when presented with HTC workloads. In this paper, we design a feedback autoscaler, High Throughput Autoscaler (HTA), that leverages the unique characteristics ofthe HTC workload to autoscales the resource pools used by HTC workloads on container orchestrators. HTA takes into account a reference input, the real-time status of the jobs’ queue, as well as two feedback inputs, resource consumption of jobs, and the resource initialization time of the container orchestrator. We implement HTA using the Makeflow workload manager, WorkQueue job scheduler, and the Kubernetes cluster manager. We evaluate its performance on both CPU-bound and IO-bound workloads. The evaluation results show that, by using HTA, we improve resource utilization by 5.6×with a slight increase in execution time (about 15%) for a CPU-bound workload, and shorten the workload execution time by up to 3.65×for an IO-bound workload.
more » « less
Full Text Available
Log Discovery for Troubleshooting Open Distributed Systems with TLQ

https://doi.org/10.1145/3311790.3396633

Kremer-Herman, Nathaniel; Thain, Douglas (July 2020, Practice and Experience of Advanced Research Computing (PEARC))
null (Ed.)
Troubleshooting a distributed system can be incredibly difficult. It is rarely feasible to expect a user to know the fine-grained interactions between their system and the environment configuration of each machine used in the system. Because of this, work can grind to a halt when a seemingly trivial detail changes. To address this, there is a plethora of state-of-the-art log analysis tools, debuggers, and visualization suites. However, a user may be executing in an open distributed system where the placement of their components are not known before runtime. This makes the process of tracking debug logs almost as difficult as troubleshooting the failures these logs have recorded because the location of those logs is usually not transparent to the user (and by association the troubleshooting tools they are using). We present TLQ, a framework designed from first principles for log discovery to enable troubleshooting of open distributed systems. TLQ consists of a querying client and a set of servers which track relevant debug logs spread across an open distributed system. Through a series of examples, we demonstrate how TLQ enables users to discover the locations of their system’s debug logs and in turn use well-defined troubleshooting tools upon those logs in a distributed fashion. Both of these tasks were previously impractical to ask of an open distributed system without significant a priori knowledge. We also concretely verify TLQ’s effectiveness by way of a production system: a biodiversity scientific workflow. We note the potential storage and performance overheads of TLQ compared to a centralized, closed system approach.
more » « less
Full Text Available
Solving the Container Explosion Problem for Distributed High Throughput Computing

https://doi.org/10.1109/IPDPS47924.2020.00048

Shaffer, Tim; Hazekamp, Nicholas; Blomer, Jakob; Thain, Douglas (May 2020, International Parallel and Distributed Processing Symposium)
null (Ed.)
Container technologies are seeing wider use at advanced computing facilities for managing highly complex applications that must execute at multiple sites. However, in a distributed high throughput computing setting, the unrestricted use of containers can result in the container explosion problem.If a new container image is generated for each variation of a job dispatched to a site, shared storage is soon exceeded. On the other hand, if a single large container image is used to meet multiple needs, the size of that container may become a problem for storage and transport. To address this problem, we observe that many containers have an internal structure generated by a structured package manager, and this information could be used to strategically combine and share container images. We develop LANDLORD to exploit this property and evaluate its performance through a combination of simulation studies and empirical measurement of high energy physics applications.
more » « less
Full Text Available
Dynamic Sizing of Continuously Divisible Jobs for Heterogeneous Resources

https://doi.org/10.1109/eScience.2019.00026

Hazekamp, Nicholas; Tovar, Benjamin; Thain, Douglas (September 2019, IEEE International Conference on e-Science)
null (Ed.)
Many scientific applications operate on large datasets that can be partitioned and operated on concurrently.The existing approaches for concurrent execution generally rely on statically partitioned data. This static partitioning can lock performance in a sub-optimal configuration, leading to higher execution time and an inability to respond to dynamic resources.We present the Continuously Divisible Job abstraction which allows statically defined applications to have their component tasks dynamically sized responding to system behaviour. The Continuously Divisible Job abstraction defines a simple interface that dictates how work can be recursively divided, executed,and merged. Implementing this abstraction allows scientific applications to leverage dynamic job coordinators for execution.We also propose the Virtual File abstraction which allows read-only subsets of large files to be treated as separate files.In exploring the Continuously Divisible Job abstraction, two applications were implemented using the Continuously Divisible Job interface: a bioinformatics application and a high-energy physics event analysis. These were tested using an abstract job interface and several job coordinators. Comparing these against a previous static partitioning implementation we show comparable or better performance without having to make static decisions or implement complex dynamic application handling.
more » « less
Full Text Available
Flexible Partitioning of Scientific Workflwos Using the JX Workflow Language

https://doi.org/10.1145/3332186.3338100

Shaffer, Tim; Kremer-Herman, Nathaniel; Thain, Douglas (January 2019, Practice and Experience of Advanced Research Computing (PEARC))

Full Text Available
A Lightweight Model for Right-Sizing Master-Worker Applications

https://doi.org/10.1109/SC.2018.00042

Kremer-Herman, Nathaniel; Tovar, Benjamin; Thain, Douglas (November 2018, International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available
A First Look at the JX Workflow Language

https://doi.org/10.1109/eScience.2018.00094

Shaffer, Tim; Sweeney, Kyle M.D.; Kremer-Herman, Nathaniel; Thain, Douglas (October 2018, IEEE International Conference on e-Science)

Full Text Available
An Algebra for Robust Workflow Transformations

https://doi.org/10.1109/eScience.2018.00031

Hazekamp, Nicholas; Thain, Douglas (October 2018, IEEE International Conference on e-Science)

Full Text Available
Reproducibility in Scientific Computing

https://doi.org/10.1145/3186266

Ivie, Peter; Thain, Douglas (July 2018, ACM Computing Surveys)

Full Text Available
Early Experience Using Amazon Batch for Scientific Workflows

https://doi.org/10.1145/3217880.3217885

Sweeney, Kyle M.; Thain, Douglas (June 2018, ScienceCloud Workshop at HPDC)

Full Text Available

« Prev Next »

Search for: All records