NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Log Discovery for Troubleshooting Open Distributed Systems with TLQ

https://doi.org/10.1145/3311790.3396633

Kremer-Herman, Nathaniel; Thain, Douglas (July 2020, Practice and Experience of Advanced Research Computing (PEARC))
null (Ed.)
Troubleshooting a distributed system can be incredibly difficult. It is rarely feasible to expect a user to know the fine-grained interactions between their system and the environment configuration of each machine used in the system. Because of this, work can grind to a halt when a seemingly trivial detail changes. To address this, there is a plethora of state-of-the-art log analysis tools, debuggers, and visualization suites. However, a user may be executing in an open distributed system where the placement of their components are not known before runtime. This makes the process of tracking debug logs almost as difficult as troubleshooting the failures these logs have recorded because the location of those logs is usually not transparent to the user (and by association the troubleshooting tools they are using). We present TLQ, a framework designed from first principles for log discovery to enable troubleshooting of open distributed systems. TLQ consists of a querying client and a set of servers which track relevant debug logs spread across an open distributed system. Through a series of examples, we demonstrate how TLQ enables users to discover the locations of their system’s debug logs and in turn use well-defined troubleshooting tools upon those logs in a distributed fashion. Both of these tasks were previously impractical to ask of an open distributed system without significant a priori knowledge. We also concretely verify TLQ’s effectiveness by way of a production system: a biodiversity scientific workflow. We note the potential storage and performance overheads of TLQ compared to a centralized, closed system approach.
more » « less
Full Text Available
Autoscaling High-Throughput Workloads on Container Orchestrators

https://doi.org/10.1109/CLUSTER49012.2020.00024

Zheng, Chao; Kremer-Herman, Nathaniel; Shaffer, Tim; Thain, Douglas (September 2020, IEEE Conference on Cluster Computing)
null (Ed.)
High-throughput computing (HTC) workloads seek to complete as many jobs as possible over a long period of time. Such workloads require efficient execution of many parallel jobs and can occupy a large number of resources for a longtime. As a result, full utilization is the normal state of an HTC facility. The widespread use of container orchestrators eases the deployment of HTC frameworks across different platforms,which also provides an opportunity to scale up HTC workloads with almost infinite resources on the public cloud. However, the autoscaling mechanisms of container orchestrators are primarily designed to support latency-sensitive microservices, and result in unexpected behavior when presented with HTC workloads. In this paper, we design a feedback autoscaler, High Throughput Autoscaler (HTA), that leverages the unique characteristics ofthe HTC workload to autoscales the resource pools used by HTC workloads on container orchestrators. HTA takes into account a reference input, the real-time status of the jobs’ queue, as well as two feedback inputs, resource consumption of jobs, and the resource initialization time of the container orchestrator. We implement HTA using the Makeflow workload manager, WorkQueue job scheduler, and the Kubernetes cluster manager. We evaluate its performance on both CPU-bound and IO-bound workloads. The evaluation results show that, by using HTA, we improve resource utilization by 5.6×with a slight increase in execution time (about 15%) for a CPU-bound workload, and shorten the workload execution time by up to 3.65×for an IO-bound workload.
more » « less
Full Text Available
Flexible Partitioning of Scientific Workflwos Using the JX Workflow Language

https://doi.org/10.1145/3332186.3338100

Shaffer, Tim; Kremer-Herman, Nathaniel; Thain, Douglas (January 2019, Practice and Experience of Advanced Research Computing (PEARC))

Full Text Available
A Lightweight Model for Right-Sizing Master-Worker Applications

https://doi.org/10.1109/SC.2018.00042

Kremer-Herman, Nathaniel; Tovar, Benjamin; Thain, Douglas (November 2018, International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available
A First Look at the JX Workflow Language

https://doi.org/10.1109/eScience.2018.00094

Shaffer, Tim; Sweeney, Kyle M.D.; Kremer-Herman, Nathaniel; Thain, Douglas (October 2018, IEEE International Conference on e-Science)

Full Text Available
Automatic Dependency Management for Scientific Applications on Clusters

https://doi.org/10.1109/IC2E.2018.00026

Tovar, Benjamin; Hazekamp, Nicholas; Kremer-Herman, Nathaniel; Thain, Douglas (April 2018, IEEE International Conference on Cloud Engineering (IC2E))

Full Text Available
SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization

https://doi.org/10.1109/WACV.2018.00086

Kinnison, Jeffery; Kremer-Herman, Nathaniel; Thain, Douglas; Scheirer, Walter (March 2018, IEEE Winter Conference on Applications of Computer Vision)

Full Text Available
Combining Static and Dynamic Storage Management for Data Intensive Scientific Workflows

https://doi.org/10.1109/TPDS.2017.2764897

Hazekamp, Nicholas; Kremer-Herman, Nathaniel; Tovar, Benjamin; Meng, Haiyan; Choudhury, Olivia; Emrich, Scott; Thain, Douglas (October 2017, IEEE Transactions on Parallel and Distributed Systems)

Workflow management systems are widely used to express and execute highly parallel applications. For dataintensive workflows, storage can be the constraining resource: the number of tasks running at once must be artificially limited to not overflow the space available in the filesystem. It is all too easy for a user to dispatch a workflow which consumes all available storage and disrupts all system users. To address these issues, we present a three-tiered approach to workflow storage management: (1) A static analysis algorithm which analyzes the storage needs of a workflow before execution, giving a realistic prediction of success or failure. (2) An online storage management algorithm which accounts for the storage needed by future tasks to avoid deadlock at runtime. (3) A task containment system which limits storage consumption of individual tasks, enabling the strong guarantees of the static analysis and dynamic management algorithms. We demonstrate the application of these techniques on three complex workflows.
more » « less
Full Text Available

Search for: All records