NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Landlord: Coordinating Dynamic Software Environments to Reduce Container Sprawl

https://doi.org/10.1109/TPDS.2023.3241598

Shaffer, Tim; Phung, Thanh Son; Chard, Kyle; Thain, Douglas (May 2023, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
An Empirical Study of Package Dependencies and Lifetimes in Binder Python Containers

https://doi.org/10.1109/eScience51609.2021.00032

Shaffer, Tim; Chard, Kyle; Thain, Douglas (September 2021, 2021 IEEE 17th International Conference on eScience (eScience))

Full Text Available
Software Environments in Binder Containers

https://doi.org/10.5281/zenodo.4891790

Shaffer, Tim; Chard, Kyle; Thain, Douglas (January 2021, Zenodo)

{"Abstract":["Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the binder.sqlite<\/code> file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the events<\/code> table, which has the following schema.<\/p>\n\nCREATE TABLE events(\n version INTEGER,\n timestamp TEXT,\n provider TEXT,\n spec TEXT,\n origin TEXT,\n ref TEXT,\n guessed_ref TEXT\n);\nCREATE INDEX idx_timestamp ON events(timestamp);\n<\/code>\n\nversion<\/code> indicates the version of the record as assigned by Binder. The origin<\/code> field became available with version 3, and the ref<\/code> field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.<\/li>timestamp<\/code> is the ISO timestamp of the launch<\/li>provider<\/code> gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.<\/li>spec<\/code> gives the particular branch/release/commit being built. It consists of <github-id>/<repo>/<branch><\/code>.<\/li>origin<\/code> indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.<\/li>ref<\/code> specifies the git commit that was actually used, rather than the named branch referenced by spec<\/code>. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.<\/li>For records where ref<\/code> is not available, we attempted to clone the named reference given by spec<\/code> rather than the specific commit (see below). The guessed_ref<\/code> field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-null ref<\/code>) will exclude these guessed commits. May be null.<\/li><\/ul>\n\nThe Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both ref<\/code> and guessed_ref<\/code> in the events<\/code> table). The schema is as follows.<\/p>\n\nCREATE TABLE spec_files (\n ref TEXT NOT NULL PRIMARY KEY,\n ls TEXT,\n runtime BLOB,\n apt BLOB,\n conda BLOB,\n pip BLOB,\n pipfile BLOB,\n julia BLOB,\n r BLOB,\n nix BLOB,\n docker BLOB,\n setup BLOB,\n postbuild BLOB,\n start BLOB\n);<\/code>\n\nHere ref<\/code> corresponds to ref<\/code> and/or guessed_ref<\/code> from the events<\/code> table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.<\/p>\n\nruntime<\/code>: runtime.txt<\/code><\/li>apt<\/code>: apt.txt<\/code><\/li>conda<\/code>: environment.yml<\/code><\/li>pip<\/code>: requirements.txt<\/code><\/li>pipfile<\/code>: Pipfile.lock<\/code> or Pipfile<\/code><\/li>julia<\/code>: Project.toml<\/code> or REQUIRE<\/code><\/li>r<\/code>: install.R<\/code><\/li>nix<\/code>: default.nix<\/code><\/li>docker<\/code>: Dockerfile<\/code><\/li>setup<\/code>: setup.py<\/code><\/li>postbuild<\/code>: postBuild<\/code><\/li>start<\/code>: start<\/code><\/li><\/ul>\n\nThe ls<\/code> field gives a metadata listing of the repo contents (excluding the .git<\/code> directory). This field is JSON encoded with the following structure based on JSON types:<\/p>\n\nObject: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.<\/li>String: symlink. The string value gives the link target.<\/li>Number: regular file. The number value gives the file size in bytes.<\/li><\/ul>\n\nCREATE TABLE clean_specs (\n ref TEXT NOT NULL PRIMARY KEY,\n conda_channels TEXT,\n conda_packages TEXT,\n pip_packages TEXT,\n apt_packages TEXT\n);<\/code>\n\nThe clean_specs<\/code> table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using the conda.models.match_spec.MatchSpec<\/code> class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.<\/p>\n\nThe missing<\/code> table gives the repos that were not accessible, and event_logs<\/code> records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.<\/p>"]}
more » « less
Autoscaling High-Throughput Workloads on Container Orchestrators

https://doi.org/10.1109/CLUSTER49012.2020.00024

Zheng, Chao; Kremer-Herman, Nathaniel; Shaffer, Tim; Thain, Douglas (September 2020, IEEE Conference on Cluster Computing)
null (Ed.)
High-throughput computing (HTC) workloads seek to complete as many jobs as possible over a long period of time. Such workloads require efficient execution of many parallel jobs and can occupy a large number of resources for a longtime. As a result, full utilization is the normal state of an HTC facility. The widespread use of container orchestrators eases the deployment of HTC frameworks across different platforms,which also provides an opportunity to scale up HTC workloads with almost infinite resources on the public cloud. However, the autoscaling mechanisms of container orchestrators are primarily designed to support latency-sensitive microservices, and result in unexpected behavior when presented with HTC workloads. In this paper, we design a feedback autoscaler, High Throughput Autoscaler (HTA), that leverages the unique characteristics ofthe HTC workload to autoscales the resource pools used by HTC workloads on container orchestrators. HTA takes into account a reference input, the real-time status of the jobs’ queue, as well as two feedback inputs, resource consumption of jobs, and the resource initialization time of the container orchestrator. We implement HTA using the Makeflow workload manager, WorkQueue job scheduler, and the Kubernetes cluster manager. We evaluate its performance on both CPU-bound and IO-bound workloads. The evaluation results show that, by using HTA, we improve resource utilization by 5.6×with a slight increase in execution time (about 15%) for a CPU-bound workload, and shorten the workload execution time by up to 3.65×for an IO-bound workload.
more » « less
Full Text Available
Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications

https://doi.org/10.1109/IPDPS49936.2021.00088

Shaffer, Tim; Li, Zhuozhao; Tovar, Ben; Babuji, Yadu; Dasso, TJ; Surma, Zoe; Chard, Kyle; Foster, Ian; Thain, Douglas (May 2021, IEEE International Parallel and Distributed Processing Symposium)
null (Ed.)
Python has become a widely used programming language for research, not only for small one-off analyses, but also for complex application pipelines running at supercomputer- scale. Modern parallel programming frameworks for Python present users with a more granular unit of management than traditional Unix processes and batch submissions: the Python function. We review the challenges involved in running native Python functions at scale, and present techniques for dynamically determining a minimal set of dependencies and for assembling a lightweight function monitor (LFM) that captures the software environment and manages resources at the granularity of single functions. We evaluate these techniques in a range of environ- ments, from campus cluster to supercomputer, and show that our advanced dependency management planning and dynamic re- source management methods provide superior performance and utilization relative to coarser-grained management approaches, achieving several-fold decrease in execution time for several large Python applications.
more » « less
Full Text Available
Solving the Container Explosion Problem for Distributed High Throughput Computing

https://doi.org/10.1109/IPDPS47924.2020.00048

Shaffer, Tim; Hazekamp, Nicholas; Blomer, Jakob; Thain, Douglas (May 2020, International Parallel and Distributed Processing Symposium)
null (Ed.)
Container technologies are seeing wider use at advanced computing facilities for managing highly complex applications that must execute at multiple sites. However, in a distributed high throughput computing setting, the unrestricted use of containers can result in the container explosion problem.If a new container image is generated for each variation of a job dispatched to a site, shared storage is soon exceeded. On the other hand, if a single large container image is used to meet multiple needs, the size of that container may become a problem for storage and transport. To address this problem, we observe that many containers have an internal structure generated by a structured package manager, and this information could be used to strategically combine and share container images. We develop LANDLORD to exploit this property and evaluate its performance through a combination of simulation studies and empirical measurement of high energy physics applications.
more » « less
Full Text Available
Flexible Partitioning of Scientific Workflwos Using the JX Workflow Language

https://doi.org/10.1145/3332186.3338100

Shaffer, Tim; Kremer-Herman, Nathaniel; Thain, Douglas (January 2019, Practice and Experience of Advanced Research Computing (PEARC))

Full Text Available
A First Look at the JX Workflow Language

https://doi.org/10.1109/eScience.2018.00094

Shaffer, Tim; Sweeney, Kyle M.D.; Kremer-Herman, Nathaniel; Thain, Douglas (October 2018, IEEE International Conference on e-Science)

Full Text Available
Taming metadata storms in parallel filesystems with metaFS

https://doi.org/10.1145/3149393.3149401

Shaffer, Tim; Thain, Douglas (January 2017, Parallel Data Storage Workshop at SC)

Full Text Available

Search for: All records