skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native Containers
Award ID(s):
1700832
PAR ID:
10348071
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
InProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems 2019
Page Range / eLocation ID:
121 to 135
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We describe the design and implementation of DetTrace, a reproducible container abstraction for Linux implemented in user space. All computation that occurs inside a DetTrace container is a pure function of the initial filesystem state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows. We show that, while software in each of these domains is initially irreproducible, DetTrace brings reproducibility without requiring any hardware, OS or application changes. DetTrace's performance is dictated by the frequency of system calls: IO-intensive software builds have an average overhead of 3.49x, while a compute-bound bioinformatics workflow is under 2%. 
    more » « less
  2. Application containers, such as those provided by Docker, have recently gained popularity as a solution for agile and seamless software deployment. These light-weight virtualization environments run applications that are packed together with their resources and configuration information, and thus can be deployed across various software platforms. Unfortunately, the ease with which containers can be created is oftentimes a double-edged sword, encouraging the packaging of logically distinct applications, and the inclusion of significant amount of unnecessary components, within a single container. These practices needlessly increase the container size - sometimes by orders of magnitude. They also decrease the overall security, as each included component - necessary or not - may bring in security issues of its own, and there is no isolation between multiple applications packaged within the same container image. We propose algorithms and a tool called Cimplifier, which address these concerns: given a container and simple user-defined constraints, our tool partitions it into simpler containers, which (i) are isolated from each other, only communicating as necessary, and (ii) only include enough resources to perform their functionality. Our evaluation on real-world containers demonstrates that Cimplifier preserves the original functionality, leads to reduction in image size of up to 95%, and processes even large containers in under thirty seconds. 
    more » « less
  3. {"Abstract":["Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the binder.sqlite<\/code> file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the events<\/code> table, which has the following schema.<\/p>\n\nCREATE TABLE events(\n version INTEGER,\n timestamp TEXT,\n provider TEXT,\n spec TEXT,\n origin TEXT,\n ref TEXT,\n guessed_ref TEXT\n);\nCREATE INDEX idx_timestamp ON events(timestamp);\n<\/code>\n\nversion<\/code> indicates the version of the record as assigned by Binder. The origin<\/code> field became available with version 3, and the ref<\/code> field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.<\/li>timestamp<\/code> is the ISO timestamp of the launch<\/li>provider<\/code> gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.<\/li>spec<\/code> gives the particular branch/release/commit being built. It consists of <github-id>/<repo>/<branch><\/code>.<\/li>origin<\/code> indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.<\/li>ref<\/code> specifies the git commit that was actually used, rather than the named branch referenced by spec<\/code>. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.<\/li>For records where ref<\/code> is not available, we attempted to clone the named reference given by spec<\/code> rather than the specific commit (see below). The guessed_ref<\/code> field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-null ref<\/code>) will exclude these guessed commits. May be null.<\/li><\/ul>\n\nThe Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both ref<\/code> and guessed_ref<\/code> in the events<\/code> table). The schema is as follows.<\/p>\n\nCREATE TABLE spec_files (\n ref TEXT NOT NULL PRIMARY KEY,\n ls TEXT,\n runtime BLOB,\n apt BLOB,\n conda BLOB,\n pip BLOB,\n pipfile BLOB,\n julia BLOB,\n r BLOB,\n nix BLOB,\n docker BLOB,\n setup BLOB,\n postbuild BLOB,\n start BLOB\n);<\/code>\n\nHere ref<\/code> corresponds to ref<\/code> and/or guessed_ref<\/code> from the events<\/code> table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.<\/p>\n\nruntime<\/code>: runtime.txt<\/code><\/li>apt<\/code>: apt.txt<\/code><\/li>conda<\/code>: environment.yml<\/code><\/li>pip<\/code>: requirements.txt<\/code><\/li>pipfile<\/code>: Pipfile.lock<\/code> or Pipfile<\/code><\/li>julia<\/code>: Project.toml<\/code> or REQUIRE<\/code><\/li>r<\/code>: install.R<\/code><\/li>nix<\/code>: default.nix<\/code><\/li>docker<\/code>: Dockerfile<\/code><\/li>setup<\/code>: setup.py<\/code><\/li>postbuild<\/code>: postBuild<\/code><\/li>start<\/code>: start<\/code><\/li><\/ul>\n\nThe ls<\/code> field gives a metadata listing of the repo contents (excluding the .git<\/code> directory). This field is JSON encoded with the following structure based on JSON types:<\/p>\n\nObject: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.<\/li>String: symlink. The string value gives the link target.<\/li>Number: regular file. The number value gives the file size in bytes.<\/li><\/ul>\n\nCREATE TABLE clean_specs (\n ref TEXT NOT NULL PRIMARY KEY,\n conda_channels TEXT,\n conda_packages TEXT,\n pip_packages TEXT,\n apt_packages TEXT\n);<\/code>\n\nThe clean_specs<\/code> table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using the conda.models.match_spec.MatchSpec<\/code> class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.<\/p>\n\nThe missing<\/code> table gives the repos that were not accessible, and event_logs<\/code> records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.<\/p>"]} 
    more » « less