skip to main content


Title: Cimplifier: automatically debloating containers
Application containers, such as those provided by Docker, have recently gained popularity as a solution for agile and seamless software deployment. These light-weight virtualization environments run applications that are packed together with their resources and configuration information, and thus can be deployed across various software platforms. Unfortunately, the ease with which containers can be created is oftentimes a double-edged sword, encouraging the packaging of logically distinct applications, and the inclusion of significant amount of unnecessary components, within a single container. These practices needlessly increase the container size - sometimes by orders of magnitude. They also decrease the overall security, as each included component - necessary or not - may bring in security issues of its own, and there is no isolation between multiple applications packaged within the same container image. We propose algorithms and a tool called Cimplifier, which address these concerns: given a container and simple user-defined constraints, our tool partitions it into simpler containers, which (i) are isolated from each other, only communicating as necessary, and (ii) only include enough resources to perform their functionality. Our evaluation on real-world containers demonstrates that Cimplifier preserves the original functionality, leads to reduction in image size of up to 95%, and processes even large containers in under thirty seconds.  more » « less
Award ID(s):
1722068
NSF-PAR ID:
10056569
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering
Page Range / eLocation ID:
476 to 486
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With close to native performance, Linux containers are becoming the de facto platform for cloud computing. While various solutions have been proposed to secure applications and containers in the cloud environment by leveraging Intel SGX, most cloud operators do not yet offer SGX as a service. This is likely due to a number of security, scalability, and usability concerns coming from both cloud providers and users. Cloud operators worry about the security guarantees of unofficial SDKs, limited support for remote attestation within containers, limited physical memory for the Enclave Page Cache (EPC) making it difficult to support hundreds of enclaves, and potential DoS attacks against EPC by malicious users. Meanwhile, end users need to worry about careful program partitioning to reduce the TCB and adapting legacy applications to use SGX. We note that most of these concerns are the result of an incomplete infrastructure, from the OS to the application layer. We address these concerns with lxcsgx, which allows SGX applications to run inside containers while also: enabling SGX remote attestation for containerized applications, enforcing EPC memory usage control on a per-container basis, providing a general software TPM using SGX to augment legacy applications, and supporting partitioning with a GCC plugin. We then retrofit Nginx/OpenSSL and Memcached using the software TPM and SGX partitioning to defend against known and potential attacks. Thanks to the small EPC footprint of each enclave, we are able to run up to 100 containerized Memcached instances without EPC swapping. Our evaluation shows the overhead introduced by lxcsgx is less than 6.9% for simple SGX applications, 9.5% for Nginx/OpenSSL, and 20.9% for containerized Memcached. 
    more » « less
  2. The Unmanned aerial vehicles (UAVs) sector is fast-expanding. Protection of real-time UAV applications against malicious attacks has become an urgent problem that needs to be solved. Denial-of-service (DoS) attack aims to exhaust system resources and cause important tasks to miss deadlines. DoS attack may be one of the common problems of UAV systems, due to its simple implementation. In this paper, we present a software framework that offers DoS attack-resilient control for real-time UAV systems using containers: Container Drone. The framework provides defense mechanisms for three critical system resources: CPU, memory, and communication channel. We restrict the attacker's access to the CPU core set and utilization. Memory bandwidth throttling limits the attacker's memory usage. By simulating sensors and drivers in the container, a security monitor constantly checks DoS attacks over communication channels. Upon the detection of a security rule violation, the framework switches to the safety controller to mitigate the attack. We implemented a prototype quadcopter with commercially off-the-shelf (COTS) hardware and open-source software. Our experimental results demonstrated the effectiveness of the proposed framework defending against various DoS attacks. 
    more » « less
  3. Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the binder.sqlite file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the events table, which has the following schema.

    CREATE TABLE events( version INTEGER, timestamp TEXT, provider TEXT, spec TEXT, origin TEXT, ref TEXT, guessed_ref TEXT ); CREATE INDEX idx_timestamp ON events(timestamp);
    • version indicates the version of the record as assigned by Binder. The origin field became available with version 3, and the ref field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.
    • timestamp is the ISO timestamp of the launch
    • provider gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.
    • spec gives the particular branch/release/commit being built. It consists of <github-id>/<repo>/<branch>.
    • origin indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.
    • ref specifies the git commit that was actually used, rather than the named branch referenced by spec. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.
    • For records where ref is not available, we attempted to clone the named reference given by spec rather than the specific commit (see below). The guessed_ref field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-null ref) will exclude these guessed commits. May be null.

    The Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both ref and guessed_ref in the events table). The schema is as follows.

    CREATE TABLE spec_files ( ref TEXT NOT NULL PRIMARY KEY, ls TEXT, runtime BLOB, apt BLOB, conda BLOB, pip BLOB, pipfile BLOB, julia BLOB, r BLOB, nix BLOB, docker BLOB, setup BLOB, postbuild BLOB, start BLOB );

    Here ref corresponds to ref and/or guessed_ref from the events table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.

    • runtime: runtime.txt
    • apt: apt.txt
    • conda: environment.yml
    • pip: requirements.txt
    • pipfile: Pipfile.lock or Pipfile
    • julia: Project.toml or REQUIRE
    • r: install.R
    • nix: default.nix
    • docker: Dockerfile
    • setup: setup.py
    • postbuild: postBuild
    • start: start

    The ls field gives a metadata listing of the repo contents (excluding the .git directory). This field is JSON encoded with the following structure based on JSON types:

    • Object: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.
    • String: symlink. The string value gives the link target.
    • Number: regular file. The number value gives the file size in bytes.
    CREATE TABLE clean_specs ( ref TEXT NOT NULL PRIMARY KEY, conda_channels TEXT, conda_packages TEXT, pip_packages TEXT, apt_packages TEXT );

    The clean_specs table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using the conda.models.match_spec.MatchSpec class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.

    The missing table gives the repos that were not accessible, and event_logs records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.

     
    more » « less
  4. P4’s data-plane programmability allows for highly customizable and programmable packet processing, enabling rapid innovation in network applications, such as virtualization, security, load balancing, and traffic engineering. Researchers extensively use Mininet, a popular network emulator, integrated with BMv2, for fast and flexible prototyping of these P4-based applications, but due to its lower performance in terms of throughput and latency compared to a production-grade software switch like Open vSwitch, it is crucial to have an accurate and scalable emulation testbed. In this paper, we develop a lightweight virtual time system and integrate it into Mininet with BMv2 to enhance fidelity and scalability. By scaling the time of interactions between containers and the underlying physical machine by a time dilation factor (TDF), we can trade time with system resources, making the emulated P4 network appear to be faster from the viewpoint of the switch/host processes in the container. Our experimental results show that the testbed can accurately emulate much larger networks with high loads, scaled by a factor of TDF with extremely low system overhead. 
    more » « less
  5. null (Ed.)
    Large-scale, high-throughput computational science faces an accelerating convergence of software and hardware. Software container-based solutions have become common in cloud-based datacenter environments, and are considered promising tools for addressing heterogeneity and portability concerns. However, container solutions reflect a set of assumptions which complicate their adoption by developers and users of scientific workflow applications. Nor are containers a universal solution for deployment in high-performance computing (HPC) environments which have specialized and vertically integrated scheduling and runtime software stacks. In this paper, we present a container design and deployment approach which uses modular layering to ease the deployment of containers into existing HPC environments. This layered approach allows operating system integrations, support for different communication and performance monitoring libraries, and application code to be defined and interchanged in isolation. We describe in this paper the details of our approach, including specifics about container deployment and orchestration for different HPC scheduling systems. We also describe how this layering method can be used to build containers for two separate applications, each deployed on clusters with different batch schedulers, MPI networking support, and performance monitoring requirements. Our experience indicates that the layered approach is a viable strategy for building applications intended to provide similar behavior across widely varying deployment targets. 
    more » « less