skip to main content

Title: Low Overhead Security Isolation using Lightweight Kernels and TEEs
The next generation of supercomputing resources is expected to greatly expand the scope of HPC environments, both in terms of more diverse workloads and user bases, as well as the integration of edge computing infrastructures. This will likely require new mechanisms and approaches at the Operating System level to support these broader classes of workloads along with their different security requirements. We claim that a key mechanism needed for these workloads is the ability to securely compartmentalize the system software executing on a given node. In this paper, we present initial efforts in exploring the integration of secure and trusted computing capabilities into an HPC system software stack. As part of this work we have ported the Kitten Lightweight Kernel (LWK) to the ARM64 architecture and integrated it with the Hafnium hypervisor, a reference implementation of a secure partition manager (SPM) that provides security isolation for virtual machines. By integrating Kitten with Hafnium, we are able to replace the commodity oriented Linux based resource management infrastructure and reduce the overheads introduced by using a full weight kernel (FWK) as the node-level resource scheduler. While our results are very preliminary, we are able to demonstrate measurable performance improvements on small scale more » ARM based SOC platforms. « less
Authors:
; ;
Award ID(s):
1704139
Publication Date:
NSF-PAR ID:
10369127
Journal Name:
Proceedings of the 11th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2021)
Page Range or eLocation-ID:
42 to 49
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The goal of this work was to design a low-cost computing facility that can support the development of an open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a significant allocation of computing resources. The improvements and expansions to our HPC (highperformance computing) cluster, known as Neuronix [2], required to support working with digital pathology fall into two broad categories: computation and storage. To handle the increased computational burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem across multiple machines. These enhancements, which are entirely based on open source software, have extended the capabilities of our cluster and increased its cost-effectiveness. Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous performance increase inmore »machine learning applications [4] and Slurm’s built-in mechanisms for handling them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be used to configure and enable support for resources beyond the ones provided by the traditional HPC scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation. This becomes very important as the computational demands of the jobs increase, so that they have all the resources they need, and that they don’t take resources from other jobs. It is a common practice among GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs, attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in exclusive-process mode. To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage solution. We utilized a number of open source tools to create a single filesystem, which can be mounted by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into 4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using Gluster [7]. ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means to connect them together over the network and using distributed (similar to RAID 0 but without striping individual files), and mirrored (similar to RAID 1) configurations [8]. By implementing these improvements, it has been possible to expand the storage and computational power of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent price/performance ratio [1].« less
  2. For system logs to aid in security investigations, they must be beyond the reach of the adversary. Unfortunately, attackers that have escalated privilege on a host are typically able to delete and modify log events at will. In response to this threat, a variety of secure logging systems have appeared over the years that attempt to provide tamper-resistance (e.g., write once read many drives, remote storage servers) or tamper-evidence (e.g., cryptographic proofs) for system logs. These solutions expose an interface through which events are committed to a secure log, at which point they enjoy protection from future tampering. However, all proposals to date have relied on the assumption that an event's occurrence is concomitant with its commitment to the secured log. In this work, we challenge this assumption by presenting and validating a race condition attack on the integrity of audit frameworks. Our attack exploits the intrinsically asynchronous nature of I/O and IPC activity, demonstrating that an attacker can snatch events about their intrusion out of message buffers after they have occurred but before they are committed to the log, thus bypassing existing protections. We present a first step towards defending against our attack by introducing KennyLoggings, the first kernel-more »based tamper-evident logging system that satisfies the synchronous integrity property, meaning that it guarantees tamper-evidence of events upon their occurrence. We implement KennyLoggings on top of the Linux kernel and show that it imposes between 8% and 11% overhead on log-intensive application workloads.« less
  3. Summary To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, also referred to as HCAs (Host Channel Adapters), resulting in a “multi‐rail”/“multi‐HCA” network. For example, the ThetaGPU system at Argonne National Laboratory (ANL) has eight adapters per node; with this many networking resources available, utilizing all of them becomes non‐trivial. The Message Passing Interface (MPI) is a dominant model for high‐performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we provide a thorough performance analysis of existing multirail solutions and their implications on collectives and present the necessity for further enhancement. Specifically, we propose novel designs for hierarchical, multi‐HCA‐aware Allgather. The proposed designs fully utilize all the available network adapters within a node and provide high overlap between inter‐node and intra‐node communication. At the micro‐benchmark level, we see large inter‐node improvements up to 62% and 61% better than HPC‐X and MVAPICH2‐X for 1024 processes. Because Allgather is used in Ring‐Allreduce, our designs also improve its performance by 56% and 44% compared to HPC‐X and MVAPICH2‐X, respectively. At the application level, our enhanced Allgathermore »shows and improvement in a matrix‐vector multiplication kernel when compared to HPC‐X and MVAPICH2‐X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2‐X.« less
  4. On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization.
  5. Recent advancements in energy-harvesting techniques provide an alternative to batteries for resource constrained IoT devices and lead to a new computing paradigm, the intermittent computing model. In this model, a software module continues its execution from where it left off when an energy shortage occurred. Enforcing security of an intermittent software module is challenging because its power-off state has to be protected from a malicious adversary in addition to its power-on state, while the security mechanisms put in place must have a low overhead on the performance, resource consumption, and cost of a device. In this paper, we propose SIA (Secure Intermittent Architecture), a security architecture for resource-constrained IoT devices. SIA leverages low-cost security features available in commercial off-the-shelf microcontrollers to protect both the power-on and power-off state of an intermittent software module. Therefore, SIA enables a host of secure intermittent computing applications such as self-attestation, remote attestation, and secure communication. Moreover, our architecture provides confidentiality and integrity guarantees to an intermittent computing module at no cost compared to previous approaches in the literature that impose significant overheads. The salient characteristic of SIA is that it does not require any hardware modifications, and hence, it can be directly applied tomore »existing IoT devices. We implemented and evaluated SIA on a resource-constrained IoT device based on an MSP430 processor. Besides being secure, SIA is simple and efficient. We confirm the feasibility of SIA for resource-constrained IoT devices with experimental results of several intermittent computing applications. Our prototype implementation outperforms by two to three orders of magnitude the secure intermittent computing solution of Suslowicz et al. presented at IGSC 2018.« less