skip to main content


Title: Greyfish: An Out-of-the-Box, Reusable, Portable Cloud Storage Service
A scalable storage system is an integral requirement for supporting large-scale cloud computing jobs. The raw space on storage systems is made usable with the help of a software layer which is typically called a filesystem (e.g., Google's Cloud Filestore). In this paper, we present the design and implementation of an open-source and free cloud-based filesystem named as "Greyfish" that can be installed on the Virtual Machines (VMs) hosted on different cloud computing systems, such as Jetstream and Chameleon. Greyfish helps in: (1) storing files and directories for different user-accounts in a shared space on the cloud, (2) managing file-access permissions, and (3) purging files when needed. It is currently being used in the implementation of the Gateway-In-A-Box (GIB) project. A simplified version of Greyfish, known as Reef, is already in production in the BOINC@TACC project. Science gateway developers will find Greyfish useful for creating local filesystems that can be mounted in containers. By doing so, they can independently do quick installations of self-contained software solutions in development and test environments while mounting the filesystems on large-scale storage platforms in the production environments only.  more » « less
Award ID(s):
1664022
NSF-PAR ID:
10152965
Author(s) / Creator(s):
;
Date Published:
Journal Name:
PEARC '19: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning)
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The goal of this work was to design a low-cost computing facility that can support the development of an open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a significant allocation of computing resources. The improvements and expansions to our HPC (highperformance computing) cluster, known as Neuronix [2], required to support working with digital pathology fall into two broad categories: computation and storage. To handle the increased computational burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem across multiple machines. These enhancements, which are entirely based on open source software, have extended the capabilities of our cluster and increased its cost-effectiveness. Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous performance increase in machine learning applications [4] and Slurm’s built-in mechanisms for handling them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be used to configure and enable support for resources beyond the ones provided by the traditional HPC scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation. This becomes very important as the computational demands of the jobs increase, so that they have all the resources they need, and that they don’t take resources from other jobs. It is a common practice among GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs, attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in exclusive-process mode. To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage solution. We utilized a number of open source tools to create a single filesystem, which can be mounted by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into 4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using Gluster [7]. ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means to connect them together over the network and using distributed (similar to RAID 0 but without striping individual files), and mirrored (similar to RAID 1) configurations [8]. By implementing these improvements, it has been possible to expand the storage and computational power of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent price/performance ratio [1]. 
    more » « less
  2. null (Ed.)
    Parallel filesystems (PFSs) are one of the most critical high-availability components of High Performance Computing (HPC) systems. Most HPC workloads are dependent on the availability of a POSIX compliant parallel filesystem that provides a globally consistent view of data to all compute nodes of a HPC system. Because of this central role, failure or performance degradation events in the PFS can impact every user of a HPC resource. There is typically insufficient information available to users and even many HPC staff to identify the causes of these PFS events, impeding the implementation of timely and targeted remedies to PFS issues. The relevant information is distributed across PFS servers; however, access to these servers is highly restricted due to the sensitive role they play in the operations of a HPC system. Additionally, the information is challenging to aggregate and interpret, relegating diagnosis and treatment of PFS issues to a select few experts with privileged system access. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. 
    more » « less
  3. Scientific workflows drive most modern large-scale science breakthroughs by allowing scientists to define their computations as a set of jobs executed in a given order based on their data dependencies. Workflow management systems (WMSs) have become key to automating scientific workflows-executing computational jobs and orchestrating data transfers between those jobs running on complex high-performance computing (HPC) platforms. Traditionally, WMSs use files to communicate between jobs: a job writes out files that are read by other jobs. However, HPC machines face a growing gap between their storage and compute capabilities. To address that concern, the scientific community has adopted a new approach called in situ, which bypasses costly parallel filesystem I/O operations with faster in-memory or in-network communications. When using in situ approaches, communication and computations can be interleaved. In this work, we leverage the Decaf in situ dataflow framework to accelerate task-based scientific workflows managed by the Pegasus WMS, by replacing file communications with faster MPI messaging. We propose a new execution engine that uses Decaf to manage communications within a sub-workflow (i.e., set of jobs) to optimize inter-job communications. We consider two workflows in this study: (i) a synthetic workflow that benchmarks and compares file- and MPI-based communication; and (ii) a realistic bioinformatics workflow that computes mu-tational overlaps in the human genome. Experiments show that in situ communication can improve the bioinformatics workflow execution time by 22% to 30% compared with file communication. Our results motivate further opportunities and challenges for bridging traditional WMSs with in situ frameworks. 
    more » « less
  4. null (Ed.)
    Reproducibility in the sciences is critical to reliable inquiry, but is often easier said than done. In the computer architecture community, research may require modifying systems from low-level circuits to operating systems and highlevel applications. All of these moving parts make reproducible experiments on full-stack systems challenging to design. Furthermore, the computing ecosystem evolves quickly, leading to rapidly obsolete artifacts. This is especially true in the realm of software where applications are often updated on a monthly, or even daily, cadence. In this paper we introduce FireMarshal, a software workload management tool for RISC-V based full-stack hardware development and research. FireMarshal automates workload generation (constructing boot binaries and filesystem images), development (with functional simulation), and evaluation (with cycle-exact RTL simulation). It also ensures, to the extent possible, that the exact same software runs deterministically across all phases of development, providing confidence in correctness and accuracy while minimizing time spent on slow and expensive RTL-level simulation. To ease workload specification, FireMarshal provides sane defaults for common components like firmware and operating systems, freeing users to focus only on project-specific components. Beyond reproducibility, Fire- Marshal enables continued development of workloads through the use of inheritance, where new workloads can be derived from established and continually updated base workloads. Users communicate their designs through the use of simple JSON configuration files that can be easily version controlled, reused, and shared. In this paper, we describe the design of FireMarshal along with the associated software management methodology for architectural research and development. 
    more » « less
  5. Cybersecurity continues to be a critical aspect within every computing division, especially in the realm of operating system (OS) development. The OS resides at the lower layer above the hardware in the computing hierarchy. If the layers above the OS are well hardened, a security flaw in the OS will compromise the resources in those higher layers. Although several learning resources and courses are available for OS security, they are taught in advanced UG or graduate-level computer security classes. In this work, we develop cybersecurity educational modules that instructors can adoptin their OS courses to emphasize security in OS while teaching its concepts. The goal of this work is to engage students in learning security aspects in OS, while learning its concepts. It will give students a good understanding of different security concepts and how they are implemented in the OS. Towards this, we develop security educational modules for an OS course that will be available to the instructors for adoption in their courses. These modules are designed to be used in a UG-level OS course. To work on these modules, students should be familiar with C programming and OS concepts taught in the class. The modules are intended to be completed within the course of a semester. To achieve this goal, we organize them into three mini-projects witheach can be completed within a few weeks. We chose xv6 as the platform due to its popularity as an educational OS for developing the modules. To develop the modules, we referred to the recent version of a popular OS textbook for the security concepts. The topics discussed in it include authentication, authorization, cryptography, and distributed system security. We kept our educational modules mostly aligned with these topics except distributed system security. We also included a module for implementing a defense mechanism against buffer-overflow attacks, a famous software vulnerability. We created three mini-projects for these modules, each accompanied by proper documentation and a GitHub repository. Two versions are created for each project, one for a student’s assignment available in the repository and another as a solution version for instructors. The first project implements a user authentication system in xv6. Students will implement various specifications such as password structure with encryption and programs such as useradd, passwd, whoami, and login. The implementation guidelines are provided in the documentation, along with skeleton code. The authorization project implements the Unix-style access control system. In this project, students will modify and create various structures and functions within the xv6 kernel. The last project is to build a defense mechanism against buffer-overflow using Address Space Layout Randomization (ASLR). Students are expected to implement a random number generator and modify the executable file loader in xv6. The submission for each project is expected to demonstrate the module behavior comparable to relevant systems present in production grade OS, such as Linux. 
    more » « less