skip to main content


Search for: All records

Award ID contains: 1705021

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Access libraries such as ROOT[1] and HDF5[2] allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. For example, access libraries often implement buffering and data layout that assume that large, single-threaded sequential access patterns are causing less overall latency than small parallel random access: while this is true for spinning media, it is not true for flash media. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph’s extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations. 
    more » « less
  2. The incubator and research projects sponsored by the Center for Research in Open Source Software (CROSS, cross.ucsc.edu) at UC Santa Cruz have been very effective at promoting the professional and technical development of research software engineers. Carlos Maltzahn founded CROSS in 2015 with a generous gift of $2,000,000 from UC Santa Cruz alumnus Dr. Sage Weil and founding memberships of Toshiba America Electronic Components, SK Hynix Memory Solutions, and Micron Technology. Over the past five years, CROSS funding has enabled PhD students to not only create re- search software projects but also learn how to draw in new contributors and leverage established open source software communities. This position paper will present CROSS fellowships as case studies for how university-led open source projects can create a real- world, reproducible model for effectively training, funding and sup- porting research software engineers. 
    more » « less
  3. The Skyhook Data Management project (SkyhookDM.com) at the Center for Research in Open Source Software (cross.ucsc.edu) at UC Santa Cruz implements customized extensions through Ceph's object class interface that enables offloading database operations to the storage system. In our previous Vault '19 talk, we showed how SkyhookDM can transparently scale out databases. The SkyhookDM Ceph extensions are an example of our 'programmable storage' research efforts at UCSC, and can be accessed through commonly available external/foreign table database interfaces. Utilizing fast in-memory serialization libraries such as Google Flatbuffers and Apache Arrow, SkyhookDM currently implements common database functions such as SELECT, PROJECT, AGGREGATE, and indexing inside Ceph, along with lower-level data manipulations such as transforming data from row to column formats on RADOS servers. In this talk, we will present three of our latest developments on the SkyhookDM project since Vault '19. First, SkyhookDM can be used to also offload operations of access libraries that support plugins for backends, such as HDF5 and its Virtual Object Layer. Second, in addition to row-oriented data format using Google's Flatbuffers, we have added support for column-oriented data formats using the Apache Arrow library within our Ceph extensions. Third, we added dynamic switching between row and column data formats within Ceph objects, a first step towards physical design management in storage systems, similar to physical design tuning in database systems. 
    more » « less
  4. In the resource-rich environment of data centers most failures can quickly failover to redundant resources. In contrast, failure in edge infrastructures with limited resources might require maintenance personnel to drive to the location in order to fix the problem. The operational cost of these "truck rolls" to locations at the edge infrastructure competes with the operational cost incurred by extra space and power needed for redundant resources at the edge. Computational storage devices with network interfaces can act as network-attached storage servers and offer a new design point for storage systems at the edge. In this paper we hypothesize that a system consisting of a larger number of such small "embedded" storage nodes provides higher availability due to a larger number of failure domains while also saving operational cost in terms of space and power. As evidence for our hypothesis, we compared the possibility of data loss between two different types of storage systems: one is constructed with general-purpose servers, and the other one is constructed with embedded storage nodes. Our results show that the storage system constructed with general-purpose servers has 7 to 20 times higher risk of losing data over the storage system constructed with embedded storage devices. We also compare the two alternatives in terms of power and space using the Media-Based Work Unit (MBWU) that we developed in an earlier paper as a reference point. 
    more » « less
  5. With ever larger data sets and cloud-based storage systems, it becomes increasingly more attractive to move computation to data, a common principle in big data systems. Historically, data management systems have pushed computation nearest to the data in order to reduce data moving through query execution pipelines. Computational storage approaches address the problem of both data reduction nearest the source as well as offloading some processing to the storage layer. 
    more » « less
  6. Ceph is an open source distributed storage system that is object-based and massively scalable. Ceph provides developers with the capability to create data interfaces that can take advantage of local CPU and memory on the storage nodes (Ceph Object Storage Devices). These interfaces are powerful for application developers and can be created in C, C++, and Lua. Skyhook is an open source storage and database project in the Center for Research in Open Source Software at UC Santa Cruz. Skyhook uses these capabilities in Ceph to create specialized read/write interfaces that leverage IO and CPU within the storage layer toward database processing and management. Specifically, we develop methods to apply predicates locally as well as additional metadata and indexing capabilities using Ceph's internal indexing mechanism built on top of RocksDB. Skyhook's approach helps to enable scale-out of a single node database system by scaling out the storage layer. Our results show the performance benefits for some queries indeed scale well as the storage layer scales out. 
    more » « less