Secure architectures are becoming an increasingly common demand. This is due in large part to the rise of cloud computing, as users are trusting their data with hardware that they do not own. Unfortunately, many secure computation and isolation techniques are still susceptible to side-channel attacks. While various defenses to side-channel attacks exist, each tends to be targeted to a specific vulnerability and comes with a high runtime overhead, making it difficult to combine these defenses together in a performant manner.This work proposes an efficient design for preventing a large range of cache side-channel attacks by leveraging a near-memory processing (NMP) architecture. Specifically, the proposed design stores all sensitive data in isolated NMP vaults and performs all computation involving that sensitive data on NMP cores. Our approach eliminates possible cache side-channels while also minimizing runtime overhead when the parallelizability of NMP architecture is leveraged. Simulation results from a cycle-accurate architecture model shows that offloading secure computation to NMP cores can have as little as 0.26% overhead.
more »
« less
HybriDS: Cache-Conscious Concurrent Data Structures for Near-Memory Processing Architectures
In recent years, the ever0increasing impact of memory access bottlenecks has brought forth a renewed interest in near-memory processing (NMP) architectures. In this work, we propose and empirically evaluate hybrid data structures, which are concurrent data structures custom-designed for these new NMP architectures. We focus on cache-optimized data structures, such as skiplists and B+ trees, that are often used as index structures in online transaction processing (OLTP) systems to enable fast key-based lookups. These data structures are hierarchical, where lookups begin at a small number of top-level nodes and diverge to many different node paths as they move down the hierarchy, such that nodes in higher levels benefit more from caching. Our proposed hybrid data structures split traditional hierarchical data structures into a host-managed portion consisting of higher-level nodes and an NMP-managed portion consisting of the remaining lower-level nodes, thus retaining and further enhancing the cache-conscious optimizations of their conventional implementations. Although the idea might seem relatively simple, the splitting of the data structure prompts new synchronization problems, and careful implementation is required to ensure high concurrency and correctness. We provide implementations of a hybrid skiplist and a hybrid B+ tree, and we empirically evaluate them on a cycle-accurate full-system architecture simulator. Our results show that the hybrid data structures have the potential to improve performance by more than 2X compared to state-of-the-art concurrent data structures.
more »
« less
- PAR ID:
- 10346864
- Date Published:
- Journal Name:
- ACM Symposium on Parallelism in Algorithms and Architectures
- Page Range / eLocation ID:
- 321 to 332
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Multicopy search structures such as log-structured merge (LSM) trees are optimized for high insert/update/delete (collectively known as upsert) performance. In such data structures, an upsert on key k , which adds ( k , v ) where v can be a value or a tombstone, is added to the root node even if k is already present in other nodes. Thus there may be multiple copies of k in the search structure. A search on k aims to return the value associated with the most recent upsert. We present a general framework for verifying linearizability of concurrent multicopy search structures that abstracts from the underlying representation of the data structure in memory, enabling proof-reuse across diverse implementations. Based on our framework, we propose template algorithms for (a) LSM structures forming arbitrary directed acyclic graphs and (b) differential file structures, and formally verify these templates in the concurrent separation logic Iris. We also instantiate the LSM template to obtain the first verified concurrent in-memory LSM tree implementation.more » « less
-
We propose a new data structure called CachedEmbeddings for training large scale deep learning recommendation models (DLRM) efficiently on heterogeneous (DRAM + non-volatile) memory platforms. CachedEmbeddings implements an implicit software-managed cache and data movement optimization that is integrated with the Julia programming framework to optimize the implementation of large scale DLRM implementations with multiple sparse embedded tables operations. In particular we show an implementation that is 1.4X to 2X better than the best known Intel CPU based implementations on state-of-the-art DLRM benchmarks on a real heterogeneous memory platform from Intel, and 1.32X to 1.45X improvement over Intel’s 2LM implementation that treats the DRAM as a hardware managed cache.more » « less
-
Recent advances in memory architectures have provoked renewed interest in near-data-processing (NDP) as way to alleviate the "memory wall" problem. An NDP architecture places logic circuits, such as simple processors, in close proximity to memory. Effective use of NDP architectures requires rethinking data structures and their algorithms. Here, we provide an empirical evaluation of several NDP-aware algorithms for general-purpose concurrent data structures such as linked-lists, skiplists, and FIFO queues. The empirical analysis reveals that the potential benefits of NDP-based concurrent data structures are less than what had been expected in earlier studies. In turn, we introduce lightweight NDP hardware modifications, inspired by initial observations on data access patterns and underlying DRAM activity. Even the minimal changes to hardware significantly improve the performance and energy consumption of NDP-based concurrent data structures, and in many cases, the resulting data structures outperform state-of-the-art concurrent data structures.more » « less
-
Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are due to vertex neighborhood lookups, locality in graph data structures plays a key role in dictating the degree of data movement. Vertex reordering is a widely used technique to improve data locality within graph data structures. However, these reordering schemes alone are not sufficient as they need to be complemented with efficient task allocation on manycore GPU architectures to reduce latency due to local cache misses. Consequently, in this article, we introduce a software/hardware co-design framework for accelerating graph computations. Our approach couples an architecture-aware vertex reordering with a priority-based task allocation technique. As the task allocation aims to reduce on-chip latency and associated energy, the choice of Network-on-Chip (NoC) as the communication backbone in the manycore platform is an important parameter. By leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)-enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SMs) and the memory controllers (MCs) follow a power-law distribution. The proposed 3D SWNoC-enabled software/hardware co-design framework achieves 11.1% to 22.9% performance improvement and 16.4% to 32.6% less energy consumption depending on the dataset and the graph application, when compared to the default order of dataset running on a conventional planar mesh architecture.more » « less