Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
The performance of today's in-memory indexes is bottlenecked by the memory latency/bandwidth wall. Processing-in-memory (PIM) is an emerging approach that potentially mitigates this bottleneck, by enabling low-latency memory access whose aggregate memory bandwidth scales with the number of PIM nodes. There is an inherent tension, however, between minimizing inter-node communication and achieving load balance in PIM systems, in the presence of workload skew. This paper presents PIM-tree , an ordered index for PIM systems that achieves both low communication and high load balance, regardless of the degree of skew in data and queries. Our skew-resistant index is based on a novel division of labor between the host CPU and PIM nodes, which leverages the strengths of each. We introduce push-pull search , which dynamically decides whether to push queries to a PIM-tree node or pull the node's keys back to the CPU based on workload skew. Combined with other PIM-friendly optimizations ( shadow subtrees and chunked skip lists ), our PIM-tree provides high-throughput, (guaranteed) low communication, and (guaranteed) high load balance, for batches of point queries, updates, and range scans. We implement PIM-tree, in addition to prior proposed PIM indexes, on the latest PIM system from UPMEM, with 32 CPU cores and 2048 PIM nodes. On workloads with 500 million keys and batches of 1 million queries, the throughput using PIM-trees is up to 69.7X and 59.1x higher than the two best prior PIM-based methods. As far as we know these are the first implementations of an ordered index on a real PIM system.more » « less
-
This article introduces the first open-source FPGA-based infrastructure, MetaSys, with a prototype in a RISC-V system, to enable the rapid implementation and evaluation of a wide range of cross-layer techniques in real hardware. Hardware-software cooperative techniques are powerful approaches to improving the performance, quality of service, and security of general-purpose processors. They are, however, typically challenging to rapidly implement and evaluate in real hardware as they require full-stack changes to the hardware, system software, and instruction-set architecture (ISA). MetaSys implements a rich hardware-software interface and lightweight metadata support that can be used as a common basis to rapidly implement and evaluate new cross-layer techniques. We demonstrate MetaSys’s versatility and ease-of-use by implementing and evaluating three cross-layer techniques for: (i) prefetching in graph analytics; (ii) bounds checking in memory unsafe languages, and (iii) return address protection in stack frames; each technique requiring only ~100 lines of Chisel code over MetaSys. Using MetaSys, we perform the first detailed experimental study to quantify the performance overheads of using a single metadata management system to enable multiple cross-layer optimizations in CPUs. We identify the key sources of bottlenecks and system inefficiency of a general metadata management system. We design MetaSys to minimize these inefficiencies and provide increased versatility compared to previously proposed metadata systems. Using three use cases and a detailed characterization, we demonstrate that a common metadata management system can be used to efficiently support diverse cross-layer techniques in CPUs. MetaSys is completely and freely available at https://github.com/CMU-SAFARI/MetaSys .more » « less
-
RACOD is an algorithm/hardware co-design for mobile robot path planning. It consists of two main components: CODAcc, a hardware accelerator for collision detection; and RASExp, an algorithm extension for runahead path exploration. CODAcc uses a novel MapReduce-style hardware computational model and massively parallelizes individual collision checks. RASExp predicts future path explorations and proactively computes its collision status ahead of time, thereby overlapping multiple collision detections. By affording multiple cheap CODAcc accelerators and overlapping collision detections using RASExp, RACOD significantly accelerates planning for mobile robots operating in arbitrary environments. Evaluations of popular benchmarks show up to 41.4× (self-driving cars) and 34.3× (pilotless drones) speedup with less than 0.3% area overhead. While the performance is maximized when CODAcc and RASExp are used together, they can also be used individually. To illustrate, we evaluate CODAcc alone in the context of a stationary robotic arm and show that it improves performance by 3.4×–3.8×. Also, we evaluate RASExp alone on commodity many-core CPU and GPU platforms by implementing it purely in software and show that with 32/128 CPU/GPU threads, it accelerates the end-to-end planning time by 8.6×/2.9×.more » « less
-
The emergence of “robotics in the wild” has triggered a wave of recent research in hardware and software to boost robots’ compute capabilities. Nevertheless, research in this area is hindered by the lack of a comprehensive benchmark suite. In this paper, we present RTRBench, a benchmark suite for robotic kernels. RTRBench includes 16 kernels, spanning the entire software pipeline of a wide swath of robots, all implemented in C++ for fast execution. Together with the suite, we conduct an evaluation of the workloads at the architecture level. We pinpoint the sources of inefficiencies in a modern robotic processor when executing the robotic kernels, along with the opportunities for improvements. The source code of the benchmark suite is available in https://cmu-roboarch.github.io/rtrbench/.more » « less