skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks
The pin count largely determines the cost of a chip package, which is often comparable to the cost of a die. In 3D processor-memory designs, power and ground (P/G) pins can account for the majority of the pins. This is because packages include separate pins for the disjoint processor and memory power delivery networks (PDNs). Supporting separate PDNs and P/G pins for processor and memory is inefficient, as each set has to be provisioned for the worst-case power delivery requirements. In this paper, we propose to reduce the number of P/G pins of both processor and memory in a 3D design, and dynamically and opportunistically divert some power between the two PDNs on demand. To perform the power transfer, we use a small bidirectional on-chip voltage regulator that connects the two PDNs. Our concept, called Snatch, is effective. It allows the computer to execute code sections with high processor or memory power requirements without having to throttle performance. We evaluate Snatch with simulations of an 8-core multicore stacked with two memory dies. In a set of compute-intensive codes, the processor snatches memory power for 30% of the time on average, speeding-up the codes by up to 23% over advanced turbo-boosting; in memory-intensive codes, the memory snatches processor power. Alternatively, Snatch can reduce the package cost by about 30%.  more » « less
Award ID(s):
1649432
PAR ID:
10025425
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
International Conference on Microarchitecture (MICRO)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graph application workloads are dominated by random memory accesses with poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM architecture designs have focused on mapping graph computations into a set of multiply-and-accumulate (MAC) operations. ReRAMs also offer a key advantage in reducing memory latency between cores and memory by allowing for processing-in-memory (PIM). However, when implemented on a ReRAM-based manycore architecture, graph applications still pose two key challenges – significant storage requirements (particularly due to wasted zero cell storage), and significant amount of on-chip traffic. To tackle these two challenges, in this paper we propose the design of a 3D NoC-enabled ReRAM-based manycore architecture. Our proposed architecture incorporates a novel crossbar-aware node reordering to reduce ReRAM storage requirements. Secondly, its 3D NoC-enabled design reduces on-chip communication latency. Our architecture outperforms the state-of-the-art in ReRAM-based graph acceleration by up to 5x in performance while consuming up to 10.3x less energy for a range of graph inputs and workloads. 
    more » « less
  2. As more critical applications move to the cloud, there is a pressing need to provide privacy guarantees for data and computation. While cloud infrastructures are vulnerable to a variety of attacks, in this work, we focus on an attack model where an untrusted cloud operator has physical access to the server and can monitor the signals emerging from the processor socket. Even if data packets are encrypted, the sequence of addresses touched by the program serves as an information side channel. To eliminate this side channel, Oblivious RAM constructs have been investigated for decades, but continue to pose large overheads. In this work, we make the case that ORAM overheads can be significantly reduced by moving some ORAM functionality into the memory system. We first design a secure DIMM (or SDIMM) that uses commodity low-cost memory and an ASIC as a secure buffer chip. We then design two new ORAM protocols that leverage SDIMMs to reduce bandwidth, latency, and energy per ORAM access. In both protocols, each SDIMM is responsible for part of the ORAM tree. Each SDIMM performs a number of ORAM operations that are not visible to the main memory channel. By having many SDIMMs in the system, we are able to achieve highly parallel ORAM operations. The main memory channel uses its bandwidth primarily to service blocks requested by the CPU, and to perform a small subset of the many shuffle operations required by conventional ORAM. The new protocols guarantee the same obliviousness properties as Path ORAM. On a set of memory-intensive workloads, our two new ORAM protocols - Independent ORAM and Split ORAM - are able to improve performance by 1.9x and energy by 2.55x, compared to Freecursive ORAM. 
    more » « less
  3. Single-photon 3D cameras can record the time of arrival of billions of photons per second with picosecond accuracy. One common approach to summarize the photon data stream is to build a per-pixel timestamp histogram, resulting in a 3D histogram tensor that encodes distances along the time axis. As the spatio-temporal resolution of the histogram tensor increases, the in-pixel memory requirements and output data rates can quickly become impractical. To overcome this limitation, we propose a family of linear compressive representations of histogram tensors that can be computed efficiently, in an online fashion, as a matrix operation. We design practical lightweight compressive representations that are amenable to an in-pixel implementation and consider the spatio-temporal information of each timestamp. Furthermore, we implement our proposed framework as the first layer of a neural network, which enables the joint end-to-end optimization of the compressive representations and a downstream SPAD data processing model. We find that a well-designed compressive representation can reduce in-sensor memory and data rates up to 2 orders of magnitude without significantly reducing 3D imaging quality. Finally, we analyze the power consumption implications through an on-chip implementation. 
    more » « less
  4. The complexity of manycore System-on-chips (SoCs) is growing faster than our ability to manage them to reduce the overall energy consumption. Further, as SoC design moves towards 3D-architectures, the core's power density increases leading to unacceptable high peak chip temperatures. In this paper, we consider the optimization problem of dynamic power management (DPM) in manycore SoCs for an allowable performance penalty (say 5%) and admissible peak chip temperature. We employ a machine learning (ML) based DPM policy, which selects the voltage/frequency (V/F) levels for different cluster of cores as a function of the application workload features such as core computation and inter-core traffic etc. We propose a novel learning-to-search (L2S) framework to automatically identify an optimized sequence of DPM decisions from a large combinatorial space for joint energy-thermal optimization for one or more given applications. The optimized DPM decisions are given to a supervised learning algorithm to train a DPM policy, which mimics the corresponding decision-making behavior. Our experiments on two different manycore architectures designed using wireless interconnect and monolithic 3D demonstrate that principles behind the L2S framework are applicable for more than one configuration. Moreover, L2S-based DPM policies achieve up to 30 energy-delay product savings and reduce the peak chip temperature by up to 17 °C compared to the state-of-the-art ML methods for an allowable performance overhead of only 5 . 
    more » « less
  5. Superconducting qubits provide a promising approach to large-scale fault-tolerant quantum computing. However, qubit connectivity on a planar surface is typically restricted to only a few neighboring qubits. Achieving longer-range and more flexible connectivity, which is particularly appealing in light of recent developments in error-correcting codes, however, usually involves complex multilayer packaging and external cabling, which is resource intensive and can impose fidelity limitations. Here, we propose and realize a high-speed on-chip quantum processor that supports reconfigurable all-to-all coupling with a large on-off ratio. We implement the design in a four-node quantum processor, built with a modular design comprising a wiring substrate coupled to two separate qubit-bearing substrates, each including two single-qubit nodes. We use this device to demonstrate reconfigurable controlled- Z gates across all qubit pairs, with a benchmarked average fidelity of 96.00 % ± 0.08 % and best fidelity of 97.14 % ± 0.07 % , limited mainly by dephasing in the qubits. We also generate multiqubit entanglement, distributed across the separate modules, demonstrating GHZ-3 and GHZ-4 states with fidelities of 88.15 % ± 0.24 % and 75.18 % ± 0.11 % , respectively. This approach promises efficient scaling to larger-scale quantum circuits and offers a pathway for implementing quantum algorithms and error-correction schemes that benefit from enhanced qubit connectivity. Published by the American Physical Society2024 
    more » « less