- Publication Date:
- NSF-PAR ID:
- 10167923
- Journal Name:
- IEEE International Symposium on High Performance Computer Architecture
- Page Range or eLocation-ID:
- 275 to 286
- Sponsoring Org:
- National Science Foundation
More Like this
-
Magneto-Electric FET ( MEFET ) is a recently developed post-CMOS FET, which offers intriguing characteristics for high-speed and low-power design in both logic and memory applications. In this article, we present MeF-RAM , a non-volatile cache memory design based on 2-Transistor-1-MEFET ( 2T1M ) memory bit-cell with separate read and write paths. We show that with proper co-design across MEFET device, memory cell circuit, and array architecture, MeF-RAM is a promising candidate for fast non-volatile memory ( NVM ). To evaluate its cache performance in the memory system, we, for the first time, build a device-to-architecture cross-layer evaluation framework to quantitatively analyze and benchmark the MeF-RAM design with other memory technologies, including both volatile memory (i.e., SRAM, eDRAM) and other popular non-volatile emerging memory (i.e., ReRAM, STT-MRAM, and SOT-MRAM). The experiment results for the PARSEC benchmark suite indicate that, as an L2 cache memory, MeF-RAM reduces Energy Area Latency ( EAT ) product on average by ~98% and ~70% compared with typical 6T-SRAM and 2T1R SOT-MRAM counterparts, respectively.
-
In this paper, we propose ReDRAM, as a reconfigurable DRAM-based processing-in-memory (PIM) accelerator, which transforms current DRAM architecture to massively parallel computational units exploiting the high internal bandwidth of modern memory chips. ReDRAM uses the analog operation of DRAM sub-arrays and elevates it to implement a full set of 1- and 2-input bulk bit-wise operations (NOT, (N)AND, (N)OR, and even X(N)OR) between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. ReDRAM can be leveraged to greatly reduce energy consumption and latency of complex in-DRAM logic computations relying on state-of-the-art mechanisms based on triple-row activation, dual-contact cells, row initialization, NOR style, etc. The extensive circuit-architecture simulations show that ReDRAM achieves on average 54× and 7.1× higher throughput for performing bulk bit-wise operations compared with CPU and GPU, respectively. Besides, ReDRAM outperforms recent processing-in-DRAM platforms with up to 3.7× better performance.
-
Abstract One of the most challenging obstacles to realizing exascale computing is minimizing the energy consumption of L2 cache, main memory, and interconnects to that memory. For promising cryogenic computing schemes utilizing Josephson junction superconducting logic, this obstacle is exacerbated by the cryogenic system requirements that expose the technology’s lack of high-density, high-speed and power-efficient memory. Here we demonstrate an array of cryogenic memory cells consisting of a non-volatile three-terminal magnetic tunnel junction element driven by the spin Hall effect, combined with a superconducting heater-cryotron bit-select element. The write energy of these memory elements is roughly 8 pJ with a bit-select element, designed to achieve a minimum overhead power consumption of about 30%. Individual magnetic memory cells measured at 4 K show reliable switching with write error rates below 10−6, and a 4 × 4 array can be fully addressed with bit select error rates of 10−6. This demonstration is a first step towards a full cryogenic memory architecture targeting energy and performance specifications appropriate for applications in superconducting high performance and quantum computing control systems, which require significant memory resources operating at 4 K.
-
This paper uses a mutual-information maximization paradigm to optimize the voltage levels written to cells in a Flash memory. To enable low-latency, each page of Flash memory stores only one coded bit in each Flash memory cell. For example, three-level cell (TL) Flash has three bit channels, one for each of three pages, that together determine which of eight voltage levels are written to each cell. Each Flash page is required to store the same number of data bits, but the various bits stored in the cell typically do not have to provide the same mutual information. A modified version of dynamic-assignment Blahut- Arimoto (DAB) moves the constellation points and adjusts the probability mass function for each bit channel to increase the mutual information of a worst bit channel with the goal of each bit channel providing the same mutual information. The resulting constellation provides essentially the same mutual information to each page while negligibly reducing the mutual information of the overall constellation. The optimized constellations feature points that are neither equally spaced nor equally likely. However, mod- ern shaping techniques such as probabilistic amplitude shaping can provide coded modulations that support such constellations.
-
Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations andmore »