skip to main content


This content will become publicly available on June 11, 2024

Title: A General-Purpose Compute-in-Memory Processor Combining CPU and Deep Learning with Elevated CPU Efficiency and Enhanced Data Locality
Award ID(s):
2008906
NSF-PAR ID:
10464006
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Symposium on VLSI Technology and Circuits
Page Range / eLocation ID:
1 to 2
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The special computational challenges of simulating 3-D hydrodynamics in deep stellar interiors are discussed, and numerical algorithmic responses described. Results of recent simulations carried out at scale on the NSF's Blue Waters machine at the University of Illinois are presented, with a special focus on the computational challenges they address. Prospects for future work using GPU-accelerated nodes such as those on the DoE's new Summit machine at Oak Ridge National Laboratory are described, with a focus on numerical algorithmic accommodations that we believe will be necessary. 
    more » « less
  2. Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x. 
    more » « less