skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Making BRAMs Compute: Creating Scalable Computational Memory Fabric Overlays
The increasing density of distributed BRAMs diffused throughout modern Field Programmable Gate Arrays (FPGAs) is ideal for forming processor in/near memory architectures. This breaks the traditional von Neumann memory bottleneck limiting concurrency and degrading energy efficiency. Ideally, processing density should scale linearly with BRAM capacity, and clock frequencies should be set by the read/write access times of the BRAM. In this paper, we present a PIM overlay that achieves these goals. We observe an improvement of performance by 2.25×, logic resource utilization by 2×, and accumulation delay by 17× compared to prior published work.  more » « less
Award ID(s):
1956071
PAR ID:
10435539
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proc. of the 31st IEEE International Symposium On Field-Programmable Custom Computing (FCCM 2023)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The dominance of machine learning and the ending of Moore’s law have renewed interests in Processor in Memory (PIM) architectures. This interest has produced several recent proposals to modify an FPGA’s BRAM architecture to form a next-generation PIM reconfigurable fabric [1], [2]. PIM architectures can also be realized within today’s FPGAs as overlays without the need to modify the underlying FPGA architecture. To date, there has been no study to understand the comparative advantages of the two approaches. In this paper, we present a study that explores the comparative advantages between two proposed custom architectures and a PIM overlay running on a commodity FPGA. We created PiCaSO, a Processor in/near Memory Scalable and Fast Overlay architecture as a representative PIM overlay. The results of this study show that the PiCaSO overlay achieves up to 80% of the peak throughput of the custom designs with 2.56× shorter latency and 25% – 43% better BRAM memory utilization efficiency. We then show how several key features of the PiCaSO overlay can be integrated into the custom PIM designs to further improve their throughput by 18%, latency by 19.5%, and memory efficiency by 6.2%. 
    more » « less
  2. The paper summarizes the single-event upset (SEU) results obtained from neutron testing on the UltraScale+ MPSoC ZU9EG device. This complex device contains a large amount of programmable logic and multiple processor cores. Tests were performed on the programmable logic and the processing system simultaneously. Estimates of the single-event upset neutron cross section were obtained for the programmable logic CRAM, BRAM, OCM memory, and cache memories. During the test, no processor crashes or silent data corruptions were observed. In addition, a processor failure cross section was estimated for several software benchmark operating on the various processor cores. Several FPGA CRAM scrubbers were tested including an external JTAG, the Xilinx “SEM” IP, and the use of the PCAP operating in baremetal. In parallel with these tests, single-event induced high current events were monitored using an external power supply and monitoring scripts. 
    more » « less
  3. null (Ed.)
    We present Fleet, a framework that offers a massively parallel streaming model for FPGAs and is effective in a number of domains well-suited for FPGA acceleration, including parsing, compression, and machine learning. Fleet requires the user to specify RTL for a processing unit that serially processes every input token in a stream, a far simpler task than writing a parallel processing unit. It then takes the user’s processing unit and generates a hardware design with many copies of the unit as well as memory controllers to feed the units with separate streams and drain their outputs. Fleet includes a Chisel-based processing unit language. The language maintains Chisel’s low-level performance control while adding a few productivity features, including automatic handling of ready-valid signaling and a native and automatically pipelined BRAM type. We evaluate Fleet on six different applications, including JSON parsing and integer compression, fitting hundreds of Fleet processing units on the Amazon F1 FPGA and outperforming CPU implementations by over 400× and GPU implementations by over 9× in performance per watt while requiring a similar number of lines of code. 
    more » « less
  4. null (Ed.)
    Background: Patients with uncomplicated cases of concussion are thought to fully recover within several months as symptoms resolve. However, at the group level, undergraduates reporting a history of concussion (mean: 4.14 years post-injury) show lasting deficits in visual working memory performance. To clarify what predicts long-term visual working memory outcomes given heterogeneous performance across group members, we investigated factors surrounding the injury, including gender, number of mild traumatic brain injuries, time since mild traumatic brain injury (mTBI), loss of consciousness (LOC) (yes, no), and mTBI etiology (non-sport, team sport, high impact sport, and individual sport). We also collected low-density resting state electroencephalogram to test whether spectral power was correlated with performance. Aim: The purpose of this study was to identify predictors for poor visual working memory outcomes in current undergraduates with a history of concussion. Methods: Participants provided a brief history of their injury and symptoms. Participants also completed an experimental visual working memory task. Finally, low-density resting-state electroencephalogram was collected. Results: The key observation was that LOC at the time of injury predicted superior visual working memory years later. In contrast, visual working memory performance was not predicted by other factors, including etiology, high impact sports, or electroencephalogram spectral power. Conclusions: Visual working memory deficits are apparent at the group level in current undergraduates with a history of concussion. LOC at the time of concussion predicts less impaired visual working memory performance, whereas no significant links were associated with other factors. One interpretation is that after LOC, patients are more likely to seek medical advice than without LOC. Relevance for patients: Concussion is a head injury associated with future cognitive changes in some people. Concussion should be taken seriously, and medical treatment sought whenever a head injury occurs. 
    more » « less
  5. Remote memory techniques are gaining traction in datacenters because they can significantly improve memory utilization. A popular approach is to use kernel-level, page-based memory swapping to deliver remote memory as it is transparent, enabling existing applications to benefit without modifications. Unfortunately, current implementations suffer from high software overheads, resulting in significantly worse tail latency and throughput relative to local memory. Hermit is a redesigned swap system that overcomes this limitation through a novel technique called adaptive, feedback-directed asynchrony. It takes non-urgent but time-consuming operations (e.g., swap-out, cgroup charge, I/O deduplication, etc.) off the fault-handling path and executes them asynchronously. Different from prior work such as Fastswap, Hermit collects runtime feedback and uses it to direct how asynchrony should be performed—i.e., whether asynchronous operations should be enabled, the level of asynchrony, and how asynchronous operations should be scheduled. We implemented Hermit in Linux 5.14. An evaluation with a set of latency-critical applications shows that Hermit delivers low-latency remote memory. For example, it reduces the 99th percentile latency of Memcached by 99.7% from 36 ms to 91 µs. Running Hermit over batch applications improves their overall throughput by 1.24× on average. These results are achieved without changing a single line of user code. 
    more » « less