skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Greatly Accelerated Scaling of Streaming Problems with A Migrating Thread Architecture
Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency ( 10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.  more » « less
Award ID(s):
1822939 2016701
PAR ID:
10402606
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)
Page Range / eLocation ID:
11 to 18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance.Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. This paper discusses the implementation of the Firehose streaming benchmark on a novel parallel architecture with greatly enhanced multi-threading characteristics that avoids the conventional inefficiencies. Results are promising, with both far better scaling and increased performance over previously reported implementations, on a prototype platform with considerably less intrinsic hardware computational resources. 
    more » « less
  2. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance.Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. This paper discusses the implementation of the Firehose streaming benchmark on a novel parallel architecture with greatly enhanced multi-threading characteristics that avoids the conventional inefficiencies. Results are promising, with both far better scaling and increased performance over previously reported implementations, on a prototype platform with considerably less intrinsic hardware computational resources. 
    more » « less
  3. null (Ed.)
    Overcoming the conventional trade-off between throughput and bit error rate (BER) performance, versus computational complexity is a long-term challenge for uplink Multiple-Input Multiple-Output (MIMO) detection in base station design for the cellular 5G New Radio roadmap, as well as in next generation wireless local area networks. In this work, we present ParaMax, a MIMO detector architecture that for the first time brings to bear physics-inspired parallel tempering algorithmic techniques [28, 50, 67] on this class of problems. ParaMax can achieve near optimal maximum-likelihood (ML) throughput performance in the Large MIMO regime, Massive MIMO systems where the base station has additional RF chains, to approach the number of base station antennas, in order to support even more parallel spatial streams. ParaMax is able to achieve a near ML-BER performance up to 160 × 160 and 80 × 80 Large MIMO for low-order modulations such as BPSK and QPSK, respectively, only requiring less than tens of processing elements. With respect to Massive MIMO systems, in 12 × 24 MIMO with 16-QAM at SNR 16 dB, ParaMax achieves 330 Mbits/s near-optimal system throughput with 4--8 processing elements per subcarrier, which is approximately 1.4× throughput than linear detector-based Massive MIMO systems. 
    more » « less
  4. null (Ed.)
    In the past we have seen two major \walls" (memory and power) whose vanquishing required significant advances in architecture. This paper discusses evidence of a third wall dealing with data locality, which is prevalent in data intensive applications where computation is dominated by memory access and movement - not flops, Such apps exhibit large sets of often persistent data, with little reuse during computation, no predictable regularity, significantly different scaling characteristics, and where streaming is becoming important. Further, as we move to highly parallel algorithms (as in running in the cloud), these issues will get even worse. Solving such problems will take a new set of innovations in architecture. In addition to data on the new wall, this paper will look at one possible technique: the concept of migrating threads, and give evidence of its potential value based on several benchmarks that have scaling difficulties on conventional architectures. 
    more » « less
  5. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance. For instance network intrusion detection and cyber security as a whole rely on real time analysis of network traffic. Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. An earlier paper discussed streaming anomaly detection within a stream having an unbounded range of keys on the Lucata migrating thread architecture. In this paper we introduce \textit{Deluge}, a new implementation that addresses several inadequacies of previous designs and seeks to more directly target the hardware efficiencies inherent to migratory execution within a PGAS address space. Deluge achieves major improvements in hardware efficiency, throughput, and scalability over previous implementations. 
    more » « less