skip to main content


Search for: All records

Award ID contains: 1822939

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Heterogeneity in computing systems is clearly increasing, especially as “accelerators” burrow deeper and deeper into different parts of an architecture. What is new, however, is a rapid change in not only the number of such heterogeneous processors, but in their connectivity to other structures, such as cores with different ISAs or smart memory interfaces. Technologies such as chiplets are accelerating this trend. This paper is focused on the problem of how to architect efficient systems that combine multiple heterogeneous concurrent threads, especially when the underlying heterogeneous cores are separated by networks or have no shared-memory access paths. The goal is to eliminate today’s need to invoke significant software stacks to cross any of these boundaries. A suggestion is made of using migrating threads as the glue. Two experiments are described: using a heterogeneous platform where all threads share the same memory to solve a rich ML problem, and a fast PageRank approximation that mirrors the kind of computation for which thread migration may be useful. Architectural “lessons learned” are developed that should help guide future development of such systems.} 
    more » « less
    Free, publicly-accessible full text available March 2, 2025
  2. The conventional model of parallel programming today involves either copying data across cores (and then having to track its most recent value), or not copying and requiring deep software stacks to perform even the simplest operation on data that is “remote”, i.e., out of the range of loads and stores from the current core. As application requirements grow to larger data sets, with more irregular access to them, both conventional approaches start to exhibit severe scaling limitations. This paper reviews some growing evidence of the potential value of a new model of computation that skirts between the two: data does not move (i.e., is not copied), but computation instead moves to the data. Several different applications involving large sparse computations, streaming of data, and complex mixed mode operations have been coded for a novel platform where thread movement is handled invisibly by the hardware. The evidence to date indicates that parallel scaling for this paradigm can be significantly better than any mix of conventional models. 
    more » « less
  3. Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency ( 10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems. 
    more » « less
  4. Stochastic Gradient Descent (SGD) is a valuable algorithm for large-scale machine learning, but has proven difficult to parallelize on conventional architectures because of communication and memory access issues. The HogWild series of mixed logically distributed and physically multi-threaded algorithms overcomes these issues for problems with sparse characteristics by using multiple local model vectors with asynchronous atomic updates. While this approach has proven effective for several reported examples, there are others, especially very sparse cases, that do not scale as well. This paper discusses an SGD Support Vector Machine (SVM) on a cacheless migrating thread architecture using the Hogwild algorithms as a framework. Our implementations on this novel architecture achieved superior hardware efficiency and scalability over that of a conventional cluster using MPI. Furthermore these improvements were gained using naive data partitioning techniques and hardware with substantially less compute capability than that present in conventional systems. 
    more » « less
  5. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance. For instance network intrusion detection and cyber security as a whole rely on real time analysis of network traffic. Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. An earlier paper discussed streaming anomaly detection within a stream having an unbounded range of keys on the Lucata migrating thread architecture. In this paper we introduce \textit{Deluge}, a new implementation that addresses several inadequacies of previous designs and seeks to more directly target the hardware efficiencies inherent to migratory execution within a PGAS address space. Deluge achieves major improvements in hardware efficiency, throughput, and scalability over previous implementations. 
    more » « less
  6. null (Ed.)
    In the past we have seen two major \walls" (memory and power) whose vanquishing required significant advances in architecture. This paper discusses evidence of a third wall dealing with data locality, which is prevalent in data intensive applications where computation is dominated by memory access and movement - not flops, Such apps exhibit large sets of often persistent data, with little reuse during computation, no predictable regularity, significantly different scaling characteristics, and where streaming is becoming important. Further, as we move to highly parallel algorithms (as in running in the cloud), these issues will get even worse. Solving such problems will take a new set of innovations in architecture. In addition to data on the new wall, this paper will look at one possible technique: the concept of migrating threads, and give evidence of its potential value based on several benchmarks that have scaling difficulties on conventional architectures. 
    more » « less
  7. null (Ed.)
    Communication overhead has been identified as the primary factor in overall performance degradation for sparse and irregular problems such as SpMV. Many works have shown significant communication reductions, but only for matrices with specific characteristics and by dramatically reworking the computations. This study develops and evaluates a communication avoiding distributed heterogeneous implementation for strong scaling of SpMV on the Sierra supercomputer architecture. To address the far bigger matrices characteristic of real problems, we utilize a hypergraph partitioning package HYPE to determine workload distribution and reduce inter-node communication. Additionally we investigated the performance impact of performing hypergraph partitioning on scale free graphs which had undergone a vertex delegation pre-processing step. We achieved up to 97% reduction in average message size per process at scale when using the HYPE partitioner. Despite this we show how optimizing SpMV on existing GPU architectures does provide increased computational performance, yet does not address the dominant communication overhead factor at scale despite attempts to avoid communication where possible. 
    more » « less
  8. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance.Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. This paper discusses the implementation of the Firehose streaming benchmark on a novel parallel architecture with greatly enhanced multi-threading characteristics that avoids the conventional inefficiencies. Results are promising, with both far better scaling and increased performance over previously reported implementations, on a prototype platform with considerably less intrinsic hardware computational resources. 
    more » « less
  9. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance.Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. This paper discusses the implementation of the Firehose streaming benchmark on a novel parallel architecture with greatly enhanced multi-threading characteristics that avoids the conventional inefficiencies. Results are promising, with both far better scaling and increased performance over previously reported implementations, on a prototype platform with considerably less intrinsic hardware computational resources. 
    more » « less
  10. null (Ed.)
    Sparse matrix dense vector multiplication (SpMV), exhibits the memory bandwidth and communication driven nature of many sparse linear algebra operations. Irregular memory accesses from the non-zero structure within a sparse matrix wreak havoc on performance. This paper presents strong scaling for communication avoiding SpMV implementations on a migrating thread system intended to address the lack of locality in sparse problems. We developed communication avoiding SpMV code to attempt to reduce off-node thread migration by using the hypergraph partitioning package HYPE to determine workload distribution. Additionally, we investigate the performance impact of overlapping communication and computation through the use of remote memory operations supported by the architecture. Incorporating remote memory operations with hypergraph partitioning we achieved 6.18X speedup for overall performance. 
    more » « less