Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU,more »
GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing
Graph processing workloads are memory intensive with irregular access patterns and large memory footprint resulting in low data locality. Their popular software implementations typically employ either Push or Pull style propagation of changes through the graph over multiple iterations that follow the Bulk Synchronous Model. The performance of these algorithms on traditional computing systems is limited by random reads/writes of vertex values, synchronization overheads, and additional overheads for tracking active sets of vertices or edges across iterations. In this paper, we present GraphPulse, a hardware framework for asynchronous graph processing with event-driven scheduling that overcomes the performance limitations of software frameworks. Event-driven computation model enables a parallel dataflow-style execution where atomic updates and active sets tracking are inherent to the model; thus, scheduling complexity is reduced and scalability is enhanced. The dataflow nature of the architecture also reduces random reads of vertex values by carrying the values in the events themselves. We capitalize on the update properties commonly present in graph algorithms to coalesce in-flight events and substantially reduce the event storage requirement and the processing overheads incurred. GraphPulse event-model naturally supports asynchronous graph processing, enabling substantially faster convergence by exploiting available parallelism, reducing work, and eliminating synchronization at iteration more »
- Publication Date:
- NSF-PAR ID:
- 10267643
- Journal Name:
- 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
- Page Range or eLocation-ID:
- 908 to 921
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We describe GraFBoost, a flash-based architecture with hardware acceleration for external analytics of multi-terabyte graphs. We compare the performance of GraFBoost with 1 GB of DRAM against various state-of-the-art graph analytics software including FlashGraph, running on a 32-thread Xeon server with 128 GB of DRAM. We demonstrate that despite the relatively small amount of DRAM, GraFBoost achieves high performance with very large graphs no other system can handle, and rivals the performance of the fastest software platforms on sizes of graphs that existing platforms can handle. Unlike in-memory and semi-external systems, GraFBoost uses a constant amount of memory for allmore »
-
Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle ofmore »
-
With slowing technology scaling, specialized accelerators are increasingly attractive solutions to continue expected generational scaling of performance. However, in order to accelerate more advanced algorithms or those from challenging domains, supporting \emph{data-dependence} becomes necessary. This manifests as either data-dependent control (eg. join two sparse lists), or data-dependent memory accesses (eg. hash-table access). These forms of data-dependence inherently couple compute with memory, and also preclude efficient vectorization -- defeating the traditional mechanisms of programmable accelerators (eg. GPUs). Our goal is to develop an accelerator which is broadly applicable across algorithms with and without data-dependence. To this end, we first identify formsmore »
-
Because of the importance of graph workloads and the limitations of CPUs/GPUs, many graph processing accelerators have been proposed. The basic approach of prior accelerators is to focus on a single graph algorithm variant (eg. bulk-synchronous + slicing). While helpful for specialization, this leaves performance potential from flexibility on the table and also complicates understanding the relationship between graph types, workloads, algorithms, and specialization. In this work, we explore the value of flexibility in graph processing accelerators. First, we identify a taxonomy of key algorithm variants. Then we develop a template architecture (PolyGraph) that is flexible across these variants whilemore »