With slowing technology scaling, specialized accelerators are increasingly attractive solutions to continue expected generational scaling of performance. However, in order to accelerate more advanced algorithms or those from challenging domains, supporting \emph{data-dependence} becomes necessary. This manifests as either data-dependent control (eg. join two sparse lists), or data-dependent memory accesses (eg. hash-table access). These forms of data-dependence inherently couple compute with memory, and also preclude efficient vectorization -- defeating the traditional mechanisms of programmable accelerators (eg. GPUs). Our goal is to develop an accelerator which is broadly applicable across algorithms with and without data-dependence. To this end, we first identify forms of data-dependence which are both common and possible to exploit with specialized hardware: specifically stream-join and alias-free indirection. Then, we create an accelerator with an interface to support these, called the Sparse Processing Unit (SPU). SPU supports alias-free indirection with a compute-enabled scratchpad and aggressive stream reordering and stream-join with a novel dataflow control model for a reconfigurable systolic compute-fabric. Finally, we add robustness across datatypes by adding decomposability across the compute and memory pipelines. SPU achieves 16.5$$\times$$, 10.3x, and 14.2x over a 24-core SKL CPU on ML, database, and graph algorithms respectively. SPU achieves similar performance to domain-specific accelerators. For ML, SPU achieves 1.8-7x speedup against a similarly provisioned GPGPU, with much less area and power.
more »
« less
Talk to My Neighbors Transport: Decentralized Data Transfer and Scheduling Among Accelerators
The demise of Dennard scaling has ushered in an era of un- precedented and ever-increasing heterogeneity, in pursuit of increasing performance via specialization. While CMOS scal- ing is believed to be approaching its end, continued increases in the number of transistors available on a chip have made specialized hardware an attractive alternative to increasing core counts or cache sizes. GPUs are commonplace in many computing domains , FPGAs are arriving in the cloud; smart storage, and networking hardware are commercially available. This paper argues for separating transport — the actual physical management of data, from the rest of the control plane by adding simple hardware specialized purely for this task, called TRANSPORTERS. TRANSPORTERS facilitate offloading accelerator scheduling, data movement, and inter- accelerator communication and co-ordination, through a management protocol called TALK TO MY NEIGHBORS TRANSPORT (TMNT).
more »
« less
- Award ID(s):
- 1700512
- PAR ID:
- 10060951
- Date Published:
- Journal Name:
- Proceedings of the 9th Workshop on Systems for Multi-core and Heterogeneous Architectures
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cyanobacteria are responsible for up to 80% of aquatic carbon dioxide fixation and have evolved a specialized carbon concentrating mechanism to increase photosynthetic yield. As such, cyanobacteria are attractive targets for synthetic biology and engineering approaches to address the demands of global energy security, food production, and climate change for an increasing world’s population. The bicarbonate transporter BicA is a sodium-dependent, low-affinity, high-flux bicarbonate symporter expressed in the plasma membrane of cyanobacteria. Despite extensive biochemical characterization of BicA, including the resolution of the BicA crystal structure, the dynamic understanding of the bicarbonate transport mechanism remains elusive. To this end, we have collected over 1 ms of all-atom molecular dynamics simulation data of the BicA dimer to elucidate the structural rearrangements involved in the substrate transport process. We further characterized the energetics of the transition of BicA protomers and investigated potential mutations that are shown to decrease the free energy barrier of conformational transitions. In all, our study illuminates a detailed mechanistic understanding of the conformational dynamics of bicarbonate transporters and provides atomistic insights to engineering these transporters for enhanced photosynthetic production.more » « less
-
Cyanobacteria are responsible for up to 80% of aquatic carbon dioxide fixation and have evolved specialized carbon concentrating mechanism to increase photosynthetic yield. As such, cyanobacteria are attractive targets for synethic biology and engineering approaches to address the demands of global energy security, food production, and climate change for an increasing world’s population. The bicarbonate transporter BicA is a sodium-dependent, low-affinity, high-flux bicarbonate symporter expressed in the plasma membrane of cyanobacteria. Despite extensive biochemical characterization of BicA, including the resolution of the BicA crystal structure, the dynamic understanding of the bicarbonate mechanism remains elusive. To this end, we have collected over 1 ms of all-atom molecular dynamics simulation data of the BicA dimer to elucidate the structural rearrangements involved in the substrate transport process. We further characterized the energetics of the cooperativity between BicA promoters and investigated potential mutations that are shown to decrease the free energy barrier of conformational transitions. In all, our study illuminates a detailed mechanistic understanding of the conformational dynamics of bicarbonate transporters and provide atomistic insights to engineering these transporters for enhanced photosynthetic production.more » « less
-
High-performance kernel libraries are critical to exploiting accelerators and specialized instructions in many applications. Because compilers are difficult to extend to support diverse and rapidly-evolving hardware targets, and automatic optimization is often insufficient to guarantee state-of-the-art performance, these libraries are commonly still coded and optimized by hand, at great expense, in low-level C and assembly. To better support development of high-performance libraries for specialized hardware, we propose a new programming language, Exo, based on the principle of exocompilation: externalizing target-specific code generation support and optimization policies to user-level code. Exo allows custom hardware instructions, specialized memories, and accelerator configuration state to be defined in user libraries. It builds on the idea of user scheduling to externalize hardware mapping and optimization decisions. Schedules are defined as composable rewrites within the language, and we develop a set of effect analyses which guarantee program equivalence and memory safety through these transformations. We show that Exo enables rapid development of state-of-the-art matrix-matrix multiply and convolutional neural network kernels, for both an embedded neural accelerator and x86 with AVX-512 extensions, in a few dozen lines of code each.more » « less
-
Data Acquisition (DAQ) workloads form an important class of scientific network traffic that by its nature (1) flows across different research infrastructure, including remote instruments and supercomputer clusters, (2) has ever-increasing through-put demands, and (3) has ever-increasing integration demands—for example, observations at one instrument could trigger a reconfiguration of another instrument. Today’s DAQ transfers rely on UDP and (heavily tuned) TCP, but this is driven by convenience rather than suitability. The mismatch between Internet transport protocols and scientific workloads becomes more stark with the steady increase in link capacities, data generation, and integration across research infrastructure. This position paper argues the importance of developing specialized transport protocols for DAQ workloads. It proposes a new transport feature for this kind of elephant flow: multi-modality involves the network actively configuring the transport protocol to change how DAQ flows are processed across different underlying networks that connect scientific research infrastructure. Multi-modality is a layering violation that is proposed as a pragmatic technique for DAQ transport protocol design. It takes advantage of programmable network hardware that is increasingly being deployed in scientific research infrastructure. The paper presents an initial evaluation through a pilot study that includes a Tofino2 switch and Alveo FPGA cards, and using data from a particle detector.more » « less
An official website of the United States government

