NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Infinity Stream: Portable and Programmer-Friendly In-/Near-Memory Fusion

https://doi.org/10.1145/3582016.3582032

Wang, Zhengrong; Liu, Christopher; Arora, Aman; John, Lizy; Nowatzki, Tony (March 2023, ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

In-memory computing with large last-level caches is promising to dramatically alleviate data movement bottlenecks and expose massive bitline-level parallelization opportunities. However, key challenges from its unique execution model remain unsolved: automated parallelization, transparently orchestrating data transposition/alignment/broadcast for bit-serial logic, and mixing in-/near-memory computing. Most importantly, the solution should be programmer friendly and portable across platforms. Our key innovation is an execution model and intermediate representation (IR) that enables hybrid CPU-core, in-memory, and near-memory processing. Our IR is the tensor dataflow graph (tDFG), which is a unified representation of in-memory and near-memory computation. The tDFG exposes tensor-data structure information so that the hardware and runtime can automatically orchestrate data management for bitserial execution, including runtime data layout transformations. To enable microarchitecture portability, we use a two-phase, JIT-based compilation approach to dynamically lower the tDFG to in-memory commands. Our design, infinity stream, is evaluated on a cycle-accurate simulator. Across data-processing workloads with fp32, it achieves 2.6× speedup and 75% traffic reduction over a state-of-the-art near-memory computing technique, with 2.4× energy efficiency.
more » « less
Full Text Available
Infinity Stream: Enabling Transparent and Automated In-Memory Computing

https://doi.org/10.1109/LCA.2022.3203064

Wang, Zhengrong; Liu, Christopher; Nowatzki, Tony (July 2022, IEEE Computer Architecture Letters)

Full Text Available
Affinity Alloc: Taming Not-So Near-Data Computing

https://doi.org/10.1145/3613424.3623778

Wang, Zhengrong; Liu, Christopher; Beckmann, Nathan; Nowatzki, Tony (October 2023, ACM)

Full Text Available
OverGen: Improving FPGA Usability through Domain-specific Overlay Generation

https://doi.org/10.1109/MICRO56248.2022.00018

Liu, Sihao; Weng, Jian; Kupsh, Dylan; Sohrabizadeh, Atefeh; Wang, Zhengrong; Guo, Licheng; Liu, Jiuyang; Zhulin, Maxim; Mani, Rishabh; Zhang, Lucheng; et al (October 2022, MICRO: IEEE/ACM International Symposium on Microarchitecture)

Full Text Available
The Signature of Metasomatized Subcontinental Lithospheric Mantle in the Basaltic Magmatism of the Payenia Volcanic Province, Argentina

https://doi.org/10.1029/2021GC010071

Chilson‐Parks, Benjamin H.; Calabozo, Fernando M.; Saal, Alberto E.; Wang, Zhengrong; Mallick, Soumen; Petrinovic, Ivan A.; Frey, Frederick A. (January 2022, Geochemistry, Geophysics, Geosystems)

Abstract The Payenia region of Argentina (34.5–38°S) is a large Pliocene‐Quaternary volcanic province of basaltic compositions in the Andean Cordillera foothills representing the northernmost extent of back‐arc volcanism in the Andean Southern Volcanic Zone (SVZ). Although the chemical diversity of the Payenia basalts has been characterized previously, the processes and sources responsible for such variation remain controversial. Here, we report new whole‐rock major and trace element concentrations, Sr‐, Nd‐, Hf‐, and Pb‐isotope ratios and high‐precision olivine oxygen‐isotope ratios in a suite of 35 alkaline basalts from Payenia. These lavas have major and trace elements that define a compositional range from arc‐influenced to intraplate signature. Variable crustal contamination and/or recent slab‐derived inputs inadequately account for elemental and isotopic systematics and spatial compositional variations of Payenia lavas. We present a simple forward model indicating that early metasomatism and subsequent melting of the metasomatized subcontinental lithospheric mantle (SCLM) has significantly contributed to the Payenia lava compositional range. Isotopic ingrowth calculations of radiogenic Sr, Nd, Hf, and Pb suggest that the SCLM metasomatism occurred at 50–150 Ma, consistent with the timing of the breakup of Gondwana and the development of the proto‐Pacific Andean arc. Variations in δ¹⁸O_olivinevalues from modeled melts indicate that the metasomatism and melting within the SCLM can fractionate oxygen isotopes even when the metasomatizing melt has MORB‐like δ¹⁸O values, providing a different explanation for the low‐δ¹⁸O signatures observed in continental arc settings.
more » « less
Full Text Available
Stream Floating: Enabling Proactive and Decentralized Cache Optimizations

https://doi.org/10.1109/HPCA51647.2021.00060

Wang, Zhengrong; Weng, Jian; Lowe-Power, Jason; Gaur, Jayesh; Nowatzki, Tony (February 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA))
null (Ed.)
As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage.
more » « less
Full Text Available
DSAGEN: Synthesizing Programmable Spatial Accelerators

https://doi.org/10.1109/ISCA45697.2020.00032

Weng, Jian; Liu, Sihao; Dadu, Vidushi; Wang, Zhengrong; Shah, Preyas; Nowatzki, Tony (July 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA))

Full Text Available
A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms

https://doi.org/10.1109/HPCA47549.2020.00063

Weng, Jian; Liu, Sihao; Wang, Zhengrong; Dadu, Vidushi; Nowatzki, Tony (February 2020, HPCA)

Abstract—Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine grain producer/consumer interaction with iteratively changing dependence distance, reuse rate, and memory access patterns. This causes a high overhead both for multi-threading due to fine-grain synchronization, and for vectorization due to the nonrectangular iteration domains. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first extend the traditional dataflow model with first class primitives for inductive dependences and memory access patterns (streams). Second, we develop a hybrid spatial architecture combining systolic and dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×—37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3.4× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.
more » « less
Full Text Available
Stream-based memory access specialization for general purpose processors

https://doi.org/10.1145/3307650.3322229

Wang, Zhengrong; Nowatzki, Tony (January 2019, ISCA)

Because of severe limitations in technology scaling, architects have innovated in specializing general purpose processors for computation primitives (e.g. vector instructions, loop accelerators). The general principle is exposing rich semantics to the ISA. An opportunity to explore is whether richer semantics of memory access patterns could also be used to improve the efficiency of memory and communication. Two important open questions are how to convey higher level memory information and how to take advantage of this information in hardware. We find that a majority of memory accesses follow a small number of simple patterns; we term these streams (e.g. affine, indirect). Streams can often be decoupled from core execution, and patterns persist long enough to express useful behavior. Our approach is therefore to express streams as ISA primitives, which we argue can enable: prefetch stream accesses to hide memory latency, semi-binding decoupled access to remove address computation and optimize the memory interface, and finally inform cache policies. In this work, we propose ISA-extensions for decoupled-streams, which interact with the core using a FIFO-based interface. We implement optimizations for each of the aforementioned opportunities on an aggressive wide-issue OOO core and evaluate with SPEC CPU 2017 and CortexSuite[1, 2]. Across all workloads, we observe about 1.37× speedup and energy efficiency improvement over hardware stride prefetching.
more » « less
Full Text Available
Oxygen isotope trajectories of crystallizing melts: Insights from modeling and the plutonic record

https://doi.org/10.1016/j.gca.2017.03.027

Bucholz, Claire E.; Jagoutz, Oliver; VanTongeren, Jill A.; Setera, Jacob; Wang, Zhengrong (June 2017, Geochimica et Cosmochimica Acta)

Full Text Available

« Prev Next »

Search for: All records