NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cases When Communicating More is Faster

https://doi.org/10.1145/3746238.3746243

Hati, Souvadra; Vuduc, Richard (July 2025, ACM)

Free, publicly-accessible full text available July 28, 2026
Brief Announcement: Optimality Conditions for Parallel Communication-Avoiding Matrix Multiplication with Overlapped Communication

https://doi.org/10.1145/3694906.3743357

Isaev, Mikhail; Eswar, Srinivas; Vuduc, Richard (July 2025, ACM)

Free, publicly-accessible full text available July 16, 2026
On Rank Selection for Nonnegative Matrix Factorization

https://doi.org/10.1109/BigData62323.2024.10825324

Eswar, Srinivas; Hayashi, Koby; Cobb, Benjamin; Kannan, Ramakrishnan; Ballard, Grey; Vuduc, Richard; Park, Haesun (December 2024, IEEE)

Rank selection, i.e. the choice of factorization rank, is the first step in constructing Nonnegative Matrix Factorization (NMF) models. It is a long-standing problem which is not unique to NMF, but arises in most models which attempt to decompose data into its underlying components. Since these models are often used in the unsupervised setting, the rank selection problem is further complicated by the lack of ground truth labels. In this paper, we review and empirically evaluate the most commonly used schemes for NMF rank selection.
more » « less
Full Text Available
A Workflow for the Synthesis of Irregular Memory Access Microbenchmarks

https://doi.org/10.1145/3695794.3695816

Sheridan, Kevin; Dominguez-Trujillo, Jered; Shipman, Galen; Lavin, Patrick; Scott, Christopher; Vaca_Valverde, Agustin; Vuduc, Richard; Young, Jeffrey (September 2024, ACM)

Full Text Available
Multifidelity Memory System Simulation in SST

https://doi.org/10.1145/3631882.3631890

Lavin, Patrick; Young, Jeffrey; Vuduc, Richard (October 2023, ACM)
Distributed-Memory Parallel JointNMF

https://doi.org/10.1145/3577193.3593733

Eswar, Srinivas; Cobb, Benjamin; Hayashi, Koby; Kannan, Ramakrishnan; Ballard, Grey; Vuduc, Richard; Park, Haesun (June 2023, Proceedings of the 37th International Conference on Supercomputing)

Joint Nonnegative Matrix Factorization (JointNMF) is a hybrid method for mining information from datasets that contain both feature and connection information. We propose distributed-memory parallelizations of three algorithms for solving the JointNMF problem based on Alternating Nonnegative Least Squares, Projected Gradient Descent, and Projected Gauss-Newton. We extend well-known communication-avoiding algorithms using a single processor grid case to our coupled case on two processor grids. We demonstrate the scalability of the algorithms on up to 960 cores (40 nodes) with 60\% parallel efficiency. The more sophisticated Alternating Nonnegative Least Squares (ANLS) and Gauss-Newton variants outperform the first-order gradient descent method in reducing the objective on large-scale problems. We perform a topic modelling task on a large corpus of academic papers that consists of over 37 million paper abstracts and nearly a billion citation relationships, demonstrating the utility and scalability of the methods.
more » « less
Full Text Available
Online model swapping for architectural simulation

https://doi.org/10.1145/3457388.3458670

Lavin, Patrick; Young, Jeffrey; Vuduc, Richard; Beard, Jonathan (May 2021, The 18th International Conference on Computing Frontiers)
null (Ed.)
As systems and applications grow more complex, detailed computer architecture simulation takes an ever increasing amount of time. Longer simulation times result in slower design iterations which then force architects to use simpler models, such as spreadsheets, when they want to iterate quickly on a design. Simple models are not easy to work with though, as architects must rely on intuition to choose representative models, and the path from the simple models to a detailed hardware simulation is not always clear. In this work, we present a method of bridging the gap between simple and detailed simulation by monitoring simulation behavior online and automatically swapping out detailed models with simpler statistical approximations. We demonstrate the potential of our methodology by implementing it in the open-source simulator SVE-Cachesim to swap out the level one data cache (L1D) within a memory hierarchy. This proof of concept demonstrates that our technique can train simple models to match real program behavior in the L1D and can swap them in without destructive side-effects for the performance of downstream models. Our models introduce only 8% error in the overall cycle count, while being used for over 90% of the simulation and using models that require two to eight times less computation per cache access.
more » « less
Full Text Available
“Smarter” NICs for faster molecular dynamics: a case study

https://doi.org/10.1109/IPDPS53621.2022.00063

Karamati, Sara; Hughes, Clayton; Hemmert, K Scott; Grant, Ryan E; Schonbein, W Whit; Levy, Scott; Conte, Thomas M; Young, Jeffrey; Vuduc, Richard W (May 2022, IEEE)
Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization

https://doi.org/10.1109/SC41405.2020.00078

Eswar, Srinivas; Hayashi, Koby; Ballard, Grey; Kannan, Ramakrishnan; Vuduc, Richard; Park, Haesun (November 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis)
null (Ed.)
Full Text Available
A supernodal all-pairs shortest path algorithm

https://doi.org/10.1145/3332466.3374533

Sao, Piyush; Kannan, Ramakrishnan; Gera, Prasun; Vuduc, Richard (February 2020, Symposium on Principles and Practice of Parallel Program-ming)
null (Ed.)
We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem.Floyd-Warshall is an attractive choice for APSP on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited,Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra’s algorithm (the algorithmic core of Johnson’s algorithm) for several classes sparse graphs.
more » « less
Full Text Available

« Prev Next »

Search for: All records