We investigate the power of randomness in two-party communication complexity. In particular, we study the model where the parties can make a constant number of queries to a function with an efficient one-sided-error randomized protocol. The complexity classes defined by this model comprise the Randomized Boolean Hierarchy, which is analogous to the Boolean Hierarchy but defined with one-sided-error randomness instead of nondeterminism. Our techniques connect the Nondeterministic and Randomized Boolean Hierarchies, and we provide a complete picture of the relationships among complexity classes within and across these two hierarchies. In particular, we prove that the Randomized Boolean Hierarchy does not collapse, and we prove a query-to-communication lifting theorem for all levels of the Nondeterministic Boolean Hierarchy and use it to resolve an open problem stated in the paper by Halstenberg and Reischuk (CCC 1988) which initiated the study of this hierarchy.
more »
« less
A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing
The convergence of extremely high levels of hardware concurrency and the effective overlap of computation and communication in asynchronous executions has resulted in increasing nondeterminism in High-Performance Computing (HPC) applications. Nondeterminism can manifest at multiple levels: from low-level communication primitives to libraries to application-level functions. No matter its source, nondeterminism can drastically increase the cost of result reproducibility, debugging workflows, testing parallel programs, or ensuring fault-tolerance. Nondeterministic executions of HPC applications can be modeled as event graphs, and the applications’ nondeterministic behavior can be understood and, in some cases, mitigated using graph comparison algorithms. However, a connection between graph comparison algorithms and approaches to understanding nondeterminism in HPC still needs to be established. This survey article moves the first steps toward establishing a connection between graph comparison algorithms and nondeterminism in HPC with its three contributions: it provides a survey of different graph comparison algorithms and a timeline for each category’s significant works; it discusses how existing graph comparison methods do not fully support properties needed to understand nondeterministic patterns in HPC applications; and it presents the open challenges that should be addressed to leverage the power of graph comparisons for the study of nondeterminism in HPC applications.
more »
« less
- Award ID(s):
- 1900888
- PAR ID:
- 10432143
- Editor(s):
- Sage
- Date Published:
- Journal Name:
- The International Journal of High Performance Computing Applications
- ISSN:
- 1094-3420
- Page Range / eLocation ID:
- 109434202311666
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
To deliver scalable performance to large-scale scientific and data analytic applications, HPC cluster architectures adopt the distributed-memory model. The performance and scalability of parallel applications on such systems are limited by the communication cost across compute nodes. Therefore, projecting the minimum communication cost and maximum scalability of the user applications plays a critical role in assessing the benefits of porting these applications to HPC clusters as well as developing efficient distributed-memory implementations. Unfortunately, this task is extremely challenging for end users, as it requires comprehensive knowledge of the target application and hardware architecture and demands significant effort and time for manual system analysis. To streamline the process of porting user applications to HPC clusters, this paper presents CommAnalyzer, an automated framework for estimating the communication cost on distributed-memory models from sequential code. CommAnalyzer uses novel dynamic program analyses and graph algorithms to capture the inherent flow of program values (information) in sequential code to estimate the communication when this code is ported to HPC clusters. Therefore, CommAnalyzer makes it possible to project the efficiency/scalability upper-bound (i.e., Roofline) of the effective distributed-memory implementation before even developing one. The experiments with real-world, regular and irregular HPC applications demonstrate the utility of CommAnalyzer in estimating the minimum communication of sequential applications on HPC clusters. In addition, the optimized MPI+X implementations achieve more than 92% of the efficiency upper-bound across the different workloads.more » « less
-
We consider large-scale implicit solvers for the numerical solution of partial differential equations (PDEs). The solvers require the high-bandwith networks of an HPC system for a fast time to solution. The increasing variability in performance of the HPC systems, most likely caused by variable communication latencies and network congestion, however, makes the execution time of solver algorithms unpredictable and hard to measure. In particular, the performance variability of the underlying system makes the reliable comparison of different algorithms and implementations difficult or impossible on HPC. We propose the use of statistical methods relying on hidden Markov models (HMM) to separate variable performance data into regimes corresponding to different levels of system latency. This allows us to, for ex- ample, identify and remove time periods when extremely high system latencies throttle application performance and distort performance measurements. We apply HMM to the careful analysis of implicit conjugate gradient solvers for finite-element discretized PDE, in particular comparing several new communication hiding methods for matrix-free operators of a PDE, which are critical for achieving peak performance in state-of-the-art PDE solvers. The HMM analysis allows us to overcome the strong performance variability in the HPC system. Our performance results for a model PDE problem discretized with 135 million degrees of freedom parallelized over 7168 cores of the Anvil supercomputer demonstrate that the communication hiding techniques can achieve up to a 10% speedup for the matrix-free matrix-vector product.more » « less
-
Santhanam, Rahul (Ed.)The following question arises naturally in the study of graph streaming algorithms: Is there any graph problem which is "not too hard", in that it can be solved efficiently with total communication (nearly) linear in the number n of vertices, and for which, nonetheless, any streaming algorithm with Õ(n) space (i.e., a semi-streaming algorithm) needs a polynomial n^Ω(1) number of passes? Assadi, Chen, and Khanna [STOC 2019] were the first to prove that this is indeed the case. However, the lower bounds that they obtained are for rather non-standard graph problems. Our first main contribution is to present the first polynomial-pass lower bounds for natural "not too hard" graph problems studied previously in the streaming model: k-cores and degeneracy. We devise a novel communication protocol for both problems with near-linear communication, thus showing that k-cores and degeneracy are natural examples of "not too hard" problems. Indeed, previous work have developed single-pass semi-streaming algorithms for approximating these problems. In contrast, we prove that any semi-streaming algorithm for exactly solving these problems requires (almost) Ω(n^{1/3}) passes. The lower bound follows by a reduction from a generalization of the hidden pointer chasing (HPC) problem of Assadi, Chen, and Khanna, which is also the basis of their earlier semi-streaming lower bounds. Our second main contribution is improved round-communication lower bounds for the underlying communication problems at the basis of these reductions: - We improve the previous lower bound of Assadi, Chen, and Khanna for HPC to achieve optimal bounds for this problem. - We further observe that all current reductions from HPC can also work with a generalized version of this problem that we call MultiHPC, and prove an even stronger and optimal lower bound for this generalization. These two results collectively allow us to improve the resulting pass lower bounds for semi-streaming algorithms by a polynomial factor, namely, from n^{1/5} to n^{1/3} passes.more » « less
-
With the ever growing complexity of high performance computing (HPC) systems to satisfy emerging application requirements (e.g., high memory bandwidth requirement for machine learning applications), the performance bottleneck in such systems has moved from being computation-centric to be more communication-centric. Silicon photonic interconnection networks have been proposed to address the aggressive communication requirements in HPC systems, to realize higher bandwidth, lower latency, and better energy efficiency. There have been many successful efforts on developing silicon photonic devices, integrated circuits, and architectures for HPC systems. Moreover, many efforts have been made to address and mitigate the impact of different challenges (e.g., fabrication process and thermal variations) in silicon photonic interconnects. However, most of these efforts have focused only on a single design layer in the system design space (e.g., device, circuit or architecture level). Therefore, there is often a gap between what a design technique can improve in one layer, and what it might impair in another one. In this paper, we discuss the promise of cross-layer design methodologies for HPC systems integrating silicon photonic interconnects. In particular, we discuss how such cross-layer design solutions based on cooperatively designing and exchanging design objectives among different system design layers can help achieve the best possible performance when integrating silicon photonics into HPC systemsmore » « less
An official website of the United States government

