skip to main content


Title: Parallelism in Randomized Incremental Algorithms
In this article, we show that many sequential randomized incremental algorithms are in fact parallel. We consider algorithms for several problems, including Delaunay triangulation, linear programming, closest pair, smallest enclosing disk, least-element lists, and strongly connected components. We analyze the dependencies between iterations in an algorithm and show that the dependence structure is shallow with high probability or that, by violating some dependencies, the structure is shallow and the work is not increased significantly. We identify three types of algorithms based on their dependencies and present a framework for analyzing each type. Using the framework gives work-efficient polylogarithmic-depth parallel algorithms for most of the problems that we study. This article shows the first incremental Delaunay triangulation algorithm with optimal work and polylogarithmic depth. This result is important, since most implementations of parallel Delaunay triangulation use the incremental approach. Our results also improve bounds on strongly connected components and least-element lists and significantly simplify parallel algorithms for several problems.  more » « less
Award ID(s):
1919223 1901381 1910030
NSF-PAR ID:
10222586
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Journal of the ACM
Volume:
67
Issue:
5
ISSN:
0004-5411
Page Range / eLocation ID:
1 to 27
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter graphs.

    To tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round).

    We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each graph.

    We believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems.

     
    more » « less
  2. null (Ed.)
    Geologic processes at convergent plate margins control geochemical cycling, seismicity, and deep biosphere activity in subduction zones and suprasubduction zone lithosphere. International Ocean Discovery Program (IODP) Expedition 366 was designed to address the nature of these processes in the shallow to intermediate depth of the Mariana subduction channel. Although no technology is available to permit direct sampling of the subduction channel of an intraoceanic convergent margin at depths up to 18 km, the Mariana forearc region (between the trench and the active volcanic arc) provides a means to access this zone. Active conduits, resulting from fractures in the forearc, are prompted by along- and across-strike extension that allows slab-derived fluids and materials to ascend to the seafloor along associated faults, resulting in the formation of serpentinite mud volcanoes. Serpentinite mud volcanoes of the Mariana forearc are the largest mud volcanoes on Earth. Their positions adjacent to or atop fault scarps on the forearc are likely related to the regional extension and vertical tectonic deformation in the forearc. Serpentinite mudflows at these volcanoes include serpentinized forearc mantle clasts, crustal and subducted Pacific plate materials, a matrix of serpentinite muds, and deep-sourced formation fluid. Mud volcanism on the Mariana forearc occurs within 100 km of the trench, representing a range of depths and temperatures to the downgoing plate and the subduction channel. These processes have likely been active for tens of millions of years at this site and for billions of years on Earth. At least 10 active serpentinite mud volcanoes have been located in the Mariana forearc. Two of these mud volcanoes are Conical and South Chamorro Seamounts, which are the furthest from the Mariana Trench at 86 and 78 km, respectively. Both seamounts were cored during Ocean Drilling Program (ODP) Legs 125 and 195, respectively. Data from these two seamounts represent deeper, warmer examples of the continuum of slab-derived materials as the Pacific plate subducts, providing a snapshot of how slab subduction affects fluid release, the composition of ascending fluids, mantle hydration, and the metamorphic paragenesis of subducted oceanic lithosphere. Data from the study of these two mud volcanoes constrain the pressure, temperature, and composition of fluids and materials within the subduction channel at depths of about 18 to 19 km. Understanding such processes is necessary for elucidating factors that control seismicity in convergent margins, tectonic and magma genesis processes in the forearc and volcanic arc, fluid and material fluxes, and the nature and variability of environmental conditions that impact subseafloor microbial communities. Expedition 366 centered on data collection from cores recovered from three serpentinite mud volcanoes that define a continuum of subduction-channel processes defined by the two previously cored serpentinite mud volcanoes and the trench. Three serpentinite mud volcanoes (Yinazao, Fantangisña, and Asùt Tesoro) were chosen at distances 55 to 72 km from the Mariana Trench. Cores were recovered from active sites of eruption on their summit regions and on the flanks where ancient flows are overlain by more recent ones. Recovered materials show the effects of dynamic processes that are active at these sites, bringing a range of materials to the seafloor, including materials from the lithosphere of the Pacific plate and from subducted seamounts (including corals). Most of the recovered material consists of serpentinite mud containing lithic clasts, which are derived from the underlying forearc crust and mantle and the subducting Pacific plate. Cores from each of the three seamounts drilled during Expedition 366, as well as those from Legs 125 and 195, include material from the underlying Pacific plate. A thin cover of pelagic sediment was recovered at many Expedition 366 sites, and at Site U1498 we cored through serpentinite flows to the underlying pelagic sediment and volcanic ash deposits. Recovered serpentinites are largely uniform in major element composition, with serpentinized ultramafic rocks and serpentinite muds spanning a limited range in SiO2 , MgO, and Fe2 O3 compositions. However, variation in trace element composition reflects pore fluid composition, which differs as a function of the temperature and pressure of the underlying subduction channel. Dissolved gases H2 , CH4 , and C2 H6 are highest at the site furthest from the trench, which also has the most active fluid discharge of the Expedition 366 serpentinite mud volcanoes. These dissolved gases and their active discharge from depth likely support active microbial communities, which were the focus of in-depth subsampling and preservation for shore-based analytical and culturing procedures. The effects of fluid discharge were also registered in the porosity and GRA density data indicated by higher than expected values at some of the summit sites. These higher values are consistent with overpressured fluids that minimize compaction of serpentinite mud deposits. In contrast, flank sites have significantly greater decreases in porosity with depth, suggesting that processes in addition to compaction are required to achieve the observed data. Thermal measurements reveal higher heat flow values on the flanks (~31 mW/m2) than on the summits (~17 mW/m2) of the seamounts. The new 2G Enterprises superconducting rock magnetometer (liquid helium free) revealed relatively high values of both magnetization and bulk magnetic susceptibility of discrete samples related to ultramafic rocks, particularly in dunite. Magnetite, a product of serpentinization, and authigenic carbonates were observed in the mudflow matrix materials. In addition to coring operations, Expedition 366 focused on the deployment and remediation of borehole casings for future observatories and set the framework for in situ experimentation. Borehole work commenced at South Chamorro Seamount, where the original-style CORK was partially removed. Work then continued at each of the three summit sites following coring operations. Cased boreholes with at least three joints of screened casing were deployed, and a plug of cement was placed at the bottom of each hole. Water samples were collected from two of the three boreholes, revealing significant inputs of formation fluids. This suggests that each of the boreholes tapped a hydrologic zone, making these boreholes suitable for experimentation with the future deployment of a CORK-lite. An active education and outreach program connected with many classrooms on shore and with the general public through social media. 
    more » « less
  3. In this article, we show how a single function,join, can be used to implement parallelbalanced binary search trees(BSTs) simply and efficiently. Based onjoin, our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework calledjoin-based algorithms. We show that thejoinfunction fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to asjoinabletrees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, includinginsert,delete,union,intersection,difference,split,range,filter, and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operationsunion,intersection, anddifferencehave work\( O(m\log (\frac{n}{m}+1)) \)and polylogarithmic span for input set sizes\( n \)and\( m\le n \).

    We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73\( \times \)speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.

     
    more » « less
  4. This article presents I/O-efficient algorithms for topologically sorting a directed acyclic graph and for the more general problem identifying and topologically sorting the strongly connected components of a directed graph G = ( V, E ). Both algorithms are randomized and have I/O-costs O ( sort ( E ) · poly(log V)), with high probability, where sort ( E ) = O( E / B log M / B ( E/B )) is the I/O cost of sorting an | E |-element array on a machine with size- B blocks and size- M cache/internal memory. These are the first algorithms for these problems that do not incur at least one I/O per vertex, and as such these are the first I/O-efficient algorithms for sparse graphs. By applying the technique of time-forward processing, these algorithms also imply I/O-efficient algorithms for most problems on directed acyclic graphs, such as shortest paths, as well as the single-source reachability problem on arbitrary directed graphs. 
    more » « less
  5. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should produce identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues. 
    more » « less