skip to main content

Title: Parallelism in Randomized Incremental Algorithms
In this article, we show that many sequential randomized incremental algorithms are in fact parallel. We consider algorithms for several problems, including Delaunay triangulation, linear programming, closest pair, smallest enclosing disk, least-element lists, and strongly connected components. We analyze the dependencies between iterations in an algorithm and show that the dependence structure is shallow with high probability or that, by violating some dependencies, the structure is shallow and the work is not increased significantly. We identify three types of algorithms based on their dependencies and present a framework for analyzing each type. Using the framework gives work-efficient polylogarithmic-depth parallel algorithms for most of the problems that we study. This article shows the first incremental Delaunay triangulation algorithm with optimal work and polylogarithmic depth. This result is important, since most implementations of parallel Delaunay triangulation use the incremental approach. Our results also improve bounds on strongly connected components and least-element lists and significantly simplify parallel algorithms for several problems.
Authors:
; ; ;
Award ID(s):
1919223 1901381 1910030
Publication Date:
NSF-PAR ID:
10222586
Journal Name:
Journal of the ACM
Volume:
67
Issue:
5
Page Range or eLocation-ID:
1 to 27
ISSN:
0004-5411
Sponsoring Org:
National Science Foundation
More Like this
  1. Geologic processes at convergent plate margins control geochemical cycling, seismicity, and deep biosphere activity in subduction zones and suprasubduction zone lithosphere. International Ocean Discovery Program (IODP) Expedition 366 was designed to address the nature of these processes in the shallow to intermediate depth of the Mariana subduction channel. Although no technology is available to permit direct sampling of the subduction channel of an intraoceanic convergent margin at depths up to 18 km, the Mariana forearc region (between the trench and the active volcanic arc) provides a means to access this zone. Active conduits, resulting from fractures in the forearc, are prompted by along- and across-strike extension that allows slab-derived fluids and materials to ascend to the seafloor along associated faults, resulting in the formation of serpentinite mud volcanoes. Serpentinite mud volcanoes of the Mariana forearc are the largest mud volcanoes on Earth. Their positions adjacent to or atop fault scarps on the forearc are likely related to the regional extension and vertical tectonic deformation in the forearc. Serpentinite mudflows at these volcanoes include serpentinized forearc mantle clasts, crustal and subducted Pacific plate materials, a matrix of serpentinite muds, and deep-sourced formation fluid. Mud volcanism on the Mariana forearc occurs withinmore »100 km of the trench, representing a range of depths and temperatures to the downgoing plate and the subduction channel. These processes have likely been active for tens of millions of years at this site and for billions of years on Earth. At least 10 active serpentinite mud volcanoes have been located in the Mariana forearc. Two of these mud volcanoes are Conical and South Chamorro Seamounts, which are the furthest from the Mariana Trench at 86 and 78 km, respectively. Both seamounts were cored during Ocean Drilling Program (ODP) Legs 125 and 195, respectively. Data from these two seamounts represent deeper, warmer examples of the continuum of slab-derived materials as the Pacific plate subducts, providing a snapshot of how slab subduction affects fluid release, the composition of ascending fluids, mantle hydration, and the metamorphic paragenesis of subducted oceanic lithosphere. Data from the study of these two mud volcanoes constrain the pressure, temperature, and composition of fluids and materials within the subduction channel at depths of about 18 to 19 km. Understanding such processes is necessary for elucidating factors that control seismicity in convergent margins, tectonic and magma genesis processes in the forearc and volcanic arc, fluid and material fluxes, and the nature and variability of environmental conditions that impact subseafloor microbial communities. Expedition 366 centered on data collection from cores recovered from three serpentinite mud volcanoes that define a continuum of subduction-channel processes defined by the two previously cored serpentinite mud volcanoes and the trench. Three serpentinite mud volcanoes (Yinazao, Fantangisña, and Asùt Tesoro) were chosen at distances 55 to 72 km from the Mariana Trench. Cores were recovered from active sites of eruption on their summit regions and on the flanks where ancient flows are overlain by more recent ones. Recovered materials show the effects of dynamic processes that are active at these sites, bringing a range of materials to the seafloor, including materials from the lithosphere of the Pacific plate and from subducted seamounts (including corals). Most of the recovered material consists of serpentinite mud containing lithic clasts, which are derived from the underlying forearc crust and mantle and the subducting Pacific plate. Cores from each of the three seamounts drilled during Expedition 366, as well as those from Legs 125 and 195, include material from the underlying Pacific plate. A thin cover of pelagic sediment was recovered at many Expedition 366 sites, and at Site U1498 we cored through serpentinite flows to the underlying pelagic sediment and volcanic ash deposits. Recovered serpentinites are largely uniform in major element composition, with serpentinized ultramafic rocks and serpentinite muds spanning a limited range in SiO2 , MgO, and Fe2 O3 compositions. However, variation in trace element composition reflects pore fluid composition, which differs as a function of the temperature and pressure of the underlying subduction channel. Dissolved gases H2 , CH4 , and C2 H6 are highest at the site furthest from the trench, which also has the most active fluid discharge of the Expedition 366 serpentinite mud volcanoes. These dissolved gases and their active discharge from depth likely support active microbial communities, which were the focus of in-depth subsampling and preservation for shore-based analytical and culturing procedures. The effects of fluid discharge were also registered in the porosity and GRA density data indicated by higher than expected values at some of the summit sites. These higher values are consistent with overpressured fluids that minimize compaction of serpentinite mud deposits. In contrast, flank sites have significantly greater decreases in porosity with depth, suggesting that processes in addition to compaction are required to achieve the observed data. Thermal measurements reveal higher heat flow values on the flanks (~31 mW/m2) than on the summits (~17 mW/m2) of the seamounts. The new 2G Enterprises superconducting rock magnetometer (liquid helium free) revealed relatively high values of both magnetization and bulk magnetic susceptibility of discrete samples related to ultramafic rocks, particularly in dunite. Magnetite, a product of serpentinization, and authigenic carbonates were observed in the mudflow matrix materials. In addition to coring operations, Expedition 366 focused on the deployment and remediation of borehole casings for future observatories and set the framework for in situ experimentation. Borehole work commenced at South Chamorro Seamount, where the original-style CORK was partially removed. Work then continued at each of the three summit sites following coring operations. Cased boreholes with at least three joints of screened casing were deployed, and a plug of cement was placed at the bottom of each hole. Water samples were collected from two of the three boreholes, revealing significant inputs of formation fluids. This suggests that each of the boreholes tapped a hydrologic zone, making these boreholes suitable for experimentation with the future deployment of a CORK-lite. An active education and outreach program connected with many classrooms on shore and with the general public through social media.« less
  2. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues.« less
  3. This article presents I/O-efficient algorithms for topologically sorting a directed acyclic graph and for the more general problem identifying and topologically sorting the strongly connected components of a directed graph G = ( V, E ). Both algorithms are randomized and have I/O-costs O ( sort ( E ) · poly(log V)), with high probability, where sort ( E ) = O( E / B log M / B ( E/B )) is the I/O cost of sorting an | E |-element array on a machine with size- B blocks and size- M cache/internal memory. These are the first algorithms for these problems that do not incur at least one I/O per vertex, and as such these are the first I/O-efficient algorithms for sparse graphs. By applying the technique of time-forward processing, these algorithms also imply I/O-efficient algorithms for most problems on directed acyclic graphs, such as shortest paths, as well as the single-source reachability problem on arbitrary directed graphs.
  4. While there has been significant work on parallel graph processing, there has been very surprisingly little work on high-performance hypergraph processing. This paper presents a collection of efficient parallel algorithms for hypergraph processing, including algorithms for betweenness centrality, maximal independent set, k-core decomposition, hypertrees, hyperpaths, connected components, PageRank, and single-source shortest paths. For these problems, we either provide new parallel algorithms or more efficient implementations than prior work. Furthermore, our algorithms are theoretically-efficient in terms of work and depth. To implement our algorithms, we extend the Ligra graph processing framework to support hypergraphs, and our implementations benefit from graph optimizations including switching between sparse and dense traversals based on the frontier size, edge-aware parallelization, using buckets to prioritize processing of vertices, and compression. Our experiments on a 72-core machine and show that our algorithms obtain excellent parallel speedups, and are significantly faster than algorithms in existing hypergraph processing frameworks.
  5. Parallel radix sorting has been extensively studied in the literature for decades. However, the most efficient implementations require auxiliary memory proportional to the input size, which can be prohibitive for large inputs. The standard serial in-place radix sorting algorithm is based on swapping each key to its correct place in the array on each level of recursion. However, it is not straightforward to parallelize this algorithm due to dependencies among the swaps. This paper presents Regions Sort, a new parallel in-place radix sorting algorithm that is efficient both in theory and in practice. Our algorithm uses a graph structure to model dependencies across elements that need to be swapped, and generates independent tasks from this graph that can be executed in parallel. For sorting $n$ integers from a range $r$, and a parameter $K$, Regions Sort requires only $O(K\log r\log n)$ auxiliary memory. Our algorithm requires $O(n\log r)$ work and $O((n/K+\log K)\log r)$ span, making it work-efficient and highly parallel. In addition, we present several optimizations that significantly improve the empirical performance of our algorithm. We compare the performance of Regions Sort to existing parallel in-place and out-of-place sorting algorithms on a variety of input distributions and show that Regionsmore »Sort is faster than optimized out-of-place radix sorting and comparison sorting algorithms, and is almost always faster than the fastest publicly-available in-place sorting algorithm.« less