skip to main content


Title: Accelerating Distributed Graphical Fluid Simulations with Micro‐partitioning
Abstract

Graphical fluid simulations are CPU‐bound. Parallelizing simulations on hundreds of cores in the computing cloud would make them faster, but requires evenly balancing load across nodes. Good load balancing depends on manual decisions from experts, which are time‐consuming and error prone, or dynamic approaches that estimate and react to future load, which are non‐deterministic and hard to debug.

This paper proposes Birdshot scheduling, an automatic and purely static load balancing algorithm whose performance is close to expert decisions and reactive algorithms without their difficulty or complexity. Birdshot scheduling's key insight is to leverage the high‐latency, high‐throughput, full bisection bandwidth of cloud computing nodes. Birdshot scheduling splits the simulation domain into many micro‐partitions and statically assigns them to nodes randomly. Analytical results show that randomly assigned micro‐partitions balance load with high probability. The high‐throughput network easily handles the increased data transfers from micro‐partitions, and full bisection bandwidth allows random placement with no performance penalty. Overlapping the communications and computations of different micro‐partitions masks latency.

Experiments with particle‐level set, SPH, FLIP and explicit Eulerian methods show that Birdshot scheduling speeds up simulations by a factor of 2‐3, and can out‐perform reactive scheduling algorithms. Birdshot scheduling performs within 21% of state‐of‐the‐art dynamic methods that require running a second, parallel simulation. Unlike speculative algorithms, Birdshot scheduling is purely static: it requires no controller, runtime data collection, partition migration or support for these operations from the programmer.

 
more » « less
NSF-PAR ID:
10455180
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Computer Graphics Forum
Volume:
39
Issue:
1
ISSN:
0167-7055
Page Range / eLocation ID:
p. 375-388
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Spatial join is an important operation for combining spatial data. Parallelization is essential for improving spatial join performance. However, load imbalance due to data skew limits the scalability of parallel spatial join. There are many work sharing techniques to address this problem in a parallel environment. One of the techniques is to use data and space partitioning and then scheduling the partitions among threads/processes with the goal of minimizing workload differences across threads/processes. However, load imbalance still exists due to differences in join costs of different pairs of input geometries in the partitions. For the load imbalance problem, we have designed a work stealing spatial join system (WSSJ-DM) on a distributed memory environment. Work stealing is an approach for dynamic load balancing in which an idle processor steals computational tasks from other processors. This is the first work that uses work stealing concept (instead of work sharing) to parallelize spatial join computation on a large compute cluster. We have evaluated the scalability of the system on shared and distributed memory. Our experimental evaluation shows that work stealing is an effective strategy. We compared WSSJ-DM with work sharing implementations of spatial join on a high performance computing environment using partitioned and un-partitioned datasets. Static and dynamic load balancing approaches were used for comparison. We study the effect of memory affinity in work stealing operations involved in spatial join on a multi-core processor. WSSJ-DM performed spatial join using ST_Intersection on Lakes (8.4M polygons) and Parks (10M polygons) in 30 seconds using 35 compute nodes on a cluster (1260 CPU cores). A work sharing Master-Worker implementation took 160 seconds in contrast. 
    more » « less
  2. Cloud computing, which helps in sharing resources through networks, has become one of the most widely used technologies in recent years. Vast numbers of organizations are moving to the cloud since it is more cost-effective and easy to maintain. An increase in the number of consumers using the cloud, however, results in increased traffic, which leads to the problem of balancing tasks on the loads. Numerous dynamic algorithms [1] have been proposed and implemented to handle these loads in different ways. The performance of these dynamic algorithms are scaled with different parameters, such as response time, throughput, utilization, efficiency, etc. The weighted round-robin algorithm is one of the most widely used load balancing algorithms. The proposed algorithm is an improvement of the weighted round-robin algorithm, which considers the priority of every task before assigning the tasks to different virtual machines (VMs). The proposed algorithm uses the priority of tasks to decide to which VMs the tasks should be assigned dynamically. The same process is used to migrate the tasks from overloaded VMs to under-loaded VMs. The simulations are conducted using CloudSim by varying cloud resources. Simulation results show that the proposed algorithm performs equivalent to the dynamic weighted round robin algorithm in all the QoS factors, but it shows significant improvement in handling high-priority tasks. 
    more » « less
  3. There is a rich theory and plethora of algorithms in the literature aiming at the efficient scheduling of stochastic networks. These solutions are predominantly designed under the assumption of traffic demands that are independently generated at network nodes, without any requirement for synchronization among their received services. In this work, we note that many applications, including cloud computing, virtual reality, gaming, autonomous vehicular networks and collaborative design, generate traffic simultaneously at multiple nodes when they arrive, with possibly non-uniform file sizes, whose performance relies on the synchronous completion of the traffic across the network. This calls for the design of new scheduling algorithms that aims to coordinate the service of packets of the same traffic across the network. Towards this end, we propose a novel scheduling algorithm that not only accounts for the heterogeneity of the file size distributions, but also works towards synchronizing the completion time of the same traffic stream across the network. This is achieved by employing two insights that emanate from key motivating examples we develop: (1) the normalization of traffic load with respect to the non-uniform file sizes; and (2) the incorporation of deviation of normalized loads across network nodes that serve synchronized traffic. After establishing the throughput-optimality of our algorithm in general stochastic networks, we perform extensive simulations under various (spanning both wired and wireless) settings to reveal the potential completion time gains that it yields over other throughput-optimal strategies designed under the assumption of independent traffic generation. 
    more » « less
  4. There is a rich theory and plethora of algorithms in the literature aiming at the efficient scheduling of stochastic networks. These solutions are predominantly designed under the assumption of traffic demands that are independently generated at network nodes, without any requirement for synchronization among their received services. In this work, we note that many applications, including cloud computing, virtual reality, gaming, autonomous vehicular networks and collaborative design, generate traffic simultaneously at multiple nodes when they arrive, with possibly non-uniform file sizes, whose performance relies on the synchronous completion of the traffic across the network. This calls for the design of new scheduling algorithms that aims to coordinate the service of packets of the same traffic across the network. Towards this end, we propose a novel scheduling algorithm that not only accounts for the heterogeneity of the file size distributions, but also works towards synchronizing the completion time of the same traffic stream across the network. This is achieved by employing two insights that emanate from key motivating examples we develop: (1) the normalization of traffic load with respect to the non-uniform file sizes; and (2) the incorporation of deviation of normalized loads across network nodes that serve synchronized traffic. After establishing the throughput-optimality of our algorithm in general stochastic networks, we perform extensive simulations under various (spanning both wired and wireless) settings to reveal the potential completion time gains that it yields over other throughput-optimal strategies designed under the assumption of independent traffic generation. 
    more » « less
  5. Millimeter-wave wireless LANs are targeted for use with bandwidth-intensive applications such as virtual/augmented reality and real-time high-definition video. To maintain high throughput while addressing mmWave signal blockages, multiple access points (APs) within one room to improve line-of-sight conditions is considered a promising approach. In a scenario with fixed and mobile (human) obstacles, we mathematically analyze LoS blockages produced by mobility, and use the analysis to develop a multi-AP association scheme. Our scheme statically assigns primary and backup APs in order to maximize blockage robustness and perform load balancing among APs. Simulation results show that: 1) our static approach can provide blockage tolerance close to that of an expensive dynamic probing approach while achieving higher throughput, 2) the use of client mobility patterns, if known, can improve our static approach even further, and 3) our approach achieves significantly better fairness and load balancing than existing approaches. 
    more » « less