The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery.
To this end, we propose MGG , a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41×, 4.81×, and 10.83× faster than DGL, MGG-UVM, and ROC, respectively.
more »
« less
Chauffeur: Benchmark Suite for Design and End-to-End Analysis of Self-Driving Vehicles on Embedded Systems
Self-driving systems execute an ensemble of different self-driving workloads on embedded systems in an end-to-end manner, subject to functional and performance requirements. To enable exploration, optimization, and end-to-end evaluation on different embedded platforms, system designers critically need a benchmark suite that enables flexible and seamless configuration of self-driving scenarios, which realistically reflects real-world self-driving workloads’ unique characteristics. Existing CPU and GPU embedded benchmark suites typically (1) consider isolated applications, (2) are not sensor-driven, and (3) are unable to support emerging self-driving applications that simultaneously utilize CPUs and GPUs with stringent timing requirements. On the other hand, full-system self-driving simulators (e.g., AUTOWARE, APOLLO) focus on functional simulation, but lack the ability to evaluate the self-driving software stack on various embedded platforms. To address design needs, we present Chauffeur, the first open-source end-to-end benchmark suite for self-driving vehicles with configurable representative workloads. Chauffeur is easy to configure and run, enabling researchers to evaluate different platform configurations and explore alternative instantiations of the self-driving software pipeline. Chauffeur runs on diverse emerging platforms and exploits heterogeneous onboard resources. Our initial characterization of Chauffeur on different embedded platforms – NVIDIA Jetson TX2 and Drive PX2 – enables comparative evaluation of these GPU platforms in executing an end-to-end self-driving computational pipeline to assess the end-to-end response times on these emerging embedded platforms while also creating opportunities to create application gangs for better response times. Chauffeur enables researchers to benchmark representative self-driving workloads and flexibly compose them for different self-driving scenarios to explore end-to-end tradeoffs between design constraints, power budget, real-time performance requirements, and accuracy of applications.
more »
« less
- Award ID(s):
- 1704859
- NSF-PAR ID:
- 10297482
- Date Published:
- Journal Name:
- ACM Transactions on Embedded Computing Systems
- Volume:
- 20
- Issue:
- 5s
- ISSN:
- 1539-9087
- Page Range / eLocation ID:
- 1 to 22
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Recently, with the advent of the Internet of everything and 5G network, the amount of data generated by various edge scenarios such as autonomous vehicles, smart industry, 4K/8K, virtual reality (VR), augmented reality (AR), etc., has greatly exploded. All these trends significantly brought real-time, hardware dependence, low power consumption, and security requirements to the facilities, and rapidly popularized edge computing. Meanwhile, artificial intelligence (AI) workloads also changed the computing paradigm from cloud services to mobile applications dramatically. Different from wide deployment and sufficient study of AI in the cloud or mobile platforms, AI workload performance and their resource impact on edges have not been well understood yet. There lacks an in-depth analysis and comparison of their advantages, limitations, performance, and resource consumptions in an edge environment. In this paper, we perform a comprehensive study of representative AI workloads on edge platforms. We first conduct a summary of modern edge hardware and popular AI workloads. Then we quantitatively evaluate three categories (i.e., classification, image-to-image, and segmentation) of the most popular and widely used AI applications in realistic edge environments based on Raspberry Pi, Nvidia TX2, etc. We find that interaction between hardware and neural network models incurs non-negligible impact and overhead on AI workloads at edges. Our experiments show that performance variation and difference in resource footprint limit availability of certain types of workloads and their algorithms for edge platforms, and users need to select appropriate workload, model, and algorithm based on requirements and characteristics of edge environments.more » « less
-
Message queues are a way for different software components or applications to communicate with each other asynchronously by passing messages through a shared buffer. This allows a sender to send a message without needing to wait for an immediate response from the receiver, which can help to improve the system’s performance, reduce latency, and allow components to operate independently. In this paper, we compared and evaluated the performance of four popular message queues: Redis, ActiveMQ Artemis, RabbitMQ, and Apache Kafka. The aim of this study was to provide insights into the strengths and weaknesses of each technology and to help practitioners choose the most appropriate solution for their use case. We primarily evaluated each technology in terms of latency and throughput. Our experiments were conducted using a diverse array of workloads to test the message queues under various scenarios. This enables practitioners to evaluate the performance of the systems and choose the one that best meets their needs. The results show that each technology has its own pros and cons. Specifically, Redis performed the best in terms of latency, whereas Kafka significantly outperformed the other three technologies in terms of throughput. The optimal choice depends on the specific requirements of the use case. This paper presents valuable insights for practitioners and researchers working with message queues. Furthermore, the results of our experiments are provided in JSON format as a supplement to this paper.more » « less
-
The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, which shifts the complex memory management from programmers to GPU driver/ hardware and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead due to tracking and data migration along with special handling of page faults and page table walk. As UVM is attracting significant attention from the research community to develop innovative solutions to these problems, in this paper, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on this important topic. The proposed UVMBench consists of 32 representative benchmarks from a wide range of application domains. The suite also features unified programming implementation and diverse memory access patterns across benchmarks, thus allowing thorough evaluation and comparison with current state-of-the-art. A set of experiments have been conducted on real GPUs to verify and analyze the benchmark suite behaviors under various scenarios.more » « less
-
The rising popularity of self-driving cars has led to the emergence of a new research field in the recent years: Autonomous racing. Researchers are developing software and hardware for high performance race vehicles which aim to operate autonomously on the edge of the vehicles limits: High speeds, high accelerations, low reaction times, highly uncertain, dynamic and adversarial environments. This paper represents the first holistic survey that covers the research in the field of autonomous racing. We focus on the field of autonomous racecars only and display the algorithms, methods and approaches that are used in the fields of perception, planning and control as well as end-to-end learning. Further, with an increasing number of autonomous racing competitions, researchers now have access to a range of high performance platforms to test and evaluate their autonomy algorithms. This survey presents a comprehensive overview of the current autonomous racing platforms emphasizing both the software-hardware co-evolution to the current stage. Finally, based on additional discussion with leading researchers in the field we conclude with a summary of open research challenges that will guide future researchers in this field.more » « less