skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: X-Stream: Accelerating streaming segments on MPSoCs for real-time applications
We are witnessing a race to meet the ever-growing computation requirements of emerging AI applications to provide perception and control in autonomous vehicles — e.g., self-driving cars and UAVs. To remain competitive, vendors are packing more processing units (CPUs, programmable logic, GPUs, and hardware accelerators) into next-generation multiprocessor systems-on-a-chip (MPSoC). As a result, modern embedded platforms are achieving new heights in peak computational capacity. Unfortunately, however, the collateral and inevitable increase in complexity represents a major obstacle for the development of correct-by-design safety-critical real-time applications. Due to the ever-growing gap between fast-paced hardware evolution and comparatively slower evolution of real-time operating systems (RTOS), there is a need for real-time oriented full-platform management frameworks to complement traditional RTOS designs. In this work, we propose one such framework, namely the X-Stream framework, for the definition, synthesis, and analysis of real-time workloads targeting state-of-the-art accelerator-augmented embedded platforms. Our X-Stream framework is designed around two cardinal principles. First, computation and data movements are orchestrated to achieve predictability by design. For this purpose, iterative computation over large data chunks is divided into subsequent segments. These segments are then streamed leveraging the three-phase execution model (load, execute and unload). Second, the framework is workflow-centric: system designers can specify their workflow and the necessary code for workflow orchestration is automatically generated. In addition to automating the deployment of user-defined hardware-accelerated workloads, X-Stream supports the deployment of some computation segments on traditional CPUs. Finally, X-Stream allows the definition of real-time partitions. Each partition groups applications belonging to the same criticality level and that share the same set of hardware resources, with support for preemptive priority-driven scheduling. Conversely, freedom from interference for applications deployed in different partitions is guaranteed by design. We provide a full-system implementation that includes RTOS integration and showcase the proposed X-Stream framework on a Xilinx Ultrascale+ platform by focusing on a matrix-multiplication and addition kernel use-case.  more » « less
Award ID(s):
2008799
PAR ID:
10482048
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Journal of Systems Architecture
Volume:
138
Issue:
C
ISSN:
1383-7621
Page Range / eLocation ID:
102857
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Self-driving systems execute an ensemble of different self-driving workloads on embedded systems in an end-to-end manner, subject to functional and performance requirements. To enable exploration, optimization, and end-to-end evaluation on different embedded platforms, system designers critically need a benchmark suite that enables flexible and seamless configuration of self-driving scenarios, which realistically reflects real-world self-driving workloads’ unique characteristics. Existing CPU and GPU embedded benchmark suites typically (1) consider isolated applications, (2) are not sensor-driven, and (3) are unable to support emerging self-driving applications that simultaneously utilize CPUs and GPUs with stringent timing requirements. On the other hand, full-system self-driving simulators (e.g., AUTOWARE, APOLLO) focus on functional simulation, but lack the ability to evaluate the self-driving software stack on various embedded platforms. To address design needs, we present Chauffeur, the first open-source end-to-end benchmark suite for self-driving vehicles with configurable representative workloads. Chauffeur is easy to configure and run, enabling researchers to evaluate different platform configurations and explore alternative instantiations of the self-driving software pipeline. Chauffeur runs on diverse emerging platforms and exploits heterogeneous onboard resources. Our initial characterization of Chauffeur on different embedded platforms – NVIDIA Jetson TX2 and Drive PX2 – enables comparative evaluation of these GPU platforms in executing an end-to-end self-driving computational pipeline to assess the end-to-end response times on these emerging embedded platforms while also creating opportunities to create application gangs for better response times. Chauffeur enables researchers to benchmark representative self-driving workloads and flexibly compose them for different self-driving scenarios to explore end-to-end tradeoffs between design constraints, power budget, real-time performance requirements, and accuracy of applications. 
    more » « less
  2. Modern automotive systems feature dozens of electronic control units (ECUs) for chassis, body and powertrain functions. These systems are costly and inflexible to upgrade, requiring ever increasing numbers of ECUs to support new features such as advanced driver assistance (ADAS), autonomous technologies, and infotainment. To counter these challenges, we propose DriveOS, a safe, secure, extensible, and timing-predictable system for modern vehicle management in a centralized platform. DriveOS is based on a separation kernel, where timing and safety-critical ECU functions are implemented in a real-time OS (RTOS) alongside non-critical software in Linux or Android. The system enforces the separation, or partitioning, of both software and hardware among different OSes. DriveOS runs on a relatively low-cost embedded PC-class platform, supporting multiple cores and hardware virtualization capabilities. Instrument cluster, in-vehicle infotainment and advanced driver assistance system services are implemented in a Yocto Linux guest, which communicates with critical real-time services via secure shared memory. The RTOS manages a real-time controller area network (CAN) interface that is inaccessible to Linux services except via well-defined and legitimate communication channels. In this work, we integrate three Qt-based services written for Yocto Linux, running in parallel with a real-time longitudinal controller task and multiple CAN bus concentrators, for vehicular sensor data processing and actuation. We demonstrate the benefits and performance of DriveOS with a hardware-in-the-loop CARLA simulation using a real car dataset. 
    more » « less
  3. null (Ed.)
    The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution. 
    more » « less
  4. Real-time applications such as autonomous and connected cars, surveillance, and online learning applications have to train on streaming data. They require low-latency, high throughput machine learning (ML) functions resident in the network and in the cloud to perform learning and inference. NFV on edge cloud platforms can provide support for these applications by having heterogeneous computing including GPUs and other accelerators to offload ML-related computation. GPUs provide the necessary speedup for performing learning and inference to meet the needs of these latency sensitive real-time applications. Supporting ML inference and learning efficiently for streaming data in NFV platforms has several challenges. In this paper, we present a framework, NetML, that runs existing ML applications on an heterogeneous NFV platform that includes both CPUs and GPUs. NetML efficiently transfers the appropriate packet payload to the GPU, minimizing overheads, avoiding locks, and avoiding CPU-based data copies. Additionally, NetML minimizes latency by maximizing overlap between the data movement and GPU computation. We evaluate the efficiency of our approach for training and inference using popular object detection algorithms on our platform. NetML reduces the latency for inferring images by more than 20% and increases the training throughput by 30% while reducing CPU utilization compared to other state-of-the-art alternatives. 
    more » « less
  5. With the growing demand for enhanced performance and scalability in cloud applications and systems, data center architectures are evolving to incorporate heterogeneous computing fabrics that leverage CPUs, GPUs, and FPGAs. Unlike traditional processing platforms like CPUs and GPUs, FPGAs offer the unique ability for hardware reconfiguration at run-time, enabling improved and tailored performance, flexibility, and acceleration. FPGAs excel at executing large-scale search optimization, acceleration, and signal processing tasks while consuming low power and minimizing latency. Major public cloud providers, such as Amazon, Huawei, Microsoft, Alibaba, and others, have already begun integrating FPGA-based cloud acceleration services into their offerings. Although FPGAs in cloud applications facilitate customized hardware acceleration, they also introduce new security challenges that demand attention. Granting cloud users the capability to reconfigure hardware designs after deployment may create potential vulnerabilities for malicious users, thereby jeopardizing entire cloud platforms. In particular, multi-tenant FPGA services, where a single FPGA is divided spatially among multiple users, are highly vulnerable to such attacks. This paper examines the security concerns associated with multi-tenant cloud FPGAs, provides a comprehensive overview of the related security, privacy and trust issues, and discusses forthcoming challenges in this evolving field of study. 
    more » « less