skip to main content


Title: Holistic Resource Allocation Under Federated Scheduling for Parallel Real-time Tasks
With the technology trend of hardware and workload consolidation for embedded systems and the rapid development of edge computing, there has been increasing interest in supporting parallel real-time tasks to better utilize the multi-core platforms while meeting the stringent real-time constraints. For parallel real-time tasks, the federated scheduling paradigm, which assigns each parallel task a set of dedicated cores, achieves good theoretical bounds by ensuring exclusive use of processing resources to reduce interferences. However, because cores share the last-level cache and memory bandwidth resources, in practice tasks may still interfere with each other despite executing on dedicated cores. Such resource interferences due to concurrent accesses can be even more severe for embedded platforms or edge servers, where the computing power and cache/memory space are limited. To tackle this issue, in this work, we present a holistic resource allocation framework for parallel real-time tasks under federated scheduling. Under our proposed framework, in addition to dedicated cores, each parallel task is also assigned with dedicated cache and memory bandwidth resources. Further, we propose a holistic resource allocation algorithm that well balances the allocation between different resources to achieve good schedulability. Additionally, we provide a full implementation of our framework by extending the federated scheduling system with Intel’s Cache Allocation Technology and MemGuard. Finally, we demonstrate the practicality of our proposed framework via extensive numerical evaluations and empirical experiments using real benchmark programs.  more » « less
Award ID(s):
1948457
NSF-PAR ID:
10325157
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Embedded Computing Systems
Volume:
21
Issue:
1
ISSN:
1539-9087
Page Range / eLocation ID:
1 to 29
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we present RT-Gang: a novel realtime gang scheduling framework that enforces a one-gang-at-atime policy. We find that, in a multicore platform, co-scheduling multiple parallel real-time tasks would require highly pessimistic worst-case execution time (WCET) and schedulability analysis—even when there are enough cores—due to contention in shared hardware resources such as cache and DRAM controller. In RT-Gang, all threads of a parallel real-time task form a real-time gang and the scheduler globally enforces the one-gangat-a-time scheduling policy to guarantee tight and accurate task WCET. To minimize under-utilization, we integrate a state-of-the-art memory bandwidth throttling framework to allow safe execution of best-effort tasks. Specifically, any idle cores, if exist, are used to schedule best-effort tasks but their maximum memory bandwidth usages are strictly throttled to tightly bound interference to real-time gang tasks. We implement RT-Gang in the Linux kernel and evaluate it on two representative embedded multicore platforms using both synthetic and real-world DNN workloads. The results show that RT-Gang dramatically improves system predictability and the overhead is negligible. 
    more » « less
  2. When scheduling multi-mode real-time systems on multi-core platforms, a key question is how to dynamically adjust shared resources, such as cache and memory bandwidth, when resource demands change, without jeopardizing schedulability during mode changes. This paper presents Omni, a first end-to-end solution to this problem. Omni consists of a novel multi-mode resource allocation algorithm and a resource-aware schedulability test that supports general mode-change semantics as well as dynamic cache and bandwidth resource allocation. Omni's resource allocation leverages the platform's concurrency and the diversity of the tasks' demands to minimize overload during mode transitions; it does so by intelligently co-distributing tasks and resources across cores. Omni's schedulability test ensures predictable mode transitions, and it takes into account mode-change effects on the resource demands on different cores, so as to best match their dynamic needs using the available resources. We have implemented a prototype of Omni, and we have evaluated it using randomly generated multi-mode systems with several real-world benchmarks as the workload. Our results show that Omni has low overhead, and that it is substantially more effective in improving schedulability than the state of the art 
    more » « less
  3. Multiprocessor scheduling of hard real-time tasks modeled by directed acyclic graphs (DAGs) exploits the inherent parallelism presented by the model. For DAG tasks, a node represents a request to execute an object on one of the available processors. In one DAG task, there may be multiple execution requests for one object, each represented by a distinct node. These distinct execution requests offer an opportunity to reduce their combined cache overhead through coordinated scheduling of objects as threads within a parallel task. The goal of this work is to realize this opportunity by incorporating the cache-aware BUNDLE-scheduling algorithm into federated scheduling of sporadic DAG task sets.This is the first work to incorporate instruction cache sharing into federated scheduling. The result is a modification of the DAG model named the DAG with objects and threads (DAG-OT). Under the DAG-OT model, descriptions of nodes explicitly include their underlying executable object and number of threads. When possible, nodes assigned the same executable object are collapsed into a single node; joining their threads when BUNDLE-scheduled. Compared to the DAG model, the DAG-OT model with cache-aware scheduling reduces the number of cores allocated to individual tasks by approximately 20 percent in the synthetic evaluation and up to 50 percent on a novel parallel computing platform implementation. By reducing the number of allocated cores, the DAG-OT model is able to schedule a subset of previously infeasible task sets. 
    more » « less
  4. null (Ed.)
    Federated scheduling is a generalization of partitioned scheduling for parallel tasks on multiprocessors, and has been shown to be a competitive scheduling approach. However, federated scheduling may waste resources due to its dedicated allocation of processors to parallel tasks. In this work we introduce a novel algorithm for scheduling parallel tasks that require more than one processor to meet their deadlines (i.e., heavy tasks). The proposed algorithm computes a deterministic schedule for each heavy task based on its internal graph structure. It efficiently exploits the processors allocated to each task and thus reduces the number of processors required by the task. Experimental evaluation shows that our new federated scheduling algorithm significantly outperforms other state-of-the-art federated-based scheduling approaches, including semi-federated scheduling and reservation-based federated scheduling, that were developed to tackle resource waste in federated scheduling, and a stretching algorithm that also uses the tasks' graph structures. 
    more » « less
  5. Newly emerging multiprocessor system-on-a-chip (MPSoC) platforms provide hard processing cores with programmable logic (PL) for high-performance computing applications. In this article, we take a deep look into these commercially available heterogeneous platforms and show how to design mixed-criticality applications such that different processing components can be isolated to avoid contention on the shared resources such as last-level cache and main memory. Our approach involves software/hardware co-design to achieve isolation between the different criticality domains. At the hardware level, we use a scratchpad memory (SPM) with dedicated interfaces inside the PL to avoid conflicts in the main memory. At the software level, we employ a hypervisor to support cache-coloring such that conflicts at the shared L2 cache can be avoided. In order to move the tasks in/out of the SPM memory, we rely on a DMA engine and propose a new CPU-DMA co-scheduling policy, called Lazy Load, for which we also derive the response time analysis. The results of a case study on image processing demonstrate that the contention on the shared memory subsystem can be avoided when running with our proposed architecture. Moreover, comprehensive schedulability evaluations show that the newly proposed Lazy Load policy outperforms the existing CPU-DMA scheduling approaches and is effective in mitigating the main memory interference in our proposed architecture. 
    more » « less