skip to main content


Title: Latency analysis of self-suspending task chains
Many cyber-physical systems are offloading computation-heavy programs to hardware accelerators (e.g., GPU and TPU) to reduce execution time. These applications will self-suspend between offloading data to the accelerators and obtaining the returned results. Previous efforts have shown that self-suspending tasks can cause scheduling anomalies, but none has examined inter-task communication. This paper aims to explore self-suspending tasks' data chain latency with periodic activation and asynchronous message passing. We first present the cause for suspension-induced delays and worst-case latency analysis. We then propose a rule for utilizing the hardware co-processors to reduce data chain latency and schedulability analysis. Simulation results show that the proposed strategy can improve overall latency while preserving system schedulability.  more » « less
Award ID(s):
1815891
NSF-PAR ID:
10394082
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Latency analysis of self-suspending task chains
Page Range / eLocation ID:
1299 to 1304
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Modern latency-sensitive and real-time systems often use multi-core platforms; thus, tasks on different cores share certain hardware resources, such as the memory bus and certain cache levels. This has two undesirable consequences: (1) tasks can interfere with each other, causing high latency for the system as a whole, and (2) it becomes difficult to meet deadlines, since the worst-case timing of a given task depends on the worst task it might have to compete with. Static partitioning isolates tasks from each other by allocating a certain fraction of the resources to each; however, many tasks execute in different phases (e.g., memory-intensive and CPU-intensive) that have different requirements. Thus, system designers are left with a choice between overprovisioning, based on the most demanding phase, or suboptimal performance. In this paper, we propose a pair of techniques, called DNA and DADNA, to address the above challenge. DNA increases throughput and decreases latency, by building an execution profile of each task to identify the phases, and then dynamically allocating resources based on which task can benefit the most; DADNA further adds support for soft real-time workloads by taking deadlines into account. We have built a prototype of both techniques in the Xen hypervisor; our experimental results show that, compared to a state-of-the-art solution, DNA and DADNA can substantially improve schedulability, reduce job deadline miss ratios, and cut latencies by more than a factor of two even in extremely overloaded situations. 
    more » « less
  2. null (Ed.)
    Analytic workloads on terabyte data-sets are often run in the cloud, where application and storage servers are separate and connected via network. In order to saturate the storage bandwidth and to hide the long storage latency, such a solution requires an expensive server cluster with sufficient aggregate DRAM capacity and hardware threads. An alternative solution is to push the query computation into storage servers. In this paper we present an in-storage Analytics QUery Offloading MAchiNe (AQUOMAN) to “offload” most SQL operators, including multi-way joins, to SSDs. AQUOMAN executes Table Tasks, which apply a static dataflow graph of SQL operators to relational tables to produce an output table. Table Tasks use a streaming computation model, which allows AQUOMAN to process queries with a reasonable amount of DRAM for intermediate results. AQUOMAN is a general analytic query processor, which can be integrated in the database software stack transparently. We have built a prototype of AQUOMAN in FPGAs, and using TPC-H benchmarks on 1TB data sets, shown that a single instance of 1TB AQUOMAN disk, on average, can free up 70% CPU cycles and reduce DRAM usage by 60%. One way to visualize this saving is to think that if we run queries sequentially and ignore inter-query page cache reuse, MonetDB running on a 4-core, 16GB-DRAM machine with AQUOMAN augmented SSDs performs, on average, as well as a MonetDB running on a 32-core, 128GB-DRAM machine with standard SSDs. 
    more » « less
  3. null (Ed.)
    Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2. 
    more » « less
  4. Edge computing allows end-user devices to offload heavy computation to nearby edge servers for reduced latency, maximized profit, and/or minimized energy consumption. Data-dependent tasks that analyze locally-acquired sensing data are one of the most common candidates for task offloading in edge computing. As a result, the total latency and network load are affected by the total amount of data transferred from end-user devices to the selected edge servers. Most existing solutions for task allocation in edge computing do not take into consideration that some user tasks may actually operate on the same data items. Making the task allocation algorithm aware of the existing data sharing characteristics of tasks can help reduce network load at a negligible profit loss by allocating more tasks sharing data on the same server. In this paper, we formulate the data sharing-aware task allocation problem that make decisions on task allocation for maximized profit and minimized network load by taking into account the data-sharing characteristics of tasks. In addition, because the problem is NP-hard, we design the DSTA algorithm, which finds a solution to the problem in polynomial time. We analyze the performance of the proposed algorithm against a state-of-the-art baseline that only maximizes profit. Our extensive analysis shows that DSTA leads to about 8 times lower data load on the network while being within 1.03 times of the total profit on average compared to the state-of-the-art. 
    more » « less
  5. null (Ed.)
    In ROS (Robot Operating System), most applications in time- and safety-critical domain are constructed in the form of callback chains with data dependencies. Due to the shortcomings in its real-time support, ROS does not provide a strong timing guarantee and may lead to disastrous results. Although ROS2 claims to enhance the real-time capability, ensuring predictable end-to-end chain latency still remains a challenging problem. In this paper, we propose a new priority-driven chain-aware scheduler for the ROS2 framework and present end-to-end latency analysis for the proposed scheduler. With our scheduler, callbacks are prioritized based on the given timing requirements of the corresponding chains so that the end-to-end latency of critical chains can be improved with a predictable bound. The proposed scheduling design includes priority assignment and resource allocation considering all ROS2 scheduling-related abstractions, e.g., callbacks, nodes, and executors. To the best of our knowledge, this is the first work to address the inherent limitations of ROS2 in end-to-end latency by proposing a new scheduler design. We have implemented our scheduler in ROS2 running on NVIDIA Xavier NX. We have conducted case studies and schedulability experiments. The results show that the proposed scheduler yields a substantial improvement in end-to-end latency over the default ROS2 scheduler and the latest work in real-world scenarios. 
    more » « less