The BUNDLE and BUNDLEP scheduling algorithms are cache-cognizant thread-level scheduling algorithms and associated worst case execution time and cache overhead (WCETO) techniques for hard real-time multi-threaded tasks. The BUNDLE-based approaches utilize the inter-thread cache benefit to reduce WCETO values for jobs. Currently, the BUNDLE-based approaches are limited to scheduling a single task. This work aims to expand the applicability of BUNDLE-based scheduling to multiple task multi-threaded task sets. BUNDLE-based scheduling leverages knowledge of potential cache conflicts to selectively preempt one thread in favor of another from the same job. This thread-level preemption is a requirement for the run-time behavior and WCETO calculation to receive the benefit of BUNDLE-based approaches. This work proposes scheduling BUNDLE-based jobs non-preemptively according to the earliest deadline first (EDF) policy. Jobs are forbidden from preempting one another, while threads within a job are allowed to preempt other threads. An accompanying schedulability test is provided, named Threads Per Job (TPJ). TPJ is a novel schedulability test, input is a task set specification which may be transformed (under certain restrictions); dividing threads among tasks in an effort to find a feasible task set. Enhanced by the flexibility to transform task sets and taking advantage of the inter-thread cache benefit, the evaluation shows TPJ scheduling task sets fully preemptive EDF cannot.
more »
« less
Slipstream Processors Revisited: Exploiting Branch Sets
Delinquent branches and loads remain key performance limiters in some applications. One approach to mitigate them is pre-execution. Broadly, there are two classes of pre-execution: one class repeatedly forks small helper threads, each targeting an individual dynamic instance of a delinquent branch or load; the other class begins with two redundant threads in a leader-follower arrangement, and speculatively reduces the leading thread. The objective of this paper is to design a new pre-execution microarchitecture that meets four criteria: (i) retains the simpler coordination of a leader-follower microarchitecture, (ii) is fully automated with just hardware, (iii) targets both branches and loads, (iv) and is effective. We review prior preexecution proposals and show that none of them meet all four criteria. We develop Slipstream 2.0 to meet all four criteria. The key innovation in the space of leader-follower architectures is to remove the forward control-flow slices of delinquent branches and loads, from the leading thread. This innovation overcomes key limitations in the only other hardware-only leader-follower prior works: Slipstream and Dual Core Execution (DCE). Slipstream removes backward slices of confident branches to pre-execute unconfident branches, which is ineffective in phases dominated by unconfident branches when branch pre-execution is most needed. DCE is very effective at tolerating cache-missed loads, unless their dependent branches are mispredicted. Removing forward control-flow slices of delinquent branches and delinquent loads enables two firsts, respectively: (1) leader-follower-style branch pre-execution without relying on confident instruction removal, and (2) tolerance of cache-missed loads that feed mispredicted branches. For SPEC 2006/2017 SimPoints wherein Slipstream 2.0 is auto-enabled, it achieves geomean speedups of 67%, 60%, and 12%, over baseline (one core), Slipstream, and DCE.
more »
« less
- Award ID(s):
- 1823517
- PAR ID:
- 10194560
- Date Published:
- Journal Name:
- 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
- Page Range / eLocation ID:
- 105 to 117
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Record-and-replay systems are useful tools for debugging non-deterministic parallel programs by first recording an execution and then replaying that execution to produce the same access pattern. Existing record-and-replay systems generally target thread-based execution models, and record the behaviors and interleavings of individual threads. Dynamic multithreaded languages and libraries, such as the Cilk family, OpenMP, TBB, etc., do not have a notion of threads. Instead, these languages provide a processor-oblivious model of programming, where programs expose task-parallelism using high-level constructs such as spawn/sync without regard to the number of threads/cores available to run the program. Thread-based record-and-replay would violate the processor-oblivious nature of these programs, as they incorporate the number of threads into the recorded information, constraining the replayed execution to the same number of threads. In this paper, we present a processor-oblivious record-and-replay scheme for such languages where record and replay can use different number of processors and both are scheduled using work stealing. We provide theoretical guarantees for our record and replay scheme --- namely that record is optimal for programs with one lock and replay is near-optimal for all cases. In addition, we implemented this scheme in the Cilk Plus runtime system and our evaluation indicates that processor-obliviousness does not cause substantial overheads.more » « less
-
Modern supercomputers have millions of cores, each capable of executing one or more threads of program execution. In these computers the site of execution for program threads rarely, if ever, changes from the node in which they were born. This paper discusses the advantages that may accrue when thread states migrate freely from node to node, especially when migration is managed by hardware without requiring software intervention. Emphasis is on supporting the growing classes of algorithms where there is significant sparsity, irregularity, and lack of locality in the memory reference patterns. Evidence is drawn from reformulation of several kernels into a migrating thread context approximating that of an emerging architecture with such capabilities.more » « less
-
Abstract—Network slicing is a key capability for next gen- eration mobile networks. It enables infrastructure providers to cost effectively customize logical networks over a shared infrastructure. A critical component of network slicing is resource allocation, which needs to ensure that slices receive the resources needed to support their services while optimizing network effi- ciency. In this paper, we propose a novel approach to slice-based resource allocation named Guaranteed seRvice Efficient nETwork slicing (GREET). The underlying concept is to set up a con- strained resource allocation game, where (i) slices unilaterally optimize their allocations to best meet their (dynamic) customer loads, while (ii) constraints are imposed to guarantee that, if they wish so, slices receive a pre-agreed share of the network resources. The resulting game is a variation of the well-known Fisher mar- ket, where slices are provided a budget to contend for network resources (as in a traditional Fisher market), but (unlike a Fisher market) prices are constrained for some resources to ensure that the pre-agreed guarantees are met for each slice. In this way, GREET combines the advantages of a share-based approach (high efficiency by flexible sharing) and reservation-based ones (which provide guarantees by assigning a fixed amount of resources). We characterize the Nash equilibrium, best response dynamics, and propose a practical slice strategy with provable convergence properties. Extensive simulations exhibit substantial improvements over network slicing state-of-the-art benchmarks.more » « less
-
Many species of fish gather in dense collectives or schools where there are significant flow interactions from their shed wakes. Commonly, these swimmers shed a classic reverse von Kármán wake, however, schooling eels produce a bifurcated wake topology with two vortex rings shed per oscillation cycle. To examine the schooling interactions of a hydrofoil with a bifurcated wake topology, we present tomographic particle image velocimetry (tomo PIV) measurements of the flow interactions and direct force measurements of the performance of two low-aspect-ratio hydrofoils ( A R = 0.5 ) in an in-line and a staggered arrangement. Surprisingly, when the leader and follower are interacting in either arrangement there are only minor alterations to the flowfields beyond the superposition of the flowfields produced by the isolated leader and follower. Motivated by this finding, Garrick’s linear theory, a linear unsteady hydrofoil theory based on a potential flow assumption, was adapted to predict the lift and thrust performance of the follower. Here, the follower hydrofoil interacting with the leader’s wake is considered as the superposition of an isolated pitching foil with a time-varying cross-stream velocity derived from the wake flow measurements of the isolated leader. Linear theory predictions accurately capture the time-averaged lift force and some of the major peaks in thrust derived from the follower interacting with the leader’s wake in a staggered arrangement. The thrust peaks that are not predicted by linear theory are likely driven by spatial variations in the flowfield acting on the follower or nonlinear flow interactions; neither of which are accounted for in the simple theory. This suggests that unsteady potential flow theory that does account for spatial variations in the flowfield acting on a hydrofoil can provide a relatively simple framework to understand and model the flow interactions that occur in schooling fish. Additionally, schooling eels can derive thrust and efficiency increases of 63-80% in either a in-line or a staggered arrangement where the follower is between two branched momentum jets or with one momentum jet branch directly impinging on it, respectively.more » « less
An official website of the United States government

