Data center workloads are composed of multiresource jobs requiring a variety of computational resources including CPU cores, memory, disk space, and hardware accelerators. Mod- ern servers can run multiple jobs in parallel, but a set of jobs can only run in parallel if the server has sufficient resources to satisfy the demands of each job. It is generally hard to find sets of jobs that perfectly utilize all server resources, and choosing the wrong set of jobs can lead to low resource uti- lization. This raises the question of how to allocate resources across a stream of arriving multiresource jobs to minimize the mean response time across jobs — the mean time from when a job arrives to the system until it is complete. Current policies for scheduling multiresource jobs are com- plex to analyze and hard to implement. We propose a class of simple policies, called Markovian Service Rate (MSR) policies. We show that the class of MSR policies is throughput- optimal, in that if a policy exists that can stabilize the sys- tem, then an MSR policy exists that stabilizes the system. We derive bounds on the mean response time under an MSR policy, and show how our bounds can be used to choose an MSR policy that minimizes mean response time.
more »
« less
On Studying CPU Performance of CloudLab Hardware
Empirical performance measurements of computer systems almost always exhibit variability and anomalies. Run-to-run and server-to-server variations are common for CPU, memory, disk, and network performance characteristics. In our previous work, we focused on taming performance variability for memory, disk, and network and established an interactive analysis service at: https://confirm.fyi/ to help users of the CloudLab testbed better plan and conduct their experiments. In this paper, we describe our analysis of CPU variability based on over 1.3M performance measurements from nearly 1,800 servers and present our initial findings. The focus of this work is on capturing hardware variability, which can make repeatable experiments more difficult and can impact conclusions; it it this important for systems researchers to understand. (We note that, though we do not study it in this work, in the cloud, multi-tenancy and resource sharing an exacerbate the problem.) Variability also inevitably impacts performance and operation of middleware and high-level applications, contributing to the straggler problems in many domains, including HPC, Big Data, and Machine Learning, and on many types of cyberinfrastructures. We analyze the data from the CloudLab servers allocated in an exclusive fashion, with no virtualization. While our analysis focuses on the testbed that aims to promote reproducible research, we believe our approach and the findings can be of value to people who manage, analyze, and utilize shared computing resources in supercomputers, clouds, and datacenters.
more »
« less
- Award ID(s):
- 1743363
- PAR ID:
- 10197399
- Date Published:
- Journal Name:
- Proceedings of the Worksop on Midscale Education and Research Infrastructure and Tools (MERIT)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Fast networks and the desire for high resource utilization in data centers and the cloud have driven disaggregation. Application compute is separated from storage, but this leads to high overheads when data must move over the network for simple operations on it. Alternatively, systems could allow applications to run application logic within storage via user-defined functions. Unfortunately, this ties provisioning and utilization of storage and compute resources together again. We present a new approach to executing storage-level functions in an in-memory key-value store that avoids this problem by dynamically deciding where to execute functions over data. Users write storage functions that are logically decoupled from storage, but storage servers choose where to run invocations of these functions physically. By using a server-internal cost model and observing function execution, servers choose to directly run inexpensive functions, while preferring to execute functions with high CPU-cost at client machines. We show that with this approach storage servers can reduce network request processing costs, avoid server compute bottlenecks, and improve aggregate storage system throughput. We realize our approach on an in-memory key-value store that executes 3.2 million strict serializable user-defined storage functions per second with 100 us response times. When running a mix of logic from different applications, it provides throughput better than running that logic purely at storage servers (85% more) or purely at clients (10% more). For our workloads, it also reduces latency (up to 2x) and transactional aborts (up to 33%) over pure client-side execution.more » « less
-
Web-based speed tests are popular among end-users for measuring their network performance. Thousands of measurement servers have been deployed in diverse geographical and network locations to serve users worldwide. However, most speed tests have opaque methodologies, which makes it difficult for researchers to interpret their highly aggregated test results, let alone leverage them for various studies. In this paper, we propose WebTestKit, a unified and configurable framework for facilitating automatic test execution and cross-layer analysis of test results for five major web-based speed test platforms. Capturing only packet headers of traffic traces, WebTestKit performs in-depth analysis by carefully extracting HTTP and timing information from test runs. Our testbed experiments showed WebTestKit is lightweight and accurate in interpreting encrypted measurement traffic. We applied WebTestKit to compare the use of HTTP requests across speed tests and investigate the root causes for impeding the accuracy of latency measurements, which play a vital role in test server selection and throughput estimation.more » « less
-
Distributed key-value stores today require frequent key-value shard migration between nodes to react to dynamic workload changes for load balancing, data locality, and service elasticity. In this paper, we propose NetMigrate, a live migration approach for in-memory key-value stores based on programmable network data planes. NetMigrate migrates shards between nodes with zero service interruption and minimal performance impact. During migration, the switch data plane monitors the migration process in a fine-grained manner and directs client queries to the right server in real time, eliminating the overhead of pulling data between nodes. We implement a NetMigrate prototype on a testbed consisting of a programmable switch and several commodity servers running Redis and evaluate it under YCSB workloads. Our experiments demonstrate that NetMigrate improves the query throughput from 6.5% to 416% and maintains low access latency during migration, compared to the state-of-the-art migration approaches.more » « less
-
All computing infrastructure suffers from performance variability, be it bare-metal or virtualized. This phenomenon originates from many sources: some transient, such as noisy neighbors, and others more permanent but sudden, such as changes or wear in hardware, changes in the underlying hypervisor stack, or even undocumented interactions between the policies of the computing resource provider and the active workloads. Thus, performance measurements obtained on clouds, HPC facilities, and, more generally, datacenter environments are almost guaranteed to exhibit performance regimes that evolve over time, which leads to undesirable nonstationarities in application performance. In this paper, we present our analysis of performance of the bare-metal hardware available on the CloudLab testbed where we focus on quantifying the evolving performance regimes using changepoint detection. We describe our findings, backed by a dataset with nearly 6.9M benchmark results collected from over 1600 machines over a period of 2 years and 9 months. These findings yield a comprehensive characterization of real-world performance variability patterns in one computing facility, a methodology for studying such patterns on other infrastructures, and contribute to a better understanding of performance variability in general.more » « less
An official website of the United States government

