Content delivery networks (CDNs) distribute much of today's Internet traffic by caching and serving users' contents requested. A major goal of a CDN is to improve hit probabilities of its caches, thereby reducing WAN traffic and user-perceived latency. In this paper, we develop a new approach for caching in CDNs that learns from optimal caching for decision making. To attain this goal, we first propose HRO to compute the upper bound on optimal caching in an online manner, and then leverage HRO to inform future content admission and eviction. We call this new cache design LHR. We show that LHR is efficient since it includes a detection mechanism for model update, an auto-tuned threshold-based model for content admission with a simple eviction rule. We have implemented an LHR simulator as well as a prototype within an Apache Traffic Server and the Caffeine, respectively. Our experimental results using four production CDN traces show that LHR consistently outperforms state of the arts with an increase in hit probability of up to 9% and a reduction in WAN traffic of up to 15% compared to a typical production CDN cache. Our evaluation of the LHR prototype shows that it only imposes a moderate overhead and can be deployed on today's CDN servers.
more »
« less
Challenges and Opportunities of DNN Model Execution Caching
We explore the opportunities and challenges of model execution caching, a nascent research area that promises to improve the performance of cloud-based deep inference serving. Broadly, model execution caching relies on servers that are geographically close to the end-device to service inference requests, resembling a traditional content delivery network (CDN). However, unlike a CDN, such schemes cache execution rather than static objects. We identify the key challenges inherent to this problem domain and describe the similarities and differences with existing caching techniques. We further introduce several emergent concepts unique to this domain, such as memory-adaptive models and multi-model hosting, which allow us to make dynamic adjustments to the memory requirements of model execution.
more »
« less
- PAR ID:
- 10159004
- Date Published:
- Journal Name:
- DIDL '19: Proceedings of the Workshop on Distributed Infrastructures for Deep Learning
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Web services rely on caching at nearly every layer of thesystem architecture. Commonly, each cache is implementedand maintained independently by a distinct team and is highlyspecialized to its function. For example, an application-datacache would be independent from a CDN cache. However, thisapproach ignores the difficult challenges that different cachingsystems have in common, greatly increasing the overall effortrequired to deploy, maintain, and scale each cache.This paper presents a different approach to cache devel-opment, successfully employed at Facebook, which extractsa core set of common requirements and functionality fromotherwise disjoint caching systems.CacheLibis a general-purpose caching engine, designed based on experiences witha range of caching use cases at Facebook, that facilitates theeasy development and maintenance of caches. CacheLib wasfirst deployed at Facebook in 2017 and today powers over 70services including CDN, storage, and application-data caches.This paper describes our experiences during the transitionfrom independent, specialized caches to the widespread adop-tion of CacheLib. We explain how the characteristics of pro-duction workloads and use cases at Facebook drove importantdesign decisions. We describe how caches at Facebook haveevolved over time, including the significant benefits seen fromdeploying CacheLib. We also discuss the implications our ex-periences have for future caching design and research.more » « less
-
There has been a growing interest in using GPU to accelerate data analytics due to its massive parallelism and high memory bandwidth. The main constraint of using GPU for data analytics is the limited capacity of GPU memory. Heterogeneous CPU-GPU query execution is a compelling approach to mitigate the limited GPU memory capacity and PCIe bandwidth. However, the design space of heterogeneous CPU-GPU query execution has not been fully explored. We aim to improve state-of-the-art CPU-GPU data analytics engine by optimizing data placement and heterogeneous query execution. First, we introduce a semantic-aware fine-grained caching policy which takes into account various aspects of the workload such as query semantics, data correlation, and query frequency when determining data placement between CPU and GPU. Second, we introduce a heterogeneous query executor which can fully exploit data in both CPU and GPU and coordinate query execution at a fine granularity. We integrate both solutions in Mordred, our novel hybrid CPU-GPU data analytics engine. Evaluation on the Star Schema Benchmark shows that the semantic-aware caching policy can outperform the best traditional caching policy by up to 3x. Compared to existing GPU DBMSs, Mordred can outperform by an order of magnitude.more » « less
-
Caches are at the heart of latency-sensitive systems. In this paper, we identify a growing challenge for the design of latency-minimizing caches called delayed hits. Delayed hits occur at high throughput, when multiple requests to the same object queue up before an outstanding cache miss is resolved. This effect increases latencies beyond the predictions of traditional caching models and simulations; in fact, caching algorithms are designed as if delayed hits simply didn't exist. We show that traditional caching strategies -- even so called 'optimal' algorithms -- can fail to minimize latency in the presence of delayed hits. We design a new, latency-optimal offline caching algorithm called belatedly which reduces average latencies by up to 45% compared to the traditional, hit-rate optimal Belady's algorithm. Using belatedly as our guide, we show that incorporating an object's 'aggregate delay' into online caching heuristics can improve latencies for practical caching systems by up to 40%. We implement a prototype, Minimum-AggregateDelay (mad), within a CDN caching node. Using a CDN production trace and backends deployed in different geographic locations, we show that mad can reduce latencies by 12-18% depending on the backend RTTs.more » « less
-
Content Distribution Networks (CDNs) manage their own caching or routing overlay networks to provide reliable and efficient content delivery services. Currently, CDNs have become one of the most important tools on the Internet. They have been responsible for the majority of today's Internet traffic. The performance of CDNs directly influences the experiences of end users. In this paper, we develop several analyses to figure out the key factors influencing the overall performance of a CDN. The primary results demonstrate that the caching overlays and the routing overlays both have their own influential factors affecting CDN performance. Our results also show that the transmission latency between a surrogate and a content owner is a critical factor determining the overall performance of routing overlays. Furthermore, we argue that the surrogate assignment policy of a routing overlay need to seriously take this latency into account. Our analysis results provide a context for the CDN community on preferable surrogate assignment solutions.more » « less
An official website of the United States government

