skip to main content


Title: Improving Cache Performance for Large-Scale Photo Stores via Heuristic Prefetching Scheme
Photo service providers are facing critical challenges of dealing with the huge amount of photo storage, typically in a magnitude of billions of photos, while ensuring national-wide or world-wide satisfactory user experiences. Distributed photo caching architecture is widely deployed to meet high performance expectations, where efficient still mysterious caching policies play essential roles. In this work, we present a comprehensive study on internet-scale photo caching algorithms in the case of QQPhoto from Tencent Inc., the largest social network service company in China. We unveil that even advanced cache algorithms can only perform at a similar level as simple baseline algorithms and there still exists a large performance gap between these cache algorithms and the theoretically optimal algorithm due to the complicated access behaviors in such a large multi-tenant environment. We then expound the reasons behind this phenomenon via extensively investigating the characteristics of QQPhoto workloads. Finally, in order to realistically further improve QQPhoto cache efficiency, we propose to incorporate a prefetcher in the cache stack based on the observed immediacy feature that is unique to the QQPhoto workload. The prefetcher proactively prefetches selected photos into cache before are requested for the first time to eliminate compulsory misses and promote hit ratios. Our extensive evaluation results show that with appropriate prefetching we improve the cache hit ratio by up to 7.4%, while reducing the average access latency by 6.9% at a marginal cost of 4.14% backend network traffic compared to the original system that performs no prefetching.  more » « less
Award ID(s):
1717660 1702474
NSF-PAR ID:
10106222
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
ISSN:
1045-9219
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In parallel with big data processing and analysis dominating the usage of distributed and cloud infrastructures, the demand for distributed metadata access and transfer has increased. In many application domains, the volume of data generated exceeds petabytes, while the corresponding metadata amounts to terabytes or even more. In this paper, we propose a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks, dubbed SMURF. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the cloud. We also study the phenomenon of semantic locality in real trace logs which is not well utilized in metadata access prediction. We implement our predictor based on this observation and compare it with three existing state-of-the-art prefetch schemes on Yahoo! Hadoop audit traces. By effectively caching and prefetching metadata based on the access patterns, our continuum caching and prefetching mechanism greatly improves local cache hit rate and reduces the average fetching latency. We replayed approximately 20 Million metadata access operations from real audit traces, in which our system achieved 80% accuracy during prefetch prediction and reduced the average fetch latency 50% compared to the state-of-the-art mechanisms. 
    more » « less
  2. In parallel with big data processing and analysis dominating the usage of distributed and Cloud infrastructures, the demand for distributed metadata access and transfer has increased. The volume of data generated by many application domains exceeds petabytes, while the corresponding metadata amounts to terabytes or even more. This paper proposes a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks, dubbed SMURF. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and semantic locality-aware prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the Cloud. We incorporate the phenomenon of semantic locality awareness for increased prefetch prediction rate using real-life application I/O traces from Yahoo! Hadoop audit logs and propose a novel prefetch predictor. By effectively caching and prefetching metadata based on the access patterns, our continuum caching and prefetching mechanism significantly improves the local cache hit rate and reduces the average fetching latency. We replay approximately 20 Million metadata access operations from real audit traces, where SMURF achieves 90% accuracy during prefetch prediction and reduced the average fetch latency by 50% compared to the state-of-the-art mechanisms. 
    more » « less
  3. Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also fundamentally different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments demonstrate that address segmentation achieves 26% - 36% higher F1-score than delta inputs and 15% - 24% higher F1-score than page & offset inputs for SPEC 2006, SPEC 2017, and GAP benchmarks. Simulation results show that TransFetch achieves 38.75% IPC improvement compared with no prefetching, outperforming the best-performing rule-based prefetcher BOP by 10.44% and ML-based prefetcher Voyager by 6.64%. 
    more » « less
  4. Low interaction response times are crucial to the experience that mobile apps provide for their users. Unfortunately, existing strategies to alleviate the network latencies that hinder app responsiveness fall short in practice. In particular, caching is plagued by challenges in setting expiration times that match when a resource's content changes, while prefetching hinges on accurate predictions of user behavior that have proven elusive. We present Marauder, a system that synergizes caching and prefetching to improve the speedups achieved by each technique while avoiding their inherent limitations. Key to Marauder is our observation that, like web pages, apps handle interactions by downloading and parsing structured text resources that entirely list (i.e., without needing to consult app binaries) the set of other resources to load. Building on this, Marauder introduces two low-risk optimizations directly from the app's cache. First, guided by cached text files, Marauder prefetches referenced resources during an already-triggered interaction. Second, to improve the efficacy of cached content, Marauder judiciously prefetches about-to-expire resources, extending cache lives for unchanged resources, and downloading updates for lightweight (but crucial) text files. Across a wide range of apps, live networks, interaction traces, and phones, Marauder reduces median and 90th percentile interaction response times by 27.4% and 43.5%, while increasing data usage by only 18%. 
    more » « less
  5. The Von Neumann bottleneck is a persistent problem in computer architecture, causing stalls and wasted CPU cycles. The Van Neumann bottleneck is particularly relevant for memory-intensive workloads whose working set does not fit into the microprocessor’s cache and hence memory accesses suffer the high access latency of DRAM. One technique to address this bottleneck is to prefetch data from memory into on-chip caches. While prefetching has proven successful, for simple access patterns such as strides, existing prefetchers are incapable of providing benefit for applications with complex, irregular access patterns. A neural network-based prefetcher shows promise for these challenging workloads. We provide a better understanding of what type of memory access patterns an LSTM neural network can learn by training individual models on microbenchmarks with well-characterized memory access patterns. We explore a range of model parameters and provide a better understanding of what model is ideal to use. We achieve over 95% accuracy on the microbenchmarks and find a strong relationship between lookback (history window) size and the ability of the model to learn the pattern. We find also an upper limit on the number of concurrent distinct memory access streams that can be learned by a model of a given size. 
    more » « less