Apache Spark arguably is the most prominent Big Data processing framework tackling the scalability challenge of a wide variety of modern workloads. A key to its success is caching critical data in memory, thereby eliminating wasteful computations of regenerating intermediate results. While critical to performance, caching is not automated. Instead, developers have to manually handle such a data management task via APIs, a process that is error-prone and labor-intensive, yet may still yield sub-optimal performance due to execution complexities. Existing optimizations rely on expensive profiling steps and/or application-specific cost models to enable a postmortem analysis and a manual modification to existing applications. This paper presents CACHEIT, built to take the guesswork off the users while running applications as-is. CACHEIT analyzes the program’s workflow, extracting important features such as dependencies and access patterns, using them as an oracle to detect high-value data candidates and guide the caching decisions at run time. CACHEIT liberates users from low-level memory management requirements, allowing them to focus on the business logic instead. CACHEIT is application-agnostic and requires no profiling or a cost model. A thorough evaluation with a broad range of Spark applications on real-world datasets shows that CACHEIT is effective in maintaining satisfactory performance, incurring only marginal slowdown compared to the manually well-tuned counterparts
more »
« less
Intermediate Data Caching Optimization for Multi-Stage andParallel Big Data Frameworks
In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer’s discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%.
more »
« less
- Award ID(s):
- 1718355
- PAR ID:
- 10067678
- Date Published:
- Journal Name:
- IEEE International Conference on Cloud Computing
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
To efficiently scale data caching infrastructure to support emerging big data applications, many caching systems rely on consistent hashing to group a large number of servers to form a cooperative cluster. These servers are organized together according to a random hash function. They jointly provide a unified but distributed hash table to serve swift and voluminous data item requests. Different from the single least-recently-used (LRU) server that has already been extensively studied, theoretically characterizing a cluster that consists of multiple LRU servers remains yet to be explored. These servers are not simply added together; the random hashing complicates the behavior. To this end, we derive the asymptotic miss ratio of data item requests on a LRU cluster with consistent hashing. We show that these individual cache spaces on different servers can be effectively viewed as if they could be pooled together to form a single virtual LRU cache space parametrized by an appropriate cache size. This equivalence can be established rigorously under the condition that the cache sizes of the individual servers are large. For typical data caching systems this condition is common. Our theoretical framework provides a convenient abstraction that can directly apply the results from the simpler single LRU cache to the more complex LRU cluster with consistent hashing.more » « less
-
Apache Spark is a popular cluster computing framework for iterative analytics workloads due to its use of Resilient Distributed Datasets (RDDs) to cache data for in-memory processing. We have revealed that the performance of Spark RDD cache can be severely limited if its capacity falls short to the needs of the workloads. In this paper, we have explored different memory hybridization strategies to leverage emergent Non-Volatile Memory (NVM) devices for Spark's RDD cache. We have found that a simple layered hybridization approach does not offer an effective solution. Therefore, we have designed a flat hybridization scheme to leverage NVM for caching RDD blocks, along with several architectural optimizations such as dynamic memory allocation for block unrolling, asynchronous migration with preemption, and opportunistic eviction to disk. We have performed an extensive set of experiments to evaluate the performance of our proposed flat hybridization strategy and found it to be robust in handling different system and NVM characteristics. Our proposed approach uses DRAM for a fraction of the hybrid memory system and yet manages to keep the increase in execution time to be within 10% on average. Moreover, our opportunistic eviction of blocks to disk improves performance by up to 7.5% when utilized alongside the current mechanism.more » « less
-
State-intensive network and distributed applications rely heavily on online caching heuristics for high performance. However, there remains a fundamental performance gap between online caching heuristics and the optimal offline caching algorithm due to the lack of visibility into future state access requests in an online setting. Driven by the observation that state access requests in network and distributed applications are often carried in incoming network packets, we present Seer, an online caching solution for networked systems, that exploits the delays experienced by a packet inside a network - most prominently, transmission and queuing delays - to notify in advance of future packet arrivals to the target network nodes (switches/routers/middleboxes/end-hosts) implementing caching. Using this as a building block, Seer presents the design of an online cache manager that leverages visibility into (partial) set of future state access requests to make smarter prefetching and cache eviction decisions. Our evaluations show that Seer achieves up to 65% lower cache miss ratio and up to 78% lower flow completion time compared to LRU for key network applications over realistic workloads.more » « less
-
In parallel with big data processing and analysis dominating the usage of distributed and Cloud infrastructures, the demand for distributed metadata access and transfer has increased. The volume of data generated by many application domains exceeds petabytes, while the corresponding metadata amounts to terabytes or even more. This paper proposes a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks, dubbed SMURF. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and semantic locality-aware prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the Cloud. We incorporate the phenomenon of semantic locality awareness for increased prefetch prediction rate using real-life application I/O traces from Yahoo! Hadoop audit logs and propose a novel prefetch predictor. By effectively caching and prefetching metadata based on the access patterns, our continuum caching and prefetching mechanism significantly improves the local cache hit rate and reduces the average fetching latency. We replay approximately 20 Million metadata access operations from real audit traces, where SMURF achieves 90% accuracy during prefetch prediction and reduced the average fetch latency by 50% compared to the state-of-the-art mechanisms.more » « less
An official website of the United States government

