Transactional memory has been receiving much attention from both academia and industry. In transactional memory, program code is split into transactions, blocks of code that appear to execute atomically. Transactions are executed speculatively and the speculative execution is supported through data versioning and conflict detection and resolution mechanisms. Lazy versioning makes aborts fast but penalizes commits, whereas eager versioning makes commits fast but penalizes aborts. In this paper, we present an adaptive versioning approach that dynamically switches between eager and lazy versioning at runtime based on appropriate system parameters so that the performance of a transactional memory system is always better than that is obtained using either eager or lazy versioning individually. We implemented our adaptive versioning approach in the latest TinySTM distribution and extensively evaluated it through 5 micro-benchmarks and 8 complex benchmarks from STAMP and STAMPEDE suites. The results show significant benefits of our approach, giving performance improvements as much as 6.3x for execution time and as much as 170x for number of aborts.
more »
« less
Jenga: Efficient Fault Tolerance for Stacked DRAM
In this paper, we introduce Jenga, a new scheme for protecting 3D DRAM, specifically high bandwidth memory (HBM), from failures in bits, rows, banks, channels, dies, and TSVs. By providing redundancy at the granularity of a cache block—rather than across blocks, as in the current state of the art—Jenga achieves greater error-free performance and lower error recovery latency. We show that Jenga’s runtime is on average only 1.03× the runtime of our Baseline across a range of benchmarks. Additionally, for memory intensive benchmarks, Jenga is on average 1.11× faster than prior work.
more »
« less
- Award ID(s):
- 1717602
- PAR ID:
- 10063748
- Date Published:
- Journal Name:
- IEEE International Conference on Computer Design (ICCD 2017)
- Page Range / eLocation ID:
- 361 to 368
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The memory system is vulnerable to a number of security breaches, e.g., an attacker can interfere with program execution by disrupting values stored in memory. Modern Intel® Software Guard Extension (SGX) systems already support integrity trees to detect such malicious behavior. However, in spite of recent innovations, the bandwidth overhead of integrity+replay protection is non-trivial; state-of-the-art solutions like Synergy introduce average slowdowns of 2.3× for memory-intensive benchmarks. Prior work also implements a tree that is shared by multiple applications, thus introducing a potential side channel. In this work, we build on the Synergy and SGX baselines, and introduce three new techniques. First, we isolate each application by implementing a separate integrity tree and metadata cache for each application; this improves metadata cache efficiency and improves performance by 39%, while eliminating the potential side channel. Second, we reduce the footprint of the metadata. Synergy uses a combination of integrity and error correction metadata to provide low-overhead support for both. We share error correction metadata across multiple blocks, thus lowering its footprint (by 16×) while preventing error correction only in rare corner cases. However, we discover that shared error correction metadata, even with caching, does not improve performance. Third, we observe that thanks to its lower footprint, the error correction metadata can be embedded into the integrity tree. This reduces the metadata blocks that must be accessed to support both integrity verification and chipkill reliability. The proposed Isolated Tree with Embedded Shared Parity (ITESP) yields an overall performance improvement of 64%, relative to baseline Synergy.more » « less
-
To process real-world datasets, modern data-parallel systems often require extremely large amounts of memory, which are both costly and energy inefficient. Emerging non-volatile memory (NVM) technologies offer high capacity compared to DRAM and low energy compared to SSDs. Hence, NVMs have the potential to fundamentally change the dichotomy between DRAM and durable storage in Big Data processing. However, most Big Data applications are written in managed languages and executed on top of a managed runtime that already performs various dimensions of memory management. Supporting hybrid physical memories adds a new dimension, creating unique challenges in data replacement. This article proposes Panthera, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories. Panthera analyzes user programs on a Big Data system to infer their coarse-grained access patterns, which are then passed to the Panthera runtime for efficient data placement and migration. For Big Data applications, the coarse-grained data division information is accurate enough to guide the GC for data layout, which hardly incurs overhead in data monitoring and moving. We implemented Panthera in OpenJDK and Apache Spark. Based on Big Data applications’ memory access pattern, we also implemented a new profiling-guided optimization strategy, which is transparent to applications. With this optimization, our extensive evaluation demonstrates that Panthera reduces energy by 32–53% at less than 1% time overhead on average. To show Panthera’s applicability, we extend it to QuickCached, a pure Java implementation of Memcached. Our evaluation results show that Panthera reduces energy by 28.7% at 5.2% time overhead on average.more » « less
-
Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal GPU. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM has become increasing important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings use NVIDIA GPUs, with a current trend of ten additional NVIDIA-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. The support for UVM is particularly attractive for programs requiring more memory than resides on the GPU, since the alternative to UVM is for the application to directly copy memory between device and host. Furthermore, CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.more » « less
-
Lawall, Julia; Williams, Dan (Ed.)Persistent memory (PMEM) allows direct access to fast storage at byte granularity. Previously, processor caches backed by persistent memory were not persistent, complicating the design of persistent applications and reducing their performance. A new generation of systems with flush-on-fail semantics effectively offer persistent caches, offering the potential for much simpler, faster PMEM programming models. This work proposes Whole Process Persistence (WPP), a new programming model for systems with persistent caches. In the WPP model, all process state is made persistent. On restart after power failure, this state is reloaded and execution resumes in an application-defined interrupt handler. We also describe the Zhuque runtime, which transparently provides WPP by interposing on the C bindings for system calls in userspace. It requires little or no programmer effort to run applications on Zhuque. Our measurements show that Zhuque outperforms state of the art PMEM libraries, demonstrating mean speedups across all benchmarks of 5.24x over PMDK, 3.01x over Mnemosyne, 5.43x over Atlas, and 4.11x over Clobber-NVM. More important, unlike existing systems, Zhuque places no restrictions on how applications implement concurrency, allowing us to run a newer version of Memcached on Zhuque and gain more than 7.5x throughput over the fastest existing persistent implementations.more » « less
An official website of the United States government

