<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ASAP: A Speculative Approach to Persistence</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10346388</idno>
					<idno type="doi">10.1109/HPCA53966.2022.00070</idno>
					<title level='j'>2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Sujay Yadalam</author><author>Nisarg Shah</author><author>Xiangyao Yu</author><author>Michael Swift</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Persistent memory enables a new class of applications that have persistent in-memory data structures. Recoverability of these applications imposes constraints on the ordering of writes to persistent memory. But, the cache hierarchy and memory controllers in modern systems may reorder writes to persistent memory. Therefore, programmers have to use expensive flush and fence instructions that stall the processor to enforce such ordering. While prior efforts circumvent stalling on long latency flush instructions, these designs under-perform in large-scale systems with many cores and multiple memory controllers.We propose ASAP, an architectural model in which the hardware takes an optimistic approach by persisting data eagerly, thereby avoiding any ordering stalls and utilizing the total system bandwidth efficiently. ASAP avoids stalling by allowing writes to be persisted out-of-order, speculating that all writes will eventually be persisted. For correctness, ASAP saves recovery information in the memory controllers which is used to undo the effects of speculative writes to memory in the event of a crash.Over a large number of representative workloads, ASAP improves performance over current Intel systems by 2.3 on average and performs within 3.9% of an ideal system.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Emerging non-volatile memory (NVM) technologies such as Intel's Optane Persistent Memory (PM) <ref type="bibr">[1]</ref> pair the high performance and byte-addressability of DRAM with the durability of disks and Flash memory. These technologies enable recoverable applications, which store critical data structures in persistent memory to survive system crashes and power failures <ref type="bibr">[2]</ref>- <ref type="bibr">[4]</ref>. However, current persistent memory systems do not save the contents of the processor cache on a failure: only data in the persistence domain survives. On systems supporting Optane, this domain includes both data in NVM and data that has reached a memory controller <ref type="bibr">[5]</ref>. As a result, developers must ensure that data has been flushed from the processor or caches to the memory controller using either non-temporal stores that write directly to memory or cache write-back instructions. These instructions incur long stalls in today's NVM systems <ref type="bibr">[6]</ref>.</p><p>Ensuring that applications can correctly recover from a failure is further complicated by the need to maintain data consistency. To achieve high performance, current proces-sors reorder data written out to memory. Yet, ensuring consistency requires ordering certain updates, such as a pointer update only after the data it references. Current hardware provides a crude mechanism to enforcing order through expensive fences that stall processor execution until all prior flushes have reached the persistence domain.</p><p>Multi-threaded applications require persist ordering across threads, which is hard to achieve efficiently. While early studies of persistent memory usage found cross-thread dependencies rare <ref type="bibr">[6]</ref>, they are frequent in a new class of concurrent applications: recently developed high-performance, scalable concurrent data structures <ref type="bibr">[4]</ref>, <ref type="bibr">[7]</ref>- <ref type="bibr">[12]</ref> specifically for persistence. For these structures, efficient support of concurrent accesses and subsequent cross-thread dependencies is critical for performance.</p><p>Several prior proposals have looked to reduce the high cost of ordering by either entrusting the cache hierarchy to enforce ordering lazily <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref>, or using buffering <ref type="bibr">[6]</ref>, <ref type="bibr">[15]</ref>- <ref type="bibr">[19]</ref> or through speculation <ref type="bibr">[20]</ref>. Some of these designs do not support ordering across multiple memorycontrollers <ref type="bibr">[15]</ref>, <ref type="bibr">[20]</ref>, and others <ref type="bibr">[6]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[17]</ref> suffer long flushing stalls while persisting writes ordered across memory controllers. Also, these designs employ conservative techniques that stall flushing to enforce ordering in the presence of cross-thread dependencies. Other designs <ref type="bibr">[16]</ref> either increase the hardware complexity or require batteries for flushing data on a failure <ref type="bibr">[19]</ref>.</p><p>This paper introduces ASAP (A Speculative Approach to Persistence), a novel persistence architecture that provides the ability to order writes to NVM with almost no stalls even in the presence of cross-thread dependencies and on systems with multiple memory controllers. The key insight of our work is that in the absence of failure, it is fine to write data out to NVM in any order. Only when a failure occurs must the system ensure that ordering was enforced. ASAP therefore flushes writes optimistically and stores enough information to correct the contents of memory when a failure occurs. For example, when a write arrives early, ASAP allows a speculative write to update NVM, but also saves the old value at the address. If a failure occurs, ASAP can restore the old value. In effect, ASAP maintains an undo log for anything updated out of order.</p><p>While writing an undo record to NVM would be expensive and lead to high write amplification, ASAP avoids additional writes by leveraging Intel's Asynchronous DRAM Refresh (ADR) technology to save recovery information in the memory controller. ADR allows a small amount of data to be written back from the memory controller to NVM following a failure but before power to the processor is completely lost. On failure, ASAP uses the undo information at the memory controller, to restore NVM to the correct state.</p><p>Storing recovery information at the memory controller differs from logging employed in hardware approaches that provide atomic durability <ref type="bibr">[21]</ref>- <ref type="bibr">[28]</ref>. First, these hardwareassisted logging mechanisms generate additional writes to NVM as they create a log entry on every write while ASAP stores recovery information only when memory is updated speculatively. Second, these designs can incur long latency to enforce ordering between log and data writes, while ASAP can reorder writes. Third, they may require large buffers to hold out-of-place updates while ASAP performs well with small buffers. ASAP does not aim to provide atomicity and instead enforces ordering of writes efficiently. Many PM applications <ref type="bibr">[4]</ref>, <ref type="bibr">[7]</ref>- <ref type="bibr">[10]</ref> do not use transactions but still require ordering primitives from the hardware. These designs would therefore benefit from ASAP but not from the hardware transactional models. If applications do require atomicity, ASAP can be coupled with any techniques such as shadow paging or software transactions. This flexibility allows ASAP to re-order writes to NVM, speculatively update memory and use small buffers.</p><p>ASAP extends the processor core with persist buffers that queue data waiting to be flushed to NVM. Queued writes are flushed speculatively and out-of-order. Ordering dependencies across threads are tracked by augmenting the coherence protocol and are resolved through inter-core communication. A small buffer called the recovery table at the memory controller stores the recovery information.</p><p>To study the performance benefits of ASAP, we compare it against one of the prior works that supports ordering across multiple memory controllers, HOPS <ref type="bibr">[6]</ref>, and against BBB <ref type="bibr">[19]</ref> which has close to ideal performance. We evaluate ASAP on a wide range of benchmarks including transactional applications using Intel's PMDK <ref type="bibr">[29]</ref>, applications written to natively use PM, data structures using the Atlas framework <ref type="bibr">[30]</ref>, and hand-written persistent data structures. On average, ASAP is 22.8% faster than HOPS and within 3.9% of BBB <ref type="bibr">[19]</ref>. In addition, we show that ASAP's performance scales with increasing threads, and offers greater performance benefit with increasing NVM write bandwidth.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND A. Persistency models</head><p>To ensure safety and correct recovery of persistent applications, the program state stored in persistent memory must be in a consistent state. Persistency models <ref type="bibr">[31]</ref>, like memory consistency models, define what constitutes a consistent state and hence how a system is allowed to reorder operations. We refer to stores in the volatile memory order as writes and stores in the persistent order as persists. ISA-level persistency models define what persist orderings a processor exposes to programs. Strict persistency couples the order of persists to the order of writes in volatile memory order. It avoids the necessity for barriers to express ordering but frequent ordering operations and the inability to coalesce can hurt performance. Epoch persistency is more relaxed as it allows some persists to be combined and/or reordered. Every thread's execution is divided into sequences of persists called epochs, which are separated by fences (also called persist barriers). Persists within an epoch can be reordered or occur in parallel, allowing persists to coalesce. Epoch persistency enforces two ordering constraints: (1) any two memory accesses belonging to different epochs, i.e., separated by a persist barrier, are ordered. (2) Conflicting memory accesses (accesses to the same address) within and across threads assume the persist order from their volatile memory order. The latter constraint is also called strong persist atomicity. Strand persistency is the most relaxed model. It divides a thread's execution into strands. Writes in different strands can be flushed concurrently but inter-strand dependencies can still arise due to strong persist atomicity. Language-level persistency. Reasoning about persist ordering across threads can be challenging. Acquire-Release Persistency (ARP) is a language-level persistency model that uses explicit acquire and release synchronization primitives to specify ordering across threads <ref type="bibr">[32]</ref>. When an acquire synchronizes with a release, all writes preceding the release should become durable before writes following the acquire. Ordering within a thread is achieved by using barriers.</p><p>Along with the ordering that ARP enforces, Release Persistency (RP) imposes ordering between writes and the synchronization instructions themselves (acquire/release) <ref type="bibr">[18]</ref>, allowing persistent synchronization variables. Writes preceding a release are persisted before the release operation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Synchronous ordering</head><p>Current Intel systems support PM through instructions for durability and ordering. Programs flush data to NVM using clflushopt and clwb to force write-back of cache lines into memory <ref type="bibr">[33]</ref>. These instructions are weakly ordered, so a program that wants to enforce ordering must use a sfence between epochs to wait for the preceding flushes to complete, which stalls the CPU until the memory controller acknowledges the flushes. Frequent ordering hurts performance by stalling on every fence instruction <ref type="bibr">[34]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Asynchronous DRAM Refresh (ADR) and enhanced ADR</head><p>Intel platforms that support Optane Persistent Memory provide a feature called Asynchronous DRAM Refresh <ref type="bibr">[5]</ref>, which flushes data queued at the memory controller to    Recently, Intel announced enhanced ADR (eADR) that flushes processor caches on power failure <ref type="bibr">[5]</ref>. If eADR is enabled, applications do not have to issue flush instructions as all the data in the caches would be flushed to NVM in the event of a crash. The operating system must be able to handle a power fail interrupt and detect if the reserve power is insufficient to flush caches, indicating that eADR may not always guarantee data is flushed <ref type="bibr">[35]</ref>  <ref type="bibr">[36]</ref>. However, the details of eADR are not available yet, so the implications on performance are not clear. eADR is expected to be costly as it requires a large battery to back the entire cache hierarchy <ref type="bibr">[36]</ref>, <ref type="bibr">[37]</ref>. Adding a battery to every system supporting NVM is not generally practical or cost-effective. ASAP is able to achieve performance close to that of eADR (Section VII) without requiring additional battery.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. MOTIVATION</head><p>The motivation for ASAP is two-fold: (1) Avoid flushing stalls in a multi-core multi-memory controller system and (2) handle cross-dependencies efficiently. Multi memory-controller systems. Having multiple memory controllers in a single processor is common for serverclass processors, such as Intel Xeon processors (the only ones to currently support NVM). With multiple memory controllers, enforcing ordering requires ensuring that persists are ordered correctly across multiple controllers. So persists at one controller may have to wait for persists to arrive and complete at another controller. Compounding the problem, the lower bandwidth of Intel Optane memory encourages interleaving physical addresses across controllers to increase memory parallelism. Recent studies <ref type="bibr">[38]</ref>, <ref type="bibr">[39]</ref> have shown that interleaving can improve write bandwidth by up to 5.6&#215;. This makes it more likely that data structures will span memory controllers.</p><p>Currently available hardware and previously proposed designs either do not address this problem or are not efficient in doing so. Intel hardware tackles the problem through synchronous fence operations: the processor stalls at every epoch boundary to wait for preceding persists to complete, which increases latency and fails to utilize the available bandwidth to NVM. Many prior works such as DPO <ref type="bibr">[15]</ref> and PMEM-Spec <ref type="bibr">[20]</ref> do not support multiple memory controllers. Most other proposals such as LB++ <ref type="bibr">[14]</ref>, HOPS <ref type="bibr">[6]</ref>, StrandWeaver <ref type="bibr">[17]</ref>, and LRP <ref type="bibr">[18]</ref> implement inefficient conservative flushing strategies. These designs order writes across memory controllers by waiting for writes in the current epoch to be acknowledged before flushing writes from later epochs. Figure <ref type="figure">1a</ref> illustrates this case, where the second epoch writing to MC 2 must stall until the first epoch is acknowledged by MC 1. These designs incur flushing stalls when one memory controller is slower to respond than the other and leads to poor bandwidth utilization.</p><p>To address this issue, Vorpal <ref type="bibr">[16]</ref> proposed distributed algorithms using vector clocks. Implementing such complex protocols can involve significant overheads: 1) high tag cost: every store needs to be tagged with a vector timestamp (a vector containing as many timestamps as cores) which can be quite large in a server with many cores, 2) communication overheads: memory controllers need to broadcast their clocks frequently as the broadcast frequency determines the rate of forward progress. There is a need for an efficient implementation for ordering persists in a multi-memory controller system. Handling cross-thread dependencies. Efficiently ordering persists correctly across threads is important for performance. The WHISPER workload analysis <ref type="bibr">[6]</ref> showed that applications had few cross-dependencies, where one thread  <ref type="bibr">[7]</ref>). Figure <ref type="figure">2</ref> shows the total number of epochs and number of cross-dependencies in 1ms of simulated program execution with release persistency. Cross-dependencies are not common in workloads in the WHISPER benchmark and PMDK-based applications such as Vacation and Memcached. However, they are frequent in new concurrent data structures such as CCEH <ref type="bibr">[7]</ref>, Dash <ref type="bibr">[8]</ref> and RECIPE <ref type="bibr">[4]</ref>. Thus, there is a necessity for a system that handles these dependencies efficiently.</p><p>To handle these dependencies, designs that employ conservative flushing stall flushing in the dependent thread until the earlier thread's persists complete. Figure <ref type="figure">1b</ref> illustrates this problem. Writes in epoch E 1,1 of thread T1 cannot be flushed until source thread T2 completes flushing its epoch E 2,0 . PMEM-Spec <ref type="bibr">[20]</ref> speculates that persists will occur in order, but has an expensive recovery mechanism if this speculation is wrong.</p><p>We measure the frequency of such flushing stalls for various workloads with HOPS (see Section VII for workload details) to demonstrate the extent of the problem. Figure <ref type="figure">3</ref> shows the percent of cycles the persist buffers are blocked without flushing writes for different workloads. Overall, the persist buffers are unable to flush for 26% of cycles on average. Many of these stalls come from cross-dependencies which is why the stall cycles are higher in the new concurrent persistent data structures.</p><p>Although earlier works alleviate processor stalls by decoupling persistence from core execution with caches/buffers, they suffer from frequent flushing stalls. Such frequent stalls cause the finite buffers to fill up and exert back-pressure on the core pipeline, ultimately stalling the processor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. ASAP OVERVIEW</head><p>The main design goals of ASAP are:  &#8226; Maximizing common case performance: Persist ordering has to be preserved to ensure consistent state during crashes. In the absence of failures, ordering is not a necessity. Crashes are rare compared to the frequency of ordering events during program execution. ASAP, therefore, tries to improve the performance during normal operation by speculatively updating memory. &#8226; Handling cross-thread dependencies efficiently: ASAP employs cross-thread dependency tracking and resolution strategies to overcome limitations of earlier works. &#8226; Scaling to large scale servers: ASAP avoids global shared state, global broadcast operations, or centralized structures with per-core state in order to scale to many cores and multiple memory controllers. ASAP is a buffered persistency system that uses a separate data path for persists. Writes are queued in a hardware buffer and are later flushed from the buffer to the memory controller. Buffering decouples the program execution from data persistence, allowing the core to continue executing instructions while the hardware handles making the writes durable. Buffering also enables coalescing of writes which reduces the number of persists issued to NVM.</p><p>ASAP flushes all persists automatically, as soon as possible. Flushes could, however, arrive out-of-order at the memory controller, which could lead the system into an inconsistent state during crashes. ASAP handles this by storing fixup information in the memory controller, which is flushed on failure using ADR. The recovery information is used to resolve the inconsistency during a crash. ASAP tracks cross-thread dependencies using coherence, and resolves dependencies using new inter-thread messages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Persistency model</head><p>ASAP builds on the ofence and dfence primitives from HOPS to enable applications to build high-level atomicity mechanisms. Writes within a thread are ordered using ofence while dfence guarantees that all earlier writes in that thread are persisted. ofence can be used to order log updates before data updates and dfence can be used at the end of transactions or after the completion of a data structure update to ensure durability before responding to the client.</p><p>The key design aspects such as eager flushing, speculative updates to memory, and cross-thread dependency handling are not tied to the persistency model. We discuss and evaluate the design of ASAP that supports 1) epoch persistency or 2) release persistency <ref type="bibr">[18]</ref>. The key difference between the two models is when cross-thread dependencies arise. With epoch persistency, a cross-thread dependency arises when a thread issues a conflicting access to any address which has been modified by another thread. With release persistency, a cross-thread dependency is detected only when a thread's acquire synchronizes with another thread's release, whether or not the address refers to persistent or volatile memory. Figure <ref type="figure">4</ref> shows the ordering semantics of the two persistency models. Persists within a thread are ordered using 2-sided persist barriers (ofence) while acquire and release act as 1-sided barriers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Nomenclature</head><p>We define some of the terms useful for understanding the design of ASAP. Writes refer to stores issued to addresses in NVM. We use the term flush to indicate a write was sent to the memory controller, where ADR will ensure it is stored to NVM on a failure. We use the term persist to indicate that a write has reached the persistence domain. All flushes and persists occur at cache-line granularity.</p><p>&#8226; An epoch is a region of code separated by persist barriers within a thread. Writes in an epoch can be flushed concurrently. Writes in a later epoch should not survive a failure unless all writes from its preceding epochs also survive. We use thread to refer to a CPU core that supports a single thread. We leave hyperthreading to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Eager Flushing</head><p>ASAP employs eager flushing wherein it flushes writes in different epochs concurrently. ASAP speculates that writes from earlier epochs will eventually become durable and so it eagerly flushes writes from future epochs. This enables ASAP to overlap flushes from different epochs and use the total available system bandwidth more efficiently. ASAP differentiates between writes that are in the current flushing epoch and writes in future epochs. When ASAP flushes a persist in a future epoch, it marks the persist as early. Eager flushing reduces latency and improves bandwidth by removing stalls needed to order epochs across multiple memory controllers, and to order dependent epochs on different threads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Speculative updates to memory</head><p>When the memory controller receives an early flush, it speculatively persists the write in memory. The system might crash after the update (mis-speculation) and leave memory in an inconsistent state. To handle this, the memory controller saves recovery information in the memory controller before speculatively persisting the write. Specifically, it creates an undo record for the speculatively updated address by reading the value at that address from memory before issuing a write. ASAP includes the recovery information in the ADR domain, which can be flushed to NVM on a power outage. If the system crashes, ASAP reverts the state of memory using the undo record in the memory controller.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Cross-thread dependencies</head><p>Similar to HOPS <ref type="bibr">[6]</ref>, ASAP extends the coherence protocol to track cross-thread dependencies. ASAP augments coherence messages with dependence information. When a thread receives a coherence request for a cache-line, it replies with the data and current epoch number and then starts a new epoch. The core requesting the cache-line starts a new epoch that is dependent on the remote thread's epoch. Note that with release persistency, dependence information is sent only if the coherence request is for a cache-line touched by an acquire/release. With epoch persistency, by creating new epochs on cross-thread dependencies, circular dependencies can be avoided. This approach is borrowed from the epoch deadlock avoidance mechanism proposed earlier <ref type="bibr">[14]</ref>. Since ordering is cheap in ASAP, this has little effect on performance. Furthermore, it is always safe to split an epoch into several smaller epochs, as they ensure the same ordering requirement. Note that, with release persistency, regular coherence requests do not establish a dependency, so ASAP requires race-free code.</p><p>ASAP applies eager flushing to cross dependencies. ASAP allows writes from a dependent thread to be flushed before or concurrently with writes from a source thread. This lets the dependent thread drain its buffer eagerly without having to wait for the source thread to complete its epoch. Like before, the dependent thread marks the writes as early when it sends the packets to the memory controllers.  <ref type="formula">2</ref>) it avoids costly snooping mechanisms used in DPO <ref type="bibr">[15]</ref>, and (3) it reduces the resolution latency and saves energy compared to polling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Write collision</head><p>Since ASAP flushes writes eagerly, writes to the same address can be flushed concurrently from multiple threads. A problem arises when multiple early flushes to the same address arrive at the memory controller out-of-order. For instance, a write with an older value could arrive at the memory controller after writes with newer values. Updating the memory speculatively and maintaining recovery information won't work as it could leave the memory with a stale value.</p><p>Consider the example in Figure <ref type="figure">5</ref>. Suppose initially A = 0. All three threads write to A, but early flushes A = 3 from thread 3 arrives at the memory controller, followed by A = 2. When the MC speculatively updates memory with value A = 3, it stores an undo record of A = 0. When it speculatively persists A = 2, the value in memory is 3, so the MC stores the undo record A = 3. Memory now has the incorrect value due to the reordering of the two writes. Furthermore, If the system fails now, the correct value A = 0 has been lost, so ASAP cannot recover to a consistent state.</p><p>ASAP handles such write collision with additional record keeping. An MC creates delay record when an early write arrives and an undo record for that address already exists. A delay record contains the value of a write from an epoch that has not yet committed; once it is committed, the value from the delay record becomes the safe value and is copied into the undo record. In the above example, the MC speculatively updates memory with value A = 3 but creates a delay record for A = 2. The MC applies the delay record to update the undo value when the epoch containing A = 2 commits. Note that more than one delay record may be created if multiple early writes arrive at the memory controller.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. ASAP IMPLEMENTATION</head><p>We implement ASAP for the x86-64 architecture to allow a comparison against Intel's currently available systems. The x86-64 ISA enforces TSO (Total Store Order) consistency model and lacks support for explicit acquire and release operations in the ISA. For release persistency, ASAP requires the programs to provide acquire/release information either through extensions to existing instructions or through annotations. We use acquire/release annotations in our programs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Hardware Structures</head><p>We first describe the hardware structures used to implement ASAP; later, we describe the overall operations. Persist Buffers (PBs): These are per-core circular buffers alongside the private caches similar to HOPS <ref type="bibr">[6]</ref> and DPO <ref type="bibr">[15]</ref>. Writes to NVM are buffered and tracked in the PB. When a core issues a store to an address in NVM, the write is simultaneously updated in the caches and enqueued in the PB. PBs flush writes in the background without the core's intervention. Memory is updated by flushing data from the PBs. Cache-lines for NVM evicted from the LLC are dropped and are not written back to memory.</p><p>ASAP uses logical timestamps to label epochs. Each core has a timestamp register for the current active epoch. When an entry is added to the buffer, it is assigned the timestamp register's value. PBs also track the status of the flush. While flushing, PBs use the timestamp of the entry to look up in the Epoch Table (described below) to determine if a flush is safe or early. To notify the memory controller if a flush is early, PB sets a bit in the packet sent to the MC. PBs delete entries after they have been acknowledged by the MC. Epoch Tables (ETs): The per-core epoch tables are CAMs that hold metadata about in-flight epochs. An epoch table resides next to the core's persist buffer. It tracks information about the status of in-flight epochs and cross-thread dependencies. For each ongoing epoch, the epoch table tracks the number of writes waiting to be flushed or awaiting acknowledgment. When a cross-thread dependency is detected via coherence message, the core records the source thread and epoch in the ET. The source thread adds the dependent thread to its epoch table as well.</p><p>Execution of an ofence increments the current epoch timestamp and adds a new entry to the ET. When the core issues a dfence instruction, the ET waits to respond until all the in-flight epochs in the table have been committed. This ensures that all the writes prior to the dfence have been persisted. The core stalls until the ET responds. Recovery Tables (RTs) are CAMs residing in the memory controller that hold undo and delay records. Early flushes could create either undo or delay records when they reach the MC. Undo and delay records store the address, data, threadID, and timestamp of the flush that created it.</p><p>Undo records are used to store the safe state for an address, i.e., value in memory prior to the speculative persist or the value written by the most recently safe flush. Undo records are created by reading the value from memory before speculatively updating it. Such a read-modify-write approach is acceptable because (1) NVM has read/write asymmetry, with read bandwidth much higher than write bandwidth, (2) only the number of reads increase (the number of writes  <ref type="formula">3</ref>) XPBuffer <ref type="bibr">[1]</ref> in Intel Optane Persistent memory caches most recently accessed lines. Write would mostly hit in this cache.</p><p>Delay records hold updates that belong to epochs that have not yet been committed. The value in the delay record is the value flushed by the PB. Delay records are processed when the epoch they belong to commits (see Section V-C).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Handling incoming flushes</head><p>Eager flushing can reorder flushes to the memory controller. The flushes cannot be written out to memory na&#239;vely. The memory controller re-orders some of the writes using delay records. Table <ref type="table">I</ref> shows the different ways in which the memory controller handles incoming flushes. We briefly explain the actions here:</p><p>1) Incoming safe flush without matching undo record: This is the normal case where the flush updates the memory normally. 2) Incoming safe flush with matching undo record: Since an undo record exists, the address in memory has been speculatively updated with a later value. The older value in the incoming safe flush is not copied to memory to avoid losing the newer value already in memory, and instead is stored in the undo record. 3) Incoming early flush without matching undo record:</p><p>The memory controller creates an undo record and speculatively updates memory. 4) Incoming early flush with matching undo record: Presence of an undo record implies memory is in a specula-tive state. The memory controller delays the incoming early flush by creating a delay record to process this flush later when its epoch commits.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Epoch commit and cross dependency resolution</head><p>ETs determines when an epoch is safe, complete, or can be committed. First, it recognizes an epoch E n as safe when (i) E n 's preceding epoch has committed, and (ii) E n has no cross dependency, or if it does, the ET has received a Cross Dependency Resolved (CDR) message from the source thread (described below). Second, ET recognizes an epoch as complete when the PB has received ACKs for all the writes in the epoch from the memory controller. ET commits an epoch when it is both safe and complete.</p><p>When an ET detects that an epoch can be committed, it sends commit messages to the memory controllers. In order to avoid unnecessary communication between the ET and the memory controllers, the ET records the memory controllers to which early flushes were issued.</p><p>A memory controller, on receiving a commit message, searches the RT and deletes any undo records belonging to that epoch, making space for new records. If any delay records exist for the epoch, they are processed as if the corresponding flushes just arrived at the MC (Section V-B): if no undo record exists, the delay value is persisted to NVM; otherwise the undo record is updated with the delay value.</p><p>After receiving an ACK for the commit message from the memory controllers, the ET considers the epoch to be committed. It then sends a CDR (Cross-thread Dependency Resolved) message to the dependent thread (if it exists). The dependent thread resolves the dependency (clears the metadata in its ET) when it receives the CDR message.</p><p>Unlike prior designs <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>, the recovery tables in the memory controller do not have to track epoch ordering.</p><p>Instead, ETs track ordering information and send commit messages to the memory controller in the correct order. This strategy allows the recovery tables to commit writes in the right order without having to compare epoch timestamps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Handling insufficient Recovery Table space</head><p>One of the challenges in ASAP is that RTs are finite. There may be no space in the RT to handle an incoming early flush. ASAP uses negative acknowledgments (NACK) to reject early flushes if there is no space in the RT.</p><p>Such a mechanism lets the memory controller exert backpressure on the PBs. If a flush is NACKed, PBs pause eager flushing and fall back to conservative flushing wherein only safe flushes are issued. These never allocate space in the RT. PBs waits until the epoch becomes safe and then retries the flush but this time as a safe flush instead of an early flush. Eager flushing resumes after the current epoch commits.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Crash handling</head><p>On failure, memory controllers are notified of a pending shutdown. MCs flush all writes in their WPQ (Write Pending Queue). Along with these writes, memory controllers write the values in the undo records to memory, thereby unwinding the effects of speculative updates. After this, no data from unsafe epochs remain in memory. delay records don't play any role here since they belong to epochs that were not completed before the crash and had not updated memory yet. Memory is restored to a consistent state during the crash and doesn't require additional recovery in hardware after restart.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Discussion</head><p>Handling private cache evictions. Cross-thread dependencies are established when a coherence request is forwarded to the thread that last wrote to it. It is possible that the cache-line is evicted from the private caches while preceding writes are still enqueued in the PB. Past designs <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref> solved this problem by managing the cache evictions in a small buffer. The cache-line eviction is delayed until all the preceding writes are persisted. ASAP uses a similar writeback buffer (WBB) <ref type="bibr">[17]</ref> to prevent cache evictions before preceding writes are persisted. WBB records the tail index of the persist buffer when the cache initiates the eviction. The cache-line is written back from the WBB when the persist buffer flushes the corresponding index entry. Handling early LLC cache-line evictions. In ASAP, cachelines for persistent memory are dropped when they are evicted from the LLC since writes take the persist path through the persist buffer. A cache-line might be queued in the PB while it is evicted from the LLC. Loads to this address cannot read from memory since the latest value resides in the PBs and not in memory. However, these events are very rare. Some previous designs <ref type="bibr">[28]</ref>, <ref type="bibr">[40]</ref> leverage the fact that writes in non-temporal paths almost always complete before the temporal counterparts. Nevertheless, this problem could arise in ASAP when flushes are NACKed. When a flush is NACKed, the data sits in the persist buffer until it is safe to be reissued. The cache-line pertaining to this flush might face an LLC eviction during this time. To</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cross-thread dependency</head><p>Intra-thread ordering overcome this, ASAP uses a counting Bloom filter at the memory controller, similar to HOPS <ref type="bibr">[6]</ref>. The Bloom filter is populated with flush addresses that were NACKed. ASAP uses the Bloom filter to check address of the cache-line being evicted from the LLC. If it hits, the cache eviction is delayed since the corresponding write in enqueued in the PB. When the write is retried, the address is removed from the Bloom filter and the cache-line is evicted. DMA coherence. To ensure that DMA reads and writes are coherent with the CPU caches, software needs to issue dfence before initiating DMA operations. This ensures that all writes queued in the persist buffers are persisted. Context switches. To ensure correct ordering when the operating system migrates a software thread from one core to another, the OS must issue a dfence instruction to ensure the thread's data have been safely persisted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CORRECTNESS</head><p>Eager flushing avoids flushing stalls by allowing overlapping flushes from different epochs across many threads. Memory controllers in ASAP handle the incoming writes differently based on the state and type of flush as shown in Table <ref type="table">I</ref>. In this section, we use formal methods to prove that the eager flushing and the memory controller actions 1) do not cause deadlocks and 2) restore memory to a consistent state during a crash.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Forward progress</head><p>ASAP ensures forward progress all the time. It is easy to reason about forward progress when the buffers are not completely full. ASAP handles filled buffers carefully:</p><p>1) Persist buffers: When there is no free space in the persist buffer, the incoming write from the core is stalled. Entries in the persist buffer are freed when the writes are flushed successfully. This allows the stalled write to be issued, thereby ensuring forward progress. 2) Epoch tables: If the number of ongoing epochs for the thread exceed the number of entries in the epoch table, the upcoming sfence or dfence would stall. Space is created when an epoch completes. 3) Recovery tables: Section V-D describes how ASAP handles insufficient space in the recover table by using NACKs. Entries in the recovery table are deleted when memory controllers receive commit messages. Below we describe how ASAP guarantees that all the epochs will become safe eventually even if buffers are full. This differentiates ASAP from hardware transactional memory, in which, if the hardware log buffer is full, transactions are aborted. Lemma 0.1. Epoch dependencies can be represented as a Directed Acyclic Graph (DAG) where the epochs form the nodes and edges represent the dependencies.</p><p>Proof: There are 2 types of epoch dependencies: <ref type="bibr">(1)</ref> intra-thread dependencies that arise due to persist barriers (ofence and dfence), and (2) cross-thread dependencies. These dependencies can be represented as a graph as shown in Figure <ref type="figure">7</ref> with epochs as nodes and dependencies as directed edges.</p><p>The first epoch in the system is always safe as there are no earlier epochs that it can depend on. This epoch forms the starting node of the DAG because it has no incoming edges. There are no cyclic dependencies in ASAP because new epochs are created in both the threads when a cross-thread dependency is detected <ref type="bibr">[14]</ref>. Therefore, the dependence graph is acyclic, i.e., a Directed Acyclic Graph.</p><p>Theorem 1. At any instance, there always exists an epoch in the system which is safe, i.e., all writes in preceding epochs in that thread are complete and cross dependency, if any, has been resolved. Proof: Every directed acyclic graph G(V, E) has at least one ordering. A topological ordering of a DAG is an ordering of its vertices V into a sequence S, such that for every edge e &#8712; E, the start vertex of e occurs earlier in the sequence S than the ending vertex of e.</p><p>The topological order in the DAG of epoch dependencies represent the order in which epochs become safe. This means that when an epoch completes, the next epoch in the topological order would become safe because all epochs it was dependent on are complete. Therefore, there always exists an epoch that would be safe. Therefore, the system would never run into a deadlock because there would always be atleast one epoch which is safe and can complete. Such a safe epoch would issue safe flushes which would not be NACKed by the memory controller. Once the epoch commits, dependent epochs become safe to complete and this goes on.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Recovery correctness</head><p>For some incoming flushes, the memory controller speculatively updates memory and for some others, it delays the update by creating a delay record. When the memory controller receives an epoch commit message, it deletes all the undo records and acts on the delay records. The values in the delay records could be written out to memory or copied to an existing undo record. Here, we try to explain how all of these actions help keep the memory in a consistent state even during failures. Proof: Incoming flushes at the memory controller could be safe or early. When a safe write arrives, the flushed data is either persisted to memory or is copied to an existing undo record in the URT. The presence of an undo records indicates that memory has been updated speculatively with a later value. If the system crashes at this time, the undo record would be used to restore memory to a consistent state. This means that memory would now hold the value written by a committed epoch. Similarly, an early write could cause the memory controller to speculatively update memory or create a delay record. When an epoch is being committed, the delay record value is copied to memory (or to an undo record which leads to scenario described above). Therefore, all the writes in a committed epoch are available even after a crash. Lemma 1.2. If there are no records for an address in the RT, then the value at the address in memory belongs to an epoch that is safe or has committed.</p><p>Proof: Only early writes result in the creation of undo or delay records. When an epoch commits (an epoch commits when it is safe and complete), undo records are deleted and values in delay records are copied to memory. Therefore, if there are no records for an address in the RT, then all epochs that wrote to that address have either committed or are safe. Theorem 2. When the system recovers from a crash, memory is in a consistent state.</p><p>Proof: When a system crashes, the memory controller copies the values in the undo records to rewind the speculative memory state; all the delay records are discarded. On a system restart, recovery tables in the memory controllers are empty. From lemma 1.2, empty RT implies consistent state of memory. From Lemma 1.1, all the writes in committed epochs are available even after a crash. Combining the two results, we get that memory is in a consistent state and has all the values written by committed epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. EVALUATION</head><p>We implemented and evaluated ASAP using full-system simulation on gem5 <ref type="bibr">[41]</ref>. We simulated a modern multicore system with 2 memory controllers similar to Intel Xeon processors. The hardware configuration is summarized in Table <ref type="table">II</ref>. We modeled NVM characteristics based on a study on Intel Optane <ref type="bibr">[38]</ref>. Each core has a 32-entry persist buffer  Table <ref type="table">III</ref> shows the workloads used to evaluate the performance of ASAP. We chose workloads from a variety of sources with a variety of programming models, emphasizing multi-threaded workloads. We use 3 classes of applications:</p><p>1) Benchmarks from the WHISPER suite with a mix of native (Nstore and Echo) and PMDK code (Vacation and Memcached) <ref type="bibr">[6]</ref>. 2) Hand-written data structures using the ATLAS persistence framework <ref type="bibr">[30]</ref>. 3) New concurrent persistent data structures, both handwritten (CCEH, Fast_Fair) and converted from DRAM structures (RECIPE, Dash). We configure all applications to be update-intensive in order to stress PM write performance. For Nstore, Echo and Vacation we use default parameters used in WHISPER <ref type="bibr">[6]</ref>. For the rest of the workloads, key and value sizes vary from 16B to 128B. Data is interleaved across memory controllers.</p><p>We compare the following designs in our evaluation:</p><p>&#8226; Baseline: This model replicates current Intel machines that support synchronous ordering through clwb and sfence instructions. &#8226; HOPS: HOPS_EP <ref type="bibr">[6]</ref> implements epoch persistency and HOPS_RP implements a variation of HOPS that provides release persistency. We make changes to the polling implementation in HOPS. The original implementation unrealistically polled every cycle and assumed read took a single cycle. We updated HOPS to poll every 500 cycles with each access of the global TS register taking 50 cycles. &#8226; ASAP: ASAP_EP supports epoch persistency and ASAP_RP implements release persistency. These designs implement PBs, ETs and RTs to provide speculative persistence. &#8226; eADR/BBB: This model implements a system with eADR. We also implemented an optimistic version of BBB <ref type="bibr">[19]</ref>. BBB's performance is very close to that of a system with eADR. We therefore use a single graph to represent the performance of both eADR and BBB. For all models, we assume ADR, i.e. the Write Pending Queues in the controllers are part of the persistence domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Performance study</head><p>Figure <ref type="figure">8</ref> compares the performance of all the models in a 4-core 2-MC system. Speedups are normalized to the Intel baseline. ASAP outperforms the baseline and HOPS while performing close to eADR/BBB for almost all workloads. Comparison to baseline. As expected, the baseline model is the slowest as it stalls the CPU frequently waiting for cache flushes to complete. By decoupling ordering and durability, ASAP can overlap flushing with useful computation. ASAP_EP and ASAP_RP offers a speedup of 2.1&#215; and 2.29&#215; on average respectively. The speedup is significant in applications without frequent durability fences as it allows ASAP to make writes durable in the background without stalling the CPU. We see that ASAP performs well for applications with smaller critical sections. Vacation uses a coarse-grained lock while performing a query on the reservation system, and performs bookkeeping of volatile data before releasing the lock. By the time another thread acquires the lock, writes have been flushed out so early flushing is not beneficial. Comparison to HOPS. ASAP outperforms HOPS for both persistency models. ASAP_EP improves performance by 37% on average over HOPS_EP while ASAP_RP improves by 23% on average over HOPS_RP. The performance improvement is significant for concurrent data structures such as CCEH, Dash and RECIPE. These workloads have smaller epochs with writes to different memory controllers. They also exhibit high cross-thread dependencies as shown in Figure <ref type="figure">2</ref>, which leads to frequent flushing stalls in HOPS which ultimately results in core stalls because of insufficient buffering capacity. Instead of stalling, ASAP flushes writes early, making space in the persist buffers for newer writes. HOPS also stalls longer on dfence as it takes longer to drain the persist buffer completely. Having flushed writes early, ASAP has fewer durability stalls.</p><p>Comparison of persistency models. Both HOPS and ASAP perform better with release persistency than epoch persistency. This is mainly because the number of cross-thread dependencies is much higher with epoch persistency than release persistency. However, the difference between the performance is not significant with ASAP because ASAP is optimized to handle frequent cross-thread dependencies.</p><p>On the other hand, HOPS_EP's performance drops below baseline for concurrent data-structures such as queue, CCEH, DASH and P-ART. These applications have small epochs and frequent cross-thread dependencies. HOPS uses polling to resolve cross-thread dependencies, and if the polling period is longer than the time it takes to flush all writes in an epoch, HOPS stalls longer than baseline. For the rest of this section, we present results of models supporting release persistency as it performs better. We use HOPS to refer to HOPS_RP and ASAP to refer to ASAP_RP.</p><p>Comparison to eADR/BBB. ASAP_RP's performance is very close to eADR, on average within 3.9%. The primary cause of stalls in ASAP are dfence instructions used to ensure durability, and even then it rarely stalls because ASAP flushes writes early. Similar to BBB, cores could stall if persist buffers get filled up. This could happen if the rate at which the core issues writes is greater than the memory bandwidth. However, this doesn't happen often and as shown in Figure <ref type="figure">11</ref>, the average PB occupancy is well below maximum capacity.</p><p>The downside of BBB is that on failure, it requires a larger battery to ensure that all the entries in the buffers are persisted along with any in-flight inter-core communication. ASAP can achieve comparable performance by using consolidated smaller buffers in the memory controller instead.</p><p>Write endurance PM has limited write endurance compared to DRAM. It is therefore important to reduce write traffic. Buffering enables coalescing which reduces the number of writes issued to memory and improves the write endurance.</p><p>Figure <ref type="figure">9</ref> shows the number of write operations in HOPS and ASAP normalized to HOPS. For most applications, ASAP has fewer write operations compared to HOPS. Some LH and Dash-EH, we observe that concurrent flushing from different threads can be coalesced in the WPQ (Write Pending Queue) in the memory controller. Therefore, along with improving the performance, ASAP improves write endurance of NVM by improving coalescing. However, ASAP incurs additional cost in creating undo records before updating memory speculatively. On average, number of PM reads increases by 5.3% in ASAP over HOPS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sensitivity studies</head><p>Number of cores. One of the key design goals for ASAP is to scale to larger servers. In that aspect, we evaluate the performance of ASAP and HOPS for release persistency by varying the number of cores. We varied the number of software threads in the applications accordingly. We fix the number of MCs to 2.</p><p>For a single thread, on average, ASAP improves performance by 18% over HOPS. With a single thread, there are no cross-thread dependencies. ASAP's improvement in throughput can be attributed to eager flushing, which allows concurrent flushes to both the MCs. ASAP is therefore able to utilize the system bandwidth better than HOPS. ASAP scales better than HOPS as the number of cores increases. Due to limited space, we show the scalability for workloads that scale best (P-ART) and worst (Skiplist) along with the average in Figure <ref type="figure">10</ref>. On average, ASAP achieves a speedup of 1.18&#215;, 1.79&#215;, 2.51&#215; and 2.85&#215; with 1, 2, 4 and 8 threads over a single thread of HOPS. HOPS is only able to achieve a speedup of 1.36&#215;, 1.94&#215; and 2.15&#215; by increasing the threads to 2, 4 and 8. As the number of cores increases, the probability of crossthread dependencies increases. HOPS falls off when the number of cores is increased because of its inefficiency in handling cross-thread dependencies. HOPS stalls frequently as it employs conservative flushing. ASAP on the other hand, avoids stalls as it employs eager flushing to flush writes from dependent threads early.</p><p>PB and RT occupancy Buffer sizes are crucial for two reasons: (1) it determines the cost (area, power) of implementing the design changes, and (2) it impacts the maximum performance achievable. ASAP adds 3 buffers in the form of PBs, ETs and RTs. ETs are very small since they neither store addresses nor data. We study the occupancy of PBs and RTs to understand their impact on performance.</p><p>Figure <ref type="figure">11</ref> plots the average and 99 th percentile occupancies of the PBs in HOPS and ASAP. Since ASAP flushes writes eagerly, writes are queued for less time in the PB, leading to lower occupancy. Both the average and the 99 th percentile occupancy are much lower in ASAP. Although we simulate performance with 32 entry PB, we expect to observe similar performance with smaller PBs.</p><p>Figure <ref type="figure">12</ref> shows the maximum occupancy of the recovery table for both 4 threads and 8 threads. The max occupancy does not increase significantly from 4 threads to 8 threads. We believe that a small RT would improve performance even as applications scale. ASAP handles full RTs by falling back to conservative flushing. Therefore, ASAP's performance does not drop below that of HOPS even if the RT gets filled up. Nstore is an exception, in that it sometimes filled the RT and triggered NACKs to the persist buffer. However, NACKs did not hurt performance as the persist buffers were still able to flush data conservatively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. System bandwidth utilization</head><p>One of the advantages of eager flushing and speculative memory updates is the overlap of work across memory controllers. With eager flushing, ASAP can utilize more of the system bandwidth efficiently. To understand how well ASAP utilizes the available write bandwidth, we ran a custom bandwidth micro-benchmark. The benchmark issues 256-byte writes alternating across 2 MCs and the writes are ordered using an ofence. The results of the experiment are plotted in Figure <ref type="figure">13</ref>. HOPS fails to utilize the system bandwidth efficiently while ASAP performs 2x better than HOPS on average owing to eager flushing which overlapped the writes to the 2 MCs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Hardware cost analysis</head><p>We ran CACTI <ref type="bibr">[42]</ref> simulations to study the area, access latency and energy overheads. At the time of writing this paper, CACTI supported 22nm node but nothing smaller. Table <ref type="table">V</ref> summarizes the hardware cost for the persist buffer, epoch table and recovery table. We also present the numbers for a typical L1 cache for comparison. The sizes of the buffers were as specified earlier and the size of each field in the buffers is shown in Figure <ref type="figure">6b</ref>.</p><p>Epoch Tables are small and do not add significant overhead. The benefits of having a persist buffer and recovery table outweigh the hardware cost they incur. Buffering improves performance significantly and enables coalescing which improves the write endurance. Recovery table enables speculative updates of memory which also speeds up the performance and reduces write operations. Draining energy cost: Unlike BBB <ref type="bibr">[19]</ref> and eADR <ref type="bibr">[36]</ref>, ASAP does not require a backup battery to flush data during crashes. ASAP extends the ADR domain in modern processors to store recovery information. With eADR, all the dirty cache blocks in the entire cache hierarchy needs to be flushed. Considering a server class CPU with 32 cores and cache sizes in Table <ref type="table">II</ref> and assuming 50% of the cache blocks as dirty, about 42MB of data has to be flushed from the caches to NVM on power failure. BBB <ref type="bibr">[19]</ref> reduces the amount of data to be flushed to about 64KB. ASAP requires less than 4KB of data to be flushed from the recovery tables LB++ <ref type="bibr">[14]</ref> DPO <ref type="bibr">[15]</ref> HOPS <ref type="bibr">[6]</ref> LRP <ref type="bibr">[18]</ref> Strand-Weaver <ref type="bibr">[17]</ref> PMEM-Spec <ref type="bibr">[20]</ref> Vorpal <ref type="bibr">[16]</ref> BBB <ref type="bibr">[19]</ref>   <ref type="bibr">[14]</ref>. LB++ augments the cache tag arrays to track ordering. Unlike ASAP, flushing begins only after an epoch is completed and all earlier epochs are complete. Since the design couples persist path to cache management, the system suffers from stalls on cache evictions and replacements. It also lacks durability guarantees required for ACID transactions and would suffer from long stalls even if it were to support it. We expect LB++'s performance to be lower than that of HOPS and ASAP. Comparison to DPO <ref type="bibr">[15]</ref>. Similar to HOPS and ASAP, DPO uses buffers alongside private caches to enqueue writes to NVM. DPO uses conservative flushing and stalls flushing on cross-thread dependencies. It uses broadcasts to resolve cross-thread dependencies which would be costly in a large system. Moreover, DPO does not support multiple memorycontrollers. DPO's performance could be comparable to that of HOPS and lesser than that of ASAP. Comparison to LRP <ref type="bibr">[18]</ref>. LRP enforces release persistency by extending the cache tag array to include epoch numbers. LRP stalls on certain cache coherence state transitions. For instance, a forward request for a released cache-line could block until previous writes in the cache persist. ASAP instead records the dependency information and persists writes speculatively without stalling. Hence, ASAP would perform better than LRP. Comparison to StrandWeaver <ref type="bibr">[17]</ref>. StrandWeaver is the only design that provides strand persistency. It performs better than HOPS as it allows epochs from different strands to be flushed concurrently. It uses conservative flushing to handle cross-strand dependencies. ASAP should outperform StrandWeaver as it allows flushing writes from different epochs concurrently including those dependent on other threads. ASAP could be integrated with StrandWeaver to support strand persistency and achieve higher performance. Comparison to PMEM-Spec <ref type="bibr">[20]</ref>. PMEM-Spec allows flushing speculatively any PM accesses without stalling or buffering. PMEM-Spec speculates that all accesses obey the ordering constraints and flushes them as they appear to the MC. Mis-speculations have high overhead as they are treated as failures and handled in software. For a single-MC system, PMEM-Spec performs similar to ASAP as it never stalls. But, in a multi-MC system out-of-order writes are common, leading to very high overhead from expensive recovery. Comparison to Vorpal <ref type="bibr">[16]</ref>. Vorpal is one of the few works that have addressed multi-memory controller systems. It uses distributed algorithms based on vector clocks to order persists across multiple memory controllers. Unlike ASAP, Vorpal delays the write until the memory controller deems it safe. As stated earlier in Section III, Vorpal incurs large overheads because of vector timestamps, and requires frequent communication amongst the memory controllers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. RELATED WORK</head><p>Related work on enforcing ordering was discussed previously, so we focus here on other closely related topics.</p><p>There has been growing interest in designing systems that provide software-transparent atomicity. These works <ref type="bibr">[21]</ref>- <ref type="bibr">[28]</ref>, <ref type="bibr">[43]</ref> provide hardware-assisted atomicity either through logging, out-of-place updates or hardware transactions. ThyNVM <ref type="bibr">[44]</ref> provides software transparent crash consistency using a periodic hardware-assisted checkpointing mechanism. ASAP, in contrast, focuses only on ordering, which is component of many atomicity protocols. Themis <ref type="bibr">[40]</ref> proposes lightweight extensions to the default x86 persistency model to provide ordering guarantees without explicitly requiring barriers but only supports programs that use undo-logging. LAD <ref type="bibr">[45]</ref> provides atomically durable transactions in a multi-MC system. It uses buffers in the memory controllers to accumulate all updates within a transaction and uses a distributed 2PC-like protocol to atomically commit the transaction. ASAP reduces complexity by providing ordering instead of atomicity guarantees.</p><p>Recent research has also focused on developing data structures optimized for NVM <ref type="bibr">[11]</ref>, <ref type="bibr">[46]</ref>, <ref type="bibr">[47]</ref>. FAST and FAIR <ref type="bibr">[9]</ref> is a crash-consistent B-+tree, while CCEH <ref type="bibr">[7]</ref> and Dash <ref type="bibr">[8]</ref> implement persistent hash tables. Recipe <ref type="bibr">[4]</ref>, Pronto <ref type="bibr">[48]</ref> and TIPS <ref type="bibr">[12]</ref> are approaches to convert concurrent DRAM indexes to crash-consistent data structures with minimal changes. These structures do not rely on transactions but do benefit from ASAP's faster ordering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IX. CONCLUSION</head><p>NVM promises high-performance persistent data structures. Yet, the need to ensure ordering for consistency causes expensive stalls in current and proposed platforms, or limits the use of multiple memory controllers. Adding to that is the necessity for ordering persists across threads. ASAP relies on a novel early flushing mechanism that speculatively persists data out of order, and only ensures proper ordering on failure. ASAP maintains a small amount of recovery information in the memory controller to unroll the speculatively persisted data. This approach performs almost 23% better than past solutions to this problem, and within 3.9% of an ideal system.</p></div></body>
		</text>
</TEI>
