<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>TDRAM: Tag-enhanced DRAM for Efficient Caching</title></titleStmt>
			<publicationStmt>
				<publisher>arxiv</publisher>
				<date>04/22/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10519589</idno>
					<idno type="doi"></idno>
					
					<author>Maryam Babaie</author><author>Ayaz Akram</author><author>Wendy Elsasser</author><author>Brent Haukness</author><author>Michael Miller</author><author>Taeksang Song</author><author>Thomas Vogelsang</author><author>Steven Woo</author><author>Jason Lowe-Power</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6× faster tag check, 1.2× speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Today's computers, equipped with significant processing capabilities and memory capacities, aim to fulfill the requirements of high-performance computing (HPC) tasks such as machine learning and artificial intelligence. To find the right balance between performance and capacity, manufacturers have embraced heterogeneous memory systems. These systems combine high-performance memories like HBM with high-capacity memories with lower performance. The introduction of interconnect technologies like Compute Express Link (CXL) is increasing memory heterogeneity with local and remote memory pools. Intel's Sapphire Rapids CPU demonstrates this strategy by employing on-package HBM DRAMs as cache <ref type="bibr">[25]</ref>, <ref type="bibr">[61]</ref>. This approach effectively boosts cache capacities and tackles the scaling limitations encountered by SRAM <ref type="bibr">[5]</ref>. The expanded cache capacity and enhanced bandwidth provided by HBM DRAMs offer the potential for improved data locality, without programmers intervention.</p><p>This work is licensed under the Creative Commons Attribution 4.0 International (CC-BY-SA 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate websites with the appropriate attribution.</p><p>However, the potential benefits of DRAM caches have not borne fruit. Previous studies of DRAM caches have shown that using standard DRAM devices as a cache can slow down applications with large memory footprints and high miss rates <ref type="bibr">[38]</ref>, <ref type="bibr">[59]</ref>. The current designs of DRAM caches, such as those in Intel's Cascade Lake DRAM cache <ref type="bibr">[8]</ref>, <ref type="bibr">[38]</ref>, store tag and metadata together with the cache line data in the same DRAM. Storing tags and data together reduces hit time for read demands <ref type="bibr">[58]</ref>, but significantly increases miss penalties since a separate DRAM read is necessary to retrieve tag and metadata for hit/miss and status information, causing contention with demand reads. Also, all write requests, including those hitting on the cache, require a DRAM read to fetch tag and data to ensure dirty data is not overwritten, further exacerbating the contention and causing expensive turnaround bubbles on the data bus <ref type="bibr">[19]</ref>. These extra accesses for readmisses and write demands increase: (i) latency for demand misses, (ii) contention in the read buffer which extends the queue occupancy time, and (iii) wasted data movement and energy consumption.</p><p>Today's applications that require large capacity memories have high miss rates in DRAM caches which cause these issues to significantly affect the workload's performance. SRAM caches cannot scale to the capacities required by today's applications, and thus it becomes imperative to enhance existing DRAM cache designs to address these challenges.</p><p>In this work, we introduce TDRAM (Tag-enhanced DRAM), a DRAM microarchitecture specifically tailored for caching purposes.<ref type="foot">foot_0</ref> TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable faster access than the data mats through the reduction of wordline and bitline lengths. The additional on-die storage is sufficiently large to accommodate the tag and metadata for all DRAM cache lines; thus, the tag store scales with the data capacity. By placing the tags in separate mats on-die, TDRAM enables rapid ondie tag checking which reduces the latency for demand misses, mitigates contention on the DRAM read queue, and decreases wasted data transfers (and thus energy).</p><p>TDRAM extends HBM3's interface in three ways: (1) It adds a unidirectional hit-miss (HM) bus to its interface to transfer the tag check result and metadata to the controller. (2) It adds two new commands to the HBM3's protocol: ActRd and ActWr, which access both tag and data mats in lockstep. These commands check the tag for the block and only send data to the controller when it is needed. (3) We add a flush buffer to store conflicting dirty data from write misses which eliminates costly turnaround delays on the data bus and immediate cache line data transfer to the controller for write requests. The overhead of this new design compared to HBM3 is 192 pins (out of 1,972 existing pins, a 10% increase) and 8.24% total die area.</p><p>TDRAM further improves performance by implementing early tag probing, which opportunistically performs tag checks (without data access) in otherwise unused command and HM bus slots. Tag probing returns early hit-miss and status indication of a demand access, allowing certain operations (e.g., main memory access for read demand misses) to start earlier.</p><p>Early tag probing also reduces request queue occupancy time by removing misses from the queue early, allowing other demands to proceed with fewer stalls.</p><p>TDRAM is orthogonal to many prior works that focus on improving DRAM caching performance by adding predictors, prefetchers, tag caches, modifying coherence protocols, bypass policies, and other application-specific mechanisms <ref type="bibr">[23]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[28]</ref>, <ref type="bibr">[36]</ref>, <ref type="bibr">[68]</ref>. TDRAM is designed in a way that all of these techniques can be applied on top of it to further improve its performance. Overall, TDRAM enables a perfectly scalable HBM-based cache with a cohesive caching paradigm akin to processors' SRAM-based caches. Thus, TDRAM focuses on optimizing the core and fundamental DRAM cache operations by eliminating inefficiencies in existing designs.</p><p>We have extensively modeled TDRAM in the gem5 simulator <ref type="bibr">[51]</ref>, for a detailed full-system cycle-level timing analysis. Our evaluations using scientific and graph analytics applications with large memory footprints, have shown TDRAM provides 2.6&#215; faster tag check and 1.2&#215; speedup and at least 21% less energy consumption, compared to the commercial and research designs such as Intel's Cascade Lake and Alloy.</p><p>In this paper, we make the following contributions:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Microarchitecture</head><p>&#8226; We propose a new HBM3-based DRAM microarchitecture, TDRAM, designed for caching with on-DRAM-die tag management, to enable perfectly scalable DRAM caching where the tag storage scales with data capacity.</p><p>&#8226; We extend the HBM3 interface with a unidirectional Hit-Miss bus to transfer tag check results and metadata from DRAM to the controller, decoupling them from data transfer.</p><p>&#8226; We add a flush buffer to hold conflicting dirty data from write misses which eliminates both costly turnaround delays on the data bus and immediate cache line data transfer to the controller for write requests. TDRAM opportunistically sends them to the controller when data bus is idle or in read-state.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Protocol</head><p>&#8226; We integrate two new commands to HBM3's protocol, ActRd and ActWr, which enable parallel independent access to both tag and data banks. The protocol selectively streamlines data to the controller only when necessary based on tag comparison, reducing bandwidth bloat.</p><p>&#8226; We propose opportunistic early tag check mechanism in unused HM and command bus slots, to minimize queueing delay for tag check, reducing buffer contention and optimizing miss latencies. This mechanisms does not access data banks. If this tag check results in miss, then TDRAM initiates backing store access immediately, if necessary; and avoids future cache line data access, if not necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluation</head><p>&#8226; We demonstrate DRAM caching using existing designs cause slowdown while TDRAM provides 1.11&#215; speedup.</p><p>&#8226; We show TDRAM reduces energy consumption by 21%, since its conditional data response to the controller removes wasted data transfers.</p><p>&#8226; We analyze performance of DRAM caching in disaggregated systems with remote main memories and show TDRAM provides 1.14&#215; speedup compared to the state-of-the-art commercial DRAM cache, for low miss ratio applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND AND MOTIVATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. HBM3 Architecture</head><p>This section provides an overview of HBM DRAMs as the basis of TDRAM. HBM provides the highest bandwidth and capacity of any single DRAM package in mass production. HBM3 DRAMs stack multiple DRAM die into a single package, and can support up to 64 GiB capacity using 12 to 16-high stacks <ref type="bibr">[47]</ref>. These devices provide up to 1024 GiB/s of bandwidth when running at 8 Gbps across 16 independent channels with 64b data (DQ) and 10b Row command (R) and 8b Column command (C) buses. Each DQ channel can be split into two 32-bit pseudo-channels (PCs) that share the same R and C buses, with each PC providing 32B access granularity <ref type="bibr">[9]</ref>, <ref type="bibr">[14]</ref>- <ref type="bibr">[16]</ref>. Each channel includes 38 additional signals for clocks, strobes, ECC, redundancy, and other functions. The wires connecting the high pin count interface between the DRAM and host (1024 DQs, 288 command/address (CA) buses, and more than 650 pins for additional channel and global functions), are implemented in technologies such as TSMC's InFO or silicon (e.g., a silicon interposer) to support the high pin and trace densities required for this technology.</p><p>HBM DRAMs are hierarchically organized, storing data bits in arrays of capacitors. DRAM bit cells are grouped into rows or pages, with multiple rows aggregated into mats, and mats organized into 2D structures called banks. Decoded row and column addresses identify bits within a bank. Multiple banks form a bank group that share some resources. Back-to-back accesses to the same bank group require longer latencies to allow these shared resources to free up, while accesses to different bank groups enable lower latencies. set, tag comparisons can be performed in parallel if each way has its own comparator. A signal from the matching way is sent to the internal control logic to select the proper column in the data mats. Implementations without on-die tag comparators send all tags in the set back to the controller, incurring additional latency and energy, and the controller subsequently sends a request for the proper column to the DRAM, again incurring additional latency and energy consumption <ref type="bibr">[48]</ref>.</p><p>5) Tag Mats Timing Values: TDRAM architecture minimizes the tags access latencies using small low-latency mats as discussed in &#167;III-B2. In our evaluation, we use timing parameters for the tag mats that are loosely based on RL-DRAM technology. We choose to base our timings on public datasheets as the timing parameters for RLDRAM are close to the proprietary analysis we conducted. Through discussions with DRAM designers, we validated the internal timing values. Table <ref type="table">III</ref> shows a list of timing values we used. The RLDRAM spec values (e.g., tRL=15ns and tRC=8ns) match, or are more optimistic than, our values (e.g., tRCD TAG+tHM=15ns and tRC TAG=12ns). Furthermore, these values and internal TDRAM timings were also correlated with prior work analyzing the use of smaller mats <ref type="bibr">[64]</ref>.</p><p>Additionally, tHM int = tCCD L+tHM detect (which is a fast equal comparison). Address comparisons are already done in DRAMs today to quickly determine if every row or column address that the DRAM receives is a repaired row or column. We set tHM detect to 0.5ns (one 2GHz clock cycle) for the fast equal comparison based on discussions with expert DRAM designers. The use of tHM int depends on tRCD since a read operation cannot occur until tRCD is met. In our design, tRCD = 12ns which is longer than tRCD TAG+tHM int=10ns, effectively hiding the tag access and hit/miss detection latency. tHM int was also correlated with prior work <ref type="bibr">[64]</ref>, which breaks ACT-to-data delay into: 47%-sensing, 26%-address-decode, 20%-MUXing (transfer+rate-conversion), and 7%-IO. The column decode is done in parallel to sensing (tRCD) with our ActRd/ActWr commands. Finally, the I/O delay is not relevant to internal timing. Only a portion of the MUXing delay, the delay to move data out of the IOSA, is relevant to internal timing, and this delay is optimized with smaller mats to achieve tHM int=2.5ns, including HM detect logic.</p><p>To ensure dirty data is not overwritten in the SA, tRL CORE (used in write operations as illustrated in Figure <ref type="figure">7</ref>) needs to be less than or equal to intRD-to-WR data Delay+tBURST/2=9ns. We performed our evaluation using tRL CORE=tCCD L=2ns.</p><p>6) Tag Storage Area Overhead: A 64 GiB direct-mapped DRAM cache can support 1 petabyte address space using a 14-bit tag. We assume 3B of tag and metadata for each 64B cache line. Tags are stored only in one bank group of the pair (the even-numbered bank groups), and the result of the tag comparison is communicated to the other (odd-numbered) bank group through an internal bus.</p><p>We estimate the die size impact of tag storage, control logic, and on-die comparison as follows. HBM3 DRAMs store an</p><p>TABLE II: TDRAM's cache operations on different accesses. Cache Access CMD DQ Activity HM Bus Later Actions Read hit to clean ActRd Hit Data Hit None Read hit to dirty Hit Data Hit None Read to invalid None Miss Read main mem &amp; fill Read miss to clean None Miss Read main mem &amp; fill Read miss to dirty Dirty Data Miss, Dirty Tag Read main mem &amp; fill Writeback dirty data Write to invalid ActWr Wr Data Miss None Write miss to clean Wr Data Miss None Write miss to dirty Wr data Miss, Dirty Tag Dirty data to flush buffer Write hit to clean Wr Data Hit None Write hit to dirty Wr Data Hit None additional 6B of information (2B metadata and 4B parity) for every 32B of data (i.e., total column size is 38B) across 19 mats as shown by Park et al. [57]. The HBM3 die photo shows that banks (including mats, BLSAs, and Sub-WL drivers) occupy about 66% of die area. The remaining 34% of die area includes shared resources like through silicon vias (TSVs), IOSAs, per-bank group ECC, and column decoders.</p><p>For the tag mats, we use four smaller tag mats per data mat to reduce the row cycle time by reducing the bit line and word line lengths. Son et al. show that the overhead of changing the aspect ratio by a factor of 4 is 19% <ref type="bibr">[64]</ref>; however, we estimate a more pessimistic 24.3% when we scale by 1/2 in each dimension, based on our discussions with DRAM designers. Additionally, we only need tag mats in the even banks, reducing our overall area overhead. Thus, even banks require 24.3% additional area for the tags and the data banks occupy 66% of the die. So the overall impact on die size is 24.3% &#215; 0.5 (only even banks) &#215; 0.66 (area for banks) = 8.02%. We also add additional area for wire routing (for example, to route Hit/Miss signals from the even bank to the odd bank), resulting in our estimate of 8.24% die area impact.</p><p>7) Tag Storage Power Overhead: On-die tag storage and tag checking increase the power of TDRAM over standard HBM3 DRAMs. However, TDRAM provides power benefits at the system level, as (i) tags are not communicated back to the controller to perform the tag check, and (ii) the protocol minimizes data movement amplification, reducing the number of commands and cache line packets sent between the TDRAM and the controller. In the HBM2 generation, 62.6% of the power is spent moving data between the DRAM and the controller <ref type="bibr">[12]</ref>. Avoiding data transfers associated with the tag and cache-lines, which are potentially not needed if the tag comparison result is a read miss clean for example, can be beneficial for overall memory system power. &#167;V-C provides an extensive power analysis of TDRAM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Protocol</head><p>TDRAM's command protocol is similar to traditional DRAM protocols, with modifications to minimize access latency of tags and bandwidth bloat. TDRAM provides new combined ActRd and ActWr commands that activate a row and read/write a column at both tag and data banks with an autoprecharge for close-page policy. These combined commands include the row and column addresses, bank group, bank, and tag address needed to determine cache hit/miss. Internal state machines in TDRAM handle sequencing and timing of the</p></div>			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>In the recent past, it was uneconomical for DRAM manufacturers to modify the core DRAM microarchitecture. However, specialty DRAMs are becoming increasingly common (e.g., Samsung's Aquabolt<ref type="bibr">[4]</ref>, Micron's Automata Processor<ref type="bibr">[67]</ref> and RLDRAM<ref type="bibr">[11]</ref>).</p></note>
		</body>
		</text>
</TEI>
