<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Max-PIM: Fast and Efficient Max/Min Searching in DRAM</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/05/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10348301</idno>
					<idno type="doi">10.1109/DAC18074.2021.9586096</idno>
					<title level='j'>2021 58th ACM/IEEE Design Automation Conference (DAC)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Fan Zhang</author><author>Shaahin Angizi</author><author>Deliang Fan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Recently, in-DRAM computing is becoming one promising technique to address the notorious ‘memory-wall’ issue for big data processing. In this work, for the first time, we propose a novel ‘Min/Max-in-memory’ algorithm based on iterative XNOR bit-wise comparison, which supports parallel inmemory searching for minimum and maximum of bulk data stored in DRAM as unsigned & signed integers, fixed-point and floating numbers. We then develop a new processing-in-DRAM architecture, called Max-PIM, that supports complete bit-wise Boolean logic and beyond. Differentiating from prior works, Max-PIM is optimized with one-cycle fast XNOR logicin-DRAM operation and in-memory data transpose, which are heavily used and keys to accelerate the proposed Min/Max-in-memory algorithm efficiently. Extensive experiments of utilizing Max-PIM in big data sorting and graph processing applications show that it could speed up ~ 50X and ~ 1000X than GPU and CPU, while only consuming 10% and 1% energy, respectively. Moreover, comparing with recent representative In-DRAM computing platforms, i.e., Ambit [1], DRISA [2], our design could speed up ~ 3X - 10X.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>In the era of big data, min/max searching from bulk data array is one of the most important and widely used fundamental operation in many data-intensive applications, including but not limited to sorting, ranking, bioinformatics, data mining, graph processing, route planning, etc. <ref type="bibr">[3]</ref>- <ref type="bibr">[5]</ref>. For example, online social and news media need real-time ranking to evaluate the hottest information to show on their website, which requires fast min/max searching from massive data. Another example: min/max searching is the most timeconsuming computation (over 40%) in many large scale graph processing algorithms, such as popular Dijkstra's algorithm to find the shortest path, Prim's algorithm to find the minimum spanning tree, maximum flow, or to solve the famous traveling salesman problem. In those widely used "best-first algorithm", min/max searching operation is called for every node to find the minimum/maximum value in the large scale graph.</p><p>However, implementing fast and efficient min/max searching for big data faces significant challenges in conventional computer system from both memory architecture and computing algorithm. <ref type="bibr">(1)</ref> From memory architecture, the well-known 'memory-wall' challenge is causing issues, like long off-chip memory access latency, data congestion due to limited memory bandwidth, two orders higher energy consumption in data movement than data processing, etc. <ref type="bibr">[6]</ref>. <ref type="bibr">(2)</ref> From computing algorithm, the min/max searching is in general comparisonbased algorithm, where CPU needs to compare every element serially for colossal raw data. Such computing property makes it demand ultra-high computing resource and power.</p><p>Above challenges naturally motivate researchers to explore implementing fast and efficient min/max searching operations within memory where bulk data is stored, to greatly minimize power-hungry and low speed massive off-chip data communication, aligning with the emerging 'processing-inmemory'(PIM) concept. Among various types of PIM platforms, 'in-DRAM computing' platform is a natural choice for this problem due to its large memory capacity to store bulk data and off-chip data transfer reduction <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>. However, if directly deploying min/max searching in the most popular existing in-DRAM computing platforms, e.g. Ambit <ref type="bibr">[1]</ref> or DRISA <ref type="bibr">[2]</ref>, it faces two challenges. (1) First, unlike general-purpose CPU/GPU providing complex and complete computing instructions, those in-DRAM computing platform only supports bulk bit-wise Boolean logic and very limited instructions, which requires a new design of 'min/max-inmemory' algorithm to make it compatible with and fully leverage the in-DRAM computing hardware supported operations. (2) Second, from logic computing perspective, min/max searching function naturally relies on X(N)OR-based comparison operations. Although existing in-DRAM computing platforms could provide such function, their in-memory-logic designs are mainly depending on charge sharing based majority gate, which requires multiple cycles to implement X(N)OR logic <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. For example, Ambit <ref type="bibr">[1]</ref> requires seven cycles to realize X(N)OR logic. It will take high cost in power and time due to large intermediate data write back, which reduces the benefits of in-memory computing.</p><p>To address above two challenges, in this work, we follow a principle of software &amp; hardware co-design to develop a parallel and efficient in-DRAM computing platform, called Max-PIM, where the main technical contributions are summarized below: <ref type="bibr">(1)</ref> We first propose a novel Min/Max-in-memory searching algorithm based on iterative XNOR bit-wise parallel comparison, which supports in-memory searching for minimum and maximum of bulk data stored in DRAM as unsigned &amp; signed integers, fixed-point and floating numbers.</p><p>(2) A new in-DRAM computing circuit and architecture, termed as Max-PIM, is then proposed to support complete bulk bit-wise Boolean logic and optimized for our proposed min/max-in-memory algorithm. Differentiating from prior works, the novelty comes from a new micro-architecture enabling an efficient in-memory data-transposing as well as a new in-DRAM-logic circuit to perform computationallyintensive XNOR logic operation in only one cycle.</p><p>(3) Detailed cross-layer experiments in a real-world dataset are conducted to demonstrate the efficiency of our design comparing with other state-of-the-art in-DRAM computing platforms (e.g. Ambit <ref type="bibr">[1]</ref>, DRISA <ref type="bibr">[2]</ref>), CPU and GPU. (4) Sorting and Dijkstra's algorithm in graph processing are used as study cases to demonstrate the improvement of Max-PIM.</p><p>II. PRELIMINARY</p><p>The Rowclone <ref type="bibr">[9]</ref>, Ambit <ref type="bibr">[1]</ref> and DRISA <ref type="bibr">[2]</ref> are among the most recognized in-DRAM computing designs with bulk bitwise in-DRAM-logic operations. We will discuss their design and compare them with ours in later experiments.</p><p>Rowclone <ref type="bibr">[9]</ref> presents a high-speed (&lt; 100ns) in-memory copy operation within DRAM sub-arrays as opposed to &#8764; 1&#956;s traditional operation in Von-Neumann computing architecture. RowClone-Fast Parallel Mode (FPM) <ref type="bibr">[9]</ref> essentially indicates an in-memory copying technique that does not require to transmit data back and forth to the processing units. In this technique, a multi-kilo byte in-memory copy operation can be implemented by issuing two back-to-back ACTIVATE commands to the source and destination DRAM rows without PRECHARGE command in between <ref type="bibr">[9]</ref>.</p><p>Ambit <ref type="bibr">[1]</ref> focuses on performing bit-wise logic operations in DRAM with minor modification. The basic AND, OR functions are implemented through a 3-input charge sharing based majority gates, called Triple Row Activation (TRA) in memory, by simultaneously sending the ACTIVATE command to three rows and then a PRECHARGE command. NOT function is also implemented considering two rows of Dual-Contact Cells (DCC). Ambit incurs 1% area over commodity DRAM chip <ref type="bibr">[1]</ref>. But, it suffers from high-latency multi-cycle logic operations to realize XOR2/XNOR2 on top of TRA.</p><p>DRISA <ref type="bibr">[2]</ref> is a re-configurable in-DRAM computing platform with different variations. DRISA not only has the ability to perform bit-wise logic operation, but also shows the ability to do arithmetical operations, such as ADD, MUL, and vectorwise inner-product. DRISA-3T1C triples the DRAM cell area to enable functionally complete NOR2 function on read bitline. However, it also requires multi-cycle operations to implement XOR2/XNOR2 operation. DRISA-1T1C is designed to execute one particular logic function via upgrading the sense amplifier (SA) unit by adding a CMOS logic gate in conjunction with a latch. Therefore, a single function (here, XNOR) could be implemented in two consecutive cycles with add-on CMOS circuitry. So, to support such a great variety of operations, DRISA requires much more modifications on standard DRAM architecture than Ambit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROPOSED MAX-PIM</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Min/Max-in-Memory Algorithm</head><p>As discussed in <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>, finding the minimum/maximum (Min/Max) number equals to finding the last/first ranked number in a given list. To fully leverage the parallel bit-wise in-DRAM computing capability due to parallel sensing of multiple bit-lines <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[7]</ref>, our core idea of developing min/maxin-memory algorithm is to convert traditional software-based sequential comparison operations to XNOR based bit-wise parallel comparison for all the data stored in the same memory array. Note that, the algorithm we discuss in this section refers to min/max function within one memory array. A large dataset will be first partitioned into multiple memory arrays to get memory-array level min/max in parallel, then an overall min/max will be identified similarly. Algorithm-1 shows the pseudo code of our proposed bit-wise parallel Min/Max-inmemory algorithm. It mainly leverages the property that each bit in the binary format has different significance, and parallel in-DRAM-XNOR logic based comparison operation could gradually exclude smaller (or larger) data from MSB (Most Significant Bit) to LSB (Least Significant Bit). Note that, our algorithm has a constant searching time determined by the bit length of numbers. It is compatible with unsigned and signed integers, fixed-point numbers and floating numbers, which will be discussed in separate sub-sections below. For unsigned bit, we do not need to distinguish sign bit and others. 12 end 13 while current bit position &lt; N do Go through every bit from MSB to LSB.; 1) Unsigned Integer: Fig. <ref type="figure">1</ref> gives an example to find the minimum value in a 4-bit unsigned int array. Since 4 bits are used to represent unsigned value, 4 iterations are required to compute the index of minimum value(s). First, it initializes a matching vector with all '1's, whose length equals to the number of data in the array, and this matching vector will be updated with the outputs of parallel XNOR logic in every</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Index of Minimum Value</head><p>Step 1</p><p>Step 2</p><p>Step iteration except when the parallel XNOR outputs are all '0's (line 18 to 22 in algorithm-1). Each element in the matching vector corresponds to one number in the array, where '1' indicates this number will be compared in the following iteration, while '0' indicates the corresponding number will be excluded. Starting from MSB, it applies parallel XNOR logic to MSB of every number with constant '0' for min function (XNOR with constant '1' for max function), corresponding to line 3 to 11 in algorithm-1. Next, the matching vector value of '1111' is updated with current iteration parallel XNOR output-'1011', since XNOR outputs are not all '0's (line 18 to 22). As shown in Fig. <ref type="figure">1</ref>, in the second iteration, the new matching vector value-'1011' indicates that the second row (marked as '0') should be excluded from XNOR based comparison operation, and this position will remain as '0' in all following iterations. The next iteration will compute parallel XNOR between the second MSBs of non-excluded numbers and constant '0' (line 16). In this example, in the second iteration, after computing XNOR between the rest bits-'111' with constant '0', the outputs are all '0'. Based on our algorithm-1 line-18, it will skip the update of matching vector of this iteration and continue to next iteration. Similar process will continue based on the same rule until LSB. As our example shown in Fig. <ref type="figure">1</ref>, when the comparison in LSBs is finished, the position(s) remained as '1' in the matching vector is(are) the identified minimum number(s). Theoretically, we could stop the process when there is only one '1' left in the matching vector (e.g. iteration-3 in our example) to save searching time. However, our target in this work is that all the operations in the algorithm-1 will be implemented by a corresponding circuit module in our proposed Max-PIM platform. Designing such early stop detection will incur extra overhead to the hardware. Thus, we choose to simplify the hardware design with no early stop, leading to a constant searching time determined by the bit representation length.</p><p>2) Signed Integer: For signed numbers in the min/max-inmemory algorithm, the basic procedure is the same as the above discussed unsigned numbers except sign bit. In signed number, the first bit represents the sign, where '1' and '0' indicate negative and positive, respectively. When for min (or max) function, the algorithm will first check the sign bit. If negative (or positive) values are found, it will exclude all positive (or negative) numbers for the following computation. To do so, as shown in line-3 to -8 in algorithm-1, a parallel XNOR between sign bits with constant-'0' (or '1') for max (or min) function is needed. The rest procedure will be similar.</p><p>3) Fixed-Point Number: Fig. <ref type="figure">2</ref> provides an example to implement min-in-memory function for a group of four 4-bit fix numbers. To start, matching vector will be initialized with all '1's. Then, XNOR logic is applied to all sign bits with '1', where the matching vectors will be updated based on XNOR outputs to exclude any positive number for next iterations. The rest 3 iterations are similar to the above discussed unsigned integer number process. In this specific example, it can be seen that if there are multiple minimum values in the array, our final matching vector will also indicate with multiple '1's.</p><p>4) Floating Number: Our algorithm is also compatible with floating numbers. For example, the IEEE754 <ref type="bibr">[12]</ref> floating number representation is a popular floating-point arithmetic standard. It contains three parts: signed bit, exponent bits, and significand precision. According to the IEEE754, the binary floating number may take 16 bits to 256 bits to represent one floating number. Taking the 64 bit floating number as an example, the representation (-1) sign &#215;(1+ 52 i=1 b 52-i 2 -i )&#215; 2 e-1023 can be written as (-1) sign &#215; 2 e-1023 &#215; 1.f raction. That means the bits in the exponent parts have higher significance than the bits in the fraction parts (significand precision). And the bits within exponent parts and significand precision obey the same rule as unsigned integer number, where the bit significance decreases along bit position. Thanks to the bit organization that exponent is on the left side of the fraction part in the endianness(little-endian) sequencing, our proposed Min/Max searching function can be directly applied to the floating number without modification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Max-PIM Logic Circuit and Architecture</head><p>It can be summarized that several important functions are required to be implemented efficiently for in-DRAM computing platform to fully support above algorithm: 1) transpose data copy; 2) fast and parallel in-DRAM XNOR logic; 3) matching vector update; 4) identified min/max index decode.</p><p>The overall architecture diagram of our proposed Max-PIM is given in Fig. <ref type="figure">3</ref>. It keeps the original memory hierarchy by dividing every DRAM chip into multiple banks that share I/O and buffers. Each bank contains multiple normal memory matrices (MATs) consisting of typical DRAM subarrays and a single computational array shown in blue. The computational array is developed similar as Ambit <ref type="bibr">[1]</ref>  <ref type="bibr">[13]</ref>, with all Ambit supported logic and instructions, but enhanced to support a new Dual Row Activation (DRA) mechanism to perform XNOR operation in addition to the typical Triple Row Activation (TRA). Each computational array consists of: two row decoders (one for data rows and one for all 1/0 rows), one column decoder (to enable transpose copy), modified logic sense-amplifier (to support XNOR logic) with Typical Sense Amplifier (TSA, for memory sensing and charge sharing based majority gate), one latch per bit-line (to store the matching vector and control the logic-SA), pseudo-OR gate (to detect all-zero XNOR outputs), one priority encoder (to return final index of identified min/max locations).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) Transpose</head><p>Copy: Due to DRAM's destructive read property, any in-DRAM computing needs to make a copy of operand data to perform in-DRAM-logic <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[7]</ref>. More importantly, based on our min/max-in-memory algorithm, one important operation of each iteration is that we need to implement parallel XNOR logic between the bits in the same location of all numbers with constant-'1'/'0', which brings two requirements for in-DRAM computing hardware: 1) for single XNOR logic, both operands (i.e. one specific bit from one number and constant-'1'/'0') need to be stored in the same bit-line based on the circuit design of in-DRAM-logic that will be explained in next paragraph; 2) to fully leverage the parallelism of in-DRAM computing, similar as prior works, multiple logic-SAs could be simultaneously activated, which is determined by the size of memory array. Therefore, to conclude, in the computational memory array, the binary data of a number needs to be stored along the bit-line direction, not along the traditional word-line direction, which requires a transpose copy from data memory to computational memory.</p><p>To support the transpose copy operation, we adopt the similar dual-contact cell (DCC) design (Fig. <ref type="figure">3</ref>) as <ref type="bibr">[1]</ref>. In DCC, one transistor shares the gate with other transistors on the same row to form the normal row-wise word-line. Another transistor shares the gate with other transistors on the same column to form the column-wise write word-line, which is controlled by the column decoder. The drain and source of the second transistor connect to the cell capacitor and the other normal DRAM array, respectively (the Data port in Fig. <ref type="figure">3(c)</ref>). Copying operand data from data storage array to computational array requires two steps: (1) reading entire word-line from the data storage array and ( <ref type="formula">2</ref>) writing them to one bit-line (column) of the computational array with the help of DCC and column decoder. Fig. <ref type="figure">4</ref> gives a demonstration for this transpose copy.</p><p>2) Parallel XNOR logic in one cycle: Max-PIM supports all traditional charge sharing based majority gate logic and optimize heavily used XNOR logic with only one cycle. In our min/max-in-memory algorithm, since one operand of XNOR comes from one bit in specific location of one number in the array, and the other one is constant-'1' or '0' for min or max function, respectively, we reserve several rows in the computational array for storing such constant bits. Traditionally, due to the destructive read in DRAM, we have to wait for data refresh of all ones/zeros rows after every computation. Thanks to the DCC cell design, which separates write and read, we could only use 2 rows for all '1's and 2 rows for all '0's without introducing extra refresh time by refreshing one row while using the other row for XNOR logic. For the all '1/0' rows, we connect the data port of DCC cells to VDD/GND.</p><p>Regarding in-DRAM XNOR logic circuit, it is based on the popular Dual-Row Activation scheme <ref type="bibr">[7]</ref>. First, the bit-lines will be pre-charged to V DD 2 . Then, two word-lines storing the operand data are activated simultaneously, which leads to charge sharing between the two cells in the same bitline and generates different sensing voltage at the associated logic-sense-amplifier (logic-SA) circuit depending on different values stored in these two cells. Then this sensed voltage will be compared with different reference voltages to implement different logic functions. In this work, we present a logic-SA circuit design that could implement XNOR logic in only one cycle as shown in Fig. <ref type="figure">3(c)</ref>, where typically multiple cycles are needed in most prior works <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. In the circuit, the plus sign marked inverter has higher threshold voltage, and the minus sign marked inverter has lower threshold voltage than normal inverter. It is designed as, for the high threshold inverter, the output goes to zero only when both activated cells have high voltage (storing '11'), implementing NAND function. The low threshold inverter works on the opposite situation that it only generates a high output when both cells have low voltage (storing '00'), implementing NOR function. Therefore, based on XNOR = NAND(OR, NAND), the XNOR logic is implemented by adding an inverter and NAND logic circuit following the two skewed inverter as shown in Fig. <ref type="figure">3(c)</ref>.</p><p>3) Matching Vector Update: As described in our algorithm-1, matching vector needs to be updated in every searching iteration, except all-zero XNOR outputs. To avoid writing back the XNOR outputs to memory for matching vector update, we incorporate latches in the logic-SA circuits as shown in Fig. <ref type="figure">3(c</ref>) to store and update matching vector. Moreover, to exclude update in case of all-zero XNOR outputs, one allzero detection unit circuit is designed based on a pseudo OR gate (dotted box of Fig. <ref type="figure">3(c</ref>)) to control the matching vector latch update.</p><p>4) Identified Min/Max decode: When the iterative XNOR comparison is done, the final matching vector indicates the indices/ location(s) of found min/max values. Max-PIM is able to return the identified min/max data address to user through a priority encoder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EXPERIMENTS AND RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experiment Setup</head><p>In our experiments, we set each bank has 16 MATs and each MAT consists of 2 sub-arrays with size of 1024&#215;256. Each bank contains one computational array shared by all the sub-arrays within the same bank. The computational array size is 260&#215;1024, where 256 rows are used to store data and 4 reserved rows store constant '0/1'. Thus, one bank stores &#8764;1MB data. Considering 1GB DRAM capacity, there are 1024 banks for the whole DRAM. As the competitors, we built Ambit and DRISA platforms with the same organization as Max-PIM, but with their own in-DRAM logic circuits. For a fair comparison, all designs use DCC cells for entire computational array to support transpose write.</p><p>To evaluate the performance, we use the similar crosslayer evaluation framework in <ref type="bibr">[7]</ref>, starting from circuit-level evaluation with TSMC 65nm technology. We use the same process node to re-implement all counterpart designs. The memory peripherals are simulated using the same technology library in Synopsys Design Compiler. We then feed the circuit simulation results into architecture-level tools. We then extensively modify CACTI7.0 <ref type="bibr">[14]</ref> to extract the performance results. An in-house mapping algorithm is then developed on top of architecture level to parse the input vector coming from various data-sets and evaluate the performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Comparing with other in-DRAM computing platforms</head><p>We compare Max-PIM with two start-of-the-art in-DRAM computing platforms: Ambit <ref type="bibr">[1]</ref> and DRISA <ref type="bibr">[2]</ref>. To conduct    fair comparison in min/max searching, we re-implement those designs with all required peripherals, including DCC array, column decoder, multiple row decoder, pseudo OR gate, latch (Matching vector), and priority encoder. Fig. <ref type="figure">5</ref>(a) shows the area overhead of a single computational array. All the peripheral circuits required to support min/maxin-memory algorithm is the same. The area difference comes from different logic-SA circuit designs. It shows Max-PIM has similar area overhead as Ambit and DRISA(1T1C) design, but DRISA(3T1C) design has much larger area overhead. Fig. <ref type="figure">5</ref>(b) provides the power comparison and cycles for XNOR logic. The DRISA(1T1C) uses XNOR gate at the bottom of every bit-line instead of leveraging the DRA to perform the XNOR, requiring two cycles. Both Ambit and DRISA(3T1C) leverage the multi-row activation to achieve the logic operations, such as AND, OR, and NOT. Ambit needs up to 7 cycles for XNOR. DRISA(3T1C) needs to do NOR 4 times and stores back the intermediate data. The Max-PIM could implement XNOR logic in only one cycle. It avoids intermediate data write back compared with other designs, and thus greatly reducing power and latency. Fig. <ref type="figure">6</ref> summarizes the distribution and comparison of latency and energy for Min/Max searching within one computational sub-array.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Min/Max searching in real word dataset</head><p>To evaluate the performance of our Max-PIM platform in min/max searching in a real world dataset, we adopt a popular data-set T10I4D100K <ref type="bibr">[15]</ref> containing 1010228 numbers, where each number is represented as 256-bit unsigned integer numbers stored in DRAM. We report the execution time and energy consumption measurement in Fig. <ref type="figure">7</ref> for different computing platforms including Ambit, DRISA, CPU, GPU. For all in-DRAM computing platforms, including our Max-PIM, Ambit and DRISA, such large array cannot fit into a single computational array, requiring necessary data partitioning. The total 1010228 numbers are partitioned into 987 computational sub-arrays, where each sub-array will compute in parallel to return a local minimal number. Then, such local minimal numbers will be wrote into another computational array to get the global minimal number. For CPU&amp;GPU evaluation, the reported time is total execution time, including data loading from main memory and processing. For energy estimation, similar as DRISA <ref type="bibr">[2]</ref>, we scale down 50% of CPU&amp;GPU average power to exclude the power cost by cooling, voltage regulators, etc. Among in-DRAM computing platforms, our Max-PIM achieves the minimal latency and energy consumption. Max-PIM could achieve &#8764;50 -&#8764;1000&#215; speed improvement, and at least one order smaller energy consumption than GPU&amp;CPU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Applications in Sorting and Graph Processing</head><p>In this section, we utilize our Max-PIM in real-word applications of data sorting and Dijkstra's algorithm in graph processing to further evaluate the performance.</p><p>For the sorting evaluation, the dataset is the same as the one used in section IV-C. Leveraging the parallel min/max searching provided by Max-PIM, we adopt the min/max sorting algorithm, where each round will find the min or max to iteratively sort the data. We report the performance of different in-DRAM computing platforms and software implementation in CPU in table I. It can be seen that all in-DRAM computing platforms outperform software implementation in CPU by three orders mainly due to save of large amount of offchip data movement and ultra-parallel processing capability. Aligning with prior experiments in single min/max searching, our Max-PIM achieves the best performance compared with other in-DRAM computing platforms due to its optimized XNOR and peripheral circuits.</p><p>Dijkstra's algorithm <ref type="bibr">[16]</ref> is a popular and widely used algorithm in graph processing to find the shortest path in large graph, where min/max searching dominates overall computation. Three different dataset are used here: geom has 7343 nodes, foldoc has 13356 nodes and EAT SR has 23219 nodes <ref type="bibr">[17]</ref>. Table <ref type="table">II</ref> reports the Min/Max searching speed improvement over CPU for different in-DRAM computing platforms. Similar performance improvement trend could be observed. In general, in-DRAM computing platform could all achieve one to two orders speed up than CPU and our Max-PIM still outperforms all other in-DRAM computing platforms significantly. Meanwhile, it is also observed that larger speed up is achieved for larger dataset, clearly proving  again memory-wall becomes the bottleneck when dealing with larger dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSION</head><p>In this work, we first propose a min/max-in-memory algorithm to find the minimum/maximum of an array stored in main memory. A DRAM based in-memory computing design, named Max-PIM, is presented to support this algorithm with massive parallelism and minimized data movement. Comparing to other DRAM based in-memory computing designs, CPU, and GPU platforms, Max-PIM's software&amp;hardware codesign requires the shortest time and smallest energy for an identical task. We have demonstrated that such proposed Min/Max searching algorithm and hardware have great potential to be applied in many applications, such as sorting, ranking, graph processing, etc.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: ASU Library. Downloaded on August 11,2022 at 20:00:16 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
