<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Batched Predecessor and Sorting with Size-Priced Information in External Memory</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/10/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10297508</idno>
					<idno type="doi">10.1007/978-3-030-61792-9_13</idno>
					<title level='j'>Latin American Symposium on Theoretical Informatics</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Michael A Bender</author><author>Mayank Goswami</author><author>Dzejla Medjedovic</author><author>Pablo Montes</author><author>Kostas Tsichlas</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The fundamental problems of sorting and searching, traditionally studied in the unit-cost comparison model, have been generalized to include priced information, where different pairs of items have different comparison costs. These costs can be arbitrary (Charikar et al. STOC 2000), structured (Gupta et al. FOCS 2001), or stochastic (Angelov et al. LATIN 2008). Motivated by the database setting where the comparison cost depends on the sizes of the records, we consider the problems of sorting and batched predecessor where two non-uniform sets of items A and B are given as input. In the RAM model, pairwise comparisons (A-A, A-B and B-B) have respective comparison costs a, b and c. We give upper and lower bounds for the case a<= b <= c, which serves as a warmup for the generalization to the external-memory model. In the Disk-Access Model (DAM), where transferring elements between disk and RAM is the main bottleneck, we consider the scenario where elements in B are larger than elements in A. All items are required in their entirety for comparisons in RAM. A key observation is that the complexity of sorting depends on the interleaving of the small and large items in the final sorted order, and with a high degree of interleaving, the lower bound is dominated by an associated batched predecessor problem. We give output-sensitive bounds on the batched predecessor and sorting; our bounds are tight in most cases. Our lower bounds require novel generalizations of lower bound techniques in external memory to accommodate non-uniform keys.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In most published literature on sorting and other comparison-based problems (e.g., searching and selection), the traditional assumption is that a comparison between any two elements costs one unit, and the efficiency of an algorithm depends on the total number of comparisons required to solve the problem. In this paper, we study a natural extension to sorting, where the cost of a comparison between a pair of elements can vary, and the comparison cost is the function of the elements being compared. We derive worst-case lower and upper bounds in the random-access-machine (RAM) and the disk-access-machine (DAM) <ref type="bibr">[1]</ref> models for sorting and the batched predecessor problems.</p><p>In the RAM model, we assume that comparisons between a pair of keys have an associated cost that depends on the type of keys involved. As a toy problem, consider n red and n blue keys, where a comparison between a pair of red keys costs a, between a red key and a blue key costs b, and between a pair of blue keys costs c. Without loss of generality we can assume a &lt; c, which gives rise to three cases to be considered, a &lt; b &lt; c, a &lt; c &lt; b, and b &lt; a &lt; c (when b = 1 but a = c = &#8734; corresponds to the well-known nuts and bolts problem <ref type="bibr">[2]</ref>.) In this paper we consider the setting of <ref type="bibr">[16]</ref>, where the comparison cost depends on the length of the keys being compared. However, our analysis considers the worst-case cost parameterized by the specific distribution (or the "interleaving") of the elements in the final sorted order.</p><p>Then we turn to the disk access machine (DAM ) model (also called the external-memory model or the I/O model ) <ref type="bibr">[1]</ref>, designed to capture the key aspect of data-intensive applications, where transferring data is the main bottleneck, as oppose to CPU computation. In this simplified model of modern memory hierarchy, data is transferred from an external disk of infinite capacity to the main memory of size M in blocks of size B, where M &gt; B and input size N &gt;&gt; M ; the cost of the algorithm is measured by the number of block transfers (I/Os) that it needs; once data is in memory, all computation comes for free.</p><p>In the DAM model, the notion of the comparison cost naturally comes into play when elements have different sizes (or lengths) because the larger the elements are, the fewer of them can fit in a block transfer. Specifically, if a key has length w, where w &#8804; B, then up to B/w keys can be fetched with one I/O; similarly, if w &#8805; B, then it takes w/B I/Os to bring that key into memory. Moreover, a long element, when brought into RAM, will displace a larger volume of keys than a short element, thus reducing the parallelism that many external-memory algorithms such as external merge sort benefit from. <ref type="foot">6</ref> In the DAM model, we are given two sets of keys, S keys of unit size (short keys) and L/w (long) keys of size w each (total volume L). We express our results parameterized by the interleaving of the elements in their final sorted order. Let the interleaving parameter k denote the number of consecutive runs of large keys (i.e., stripes) in the final sorted order. Our goal is to express the performance of the sorting algorithm, as a function of S, L, w, and k.</p><p>Sorting with two key lengths illustrates a special connection between sorting and the batched searching problem that we call PLE (Placement of Large Elements). Given S keys of unit size (short keys) in the sorted order, and L/w (long) keys of size w each, the objective in the PLE is to find the short key that is the immediate predecessor of each long key. The PLE problem is a lower bound on sorting because it starts off with more information than the original sorting and asks to do less; in many cases, it dominates the sorting cost.</p><p>However, obtaining lower bounds on the PLE presents several challenges. First, having records of different sizes implies that standard information-theoretic lower bounds on the unit-sized case do not apply, because now different types of I/Os can contribute different amounts of information. Second, depending on the interleaving of the small and large elements, expressed in our bounds with the interleaving parameter k, and the size of the large element w, the PLE bounds substantially change. And lastly, PLE is a batched searching problem with a nontrivial preprocessing-query tradeoff <ref type="bibr">[9]</ref>, an interesting problem on its own. Related Work. In RAM, algorithms for inputs with priced information have been studied in the context of competitive analysis <ref type="bibr">[11,</ref><ref type="bibr">12,</ref><ref type="bibr">16,</ref><ref type="bibr">3]</ref>. Another example of varying comparison costs is the well-known nuts-and-bolts problem <ref type="bibr">[2]</ref>. Interleaving-sensitive lower bounds and batched searching are related to lower bounds for sorting multisets <ref type="bibr">[19]</ref> and distribution-sensitive set-partitioning <ref type="bibr">[14]</ref>.</p><p>Aggarwal and Vitter <ref type="bibr">[1]</ref> introduced the external-memory (DAM) model. The fundamental lower bounds were further generalized in <ref type="bibr">[6]</ref> and <ref type="bibr">[15]</ref> to the external algebraic decision tree model. Prominent examples studying lower bounds on batched and predecessor searching are found in <ref type="bibr">[4,</ref><ref type="bibr">7,</ref><ref type="bibr">9]</ref>.</p><p>Most previous work that considers variable-length keys does so in the context of B-trees <ref type="bibr">[18,</ref><ref type="bibr">13,</ref><ref type="bibr">17,</ref><ref type="bibr">20,</ref><ref type="bibr">8]</ref>, and string sorting <ref type="bibr">[5]</ref>, where in the latter, authors derive upper and lower bounds under different models of key divisibility. Strings are divisible, and that brings down the complexity of sorting, but the lower bounds in our paper also imply the worst-case bounds for the string scenario where the tie is broken at the last character. Indivisible keys are found in practical settings, where sorting and searching libraries such as GNU Sort <ref type="bibr">[21]</ref> or Oracle Berkeley DB <ref type="bibr">[10]</ref> allow the developer to pass in a comparison function as a parameter.</p><p>Organization. In Section 2 we present the RAM version of our problem. We present the sorting problem in external-memory in Section 3 and relate it to the batched predecessor problem. We then derive lower and upper bounds on the batched predecessor problem in sections 5 and end with open problems in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Warmup: the RAM Version</head><p>Two types, RAM version (2RAMSORT). The input is n red and n blue keys, and the output is the sorted sequence of all keys. A comparison between a pair of red keys costs a, between a red key and a blue key costs b, and between a pair of blue keys costs c. Without loss of generality we can assume that a &lt; c.</p><p>The optimal sorting cost in RAM depends on the final interleaving of the elements in the final sorted order. If in the final sorted order all red keys come before all blue keys, then &#920;(an log n + cn log n) is the optimal total comparison cost, and the algorithm that separately sorts the two sets, and uses only one red-blue comparison to concatenate is optimal. However, if the red and blue keys alternate in the final sorted order, then no blue-blue comparisons are ever required to sort, rendering the previous algorithm suboptimal. Stripes and the interleaving parameter k. A consecutive run of red or blue keys in the final sorted order is called a stripe. For simplicity we assume we have as many red stripes as blue. Define k to be the number of stripes, and let i (respectively s i ) be the number of blue (respectively red) keys in stripe i. The notation and s are chosen to correspond with the later sections when red elements will be small and blue elements will be large.</p><p>Below we prove the tight bounds for the case a &lt; b &lt; c: this is the most natural case to serve as a warmup for the I/O-model, because if we consider the red elements small, blue elements large, then the comparisons involving red elements cost less than those involving blue elements, thus a &lt; b &lt; c. Theorem 1. 2RAMSORT has the following comparison cost complexity for the comparison-cost case a &lt; b &lt; c:</p><p>Proof. The lower bounds follow by counting the number of permutations needed to solve the following subproblems of 2RAMSORT: 1. The total number of permutations any algorithm for 2RAMSORT must achieve is at least n!. A binary comparison reduces the number of permutations by a factor of at most 2, and the cheapest comparison cost is a, thus giving a lower bound a log n! = &#8486;(an log n), our first term. 2. Consider the following instance of the problem, where the red elements are given as sorted, and the locations and the contents of stripes of blue elements are also provided but stripes are unsorted inside. This gives us a lower bound of &#8486;(c k i=1 i log i ), our third term. 3. Proving the second term as a lower bound involves addressing the following instance: the red elements are given sorted, and the algorithm is just required to discover the contents and the locations of k blue stripes. Once the algorithm solves this batched predecessor problem, the stripes are individually sorted for free.</p><p>There are at least P = n k S(n, k) permutations to consider, first term corresponding to finding k positions for stripes, and the second term, S(n, k) is the Stirling number of the second kind, i.e., the number of ways to partition a set of size n into k non-empty, disjoint subsets, which corresponds to distributing contents among the k blue stripes. Considering that S(n, k) &#8805; k n-k , we obtain that P &#8805; (n/k) k k n-k . Since red elements are already sorted, comparisons of cost a are useless, and the cheapest available comparisons are those costing b. Thus we get a lower bound of &#8486;(b(k log(n/k) + (n -k) log k)), which equals &#8486;(b(k log n + n log k -2k log k)) . One can then show by a case analysis that the last -2k log k term can be removed from the above expression, a detail we allocate to the full version.</p><p>Since we derived the lower bound on a constant number of instances of 2RAMSORT, we can claim a lower bound of the maximum complexity of these instances, that in turn equals the sum of the complexities in &#8486;(.) notation.</p><p>Due to space constraints, we refer the reader to the full version of the paper for the description of the upper bound. The steps in the upper bound match the respective lower-bound terms, by 1) sorting the red elements, 2) using two balanced binary trees, one for discovering new stripes of blue elements, and the other for assigning blue elements to existing stripes, and lastly, 3) sorting the stripes of blue elements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Sorting and Batched Predecessor in External Memory</head><p>with Size-Priced Information</p><p>The input to the two-sized sorting and batched predecessor problems are S = {s * } (the small records) and L = { * } (the large records, each of size 1 &lt; w &#8804; M/2). A set of large elements forms a stripe if for each pair of large elements i and j in the stripe, there does not exist a small element between i and j in the final sorted order. Let k be the number of large-element stripes, and let the large-element stripes be L 1 , L 2 , . . . , L k , as they are encountered in the ascending sorted order. The parameters in the complexity analysis of sorting and batched predecessor are thus S = |S|, L = |L|, w, k, and {L i } k i=1 , where</p><p>Definition 1 (Two-Sized Sorting Sort (S, L)). The input is an (unsorted) set of elements N = S &#8746; L. Set S consists of S unit-size elements, and L consists of L/w elements, each of size w. <ref type="foot">7</ref> The output comprises the elements in N , sorted and stored contiguously in external memory.</p><p>Definition 2 (PLE-Placement of Large Elements:). The input is the sorted set of small elements S = {s 1 , s 2 , . . . , s S }, and the unsorted set of large elements L = { 1 , 2 , . . . , L/w }. In the output, elements in S are sorted, and elements in L are sorted according to which stripe they belong to, but arbitrarily ordered within their stripe.</p><p>Theorem 2 (Sorting complexity). Denote by PLE (S, L) the complexity of the PLE problem. Then the I/O complexity of Two-Sized Sorting Sort (S, L) is</p><p>The first term refers to the sorting the short elements, and the third term refers to sorting the individual stripes of large elements. The first term is identical to the conventional O( N B log M/B N B ) bound by Aggarwal and Vitter <ref type="bibr">[1]</ref>. The third term requires a slight generalization of that bound, because large elements are larger than a block, a proof we include in our full version.</p><p>As in the RAM setting, since we have three subproblems, their maximum complexity, and hence the complexity of their sum, is a lower bound on Sort (S, L). We have thus reduced the sorting problem to the batched predecessor problem, which will occupy the rest of this article. The proof of the lower bound for sorting N unit-sized keys in <ref type="bibr">[15]</ref> proceeds in the following fashion: assuming that all blocks are sorted (using a linear scan costing N/B I/Os), there are N !/(B!) (N/B) permutations required to achieve, and the transfer of a block of B sorted elements into the main memory containing M -B sorted elements reduce the number of permutations by at most a factor M B (the "fan-out," since this is the degree of the node in the decision tree). Standard algebra gives a lower bound of &#8486;( N B log M/B N B ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Main Challenges in the Batched Predecessor Problem</head><p>small elements. Naively the maximum fan-out can be upper bounded by B p , which is not tight. Our main aim is to get a better bound on this fan-out. Both our upper and lower bounds are a minimum of two terms, where one dominates the other depending on how large the large elements are. 2. Requiring output-sensitive lower bound limits adversarial arguments: Lower bounds on the unit-sized batched predecessor problem in external memory were recently obtained in <ref type="bibr">[9]</ref>. In our setting, a more complicated adversarial analysis is required, that forms exactly k stripes at the end. Using this, we can argue a fan-out of at most 2 B on most small block I/Os.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Complexity of the Batched Predecessor Problem: Lower Bounds</head><p>The following theorem presents the PLE (S, L) lower bound:</p><p>Theorem 3 (PLE Lower Bound).</p><p>In order to prove Theorem 3, we first divide the PLE (S, L) problem further into three subproblems for which we develop a common adversary argument framework:</p><p>1. S-k: An instance with only one large element in each large-element stripe.</p><p>-Input: Set S of unit-sized elements s 1 , . . . , s S (sorted), where s 1 = -&#8734; and s S = &#8734;, and large elements 1 , . . . , k (volume kw) unsorted. -Output: For each i output s j such that s j &#8804; i &#8804; s j+1 . It is guaranteed that no other k satisfies s j &#8804; k &#8804; s j+1 (one large element per stripe). 2. k-k: An instance with only one small element in each small-element stripe.</p><p>-Input: Unit-sized elements s 1 , . . . , s k+1 sorted, where s 1 = -&#8734; and s k+1 = &#8734;, and large elements 1 , . . . , k (volume kw) unsorted. -Output: For each i , output its predecessor and successor in S.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">k-k:</head><p>An instance with only one element in each stripe, large or small.</p><p>-Input: Unit-sized elements s 1 , . . . , s k+1 sorted, where s 1 = -&#8734; and s k+1 = &#8734;, and large elements 1 , . . . , k (volume kw) unsorted. -Output: The entire set in the sorted order. The format of lower bounds for S-k, k-k and k-k is as follows: let X be the logarithm of the total number of permutations that an algorithm needs to achieve in order to solve the problem. As is easily observed, the values of X (modulo constant factors) for these three subproblems are k log(S/k), k log k, and k log k, respectively. Lemma 1 below is the most technical part of this paper, and it helps us quantify the behavior of the adversary during small-block and largeelement inputs for all three subproblems. We use this lemma to prove the lower bounds for the individual three subproblems (found in Lemma 2, Lemma 3, and Lemma 4). Then we put the three lemmas together to obtain the expression from Theorem 3. Lemma 1. Consider any algorithm for the S-k, k-k, or the k-k problem. There exists an adversary such that:</p><p>-On the input of any block of B short elements, the adversary answers comparisons between all elements in main memory such that the fan-out of this I/O is at most 2 B . In other words, the number of permutations the algorithm needs to check is reduced by a factor at most B . -On the input of any large element (costing w/B I/Os), the adversary answers comparisons between all elements in main memory such that the fan-out of this I/O is at most O(M ). In other words, the number of permutations the algorithm needs to check is reduced by a factor at most O(M ).</p><p>Proof of Lemma 1: We prove this lemma by describing the adversary. We capture the information learned at every point of the algorithm by assigning a search interval R( ) = (s i , s j ) for a large element at step t as the narrowest interval of small elements where can possibly land in the final sorted order, given the information the algorithm has learned so far. Bits of information is the logarithm of the search interval (e.g., halving the search interval means learning one bit of information.) It will be useful to consider the binary tree T on the set S. The search interval of any large element at any point during the execution of the algorithm is a contiguous collection of leaves in T .</p><p>For simplicity we will assume that the size of S is a power of 2, and hence T is perfectly balanced. Also, if R( ) = (s i , s j ) is the range of a large element, we will make sure the adversary "rounds off" the search space so that the new range corresponds exactly to a subtree of some node in T . This is accomplished by first finding the least common ancestor lca of s i and s j , and then shrinking the search space of to either the search space in the left subtree of lca or to the search space in the right subtree of lca, whichever is larger. Thus each large element at any time has an associated node in T , which we denote by v( ). We also denote the interval corresponding to v( ) (this is just the interval of its subtree) as I(v( )). Mechanics of the adversary's strategy: Our adversary will try to maintain the following invariant at all times during the execution of the algorithm. Invariant: The search intervals of large elements in main memory are disjoint.</p><p>We denote by { p-1 i } M/w i=1 the set of at most M/w large elements in memory before the pth I/O. By hypothesis, the nodes in T belonging to the set {v( p-1 i )} M/w i=1 have no ancestor-descendant relationships between them. We write</p><p>)), the search interval of large element p-1 i at step p-1. Small-block input. Consider the incoming block. We denote n p,i as the number of incoming small elements that belong to S p-1 i . These elements divide S p-1 i into n p,i +1 parts {P 1 , . . . , P npi+1 }, some of them possibly empty. The largest of these parts (say P j ) is of size at least 1/(n p,i + 1) times the size of S p-1 i . The new search interval of p i is defined to be the highest node in T such that I(v) &#8834; P j . Large element input. On an input of a large element p new (with search interval S p-1 new ), the adversary uses a strategy similar to that one on a small-block input to compare p new with the (at most) M small elements present in memory. These M small elements divide S p-1 new into at most M parts, and the new search interval of p new corresponds to the highest node in T that contains the largest part. This is the temporary search interval S new , with the corresponding node v new . S new can be related to the search intervals of large elements in memory in three ways:</p><p>-Case 1: The element p new shares a node with another large element p i . The conflict is resolved by sending p new and p i to the left and right children of v new , respectively.</p><p>-Case 2: The element p new has an ancestor in memory. The ancestor is sent one level down, to the child that does not contain v new in its subtree. Thus the conflict is resolved while giving at most O(1) bit.</p><p>-Case 3: The element p new has descendants in memory. Denote the nodes that are descendants of v new in T as v 1 , . . . , v M/w . Let the corresponding search intervals be S p-1 . In doing this we have given at most O(log M ) bits. Now we proceed as in Case 1 to resolve the conflict with at most O(1) extra bits. Otherwise, if Z = Y i for some i, then the adversary allots p new to the highest node v in T such that I(v) &#8838; Z. We show that the above strategy produces the following guarantees (proof in full version):</p><p>1. On a small-block input, the adversary gives at most O(log(n p,i + 1)) bits to p i . 2. On a small-block input, the adversary gives at most O(B) bits. 3. On the input of a large element, the adversary gives at most O(log M ) bits.</p><p>4.1 Putting It All Together: Lower Bounds for S-k, k-k and k-k 1) S-k Lower Bound. The proof rests on the following action of the adversary: in the very beginning, the adversary gives the algorithm the extra information that the ith largest large element lies somewhere between s (i-1)&#945; and s i&#945; , where &#945; = S/k. In other words, the adversary tells the algorithm that the large elements are equally distributed across S, one in each chunk of size S/k in S. This deems the invariant of large elements in main memory having disjoint search intervals automatically satisfied. Since any algorithm that solves S-k must achieve &#8486;(k log(S/k)) bits of information, we have that</p><p>2) k-k Lower Bound. To solve k-k, an algorithm needs to learn k log k bits of information. Using the adversary strategy we described, we observe:</p><p>3) k-k Lower Bound. To solve k-k, an algorithm needs to learn k log k bits of information. In the k-k problem, we expect to produce the perfect interleaving of the small and large elements in the final sorted order. That is, each element lands in its own leaf of T . Therefore, the adversary does not posses the freedom to route elements down the tree at all times using the strategy we described. Instead, the strategy is used for a fraction of total bits the algorithm learns, and the remaining fraction is used to make up for the potential imbalance created by sending more elements to the left or to the right. We call these type one and type two bits, respectively. Type two bits are effectively given away for free by the adversary.</p><p>More formally, we define the node capacity (c T (v)) as the number of large elements that pass through v during the execution of an algorithm. If the kk algorithm runs in T I/Os, then the node capacity of v at a level h of T is designated by c T (v) = k/2 h . Definition 3 (type one and type two bits). A bit gained by a large element is an type one bit if, when moves from v to one of v's children, at most c T (v)/4 -1 other large elements have already passed through v. The remainder of the bits are type two bits.</p><p>Because a small-block input gives O(B) bits and a large-element input gives O(log M ) bits, and we need to achieve all type one bits to solve the problem (there are (k log k)/4 of them), we obtain the following lower bound:</p><p>Proof of Theorem 3 The lower bounds for k-k, k-k and S-k are each a minimum of two terms; it is safe to add the respective terms as the transition between which term dominates occurs at exactly the same value of w for each of the subproblems. Adding the terms for the lower bounds of k-k and S-k provides the k B log S and kw B log M S terms in Theorem 3. Adding the terms for the lower bounds of k-k and k-k, and using that k + k = L/w provides the L wB log S and L B log M S terms in Theorem 3. The theorem also holds for other item sizes (w 1 &#8804; w 2 &#8804; B) and we prove the generalized result in the full version.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Upper Bounds</head><p>Our algorithm for Sort (S, L) works in three steps:1) sort the short elements using traditional multi-way external memory merge-sort <ref type="bibr">[1]</ref>, 2) solve the associated PLE (S, L) problem, and 3) sort the long stripes obtained again using multi-way mergesort. The first and third steps give the first and third terms in the sorting complexity in Theorem 2.</p><p>We give two algorithms to solve PLE (S, L): PLE-DFS and PLE-BFS. The final upper bound is the minimum of the two terms, as presented in Theorem 4. PLE-DFS: PLE-DFS builds a static B-tree T on S, and searches for large elements in T one by one. This approach is preferred in the case of really large elements, and it is better to input them fewer times.</p><p>We dynamically maintain a smaller B-tree T that contains only border elements (the two small elements sandwiching each large element in the final sorted order) and has depth at most log B k. All large elements first travel down T to locate their stripe. Only those elements for which their stripe has not yet been discovered need to travel down T . After a new stripe is discovered in T , it is then added to T . The total cost becomes</p><p>PLE-BFS: Our second algorithm for PLE uses a batch-searching tree with fanout &#920;(M ). When a node of the tree is brought into memory, we route all large elements via the node to the next level. We process the nodes of the Mtree level by level so all large elements proceed at an equal pace from the root to leaves. This is helpful when large elements are sufficiently small so that bringing them many times into memory does not hurt while they benefit from a large fanout. The analysis is as follows: at each level of M-tree, the algorithm spends &#920;(L/B) I/Os in large-element inputs. Every node of the tree is brought in at most once, which results in total O(S/B) I/Os in small-element inputs. The total number of memory transfers for PLE-BFS then becomes O L B log M S + S B . Our final upper bound is the better of the two algorithms: Substituting the lower and upper bounds of the batched predecessor problem (PLE (S, L)) derived in Theorems 3 and 4 into the complexity of sorting in Theorem 2 gives us lower and upper bounds on the sorting problem Sort (S, L). Remark 1: One observes that in Theorem 4 (PLE (S, L) lower bound), the transition between the two terms in the minimum happens at w = B log M . This is because when large elements are very large, the bound obtained by algorithms that do not input the large elements too often (PLE-DFS) is smaller than algorithms that input large elements multiple times (e.g., PLE-BFS). Remark 2: The upper and lower bounds on PLE (S, L) are tight for a wide range of parameters. Moreover, if the first and third terms in the complexity of sorting (Theorem 2) dominate the complexity of the associated PLE (S, L) problem, our sorting algorithms are tight.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Open Problems</head><p>We derived upper and lower bounds on sorting and batched predecessor in the RAM and DAM models, when comparison or I/O costs depend on the length of the items being compared. In many settings, we show that the optimal sorting algorithm involves the optimal batched predecessor problem as a subroutine, and develop algorithms for the batched predecessor problem.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_0"><p>DAM model can represent any two levels of memory, which is related to the record size in our problem. If the two levels are the disk and the main memory, elements could be larger than B but are much smaller than M . If the levels are cache and main memory, then the elements could have a length that is a nontrivial fraction of M .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_1"><p>We overload notation for convenience of presentation. We assume w &#8805; B also for the convenience of presentation. Our bounds hold for any 1 &lt; w &#8804; M/2 and are presented in the full version of the paper.</p></note>
		</body>
		</text>
</TEI>
