<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Provably Good Randomized Strategies for Data Placement in Distributed Key-Value Stores</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>02/21/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10410457</idno>
					<idno type="doi">10.1145/3572848.3577501</idno>
					<title level='j'>Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel (PPoPP)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhe Wang</author><author>Jinhao Zhao</author><author>Kunal Agrawal</author><author>He Liu</author><author>Meng Xu</author><author>Jing Li</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Distributed storage systems are used widely in clouds, databases, and file systems. These systems store a large amount of data across multiple servers. When a request to access data comes in, it is routed to the appropriate server, queued, and eventually processed. If the server's queue is full, then requests may be rejected. Thus, one important challenge when designing the algorithm for allocating data to servers is the fact that the request pattern may be unbalanced, unpredictable, and may change over time. If some servers get a large fraction of the requests, they are overloaded, leading to many rejects. In this paper, we analyze this problem theoretically under adversarial assumptions. In particular, we assume that the request sequence is generated by an adversarial process to maximize the number of rejects and analyze the performance of various algorithmic strategies in terms of the fraction of the requests rejected. We show that no deterministic strategy can perform well. On the other hand, a simple randomized strategy guarantees that at most a constant fraction of requests are rejected in expectation. We also show that moving data to load balance is essential if we want to reject a very small fraction (1/𝑚 where 𝑚 is the number of servers) of requests. We design a strategy with randomization and data transfer to achieve this performance with speed augmentation. Finally, we conduct experiments and show that our algorithms perform well in practice.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Distributed key-value stores are used extensively on modern platforms, particularly in cloud applications <ref type="bibr">[1,</ref><ref type="bibr">8,</ref><ref type="bibr">11,</ref><ref type="bibr">17,</ref><ref type="bibr">18]</ref>, but also in large-scale databases <ref type="bibr">[12,</ref><ref type="bibr">20,</ref><ref type="bibr">24,</ref><ref type="bibr">36,</ref><ref type="bibr">41]</ref> and distributed file systems <ref type="bibr">[15,</ref><ref type="bibr">22,</ref><ref type="bibr">39]</ref>. In these applications, there is a large corpus of data items (e.g., key-value pairs, objects, files) stored across many servers. Clients of the system send requests, which access some particular data. Each request is routed to the server that has the items, and the server is responsible for processing this request and responding to the client. If the server is busy, the request may be queued.</p><p>For such systems, an important optimization criterion is throughput -the number of requests that the system can process on average per time period. If most requests end up accessing a small number of servers, most system capacity might be idle while a few servers' queues grow unboundedly. The client may care about latency -the amount of time they wait between sending a request and receiving a reply. Since long queues impact latency, distributed storage systems implement bounded queues for servers such that a server with a full queue rejects future requests.</p><p>The goal of the system is to accept or consume as many requests as possible, or inversely to reject as few requests as possible while keeping the queue size small -this is equivalent to maximizing throughput while keeping latency small. To avoid rejecting many requests (thereby reducing the system throughput), distributed storage systems try to balance load across servers. However, they have no direct control over the load, since requests are generated by clients and the request access pattern may change over time. Therefore, if all the "hot items" -data items that are being accessed frequently -are on the same server, that server will experience a high load. Avoiding overload may require load balancing via moving these items to other servers dynamically.</p><p>Load balancing algorithms for distributed storage systems generally have the following steps: <ref type="bibr">(1)</ref> initially distribute data across servers in some manner; <ref type="bibr">(2)</ref> each server has a queue with bounded size (decided by the algorithm) and rejects incoming requests if its queue is full; (3) do dynamic data distribution by choosing some data to move from an "overloaded" servers to other servers. The question considered in this work is how one should do these steps. This problem has been studied empirically using various heuristics based on past system behaviors or theoretically using queuing theoretic assumptions on request arrivals <ref type="bibr">[9,</ref><ref type="bibr">30]</ref>.</p><p>In this paper, we consider this problem from a theoretical perspective. More formally, assume that the data is divided into &#119899; chunks that must be housed on &#119898; servers. At every time step, each server can process one request. We assume that the requests are generated by an adversarial process that generates &#119898; requests per time step and maximizes the number of rejected requests. Our goal is to understand what strategies with small queues may work against various adversarial assumptions. We have the following results: Simple deterministic strategies work against random clients, but no deterministic strategy is good against an adversarial client: If the client is not adversarial and picks a random chunk to request in every time step, then any simple strategy that divides the chunks evenly across servers works well; however, if the client is adversarial, then, for any deterministic algorithm, there exists a request sequence such that the algorithm rejects most requests. Simple randomized strategy works reasonably well against oblivious adversary: A simple randomized strategy that places chunks on servers uniformly at random and then never moves them performs does ok -that is, it accepts at least a constant fraction of the requests in expectation. In addition, this strategy provides a strong bound of rejecting only &#119874; (1/&#119898;) fraction of the requests if either of the following is true: (1) if the adversary is weakened so that there is a bound on how frequently it can access the same chunk; or <ref type="bibr">(2)</ref> servers have speed augmentation and each server consumes &#119874; (log &#119898;) requests per time step instead of 1. Data transfer is useful: We show that no strategy without data transfer and constant speed can consume 1 -1/&#119898; fraction of requests with an adversarial client. We also design an algorithm that starts with a randomized allocation and then moves chunks out of heavily loaded servers to balance the load. Given constant speedups of processing and data transfer, we show that this strategy consumes 1 -1/&#119898; fraction of the requests even against a fully adversarial client. Empirical evaluations are promising: To demonstrate the practicality of the system model considered in this work, we construct a case study using FoundationDB <ref type="bibr">[41]</ref> to run realistic benchmarks on a real cluster. We conduct simulation experiments using adversarial and realistic workloads to evaluate the empirical performance of our proposed algorithms. Results show that randomized algorithms with and without data transfers only need a small speedup to achieve the same performance as an offline optimal algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System Model and Performance Metrics</head><p>In distributed data stores, we have &#119898; servers that store a large collection of items, such as key-value pairs, objects, or files. We can partition the data into &#119899; (&#119899; &gt; &#119898; 2 ) data chunks, where a chunk is a collection of keys or a contiguous range of keys. The system distributes chunks to servers according to its load balancing policies. In this paper, we are going to abstract away the details of exactly how the data partition is done and assume that &#119899; and &#119898; are given and do not change over time. We also leave other problems such as data replication (and other database mechanisms, such as write-ahead log and multi-version control) as future work.</p><p>Clients send requests to the cluster online and we assume that each request accesses a single data chunk. When a request arrives, it is delivered to the server that hosts the appropriate data chunk. For analysis purposes, we divide the timeline into identical time slots, which have a length equal to the time to process a request. For simplicity, we do not distinguish the read, write, and read-range operations. Each server has a FIFO queue (first-in-first-out) with the maximum length of &#119902; to store the unserved client requests. When requests arrive, they are stored in the respective server's queue unless there is no space in which case the request is rejected. Since a server can consume one request in a slot, the cluster can consume at most &#119898; requests in the ideal case where the &#119898; requests access chunks at &#119898; different servers. Therefore, to make the workload feasible, we assume that at most &#119898; requests arrive at the cluster in a time slot and each of these requests accesses a different chunk.</p><p>During system initialization, the system allocates chunks to servers according to some load balancing algorithms, after which client requests are served. For the first few results in this paper, we assume that the allocation is fixed after initialization. However, some servers may be overloaded, and a load-balancing algorithm may want to move data from one server to another; therefore, later sections of the paper analyze algorithms that move data. We formally model the process of the data movement, named data transfer, as follows. At any time, a server can be involved in one data transfer. In each transfer, at most &#119904; chunks to one other server, and this takes &#119904; time slots. In other words, transferring &#119904; or fewer chunks from one server to another takes &#119904; time; therefore, the transfer time is a stepped function, not linear. This models the fact that the latency of moving data from one server to another is often large, but the bandwidth is also large -therefore, sending or receiving 1 chunk vs. several chunks up to the bandwidth takes the same amount of time. On the other hand, moving two sets of chunks from or to two different servers takes 2&#119904; time slots, since the server must prepare the transfer and initiate the transfer. We use the single parameter &#119904; for modeling both the maximum number of chunks in a data transfer (with a total size below the bandwidth) and the latency of a data transfer. Our case study in Section 6 indicates that this model is reasonable on a real platform.</p><p>As noted in Section 1, the goal is to minimize the number of rejected requests, which is a measure of throughput.</p><p>Definition 1 (Throughput). Given a sequence &#120590; of requests arriving over time, &#119860;(&#120590;) denotes the number of requests that are accepted by algorithm &#119860; on sequence &#120590;.</p><p>We assume the input sequence of client requests is generated by an oblivious adversary defined as follows.</p><p>Definition 2 (Oblivious adversary). The oblivious adversary knows how the online algorithm works, but it does not know the random choices made by the algorithm.</p><p>For deterministic data allocation, the oblivious adversary always knows the exact locations of chunks at any point in time; therefore, unsurprisingly, we were able to show that no deterministic algorithm performs well. Most of this paper analyzes the performance of various randomized algorithms.</p><p>The theoretical performance of an algorithm can be analyzed by comparing its throughput with optimal throughput. Definition 3 (Constant competitive). An online algorithm &#119860; is (constant) &#119888;-competitive, if &#119860;(&#120590;) &#8805; &#119888; &#8226; &#119874;&#119875;&#119879; (&#120590;) for any finite input sequence &#120590;, where &#119874;&#119875;&#119879; is the offline optimal. An algorithm's performance is better when &#119888; gets closer to 1. While constant competitiveness is nice, we would prefer not to reject a constant fraction of the requests. Thus, we define the following, stronger performance criterion.</p><p>Definition 4 (Almost optimal). An online algorithm &#119860; is almost optimal, if &#119860;(&#120590;) &#8805; (1 -&#119874; (1/&#119898;))&#119874;&#119875;&#119879; (&#120590;) for any finite input sequence &#120590;, where &#119874;&#119875;&#119879; is the optimal offline scheduler.</p><p>For randomized algorithms, similar definitions apply except that we compare the expected throughput of the algorithm with the throughput of &#119874;&#119875;&#119879; . Note that in the best case for an offline optimal where all requests arriving in a time slot access different servers and are processed in this slot, all the |&#120590; | requests in a finite input sequence &#120590; can be accepted by the optimal. Hence, an online algorithm is near optimal if it can accept (1 -1/&#119898;)|&#120590; | requests for all input sequences.</p><p>Finally, some of our algorithms will require resource augmentation or speedup. Resource augmentation implies that we allow the algorithm to perform certain operations faster than the optimal algorithm can. We consider two types of resource augmentation: (1) speed of processing requests on servers and (2) speed of data transfer. Often resource augmentation is necessary to achieve nontrivial results. However, the smaller the resource augmentation required, the better the algorithm. We generally want the resource augmentation factor to be no more than a constant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Deterministic Policies</head><p>This section analyzes deterministic policies for allocating data chunks to servers. For warmup, we will first show that if the client requests access to random chunks, then any policy that evenly balances the number of chunks on each server performs well. On the other hand, no deterministic policy performs well against an oblivious adversary -for any deterministic policy, there exists an access pattern that causes the system to accept only a small number of requests.</p><p>Simple policies work well against random clients. We now do a simple analysis showing that all reasonable allocations work for random clients where each request picks a chunk to access uniformly at random and independently. Without loss of generality, we relax the assumption and allow multiple requests to access the same chunk even in the same time slot. For this random client, we consider a simple even allocation, where &#119899;/&#119898; chunks are allocated to each server. The allocation is generated once and need not change. The following lemma is easy to prove using Chernoff bounds.</p><p>Lemma 1. For a random client and an even allocation, in any log &#119898; consecutive time slots, the probability that more than 3 log &#119898; requests arrive at a particular server is less than 1/&#119898; 2 .</p><p>Lemma 1 leads to the following theorem. Proof. Divide the execution into rounds, where each round has &#119898; log &#119898; requests -therefore, each round has at least log &#119898; time steps, so a server can process 3 log &#119898; requests with speed 3 from its queue. From Lemma 1, the probability that a server receives more than 3 log &#119898; requests in a round is at most 1/&#119898; 2 . By union bound, in a particular round &#119903; , the probability that at least one server receives more than 3 log &#119898; requests is at most 1/&#119898;.</p><p>For a particular round &#119903; , we will prove via induction that the number of leftover requests from the previous round in any server's queue is at most 3 log &#119898; for all servers at the start of every round. We have two cases. First, In round &#119903; , no server gets more than 3 log &#119898; requests. In this case, the queue size of server &#119886; at the end of the round does not increase, since server &#119886; can process 3 log &#119898; requests during a round if available. During this round, there can be at most 6 log &#119898; requests in any queue and no requests are rejected. Second, in round &#119903; , some server get more than 3 log &#119898; requests. We can pessimistically assume that all the 3 log &#119898; requests that arrived in &#119903; are rejected maintaining the inductive hypothesis. Since requests are rejected only in the second type of rounds, which occur with a probability at most 1/&#119898;, we reject at most 1/&#119898; fraction of requests in expectation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9633;</head><p>No deterministic policy works against an adversarial client. We now consider an oblivious adversary. Intuitively, since the algorithm is deterministic and the adversary knows the allocation, it can always overload some server. Here we argue that no deterministic policy, even if it moves data to load balance, can work well with small queues.</p><p>Theorem 2. Consider a deterministic algorithm &#119863; running on a system with &#119898; servers, &#119899; chunks, max queue length &#119902;, data transfer time &#119904;, and constant speedup for both request processing and data transfer, where &#119899; &gt; &#119898; 2 . There exists a request sequence &#120590; such that &#119874;&#119875;&#119879; (&#120590;)/&#119863; (&#120590;) = &#937;(&#119898;&#119904;/(&#119902; +&#119888;&#119904;)) where &#119874;&#119875;&#119879; is the offline optimal.</p><p>Proof. Since &#119899; &gt; &#119898; 2 , at any time instant, there exists a server with more than &#119898; chunks. Since the algorithm is deterministic, the adversary always knows which server has more than &#119898; chunks and can send all &#119898; requests to these chunks.</p><p>Since the server can process only &#119888; chunks per time step, the queue will soon fill up causing most requests to be rejected. Now, say that the &#119863; moves chunks. We divide the timeline into rounds. Each round has &#119898;&#119904; time slots, where &#119904; is the number of slots to perform a data transfer. Here is the adversary's strategy. At the start of round &#119894;, pick &#119898; chunks that are all in a particular server for &#119863;, say &#119886;, and send requests to these chunks only for the next &#119904; time steps. By the end of &#119904; steps, &#119863; can move &#119904; of these chunks away from &#119886; with &#119902; requests. With speed &#119888; for request processing, &#119863; can process at most &#119888;&#119904; requests in the &#119904; time steps and store at most &#119902; requests in the queue. Therefore, it accepts at most &#119888;&#119904; + 2&#119902; of these &#119898;&#119904; requests. After this, for the rest of the round (the remaining &#119898;&#119904; -&#119904; time steps), the adversary does not send any requests.</p><p>Since OPT knows the sequence, in the previous round &#119894; -1, it will set up to ensure that all these &#119898; chunks were on different servers so it can accept all requests. In the remaining (&#119898; -1)&#119904; time steps of round &#119894;, OPT (in collaboration with the adversary) sets up for the next round. It knows which server will have at least &#119898; chunks at the start of the next round for &#119863;, so it distributes these &#119898; chunks across &#119898; servers for OPT so that OPT can answer all &#119898;&#119904; requests in the next round. The adversary can then repeat for the next round. &#9633;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Randomized Policies with No Transfers</head><p>In this section, we consider a simple randomized strategy that randomly allocates chunks to servers, which is similar to the game of throwing randomized balls into bins. We will show that it is constant competitive. We will also prove that no strategy that doesn't move data can be almost optimal against an oblivious adversary. In the next section, we will see a strategy that is almost optimal.</p><p>Definition 5 (M &#119887;&#119886;&#119897;&#119897;-&#119887;&#119894;&#119899; ). For &#119899; chunks and &#119898; servers, the algorithm picks a server for each chunk uniformly at random and independently. A M &#119887;&#119886;&#119897;&#119897;-&#119887;&#119894;&#119899; mapping denotes the mapping between chunks and servers.</p><p>Upper Bounds for Balls into Bins. We will now prove a theorem against an adversarial client.</p><p>Theorem 3. Given a system with &#119898; servers and &#119899; requests. For any sequence &#120590;, say &#119864; [&#119861; &#120590; ] is the expected number of requests accepted by balls into bins allocation &#119861;. Then, &#119864; [&#119861; &#120590; ]/|&#120590; | &#8805; 1-1/&#119890; even with queue size &#119902; = 1 and no resource augmentation.</p><p>Proof. Consider the balls into bins allocation and consider a particular time step &#119905; when the client sends requests to &#119896; &#8804; &#119898; distinct chunks. These &#119896; chunks were randomly thrown on &#119898; servers. Now consider a particular server &#119886;. The probability that exactly &#119894; of these &#119896; requests hit &#119886; is:</p><p>Therefore, the expected number of servers that get at least 1 request is at least</p><p>Any server that gets at least one request processes at least one request -therefore, at least &#119909; requests out of &#119896; total requests were consumed at this time step. Therefore, by adding over all time steps, we get the result. &#9633;</p><p>We see that the balls into bins allocation is constant competitive with no speed augmentation and with very small queues. What if we give it speed augmentation? We now show that with sufficient (&#920;(log &#119898;)) speed augmentation, it is almost optimal. Proof. Consider a particular time step &#119905; when the client sends requests to &#119896; &#8804; &#119898; distinct chunks. These &#119896; chunks were randomly thrown on &#119898; servers when the allocation was done. Now consider a particular server &#119886;. The probability that at least &#119909; requests access this server at this time step is at most 2 for &#119909; = &#119888; log &#119898; with large enough &#119888;. At any time step, with &#920;(log &#119898;) speed, &#119909; = &#920;(log &#119898;) requests can be processed by a server. Therefore, the balls into bins allocation can process all the &#119896; requests that arrive on that time step with a probability at least 1-1/&#119898;. Even assuming pessimistically that all requests at other times are rejected, summing over time still gives us the result. &#9633; Lower bound on algorithms with no data transfer. We saw that the balls into bins algorithm is almost optimal with &#920;(log &#119898;) speed; however, what about constant speed? We now show a lower bound, saying that any algorithm without moving data cannot be almost optimal with constant speed. Theorem 5. Given any policy &#119863; for allocating chunks, for any queue with length &#119902;, a constant speedup of processing, if &#119899; &gt; &#119898; 2 , then there exists an input sequence, such that</p><p>Here is the adversary's strategy: It simply selects &#119898; different chunks at random and repeatedly requests these chunks in every time slot. The challenge is to show that there is no randomized strategy that can provide an almost optimal acceptance ratio in expectation for this sequence. To do this, we will define a concept of a group, which will allow us to extract a common property of all distributions. Definition 6 (group). Let a &#119879; -group consists of exactly &#119879; chunks that are in the same server, and different groups do not have any chunks in common.</p><p>We want all groups to have the same number of chunks in order to compute probabilities. If a server has at least &#119879; chunks, we create groups with &#119879; chunks each and leave the leftover chunks ungrouped. Repeatedly put &#119879; ungrouped chunks into groups until we have fewer than &#119879; ungrouped chunks. The following observation just says that there are a large number of groups.</p><p>Observation 1. For any position of &#119899; chunks in &#119898; servers, the number of (&#119899;/2&#119898;)-groups is at least &#119898;.</p><p>We now only consider grouped chunks and we will show that a large number of requests sent to grouped chunks will be rejected. In particular, we will first show that the probability that a group gets a large number of requests is not too small. Lemma 2. Given a specific distribution of the chunks, and suppose the adversary randomly selects &#119898; chunks. Then for &#119896; = &#119900; (&#119898;) and &#119899; &gt; &#119898; 2 , the probability that an (&#119899;/2&#119898;)-group has at least &#119896; requests is greater than</p><p>We can say that these groups are overloaded.</p><p>Proof. Since the adversary randomly chooses &#119898; distinct chunks, the number of chunks in each group follows a Hypergeometric Distribution. In particular, consider &#119896; = &#119900; (&#119898;) and group size &#119879; = &#119899;/2&#119898;. We can assume that &#119898; -&#119896; &gt; &#119898;/2 and &#119879; -&#119896; &gt; &#119879; /2 and &#119898; -&#119896; &lt; &#119899;/&#119898;. We can then bound the probability that a particular group has &#119896; chunks among the &#119898; chunks by the following:</p><p>This proves the lemma. &#9633;</p><p>Now we are able to give a proof for Theorem 5. Since the adversary repeatedly requests the same &#119898; chunks, overloaded groups remain overloaded. Therefore, regardless of the queue size, servers with overloaded groups will eventually reject many of their requests.</p><p>Proof. According to lemma 2, each (&#119899;/2&#119898;)-group is overloaded with probability &#119901; -they get at least &#119896; requests on every time step. Therefore, with any speedup smaller than &#119896;, these groups remain overloaded forever since their total capacity is smaller than their average load. Therefore, these chunks will reject at least one request per time step. Since the expected number of overloaded groups is &#119898;&#119901; = &#119898; 8 &#119896; &#119896;! &#8226;&#119890; -3 2 , with speed &lt; &#119896;, at least &#119898;&#119901; requests are rejected on every time step within expectation while the optimal algorithm can handle all &#119898; requests. Therefore, the fraction of the requests that any algorithm accepts is &#119901;, and &#119901; is a constant for any constant &#119896;. &#9633;</p><p>Constraining the request sequence. The argument in the previous section indicates that the worst-case workload for any randomized strategy that does not move data is an adversary which requests the same &#119898; chucks repeatedly. However, this is often not what happens in the real world. In this section, we show that we can do better if the adversary can not request the same chunks repeatedly -we assume that the client issues at most one request to a particular chunk in any consecutive log &#119898; time steps and show that balls into bins strategy is almost optimal against this adversary.</p><p>We divide time into phases of size log &#119898;. The following Lemma is similar to Lemma 1, except that the randomness comes from the randomized allocation rather than the client. Lemma 3. Against a constraint adversary which can not send more than one request to the same server in one phase, the probability that any server gets more than 3 log &#119898; requests in a phase is at most 1/&#119898; 2 .</p><p>With Lemma 3, we can prove Theorem 6 in a manner very similar to the proof of Theorem 1. Theorem 6. Given a queue with size &#119902; = &#119874; (log &#119898;) and a constrained adversarial input &#120590;, the expected number of requests consumed by balls into bins policy is</p><p>The result indicates that a workload is great if requests are not successively and repeatedly access to the same chunk.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Randomized Policies with Transfers</head><p>In this section, our goal is to design a policy that is almost optimal (consumes 1 -&#119874; (1/&#119898;) fraction of the requests) with constant speedup and without placing any restrictions on the input sequence. In particular, we propose an algorithm, say &#119884; which satisfies the following strong guarantee.</p><p>Theorem 7. Say we have a constant speedup of data movement and a constant speedup of processing on the system &#119884; with the queue length &#119902; = &#920;(&#119904; log 2 &#119898;), where &#119904; is the number of time slots to complete a data transfer. For any input sequence &#120590;, the expected number of requests consumed by &#119884; is at least</p><p>An overview of the system &#119884; : In prior sections, we have seen that while deterministic policies can not be shown to be constant competitive (even allowing data movement) (Theorem 2), simple randomized policies such as allocating chunks uniformly at random can achieve constant competitiveness (Theorem 3). In particular, with balls into bins randomization, a constant fraction of the requests can be consumed on each time step. However, since some servers get more than a constant number of requests (some servers get &#937;(log &#119898;) requests) on each time step, these servers are overloaded and the adversary can keep them overloaded -therefore, with at most constant speed and with no restrictions on the input sequence, no randomized policy without data movement can be almost optimal (Theorem 5).</p><p>To address this issue, we will design a system, we call it, &#119884; which moves data to get a good load balance. In particular, we start with a randomized balls into bins allocation just like the previous sections. Therefore, all chunks have a home server where they were allocated through balls into bins. However, when the queue of the server contains a request that has been in the system for &#920;(&#119904; log &#119898;) time, this indicates that this server is overloaded and can not handle all the requests being sent to it. At this point, a batch movement process is triggered on this server and the chunks in the server that have pending requests are distributed to other servers so that these old pending requests can be handled efficiently by other servers. Once these pending requests are handled, these chunks are moved back to their home server so that the original randomized allocation is restored.</p><p>Modeling assumptions: For simplicity in analysis, we will make some modeling assumptions that do not impact the overall result. In particular, we will assume that each server has two processors: a primary processor which consumes the requests that arrive at this server from the client and a secondary processor which consumes the requests that sent to it from other servers for load balancing purposes from other processors. Each server has its own queues, where the primary and secondary queues are both of size &#920;(&#119904; log 2 &#119898;). Note that this does not impact the theorem statement since we allow for constant speedup -if we allow for processor speed of &#120588; &#119901; , then each of the processors can run at half the speed. Similarly, the queue length can be split among the two queues while only impacting constant factors. In this analysis, we will not try to optimize constant factors; therefore, the constant factors computed will be large. In the evaluation section, we will see that the algorithm requires a quite small constant in practice to perform very well.</p><p>To restate the algorithm using these terms: when a request arrives at a server from a client, it is put in the primary queue. If the primary queue is full, then the request is rejected. If the age of the oldest request in the primary queue is 6&#119904; log &#119898;, the server triggers a batch movement process, and any chunk that has a request in the primary queue is moved to other servers and their corresponding requests are moved to those servers' secondary queues (move out phase). These moved requests are processed by the target processor's secondary server. Once a server's secondary queue is empty, the chunks that have been moved there are sent back to their home servers (move back phase). We will define the precise policy of movement momentarily, but let us first consider what the challenges are in designing this policy.</p><p>Challenges: Intuitively, the system &#119884; should work since (1) it moves the requests out of the primary queue in a timely manner to avoid filling the primary queue; (2) it spreads the requests among the secondary queues, which efficiently consumes the requests and avoids filling secondary queue. However, there are a few challenges. First, a batch movement process might take a long time if the batches are large. When a server triggers a batch movement, many chunks in that server may have pending requests in the primary queue and all these chunks must be moved. Recall, from Section 2, that it takes &#119904; time to move at most &#119904; chunks between two particular servers -we call this one operation a transfer. Therefore, depending on how many transfers are required, a full batch movement may take time. This can cause the queue of the server to get longer while the batch process is executing as well as blocking other batch processes from starting. Second, we may get unlucky and a server may get a very large number of requests from the client at the same time step. When this happens, the primary queue can become overloaded very quickly, potentially causing downstream effects. Third, the batch process is deterministic -therefore, in principle, the adversary can guess the location of chunks when they are being processed away from their home server and can overload the servers where these chunks are located by sending too many requests there.</p><p>Algorithm Description. We can now describe &#119884; and how it handles these challenges. The queue size of each server is &#920;(&#119904; log 2 &#119898;) for a sufficiently large constant hidden in the &#920;-notation.</p><p>(I) Batch trigger: A batch process at a particular server &#119886; is triggered when the oldest request at this server that is not already a part of an older batch process is 6&#119904; log &#119898; old.</p><p>(II) Batch start: In Y, batch processes may not start as soon as they are triggered; they are executed in order. Say, the batch process &#119894; is triggered before the batch process &#119895; is triggered. If the processes are triggered on the same server, then &#119895; executes after &#119894; completes since the same chunks may be involved in both. If they are on different servers, they can execute in parallel. Our distribution algorithm ensures that a server is involved in at most transfer at a time.</p><p>(III) Data distribution during moveout: Once a batch is triggered, we know which chunks are part of the batchthat is which chunks have pending requests which are part of this batch. These chunks are divided into log &#119898; packages for transfer so that each bucket has &#119874; (&#119904;) chunks and at most &#119874; (&#119904; log &#119898;) requests. We later show that this is possible with high probability. Therefore, each of these packages can be moved to destination servers using one transfer each. When the batch starts, these buckets are moved to log &#119898; different servers and the corresponding requests are moved to those server's secondary queues.</p><p>(IV) Process the chunks and move back: The secondary processor of the target server processes the requests from the secondary queue in order. Once all requests of a particular transfer have been processed, the chunks are moved back to their home server.</p><p>(V) Request handling during batch: Even when a chunk has been moved to a different server for processing its pending requests, any new requests to this chunk are still sent to its home server's primary queue and these are processed once the chunk has moved back to its home server.</p><p>(VI) Flow control and batch cut-off: In order to avoid some boundary conditions, we perform some controls: (1) If a single server receives more than 2 log &#119898; requests in a single time step, then some of the requests are immediately rejected to ensure that only 2 log &#119898; requests are added any primary queue on any single time step. ( <ref type="formula">2</ref>) If the server has more than 24&#119904; log &#119898; chunks with requests in the primary queue, the server only moves 24&#119904; log &#119898; chunks with the most requests, and the server rejects the requests of unmoved chunks.</p><p>Causes of rejection. We want to show that this process rejects at most &#119874; (1/&#119898; 2 ) fraction of the requests in expectation. Note that requests can be rejected due to the following reasons: (1) flow control and batch cut; (2) rejection from the primary queue if it becomes full; and (3) rejection from the secondary queue if it becomes full. We will show that the first two reasons cause few rejects in expectation and the last reason causes no rejections for a good load balancing policy. However, in order to show this, we must first bound the time it needs to complete a batch process.</p><p>Bounding the execution time of a batch process. We first bound the execution time of a batch process -the time between the triggering and completion of a batch process. We will divide time into phases of size 3&#119904; log &#119898;. We say that &#119883; &#119894; is the set of batch processes that were triggered during phase &#119894;. We will prove the following key Lemma. Lemma 4. Any batch process that was triggered during phase &#119894; will be completed by the end of phase &#119894; + 1 with constant speedup on both processing speed and transfer speed.</p><p>In order to prove this lemma, we first prove some supporting lemmas. First, we recall that a batch process is triggered when the oldest request in the queue which is not already part of the batch process has been in the queue for 6&#119904; log &#119898; time. Therefore, a batch process can contain requests from at most 6&#119904; log &#119898; time steps. We can bound the number of requests that are part of a batch process as follows: Lemma 5. Given the balls into bins allocation, constant &#119888; &#8805; 1 and &#119904; &lt; &#119898;/(&#119888; log &#119898;). In 6&#119904; log &#119898; consecutive time slots, the probability that a particular server receives 12&#119904; log 2 &#119898; or more requests is less than 1/&#119898; 2 .</p><p>Proof. Note that the client can send requests to the same chunk in different time steps, but in one time slot it must send &#119898; different requests. In each time step, for a particular server &#119886;, say &#119883; &#119894; &#119886; is the random variable representing the number of requests that hit this server at time step &#119894;. For &#119909; = 2 log &#119898;, we have</p><p>The last equation works when &#119898; &gt; 4. Therefore, we can sum up &#119883; &#119894; &#119886; through all &#119888;&#119904; log &#119898; steps and use Bernoulli's inequality to get the bound we need. &#9633;</p><p>We can also bound the number of chunks involved in a batch process by a method very similar to Lemma 1. Lemma 6. Given a balls into bins allocation, the total number of chunks from a particular server that can be requested within 6&#119904; log &#119898; time is 24&#119904; log &#119898; with probability of at least 1 -1/&#119898; 2 . Therefore, the total number of chunks involved in a batch process is 24&#119904; log &#119898; with probability of 1 -1/&#119898; 2 . Hence, each batch process causes log &#119898; transfers to different targets.</p><p>Given these lemmas, we can define a transfer packaging policy. Recall that each data transfer can transfer up to &#119904; chunks to a particular server. Once we have defined a batch with at most &#119874; (&#119904; log 2 &#119898;) requests and &#119874; (&#119904; log &#119898;) chunks, we package these into log &#119898; transfers greedily. We keep adding chunks to a transfer until either the number of requests in the transfer is greater than &#119888; 1 &#119904; log &#119898; for some constant &#119888; 1 or the number of chunks in the transfer is greater than &#119888; 2 &#119904;. At this point, this transfer is complete and we start a new transfer. Due to the previous two lemmas, the total number of transfers for a particular batch process is at most log &#119898; for suitable choices of &#119888; 1 and &#119888; 2 . Therefore, we have the following corollary:</p><p>Corollary 7.1. The amount of time it takes to move all the chunks that are part of the same batch process to log &#119898; target servers is &#119904; log &#119898; assuming that we have data movement speedup of 24 -that, we can move 24&#119904; chunks in &#119904; time.</p><p>We can also bound the number of requests in the secondary queue under certain conditions. Consider batch processes in &#119883; &#119894; &#8746; &#119883; &#119894;+1 -the batch processes triggered during two consecutive phases. We want to bound the number of requests that end up in the same server's secondary queue due to these batch processes. Lemma 7. Considering only requests which are part of batches from &#119883; &#119894; &#8746; &#119883; &#119894;+1 , the number of these requests that end up in a particular secondary server's queue is at most &#119888; 3 &#119904; log &#119898; for some constant &#119888; 3 .</p><p>Proof. All the requests that are part of batch processes in &#119883; &#119894; &#8746; &#119883; &#119894;+1 arrive within a time interval of 12&#119904; log &#119898; since any request which is in batch from &#119883; &#119894; must have arrived at most 6&#119904; log &#119898; time before the beginning of &#119883; &#119894; . Therefore, these batches can contain a maximum of 12&#119904;&#119898; log &#119898; requests. Since requests are evenly balanced across servers during the batch movements and we have &#119898; servers, no server has more than &#119888; 3 &#119904; log &#119898; requests for a suitably large &#119888; 3 .</p><p>&#9633;</p><p>We now observe the following invariant about consecutive phases since a batch is triggered when the oldest request which is not part of a batch arrived 6&#119904; log &#119898; time ago. Observation 2. Consider any two consecutive phases &#119894; and &#119894; + 1, and consider the set &#119883; &#119894; &#8746; &#119883; &#119894;+1 -the set of batch processes triggered during these phases. No two batch processes in this set can be triggered by the same server.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Now we can prove Lemma 4 via induction.</head><p>Proof. The base case is trivial since the first phase doesn't trigger any batch processes. For Inductive Hypothesis, assume that all batch processes triggered during phase &#119894; pleted by the end of phase &#119894; +1. Therefore, all batch processes triggered during phase &#119894; + 1 are ready to start at the beginning of phase &#119894; + 2 since all prior batch processes at their respective servers have completed.</p><p>None of the batch processes triggered during phase &#119894; + 1 have the same source (Observation 2). In addition, they trigger log &#119898; transfers each (Corollary 6). Therefore, there are a total of at most &#119898; sources and at most &#119898; log &#119898; targets of the transfers -these can be scheduled in &#119904; log &#119898; time without source or destination conflicts with sufficient large constant speedup in data transfer speed. Now consider the processing of these requests by the secondary server. The secondary server can only contain requests due to batch processes triggered during phase &#119894; + 1 and phase &#119894; +2 (since phase &#119894; +2 has started, some of the batch processes in this phase may have also started). By Lemma 7, the total number of requests in any secondary queue due to requests from these two phases is at most &#119888; 3 &#119904; log &#119898; for some constant &#119888; 3 . Therefore, again, with sufficient constant speedup on processing, all these requests can be processed by the secondary server in the next &#119904; log &#119898; steps. Finally, all the chunks from batch processes in phase &#119894; + 1 must move back and again, which can be done in &#119904; log &#119898; time by using a reverse schedule from the move-out schedule. Therefore, in at most 3&#119904; log &#119898; time after the start of phase &#119894; + 1 -that is, at the end of phase &#119894; + 2 -all these batches have completed. &#9633; Bounding the number of rejected requests. We can now show that the number of rejected requests is small. Below is the proof of Theorem 7.</p><p>Proof. Recall that requests can be rejected due to the two control strategies. First, Lemma 6 implies that the probability that a batch has more than 24&#119904; log &#119898; chunks involved is less than 1/&#119898; 2 . Since rejecting a request due to the second policy happens only on these batches, this probability is also less than 1/&#119898; 2 . Second, consider the case where 2 log &#119898; requests arrive at the same server. For a particular server &#119886;, say &#119883; &#119886; is the random variable representing the number of requests that hit this server at this time step. Denote &#119909; = 2 log &#119898;, we have</p><p>For &#119898; &gt; 4 we have 1 &#119909;! &lt; 1/&#119898; 3 . According to the union bound through all the servers, the probability that it is rejected due to the first policy is less than 1/&#119898; 2 .</p><p>In addition to the control strategies, a request may be rejected if the primary queue is full. However, recall that each batch process completes in time at most &#119874; (&#119904; log &#119898;) and due to Lemma 5 the total number of requests that arrive in this time period is at most &#119874; (&#119904; log 2 &#119898;) with probability 1 -1/&#119898; 2 . Since the primary queue has this capacity, the probability of a request being rejected due to this is small. Finally, Lemma 7 indicates that requests can never be rejected from the secondary queue since the number of items in it are at most &#119874; (&#119904; log &#119898;). &#9633;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Empirical Evaluations</head><p>In this section, we evaluate the practicality of our theoretical study on load balancing in distributed key-value stores via a case study real-world database cluster and extensive simulations based on statistics measured from realistic workloads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">A Case Study</head><p>Our case study platform is an Amazon Kubernetes Service cluster running FoundationDB <ref type="bibr">[41]</ref>, which is an open-source distributed database supporting transactional key-value stores. We develop experimental components in FoundationDB to measure the runtime statistics and customize workloads to mimic real-world applications. The case study aims to verify the applicability of the theoretical model for the load balancing problem in distributed key-value stores and collect statistics to be used in simulation experiments.</p><p>Experimental design. We set up a FoundationDB cluster on the Amazon Kubernetes Service with 70 storage servers and one data distributor, managed by the FoundationDB Operator. Each storage server runs in a dedicated Kubernetes pod with Amazon EBS storage and contains more than 100 data chunks. The RocksDB <ref type="bibr">[5]</ref> is used as the storage engine.</p><p>The storage layer of the FoundationDB implementation that supports snapshots, including RocksDB, works as follows. The keys (values) are persistent in RocksDB checkpoints. Each checkpoint, denoted as a data chunk in previous sections, is stored as a physical file on the disk. Each key is persistent in one checkpoint file. A data transfer from server &#119894; to server &#119895; is implemented as fetching and transmitting at most &#119904; checkpoint files from server &#119894; to &#119895;.</p><p>In FoundationDB, a client uses transactions to interact with the storage. A transaction contains a series of requests (operations) on keys, such as read, write, and scan (read range) requests. Since this work focuses on the storage layer, we regard transactions as the proxy of clients.</p><p>Our case study runs the realistic workload using the Yahoo! Cloud Serving Benchmark (YCSB) <ref type="bibr">[10]</ref>, with an extension that client requests have a 90/10 Read/Write ratio, which mimics the pattern in real-world applications. We run the workload for one hour containing more than one million requests, from which we randomly sample and measure 100K requests with specific properties under consideration.</p><p>Since the data distributor issues a data transfer and monitors it until the data chunks complete moving, we measure the data transfer latency inside the data distributor. Hence, the latency measurement is accurate without needing clock synchronization between storage servers. When recording a data transfer, we collect the number of bytes in the moved data chunks, its end-to-end latency, the source storage server, and the destination storage server.</p><p>Data transfer latency. Recall that our theoretical model assumes that the latency of a data transfer does not increase with the number of moved chunks, as long as the total size is below the network bandwidth. To validate this assumption, we use our FoundationDB cluster to measure the data transfer latencies under different settings: (1) moving chunks with different total data sizes from one server to another server, (2) receiving chunks in multiple data transfers from different servers at the same destination server, and (3) sending chunks in multiple data transfers from the same source server to different servers.</p><p>Figure <ref type="figure">1</ref>(a) presents the latency of standalone data transfers with different total sizes. We can see that the latency does not increase with larger data sizes. This is mainly because the network bandwidth between storage servers is approximately above 1GB per second, which is larger than the total size. Thus, the latency is dominated by the CPU overhead of the storage server to prepare for the data transfer, which is independent of the chunk size.</p><p>For multiple data transfers to the same destination server or from the same source server in Figures <ref type="figure">1(b</ref>) and 1(c), we can see that the latency increases roughly linearly with an increasing number of data transfers. This is because the storage server can only process the data transfers sequentially, which is consistent with our theoretical model. Data transfer latency vs. request processing latency. For analyzing theoretical bounds, the time to process a client request is considered as 1 time slot, while the time to perform data transfer is denoted as &#119904; time slots. Our empirical measurements show that the request processing latency and data transfer latency are at the magnitude of 0.001 and 0.1 seconds, respectively. Besides, the read and scan requests have smaller latencies than that of write. Hence, the ratio &#119904; is around 100 to 1000 in practice. Thus, our simulation experiments below use &#119904; = 100.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Simulation Experiments</head><p>We conduct simulation experiments with different workloads to evaluate our proposed data distribution policies.</p><p>Experimental design. We implement a simulator for a FoundationDB cluster with &#119898; storage servers, where &#119899; data chunks are distributed among the servers according to the chosen policy. During a simulation, the simulator generates requests that access some servers following some workload pattern. When adding a generated request to the server's queue, the simulator rejects (drops) the request if the number of requests waiting in the queue is already equal to the queue length &#119902;. Note that the policy in Section 5 utilizes two queues. In this case, a request is rejected if both queues with a total size &#119902; are full.</p><p>We experimented with two request workloads: Adversary and Zipfian. The adversarial workload always generates requests that access the same set of &#119898; chunks during the simulation, which is especially difficult for policies without knowing this access pattern. In comparison, the Zipfian workload generates requests that access chunks following the Zipfian distribution <ref type="bibr">[30]</ref> with the parameter &#119886; = 2, which represents more realistic workloads. For each setting, we run 10 simulations, each with 4.5 million generated requests, measure the request rejection ratios and report the median.</p><p>Policies and baselines. We implemented our proposed randomized policy without data transfers (labeled Random) and with data transfers (labeled DataMove). To examine how speedup of request processing and data transfers improves the performance of our policies given the same workload, we also enable the simulator to have different speeds. In particular, at speed 1, a server processes one request in a time slot. At speed 2, for instance, a server processes two requests in a time slot. In the simulation, there is no speedup of data transfer.</p><p>As a baseline comparison, we implemented and evaluated a policy that deterministically distributes data chunks to servers. Additionally, if requests evenly access all servers, the requests generated at the beginning of a time slot can all be fully processed in this time slot at speed 1. Therefore, the lower bound of an optimal policy can achieve zero rejection.</p><p>Evaluation results. Figure <ref type="figure">2</ref> shows the rejection ratios under the adversarial and ZipFian workloads, where the queue lengths are set to &#119902; = 200&#119904; log &#119898; and &#119902; = 20&#119904; log &#119898;, respectively. The queue length is chosen so that we can clearly observe the performance differences under different policies. Simulation results show that randomized policy significantly outperforms deterministic policy. Adding data transfers to the randomized policy further reduces the rejection ratios for both workloads. Moreover, we can see that our proposed Random and DataMove policies can already achieve zero rejection with a speedup of 2 and 3, respectively. This result reveals that the required speedup of our policies in practice   can be significantly smaller than that in the theoretical results. We have also run experiments with larger &#119904;, &#119898;, and &#119899; and observed similar performance trends.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Related work</head><p>Both the problem and the solution are related to many topics studied by researchers. Here we provide a brief overview of some of the related work.</p><p>Distrbuted key-value stores have been designed both for academia and commercial use <ref type="bibr">[12,</ref><ref type="bibr">15,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">36,</ref><ref type="bibr">41]</ref>. Most of these systems are evaluated empirically and most of this work finds that handling online requests that might be skewed towards certain keys is often an important challenge <ref type="bibr">[2,</ref><ref type="bibr">10]</ref>. Most of the theoretical analysis of these systems is done by making stochastic assumptions on the arrival pattern of requests <ref type="bibr">[28,</ref><ref type="bibr">29,</ref><ref type="bibr">32]</ref>. In this paper, we take a different tack and analyze these systems under adversarial inputs.</p><p>Another related area of research is distributed hash tables, which have received extensive theoretical and empirical investigation. For instance, consistent hashing <ref type="bibr">[21,</ref><ref type="bibr">27]</ref> is widely used to partition the data in distributed hash maps <ref type="bibr">[12,</ref><ref type="bibr">24,</ref><ref type="bibr">34]</ref>. Randomization is widely used to build and maintain the distributed hash tables <ref type="bibr">[13,</ref><ref type="bibr">25,</ref><ref type="bibr">38]</ref>. In addition, some distributed hash tables consider network topology in the design of the hash table itself <ref type="bibr">[3,</ref><ref type="bibr">33]</ref>. The problem we consider in this paper is also related to general load balancing problems where tasks can be dispatched to various servers.</p><p>The techniques used in this paper such as balls-into-bins analysis were originally developed in the load balancing context. These games have been extensively analyzed for various metrics <ref type="bibr">[4,</ref><ref type="bibr">16,</ref><ref type="bibr">31]</ref>. In addition, variations of these games such as power of two choices <ref type="bibr">[14,</ref><ref type="bibr">26,</ref><ref type="bibr">29]</ref> have also been analyzed extensively and sometimes used in the design of distributed hash tables. The problem we considered in this paper is different but related, to the distributed hash table problem as well as the load balancing problem. In particular, the model for queuing and the model for moving data from one location to another is generally different in distributed hash tables which leads to different algorithmic and analytical challenges.</p><p>Since most online systems are dynamic where the system load changes and the keys which are accessed frequently change over time, data migration is an important tool in this system design. There has been extensive empirical work for designing data migration schemes and evaluating them in a diverse range of systems <ref type="bibr">[6, 7, 19, 23, 35-37, 40, 41]</ref> for various distributed workloads in key-value stores and distributed databases. Most of this work is empirical or uses stochastic assumptions on data arrivals for analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusion</head><p>We have explored the problem of load balancing in distributed storage systems, assuming an oblivious adversary sends requests. With reasonable queue sizes and good algorithm designs, it is possible to consume almost all requests in expectation. Several open problems remain. First, we assume that each chunk is stored in at most one server. Real systems often store each chunk on multiple servers via replication to aid the load balancing and recovery from faults. We will design algorithms to support replication. Second, we assume an oblivious adversary that gets no information from the system once it runs. One can imagine an adaptive adversary that can glean the requests' distribution from the rejection pattern or reply latencies. Finally, we would like to explore if there are algorithms that can provide results similar to ours but using smaller queue sizes and thus smaller latencies.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>PPoPP<ref type="bibr">'23,</ref> Feb 25-Mar 01, 2023, Montreal, Canada Zhe Wang, Jinhao Zhao, Kunal Agrawal, He Liu, Meng Xu, and Jing Li</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>PPoPP'23, Feb 25-Mar 01, 2023, Montreal, Canada   </p></note>
		</body>
		</text>
</TEI>
