<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Private Proximity Retrieval</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019 July</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10191083</idno>
					<idno type="doi">10.1109/ISIT.2019.8849249</idno>
					<title level='j'>2019 IEEE International Symposium on Information Theory (ISIT)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Tuvi Etzion</author><author>Oliver W. Gnilke</author><author>David Karpuk</author><author>Eitan Yaakobi</author><author>Yiwei Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A private proximity retrieval (PPR) scheme is a protocol which allows a user to retrieve the identities of all records in a database that are within some distance r from the user's record x. The user's privacy at each server is given by the fraction of the record x that is kept private. The distortion of a PPR scheme measures how accurately the user can calculate the identities of the desired files. We assume that each server stores a copy of the database. This paper studies protocols that offer trade-offs between perfect privacy and low computational complexity and storage.In this paper, this study is initiated. The work focuses on the case when the records are binary vectors together with the Hamming distance. In particular, for a given privacy level, we investigate the minimum number of servers that guarantee a prescribed distortion value. The collusions of pairs of servers as well as other distance measures are investigated.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>The growing amount of available information, which mostly resides in the web and on the cloud, has made information retrieval (IR) one of the more important computing tasks. In fact, web search has become the standard source of information search and in general, IR refers to information access in order to obtain data, in any form, from any available information resources. However, this form of communication also poses a risk to the user privacy, since the servers can monitor the user's requests in order to deduce important information on the user and his interests. Therefore, an important aspect of IR is hiding the information the user is searching for.</p><p>Private information retrieval (PIR) is one of the well-known problems that provide privacy to user's requests. This problem was introduced by Chor et al. in <ref type="bibr">[5]</ref>. PIR protocols make it possible to retrieve a data item from a database without disclosing any information about the identity of the item being retrieved. This problem has attracted considerable attention since its inception, see e.g. <ref type="bibr">[6]</ref>, <ref type="bibr">[22]</ref>. The classic PIR model of <ref type="bibr">[5]</ref> views the database as a collection of bits and assumes that the user wishes to retrieve the ith bit without revealing any information about the index i. This problem has received recently significant attention from an information-theoretic perspective, wherein the database consists of large records and the goal is to minimize the number of bits that are downloaded from the servers <ref type="bibr">[2]</ref>. Since then, extensions of this model for several more setups have been rigorously studied; see e.g. <ref type="bibr">[1]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[18]</ref>- <ref type="bibr">[20]</ref> and references therein.</p><p>PIR protocols have been rigorously studied mostly for the basic IR problem when the user knows which data records are stored in the database (but not their content) and simply asks for the content of one of them. However, the user may be interested in other forms of information from the database rather than just asking for a particular record. The recently-introduced problem of private computation is a generalization of PIR which allows a user to compute an arbitrary function of a database, without revealing the identity of the function to the database. This problem has been studied for linear functions in <ref type="bibr">[12]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[18]</ref> and for polynomial functions in <ref type="bibr">[8]</ref>, <ref type="bibr">[17]</ref>. Many open questions about private computation remain, especially for non-linear functions, one of which is the topic of this paper.</p><p>Another important IR problem is that of proximity searching. One example of proximity searching is the K-nearest neighbor search (K-NNS), where the goal is to find the K elements from the database that minimize the distance to a given query <ref type="bibr">[3]</ref>, <ref type="bibr">[11]</ref>. Proximity searching has several applications, among them are classification, searching for similar objects in multimedia databases, searching for similar documents in information retrieval, similar biological sequences in computational biology, and more. While proximity searching has been well-studied in the literature, to the best of our knowledge, there are no existing solutions which offer both proximity searching and privacy simultaneously. The assumption in proximity searching algorithms is that there is only a single server which stores the database. However, modern data storage systems are stored across several servers in a distributed manner, and thus we can take advantage of this setup in order to provide privacy for proximity retrieval.</p><p>In our setup of the problem we assume the user has a record and is interested in knowing the identities of the records in the database that are close to his record, according to some distance measure. This setup can fit to the case when the records stored in the database are attributes of different users. Given the user's attribute, he is only interested to know the users which have similar attributes to his according to some distance measure between the attributes. For example, a record may be a user's location and the database may consist of the locations of agents. In this context, the user seeks to determine the identity of the agents nearest to him, without necessarily knowing their exact location, while minimizing the amount of information that is exposed to the servers about his location. This example is related to the private proximity testing problem in which two users seek to determine whether they are close to each other, without revealing any information about each other's location <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[16]</ref>. Yet another example assumes that each record is a file (song, movie, DNA sequence etc.), and the user is interested in determining the records which are similar to his. There are several more related problems to private proximity retrieval such as private keyword search <ref type="bibr">[10]</ref>, <ref type="bibr">[21]</ref> and private search <ref type="bibr">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Setup</head><p>Let V be a finite set, assume an M -file database X = {x 1 , . . . , x M } &#10003; V, where the x i are i.i.d. samples of a random variable X, is stored on N different non-colluding servers. Let</p><p>The user has a record x 2 V chosen from X and seeks to know the identity of every record in B(x, r) \ X , i.e., the identity of every record in the database similar to his, which is given by the set I(x, r) , {m 2 {1, . . . , M}| x m 2 B(x, r)}.</p><p>As opposed to the classical PIR problem, our solutions do not provide full privacy, but only partial privacy which will depend on the accuracy of calculating the set I(x, r). Definition 1. Given the above setup, a private proximity retrieval (PPR) scheme consists of three algorithms.</p><p>1) A randomized algorithm Q that forms N queries, q 1 , . . . , q N , depending on X, M , the user's record x, and the search radius r.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2) An algorithm A that calculates answers</head><p>given a query q n generated by the algorithm Q, and the database X . 3) An algorithm R, that calculates an approximation b I(x, r) of the set I(x, r), given X, x, r, the set of queries q 1 , . . . , q N , and the corresponding answers A 1 , . . . , A N . The user's privacy at server n is defined, roughly speaking, to be the uncertainty in X after the server sees the query q n , relative to the entropy of X. That is,</p><p>While the goal in our schemes is to accurately calculate the set I(x, r), i.e., b I(x, r) = I(x, r), it may happen that it is only possible to approximately calculate the set, i.e., I(x, r) ( b I(x, r). The distortion of a PPR scheme is the smallest integer &#9999; s.t. b I(x, r) &#10003; I(x + &#9999;, r), in other words, b I(x, r) contains an index of at least one element of distance r + &#9999; from x (where we now assume X to be uniform on V or at least all elements of V to have a positive probability).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 1. It is possible to define a lower and an upper distor</head><p>where &#9999; 1 and &#9999; 2 are minimal. We reinterpret the approximation b I(x, r) as a solution to the search around x with radius r 0 := r &#9999; 1 . Then</p><p>, and it is therefore sufficient to study schemes that have lower distortion 0 and upper distortion</p><p>We denote by H(&#8226;) the binary entropy function. For a discrete random variable Y supported on a set Y, we define its entropy by</p><p>y2Y P r(y) log 2 (P r(y)). It will always be clear from context whether the function H(&#8226;) is taking a scalar or a random variable as its argument. For a positive integer n, the set {1, 2, . . . , n} is denoted by <ref type="bibr">[n]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BASIC PROPERTIES AND CONSTRUCTIONS OF PRIVATE PROXIMITY RETRIEVAL SCHEMES</head><p>A. A Simple Error-Free Scheme for the Hamming Space The main part of the current work concentrates on the case of V = F L 2 , d is the Hamming distance on F L 2 , and the random variable X equal to the uniform distribution on F L 2 , and this is the case to which we now specify unless stated explicitly otherwise. For two vectors u and v of the same length, the Hamming distance between u and v is denoted by d H (u, v) and the Hamming weight of u is denoted by w H (u). The support of u will be denoted by supp(u). For s &gt; 0, let W s be the set of all vectors of weight exactly s in</p><p>The following lemma will motivate our first example.</p><p>Lemma 2. For all r, s, L such that r + 2s + 1 6 L, and x 2 F L 2 , it holds that B(x, r) = T z2Ws B(x + z, r + s), and hence</p><p>Example 1. Lemma 2 suggests a PPR scheme for retrieving I(x, r). Assign each vector z 2 W s to one server at random and send the query q z = (x + z, r + s). The server computes the set A z = I(x + z, r + s), and sends it as its response back to the user. Finally, the user computes the intersection of all responses and therefore the set I(x, r), by Lemma 2. Hence this scheme has distortion &#9999; = 0. We discuss the privacy at each server in more generality in the next section. 2</p><p>B. A General PPR Scheme and Some Basic Properties While the PPR scheme described in Example 1 using the set W s has the advantage of being easy to describe, the number of servers is L s , which even for small values of L is unreasonable. Our general strategy for improving on this construction will be to consider subsets Z &#10003; W s which satisfy, or approximately satisfy, the equation of Lemma 2, but for which |Z| &#8999; |W s |. We formalize the studied family of PPR schemes in the following definition. Definition 3. Given a search radius r, a parameter s, and a set of query vectors Z &#10003; W s of size N , the PPR scheme P P R(r, s, Z) is defined to consist of the following algorithms:</p><p>1) The algorithm Q applies a uniform random permutation of [L] to the coordinates of all vectors z n 2 Z, and then sends the query q n = (x + z n , r + s) to server n = 1, . . . , N, where x is the user's record.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2) The algorithm A computes</head><p>The matrix M 2 F N &#8677;L 2 whose rows are the vectors z 2 Z is referred to as the query matrix. Proposition 4. For the PPR scheme P P R(r, s, Z), the privacy at</p><p>In particular, if = s/L is constant with respect to L, then P n ! H( ) as L ! 1.</p><p>Let P = P n for the PPR scheme P P R(r, s, Z), which by Proposition 4 is independent of n. It is not hard to show that P &lt; H( ), thus approximating P &#8673; H( ) for large L slightly overestimates the privacy level. Nevertheless, to maximize privacy, one wants to be as close as possible to 1/2, so that H( ) &#8673; 1. Using the scheme constructions which we will outline in the next few sections, we can obtain H( ) &gt; 1 for any &gt; 0 and search radius r = 0. On the other hand, it will be shown in the sequel that it is impossible to attain privacy of exactly 1 for any PPR scheme of the form P P R(r, s, Z).</p><p>Apart from its role in the above brief analysis of the privacy of P P R(r, s, Z), the random coordinate permutation used by the algorithm Q is largely immaterial to the analysis of the scheme, and in what follows we will largely ignore it. To begin our analysis of the family P P R(r, s, Z) of PPR schemes, we state several elementary but useful lemmas.</p><p>In case s &gt; (L r)/2 we have the following negative result on the distortion value. Proof: Let 1 be the all ones vector, and z 2 W s . We see</p><p>T z2Ws B(z, r + s) and the distortion for any P P R(r, s, Z) scheme is maximal, i.e., &#9999; = L r.</p><p>By Lemma 5, the PPR scheme P P R(r, s, Z) has non-trivial distortion only when s 6 (L r 1)/2, or equivalently, 6 1 2 r+1 2L . Hence H( ) &lt; 1 for any such scheme as long as we insist on non-trivial distortion, that is, we will always fall short of perfect privacy. On the other hand, by Proposition 4, we need s to be a sizable fraction of L to achieve a good privacy level. For the rest of the paper, we assume that s 6 (L r 1)/2. We will repeatedly use the following two lemmas to compute the distortion of a P P R(r, s, Z) scheme. Proof: Suppose there exists a vector y 2 T z2Z B(z, r + s) with weight w H (y) = r + where is a positive odd integer. By Lemma 6, we have /2 6 | supp(y) \ supp(z)| for all z 2 Z, and since the right-hand side is an integer, it follows that ( + 1)/2 6 | supp(y) \ supp(z)|. Now consider a vector y 0 of weight w H (y 0 ) = r + + 1 given by adding a single 1 to y in any coordinate not in supp(y). We have</p><p>for all z 2 Z, which by Lemma 6 shows that y 0 2 T z2Z B(z, r+ s). Thus there exists a vector y 0 2 T z2Z B(z, r + s) with weight r + + 1, and hence &#9999; is even.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Lower Bounds on the Number of Servers</head><p>Every set Z &#10003; W s gives us a PPR scheme P P R(r, s, Z) and according to Proposition 4 the privacy at each server of this scheme is P = log 2 ( L s )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>L</head><p>. Thus, for a given distortion value &#9999; we try to find a set Z of minimal size that guarantees this distortion level. This motivates us to study the following design problem.</p><p>Definition 8. Let L, s, r, and &#9999; be positive integers such that &#9999;, s, r 6 L. Let N (L, s, r, &#9999;) be the minimal value of N such that there exists a set of query vectors Z &#10003; W s of size N satisfying</p><p>Clearly N (L, s, r, &#9999;) 6 L s for s 6 (L r 1)/2, since by Lemma 2 taking Z = W s satisfies the above with the left-hand inclusion an equality. Our goal in this section is to prove lower bounds on the value of N (L, s, r, &#9999;). Lemma 9. Suppose that |Z| = 2, so Z = {z 1 , z 2 } &#10003; W s . Then the distortion of P P R(r, s, Z) satisfies 2s 6 &#9999;. In particular if s &gt; &#9999;/2 + 1 then N (L, s, r, &#9999;) &gt; 3.</p><p>Proof: Let y 2 F L 2 be a vector of Hamming weight 2s + r such that supp(z 1 ) [ supp(z 2 ) &#10003; supp(y). Then, d H (y, z 1 ) = d H (y, z 2 ) = s + r, and hence y 2 B(z 1 , r + s) \ B(z 2 , r + s). Therefore &#9999; &gt; 2s and the result follows.</p><p>Lemma 10. Let Z &#10003; W s be any set of query vectors of size |Z| = N such that P P R(r, s, Z) has distortion &#9999;. Suppose that there exists a subset S &#10003; [L] such that | supp(z) \ S| &gt; &#9999;/2 + 1 for all z 2 Z. Then, it holds that |S| &gt; r + &#9999; + 2.</p><p>Proof: Assume to the contrary that |S| 6 r + &#9999; + 2, and let y is a vector with w H (y) = r + &#9999; + 2 such that S &#10003; supp(y). By construction, we have | supp(y) \ supp(z)| &gt; &#9999;/2 + 1 for all z 2 Z. Thus, by Lemma 6 we have y 2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>T</head><p>z2Z B(z, r + s). But this contradicts the assumption that Z has distortion &#9999;.</p><p>Thus effective lower bounds on N (L, s, r, &#9999;) will come from constructing small sets S which intersect the support of every z 2 Z in at least &#9999;/2 + 1 coordinates. The following theorem does exactly that by recursively finding columns of large weight in the query matrix M and repeatedly applying Lemma 10. Theorem 11. Suppose that s &gt; &#9999;/2 + 1. Then the quantity N (L, s, r, &#9999;) is lower bounded by</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. UPPER BOUNDS ON THE NUMBER OF SERVERS</head><p>In this section we derive upper bounds on the value of N (L, s, r, &#9999;) by explicitly constructing query matrices. Our first construction uses matrices of fixed row and column weight and achieves the following result.</p><p>Theorem 12.</p><p>For the rest of this section we focus on the case &#9999; = 0 and for shorthand, we denote the value of N (L, s, r, 0) by N (L, s, r). In order to accomplish this task, a new family of codes, called anticovering codes, is presented which will be used as a building block on the construction of sets Z that achieve zero distortion.</p><p>Given L, s, r we say that a length-L code C is an (r, s) L anti-covering code if 1) C &#10003; W s , 2) for any length-L vector v such that w H (v) &gt; r, there exists a codeword z 2 C such that d H (v, z) &gt; r + s. For all L, s, r we denote by C A (L, s, r) the minimum size of any (r, s) L anti-covering code. We see that for all L, s, r, it holds that N (L, s, r) = C A (L, s, r), and based upon this relation we present our main construction of anti-covering codes in the next theorem. This construction provides an upper bound on the value of N (L, s, r). Proof Sketch: We sketch the proof for the case when C = 2 and r is odd. In this case it holds that A = d s(r+1)</p><p>L 2s e and we assume that s is a multiple of A. First we generate 2A + r + 1 disjoint subsets of the set [L], each of size s A (note that (2A + r + 1) &#8226; s A 6 L). We denote these sets by S 1 , S 2 , . . . , S 2A+r+1 , and they are further partitioned into two families sets</p><p>2 +1 , . . . , S 2A+r+1 }. Next we define a set of N vectors z 1 , z 2 , . . . , z N as follows. For every</p><p>2 , we let z i1,i2,...,iA be a vector whose support set is</p><p>. The rest of the proof follows by proving that C = {z 1 , z 2 , . . . , z N } is an (r, s) L anti-covering code. The first property clearly holds and the second one is verified by proving that for any vector v of Hamming weight greater than r there exists z 2 C such that</p><p>While the condition in Theorem 13 requires that s is a multiple A, for L large enough and = s/L this condition is negligible so it is possible to conclude with the following corollary. Corollary 14. For all constant = s/L &lt; 0.5, it holds that</p><p>; , and for r large enough</p><p>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. SERVER COLLUSIONS</head><p>Previous sections assumed that the servers do not communicate with each other in an attempt to deduce more information about the user's vector of interest x. In this section, we analyze the privacy loss from pairs of servers colluding to determine the record x, and construct a scheme which is particularly resistant to pairwise server collusion.</p><p>In general, we let T &#10003; [N ] be a subset of servers of size t. The privacy with respect to T = {j 1 , . . . , j t } is given by</p><p>= H(X|q j1 , . . . , q jt ) L .</p><p>In this section, we study the case of two-server collusions. Remember that in our PPR scheme P P R(r, s, Z), when a server receives a request z, it is possible to deduce that the user's record x is any vector which is of distance s from z, that is, any vector in S(x + z, r) = {y 2 F L 2 | d H (x + z, y) = r}. The next lemma determines the privacy level based upon the size of the intersection of the support sets of their queries.</p><p>Lemma 15. Let (x + z 1 , r + s) and (x + z 2 , r + s) be queries received by two colluding servers T = {j 1 , j 2 }, where w H (z 1 ) = w H (z 2 ) = s and | supp(z 1 )\supp(z 1 )| = t. The privacy at these</p><p>In particular, if = s/L, = t/s and = 1 are constants with respect to L, then P T ! 2 + (1 2 )H(( )/(1 2 )) as L ! 1.</p><p>Given a fixed number of servers N and a column weight c &lt; N we build the set Z as a matrix containing all vectors of length N and weight c as columns. We see that L = N c and s = N 1 c 1 . Hence = c/N and any two rows of Z share a support of size t = N 2 c 2 leading to = (c 1)/(N 1). Applying Lemma 15 shows that when L approaches infinity, the privacy when two servers collude is</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. GENERALIZATION TO OTHER METRICS</head><p>The definitions for a private private proximity retrieval scheme given in Sections I are general for any type of database and any metric defined on it. Fortunately, other results can be generalized to other metrics and in particular to metrics which are based on distance regular graphs (DRG metrics in short). A metric is called a DRG metric if for any</p><p>of the choice of u 1 and u 2 . The diameter of the metric, denoted by L, is the maximum distance between two elements in V. Furthermore, let r be the proximity's radius and s the search's radius parameters for the scheme. An immediate result from this definition is the following result which will be used in the sequel. It holds for DRG metric as well as other important metrics like the L 1 metric.</p><p>Lemma 16 does not hold for all metrics and it is essential for generalizing the proof of Lemma 2 which motivated our approach. For an element x 2 V, let W x s denote the set of words in V of distance s from x, i.e. W x s , {z : d(z, x) = s}. Lemma 17. For all r, s, L such that L &gt; r + 2s + 1 and x 2 V, we have</p><p>There are many DRG metrics, but the most important and interesting ones, which can be used for the private proximity scheme, are the Hamming scheme (over any finite field), the Johnson scheme, and the Grassmann scheme.</p><p>The Johnson scheme J(n, L) consists of all L-subsets of an n-set. The Johnson distance d J (x, y) between two L-subsets x and y is defined by, d J (x, y) , |x \ y|.</p><p>The Grassmann scheme G q (n, L) consists of all L-subspaces of an n-space over F q . The Grassmann distance d G (x, y) between two L-subspaces x and y is defined by, d G (x, y) , L dim(x \ y). VI. WITH PARALLEL WORK In this section we compare our scheme with both a trivial scheme, analogous to the trivial scheme in private information retrieval of downloading the entire database, and the scheme of Chen et al. <ref type="bibr">[4]</ref>.</p><p>We compare the three strategies with respect to the achieved privacy level P , the required storage, the upload cost, and the download cost. For the present scheme, we only consider the case where &#9999; = 0, since the other two schemes assume this condition.</p><p>In the trivial scheme, the user uses a single server which is storing the entire database X . The user simply retrieves the indices of the files in B(z, r) \ X for every z 2 F L 2 . In other words, the user sends 2 L queries to a single server, each of which is a bit string of length L.</p><p>The private search scheme of <ref type="bibr">[4]</ref> applies not only for balls in Hamming space, but any set of subsets of a given set. In this scheme, every server stores 2 L vectors of length M . The PIR scheme requires the user to upload a binary vector of length 2 L to each server, and each server responds by transmitting a vector of length M . For the analysis done in <ref type="bibr">[4]</ref> we must have M 2 L . That is, the database contains an enormous amount of identical files.</p><p>We compare these two schemes with our scheme in Table <ref type="table">I</ref>. In summary, our scheme loses to that of <ref type="bibr">[4]</ref> in terms of privacy and download rate, but saves in terms of storage and upload cost. The download cost, that is, the total amount of downloaded data for both schemes, is approximately equal. Note lastly that if L is especially small, then the user is likely better off simply using the trivial scheme.  </p></div></body>
		</text>
</TEI>
