<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Near-Neighbor Methods in Random Preference Completion</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/21/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10108717</idno>
					<idno type="doi">https://doi.org/10.1609/aaai.v33i01.33014336</idno>
					<title level='j'>Proceedings of the ... AAAI Conference on Artificial Intelligence</title>
<idno>2159-5399</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Ao Liu</author><author>Qiong Wu</author><author>Zhenming Liu</author><author>Lirong Xia</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper studies a stylized, yet natural, learning-to-rank problem and points out the critical incorrectness of a widely used nearest neighbor algorithm. We consider a model with n agents (users) {xi}i∈[n] and m alternatives (items) {yl}l∈[m], each of which is associated with a latent feature vector. Agents rank items nondeterministically according to the Plackett-Luce model, where the higher the utility of an item to the agent, the more likely this item will be ranked high by the agent. Our goal is to identify near neighbors of an arbitrary agent in the latent space for prediction.We first show that the Kendall-tau distance based kNN produces incorrect results in our model. Next, we propose a new anchor-based algorithm to find neighbors of an agent. A salient feature of our algorithm is that it leverages the rankings of many other agents (the so-called “anchors”) to determine the closeness/similarities of two agents. We provide a rigorous analysis for one-dimensional latent space, and complement the theoretical results with experiments on synthetic and real datasets. The experiments confirm that the new algorithm is robust and practical.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In a learning-to-rank problem, there is a set of agents (users) X = {x 1 , . . . x n } and a set of alternatives (items) Y = {y 1 , . . . y m }. Each agent reveals her preferences over a subset of alternatives. The goal is to infer agents' preferences over all alternatives, including those that are not rated or ranked. This fundamental machine learning problem has many practical applications. For example, recommender systems use an agent' revealed preferences to discover other alternatives she might be interested in; product designers learn from consumers' past choices to estimate the demand curve of a new product; defenders can predict terrorists' preferences based on their past behavior; and political parties can evaluate campaign options based on voters' preferences. See <ref type="bibr">(Liu and others 2009)</ref> for a recent survey. Rating vs. ranking. Agents' preferences can be represented by either a rating for each alternative (e.g., an integer rating in Netflix), or a ranking over the alternatives (i.e., complete ordering). Rating-based approaches have many known drawbacks <ref type="bibr">(Liu and Yang 2008;</ref><ref type="bibr">Katz-Samuels and Scott</ref> Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2018), including (i) agents often have different scales for ratings; and (ii) numeric values are often less robust than ranking-based approaches. In fact, rating data can always be converted to ranking data (e.g., y 1 is ranked higher than y 2 if y 1 has a higher rating) and thus ranking-based models and algorithms are more general. We focus on ranking data.</p><p>A common approach to infer an agent's preference is to first identify near neighbors of the agent in terms of the Kendall-Tau (KT) distance, then aggregate their rankings to produce a prediction. The KT distance is a metric that counts the number of pairwise disagreements between two ranking lists. This approach was proposed by <ref type="bibr">Liu and Yang (2008)</ref>, and their algorithm will be referred to as KT-kNN in this paper. Many subsequent work are based on the following assumption <ref type="bibr">(Hwang and Lee 2009;</ref><ref type="bibr">Wang et al. 2012;</ref><ref type="bibr">Fan and Lin 2013;</ref><ref type="bibr">Wang et al. 2014;</ref><ref type="bibr">Park et al. 2015)</ref>.</p><p>Assumption 1. KT distance is a good measure of similarity between agents.</p><p>No theoretical justification for this assumption was known until recently. Katz-Samuels and Scott (2018) proposed a latent utility model to justify the assumption. In their model, each agent or alternative is associated with a latent feature. Alternative j's utility to agent i is controlled by a deterministic function of the similarity in their latent features. Under this model, consistency result is established for the KT-kNN algorithms.</p><p>However, this model assumes that agents' preferences are deterministic, which is unrealistic in many settings. For example, an agent can exhibit irrational behavior, or provide only a noisy version of her preferences. In fact, human preferences are often highly non-deterministic. Various statistical models have been built to model such randomness, pioneered by the Nobel Laureate <ref type="bibr">McFadden (McFadden 2000)</ref> among many other researchers. Therefore, the following question remains open.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>How can we learn an agent's random preferences from other agents' random preferences?</head><p>This question can be answered by designing algorithms for two closely-related problems: (i) preference completion (PC): given each agent's preferences over a subset of alternatives, the goal is to estimate its preference over all the alternatives. (ii) near neighbors (NN): given an agent x i , the goal is to find agents x i &#8242; close to x i in the latent space.</p><p>Standard techniques exist to use algorithms for the NN problem to solve a PC problem <ref type="bibr">( (Katz-Samuels and Scott 2018;</ref><ref type="bibr">Liu and Yang 2008)</ref>; see also Appendix D). Therefore, we focus on the NN problem in this paper.</p><p>Our Contributions. Our main conceptual contribution is the combination of a distance-based latent model and random preferences for learning to rank. To the best of our knowledge, while there is a large literature in each component, we are the first to consider both. See related work for more discussions.</p><p>Our model is called distance-based random preference model. Let the latent feature of agent i (alternative j) be x i (y j ). Agent i's preferences are determined by a utility function u(x i , y j ) = &#952;(x i , y j ) + &#491; i,j , where &#952;(x i , y j ) is a deterministic monotonically decreasing distance-based function and &#491; i,j is a zero mean independent random variable. Our model captures two pervasive characteristics of ranking datasets: Ch1. Economically meaningful &#952;(&#8226;, &#8226;) function. u(x i , y j ) is high in expectation when x i and y j are close. An agent is more likely to prefer alternatives with similar latent features to itself. Ch2. Random preference model. The function u(x i , y j ) contains a noise term &#491; i,j to capture uncertainties in agents' behaviors.</p><p>Our technical contributions are two-fold. First, we prove that Assumption 1 does not hold anymore in our distancebased random preference model. More precisely, we prove that the agents found by the KT-kNN algorithm <ref type="bibr">(Liu and Yang 2008</ref>) is far away from the given agents with high probability, even when n, m &#8594; &#8734;.</p><p>Second, we design an "anchor-based" algorithm for finding an agent's near neighbors under random preferences. The algorithm is based on the following natural idea: if two agents i 1 and i 2 are close, then their KT distance to any other agent j (an anchor) should also be close. The algorithm proceeds by using the KT distance to other agents as an agent's feature, and measures the closeness between two agents by the L 1 distance of their features. We prove that asymptotically our algorithm identifies an agent's near neighbors with high probability when the latent space is 1dimensional. Many techniques we developed can be generalized to high-dim settings.</p><p>Experiments on synthetic data verify our theoretical findings, and demonstrate that our algorithm is robust in highdim spaces. Experiments on Netflix data shows that our anchor-based algorithm is superior to the KT-kNN algorithm and a standard collaborative filter (using the cosinesimilarities to determine neighbors). Related Work and Discussions. While using random utility models in learning-to-rank problems is not new (Lu and Negahban 2015; <ref type="bibr">Park et al. 2015;</ref><ref type="bibr">Oh, Thekumparampil, and Xu 2015;</ref><ref type="bibr">Zhao, Piech, and Xia 2016;</ref><ref type="bibr">Zhao, Villamil, and Xia 2018;</ref><ref type="bibr">Liu et al. 2019</ref>; Katz-Samuels and Scott 2018), we are not aware of any that simultaneously achieves both Ch1 and Ch2.</p><p>Random utility-based ranking algorithms (Lu and Negahban 2015; <ref type="bibr">Park et al. 2015;</ref><ref type="bibr">Oh, Thekumparampil, and Xu 2015)</ref> address Ch2, but the function &#952;(x i , y j ) often does not have an explicit economics interpretation. For example, let &#920; &#8712; R n&#215;m be a matrix such that &#920; i,j = &#952;(x i , y j ). <ref type="bibr">(Park et al. 2015;</ref><ref type="bibr">Oh, Thekumparampil, and Xu 2015)</ref> assume that &#920; is low rank. But the low rank assumption does not have explicit economically interpretation.</p><p>While recent non-parametric models (e.g., (Katz-Samuels and Scott 2018)) allow one to use economically interpretable functions &#952; (addressing Ch1), they operate only under deterministic utility models.</p><p>Parametric preference learning has been extensively studied in machine learning, especially learning to rank <ref type="bibr">(Azari Soufiani et al. 2013;</ref><ref type="bibr">Azari Soufiani, Parkes, and Xia 2013;</ref><ref type="bibr">2014;</ref><ref type="bibr">Cheng, H&#252;llermeier, and</ref>  <ref type="table">Dembczynski  2010; Hughes, Hwang,</ref> and<ref type="table">Xia 2015; Khetan</ref> and<ref type="table">Oh 2016;  Maystre</ref> and<ref type="table">Grossglauser 2015)</ref>. These works are different with ours as it is often assumed that agents' preferences are generated from a parametric model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>Distance-Based Random Preference Model. Let X = {x 1 , . . . , x n } &#8834; R d denote the set of agents and let Y = {y 1 , . . . , y m } &#8834; R d denote the set of alternatives. We slightly abuse the notation and use x i to refer to both agent i and her latent features. Each agent x i has a ranking (preference list)</p><p>Utility functions and the random utility model. Agent i's expected utility on alternative j is determined by a function &#952;(x i , y j ). Throughout this paper, we use &#952;(x i , y j ) = exp(-x i -y j 2 ), where . 2 is the &#8467; 2 -norm. Agent i's ranking R i is determined by the widely-used Plackett-Luce model <ref type="bibr">(Plackett 1975;</ref><ref type="bibr">Luce 1977)</ref>. The realized utility of alternative y j for agent i is generated by u(x i , y j ) &#8801; &#952;(x i , y j ) + &#491; i,j , where &#491; i,j is a zero mean independent random variable that follows the Gumbel distribution. Then, agent i ranks the alternatives in decreasing order of their realized utilities. The density function of the Plackett-Luce model has a closed-form formula. Let y j1 &#8827; i y j2 represent that y j1 is ahead of y j2 in R i and let j 1 , j 2 , . . . j m be a permutation of <ref type="bibr">[m]</ref>. We have</p><p>.</p><p>(1)</p><p>The marginal distribution between alternatives j 1 and j 2 is</p><p>&#952;(xi,yj 1 )+&#952;(xi,yj 2 ) . Distributions of X and Y. x i and y j are i.i.d. generated from distributions D X and D Y . The supports of D X and</p><p>where c is a constant. We adopt the standard "near uniform" assumption for D X and D Y <ref type="bibr">(Abraham et al. 2015;</ref><ref type="bibr">Hoff, Raftery, and Handcock 2002;</ref><ref type="bibr">Kleinberg 2000;</ref><ref type="bibr">Sarkar, Chakrabarti, and Moore 2011)</ref>.</p><p>. Observation model. We observe only agent i's ranking over a subset O i &#8838; [m] of alternatives. Each alternative j is in O i independently with probability p. The O i 's are also independently generated across different agents. Let R Oi be the ordered list over O i &#8838; Y that is consistent with R (i.e., R Oi is the partial ranking of R over O i ). For each agent i, we observe R Oi i . The near neighbor problem. Here, an algorithm needs to find near neighbors of an input agent. An algorithm is a k(n, m)-NN solver with parameter &#964; (n) if &#8226; for any input agent i, the algorithm outputs k agents i 1 , i 2 , . . . i k , and</p><p>We often write k-NN or kNN instead of k(n, m)-NN when k's dependencies on m and n are not critical. Additional notations and examples. For an arbitrary ordered list R, we use it calligraphic form R to extract the rank of an alternative. For example, suppose y j is the top-ranked alternative in R, then R(y j ) = 1. Let I(v) be an indicator that sets to 1 if the argument v is true; if false, it sets to 0.</p><p>Let |R| be the length of the list R. Let R 1 and R 2 be two ordered lists over the same set of alternatives. The normalized Kendall-Tau distance between R 1 and R 2 is</p><p>When R 1 and R 2 do not have the same support, the normalized KT distance is defined as</p><p>To facilitate analysis, sometimes we need to introduce new agents outside X . For a new agent with latent features x, let R x denote its ranking over Y and let R Ox x denote the observed ranking. Conditional probability and expectations. There are multiple levels of randomness for producing the rankings R i 's: (i) x i and y j are random and (ii) u(x i , y j ) consists of a random component (i.e., randomness from the Plackett-Luce model). Care must be taken when operating the conditional random variables defined in our process. For example, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Inefficacy of KT-kNN</head><p>In this section, we will prove the inefficacy of KT-kNN algorithm by <ref type="bibr">Liu and Yang (2008)</ref> (Algorithm 1) in our distanced-based random preference model. This implies that Assumption 1 does not hold in our model.</p><p>Recall that KT-kNN uses KT distances to find an agent's neighbors based on the intuition that when x i and x j are close, their "opinion" on alternatives' utilities should be similar. The next theorem show that this intuition does not hold Algorithm 1: KT-kNN (it produces incorrect results)</p><p>, k, and an agent x i . 2 Output: k neighbors near agent i in the latent space.</p><p>jn-1 ) 4 Return X KT-kNN &#8592; {j 1 , . . . j k } in our model, by proving that KT-kNN does not return any near neighbors for a large fraction of x i . Theorem 1. Consider Algorithm KT-kNN under distancebased random preference model in which d = 1 and p = 1. Let D Y and D X be uniform distributions on <ref type="bibr">[-1, 1]</ref>. For any constant &#491;, any x i &#8712; [-1 + &#491;, -0.5] &#8746; [0.5, 1 -&#491;], and any k &#8804; n/ ln 5 n, we have</p><p>with high probability. The probability comes from random X /{x i }, random Y, and random preferences.</p><p>Remarks. Theorem 1 states that KT-kNN fails to work even for the simple case where d = 1 and <ref type="formula">3</ref>) is a strong result because trivial algorithms exist to find an agent x j whose distance to x i is &#920;(1) (just picking up an arbitrary x j ). In addition, this result continues to hold for large populations (e.g., when n, m &#8594; &#8734; and p = 1), suggesting that the limitation of the KT-based approach roots at the structural properties of the NK function. In addition, if we use KT-kNN to solve PC problem by applying standard techniques, it will also produce poor results (see Appendix D and Lemma 8 there).</p><p>Comparison to (Katz-Samuels and Scott 2018). Katz-Samuels and Scott (2008) proved that KT-kNN is effective under the deterministic utility model. This suggests that with the presence of uncertainties in the utility function (a more realistic assumption), the algorithmic structure of the NN problem is significantly altered. Intuitions behind Theorem 1. The following example highlights the salient structures of KT-distances. Example 1 (Near-neighbors in expectation). Let x 1 = 0, y 1 = -0.5, and y</p><p>, where would we place an agent that minimizes its KT distance to x 1 ?). One would hope that when x * and x 1 are close, R x * and R 1 is close, but here x * = -0.5. Specifically, let a be the probability x 1 prefers y 1 to y 2 (i.e., a &#8801; &#952;(x1,y1) &#952;(x1,y1)+&#952;(x1,y2) ) and let b(x) be the probability x prefers y 1 to y 2 . Recall that agents' support is <ref type="bibr">[-1, 1]</ref>. We need to solve</p><p>Here, we aim to minimize the weighted sum of a &#8712; [0, 1] and (1 -a) via controlling b(x). The optimal solution has a simple structure: when a &gt; (1 -a) (equivalently, a &gt; 0.5), we need to set the weight associated with a as small as possible, which means setting b(x) to the largest possible value. When a &lt; (1 -a), b(x) needs to be minimized. Thus, the optimal solution uses the following threshold rules (assume a = 0.5 for simplicity).</p><p>x * &#8712; (-1, y 1 ) if a &gt; 0.5 (y 2 , 1) if a &lt; 0.5 .</p><p>This minimizer is far from x 1 .</p><p>See also Example 3 in Appendix B.5 for another small and concrete example, in which KT-kNN produces poor output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Proof sketch of Theorem 1</head><p>We use intuitions from Example 1 to prove the theorem. Specifically, define</p><p>First, note that NK(R i , R x ) concentrates at G i (x) when m is sufficiently large. This comes from the concentration behavior of the NK function:</p><p>See Appendix B.1 for the proof. The terms in NK are not independent terms so we cannot directly apply Chernoff bounds. Our proof uses the combinatorial structure of the NK function to decouple the dependencies among terms. The technique we develop can be of independent interests.</p><p>Let x * = arg min x G i (x). Below is our main lemma:</p><p>Lemma 2. Let D Y be uniform distribution on <ref type="bibr">[-1, 1]</ref>. Let x i be any agent in [-1, -0.5]. We have</p><p>Similarly, when x i &#8712; [0.5, 1], arg min x G i (x) = 1.</p><p>For any x i &#8712; [-1, -0.5]&#8746;[0.5, 1], Lemma 1 and Lemma 2 give us:</p><p>, where the first approximation comes from the concentration bound of NK and the second approximation comes from the fact that there must exist one agent close to -1 when the number of agent is large. Therefore, all the neighbors produced by KT-kNN are far from x i (see Appendix B.4 for a rigorous analysis).</p><p>Proof of Lemma 2. By linearity of expectation, we have</p><p>The last equality holds because y j 's are i.i.d. samples from D Y . Define</p><p>When the context is clear, we shall refer to p i (y 1 , y 2 ) and p x (y 1 , y 2 ) as p i and p x , respectively. We have</p><p>One can see that G i (x) is a smooth function (the first derivative exists). Our proof consists of three parts. Part</p><p>The proof for part 3 is similar to part 1. Proving part 2 is also simpler. Therefore, we focus only on the proof for part 1. Proof for part 2 and 3 is deferred to Appendix B.2.</p><p>We now show that when x &#8712; (-1, x i ], &#8706;G i (x)/&#8706;x &gt; 0. We have (see Fact 2 in Appendix B.2):</p><p>where</p><p>&#8226; sign(y 1 -x)-sign(y 2 -x)</p><p>Here,</p><p>1,2 ) measures whether x i (x) is closer to y 2 or y 1 . Similar to Example 1, they serve as important quantities determining the structure of &#8706;G i (x)/&#8706;x (and therefore the optimal solution).</p><p>One can check that &#934;(y 1 , y 2 , x, x i ) = &#934;(y 2 , y 1 , x, x i ). Therefore,</p><p>Central to our analysis is carefully partitioning the event y 1 &#8804; y 2 into four disjoint (sub)-events. Under each event, the conditional expectation of &#934; can be computed in a straightforward manner. Specifically, define</p><p>. Figure <ref type="figure">3</ref>(a) in Appendix A visualizes the events to complement the analysis. We now interpret the meaning of these events. Event E 1 . Event E 1 represents the case in which y 1 and y 2 are on the same side of x. In this case, any movement Algorithm 2: Anchor-kNN</p><p>, k, and agent x i . 2 Output: k neighbors near agent i.</p><p>with high probability. The probability comes from random X /{x i }, random Y, and random preferences.</p><p>Remark. &#964; (n) cannot be too small because our function D(x i , x j ) cannot measure the distance of two agents well if they are too close. m needs to grow when p (fewer samples) or &#964; (n) decreases (higher quality requirement), which is intuitive. k measures the number of numbers an algorithm can find so that their distance is within &#964; (n); larger k means the algorithm is more powerful.</p><p>Let</p><p>In the remainder of this section, we analyze the behavior of of D. The function D concentrates at D and can be shown by using simple Chernoff bounds (see Appendix C.2 for a complete analysis). Lemma 3. For any near-uniform D X , D Y on [-c, c] and any two agents x i , x j , we have</p><p>where c 3 is a constant that depends only on c.</p><p>Upper bound proof for Lemma 3. The upper bound requires only a straightforward calculation. Recall that p</p><p>The last inequality is shown in Fact 6 in Appendix C.1. Lower bound bound proof for Lemma 3. Here we analyze only the case |x i -x j | &#8804; 2c -2 ln n. When |x i -x j | &gt; 2c -2 ln n (e.g., x i and x j are around the boundaries -c and c, respectively), the result is trivial. Wlog, assume that x i &lt; x j . We partition [-c, c] into three intervals and consider anchor agents in each of these intervals. Specifically, define (see also Figure <ref type="figure">1</ref></p><p>The agents in I 2 are "less effective anchors" (C2). We use trivial bound for terms in I 2 (|F i,t -F j,t | &#8805; 0). Focusing on I 1 and I 3 , we have</p><p>x j is at least in the order of |x i -x j | 2 . The analysis for the other term is similar. Below is the lemma we need (related to C2):</p><p>&#8706;x dx , we aim to give a bound for &#8706;Gt(x)  &#8706;x . Specifically,</p><p>when x &#8805; 2x t + c (or equivalently, x t &#8804; x-c 2 ). Re-cycle the definition of &#934;, &#8710; (i) 1,2 , and &#8710; (x) 1,2 used in the analysis of Theorem 1 (see also Appendix A for the notation summary) so that &#8706;G(x)  &#8706;x = E y1&lt;y2 [&#934;(y 1 , y 2 , x, x t ) | x, x t ] We partition the positions of {y 1 , y 2 } into events (see also Fig. <ref type="figure">1c</ref> for a visualization):</p><p>&#8226; E 1 : when x &#8804; y 1 &#8804; y 2 or y 1 &#8804; y 2 &#8804; x.</p><p>&#8226; E 2 : when y 1 &#8712; [ x-7c 8 , 3x-5c 8 ] and y 2 &#8805; x+c 2 . We have</p><p>&#8226; E 3 : when (y 1 , y 2 ) / &#8712; E 1 &#8746; E 2 . As explained below, we do not need to explicitly calculate Pr[E 3 | y 1 &#8804; y 2 ].</p><p>We now explain the intuition associated with these events.</p><p>Event E 1 . Because y 1 and y 2 are on the same side of x, we have E[&#934; | E 1 , x, x t ] = 0 (see also Lemma 2). Event E 2 and event E 3 . When (y 1 , y 2 ) &#8712; E 2 &#8746; E 3 , we have |y 1 -x t | &#8804; |y 2 -x t | (x t is closer to y 1 than to y 2 ). This is because (i) |y 2 -x t | &#8805; x -x t when y 2 &#8805; x and x &#8712; I 1 , and (ii) |y 1 -x t | &#8804; max{x t -c, x -x t } &#8804; x -x t since y 1 &#8804; x and x &#8805; 2x t + c. This conclusion is trivial when the positions of the points are visualized (Figure <ref type="figure">1(c)</ref>).</p><p>In these events, an increment in x will result in an increment in G t (x). Therefore E[&#934; | E 2 &#8746; E 3 , x, x t ] &#8805; 0. Event E 2 . Knowing &#934; can be arbitrarily close to 0 when E 2 &#8746; E 3 happens. Here, we need also identify an event so that &#934; is at least a positive constant. Event E 2 serves for this based kNN algorithm failed to find similar agents (users), challenging the assumptions made in many prior preference completion algorithms. To fix the problem, we introduced a new anchor-based algorithm Anchor-kNN that uses all the agents' ranking data to determine the closeness of two agents. Our approach is in sharp contrast to most existing feature engineering methods. We provided a rigorous analysis for Anchor-kNN for 1-dim latent space, and performed experiments on both synthetic and real datasets. Our experiments showed that Anchor-kNN works in high dim space and promises to outperform other widely used techniques.</p></div></body>
		</text>
</TEI>
