<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Interactive Linear Regression with Pairwise Comparisons</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018 October</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10091686</idno>
					<idno type="doi"></idno>
					<title level='j'>52nd Asilomar Conference on Signals, Systems, and Computers</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yichong Xu</author><author>Sivaraman Balakrishnan</author><author>Aarti Singh</author><author>Artur Dubrawski</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A general goal of interactive learning is toinvestigate broad ways of leveraging human feedback, andunderstand the benefits of learning from potentially complex feedback. We study a special case of linear regressionwith access to comparisons between pairs of samples.Learning from such queries is motivated by several important applications, where obtaining comparisons can bemuch easier than direct labels, and/or when comparisonscan be more reliable. We develop an interactive algorithmthat utilizes both labels and comparisons to obtain a linearestimator, and show that it only requires a very smallamount of direct labels to achieve low error. We also giveminimax lower bounds for the problem, showing that ouralgorithm is optimal up to log factors. Finally, experimentsshow that our algorithm outperforms label-only algorithmswhen labels are scarce, and it can be practical for real world applications]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Interactive learning has been drawing attention from the machine learning community, both theoretically and practically, as it helps enrich the feedback that ML systems can handle, and reduce the learning effort <ref type="bibr">[2]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[24]</ref>. Different than traditional learning and crowdsourcing schemes, interactive learning relies on complex or structured feedback from the labelers, to provide more information or more actionable information about each sample. Another important factor of interactive learning is being active: Instead of dealing with a fixed dataset, interactive learning selectively chooses the most informative questions throughout the learning process.</p><p>We investigate a special case of interactive regression that uses pairwise comparisons. In many applications <ref type="bibr">[20]</ref>, <ref type="bibr">[23]</ref>, people can provide more accurate results when they compare the objective for two different samples, than giving direct labels for individual samples. Comparisons also often cost less effort for humans. For example in clinical settings, assessing the health condition of an individual may require laborious review of complex medical data, but comparing the relative health status of two patients can be easier for trained clinicians and carry less subjective bias. Another example involves evaluating face images (e.g., estimating people's age): Crowdsource workers typically have difficulty giving direct evaluations about "how old" is a person, but they are often better at comparing two face images and deciding who looks older. We conduct experiments on this task in Section VI.</p><p>We consider interactive pairwise comparisons in the special case of linear regression, one of the most important methods of statistical learning. It can be very effective when we expect a simple relationship between the sample and response. However, often the limited number of labeled training data leads to overfitting models, especially when n&lt;d , where n is the number of labels and d is the dimensionality. In such a case, certain sparsity or additive assumptions have been suggested, such as LASSO <ref type="bibr">[22]</ref> and SpAM <ref type="bibr">[18]</ref>. We take a different approach, showing that when we have enough comparisons, it is possible to learn a linear function even when n&lt;d, without making any special assumptions. The basic idea behind our algorithm is very simple: we first learn to predict the comparison, which is intrinsically linear classification; with comparison queries we can infer the underlying weights up to scaling transformations <ref type="foot">1</ref> . After that, we use labels to infer the scale of the weights, and the label complexity of such inference is independent of d. Our contributions come in threefold:</p><p>&#8226; We develop and analyze an interactive learning algorithm that efficiently learns a linear estimator, using both label and comparison queries. Given n label queries and m comparison queries, we show that the mean squared error (MSE) of our algorithm decays at a rate of O 1 n +exp(-m d ) . When sufficient comparisons are available, the label complexity of our algorithm is independent of dimension d. This establishes that we can achieve good MSE rate without making additional assumptions about the underlying function.</p><p>&#8226; We give complementary lower bounds to show that our results are almost optimal, up to log factors. In particular, we show that the rate of O( 1 n ), and the total number of queries, are not improvable up to log factors.</p><p>&#8226; Finally, we conduct experiments on synthetic and real datasets, and show that our algorithm can outperform label-only algorithms when labels are scarce. Experiments on real datasets demonstrate the practicality of our algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>Typically, the focus of current research on pairwise comparison is to elicit a ranking from (noisy) pairwise comparisons on a finite set of items. For instance, <ref type="bibr">[8]</ref> gives an efficient algorithm of error O( 1 n ) under the assumption that each comparison is flipped with a bounded probability. More recently, <ref type="bibr">[17]</ref>, <ref type="bibr">[20]</ref> give algorithms that work under the Bradley-Terry <ref type="bibr">[7]</ref> and Thurstone <ref type="bibr">[21]</ref> models. In contrast, our goal is to elicit a regression function from comparisons, which leads to quite different models and methods.</p><p>Active learning <ref type="bibr">[5]</ref>, <ref type="bibr">[9]</ref> focuses on actively selecting questions to present to the user, but primarily focuses on label queries only. The first part of our algorithm relies on previous results in active learning <ref type="bibr">[4]</ref>, but our setting differs from that of active learning since we introduce comparisons.</p><p>Interactive learning has been considered in various circumstances for different kinds of oracles. For example, <ref type="bibr">[11]</ref> considers learning from partial corrections, where the user points out the error of prediction with respect to certain components of the input. <ref type="bibr">[25]</ref> considers feature discovery through comparison queries. Two recent papers <ref type="bibr">[15]</ref>, <ref type="bibr">[24]</ref> are most closely related to our work. They consider the benefits of learning from comparisons for active classification. However, we focus on regression and our method is very different from theirs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROBLEM STATEMENT</head><p>We assume our samples X come from a distribution P X on R d . Following literature in active learning <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, we assume that P X is isotropic and log-concave; that is, features of X are independent, centered around 0, have covariance I d ; and log of the density function of X is concave. The first assumption can be achieved through standard preprocessing like ICA, and the latter is true for many prevalent choices of distributions, such as uniform and Gaussian <ref type="bibr">[16]</ref>. Note that this assumption is weaker than in several regression papers <ref type="bibr">[14]</ref>, which assumes P X is independent Gaussian. In addition, let B(v, r) denote the ball of radius r around vector v.</p><p>Following previous literature (e.g., <ref type="bibr">[9]</ref>), we assume access to a label oracle O l , which takes a sample x &#8712; R d and outputs a label Y &#8712; R. We assume a linear relation between the label and features:</p><p>In addition to traditional label queries, we assume access to a (potentially cheaper) comparison oracle O c . On each query, O c receive a pair of samples (X, X ) &#8764; P X &#215; P X , and returns a random variable Z &#8712;{-1, +1}, where Z =1indicates that the user thinks f (X) &gt; f (X ), and Z = -1 otherwise. We assume an agnostic noise<ref type="foot">foot_1</ref> &#957; for Z:</p><p>That is, a randomly sampled triplet (X, X ,Z) is wrong with probability at most &#957;. Note that the error for a particular example (X, X )=(x, x ) can be arbitrary.</p><p>Given a (arbitrarily large) set of unlabeled instances U = {X 1 ,X 2 , ...} coming from P X , we aim to estimate w * by querying O l and O c with samples in U , using a label and comparison budget of n and m respectively. We characterize the quality of any such estimator w in terms of the mean squared error (MSE)</p><p>where the expectation is taken over randomness in estimator w, oracles O l , O c and random sample X respectively. We also study information-theoretic limits of any estimator by examining the minimax risk:</p><p>As a final remark of this section, the classical minimax rate for ordinary least squares(OLS) is of order O( d n ), where n is the number of label queries. This rate cannot be improved by active label queries (c.f. <ref type="bibr">[10]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. ALGORITHM &amp;ANALYSIS</head><p>Our algorithm is described in Algorithms 1 and 2. We first consider comparisons as a classification problem with samples ((X -X ),Z), and use active linear classification to learn an estimated v &#8776; v * . The active classification algorithm we use comes from <ref type="bibr">[4]</ref>, and is presented in Algorithm 2. In each iteration, it find the best weights that minimize the hinge loss</p><p>where W = {(x i ,x i ,z i )} mk i=1 is the labeled dataset. This vector is normalized and then used as the criterion for selecting pairs. The final result of the classification is a unit vector v &#8776; v * .</p><p>After classification, we use the estimated v along with actual label queries to learn an estimated weight norm r, through OLS. Combining the results we get w = r &#8226; v.  l &#964;k (u, W )+&#954;/8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:</head><p>v k &#8592; uk uk 2 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6:</head><p>Sample another dataset U of m k unlabeled sample pairs. 7:</p><p>Ask O c to label all samples in U and obtain labeled set W . 9:</p><p>Theorem 1. There exists some constants C, N such that if n&gt;N, the MSE of Algorithm 1 satisfies 3</p><p>Several remarks are in order before we turn to details of the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remarks. (1)</head><p>The MSE rate for classical ordinary least 3 We use O to represent expressions without the log log(&#8226;) terms. square (OLS) is d n (see e.g., <ref type="bibr">[12]</ref>). Theorem 1 reduces this to 1  n , which is independent of n. This is critical for high-dimensional linear regression, where we typically have d&gt;n.</p><p>(2) The dependence on m, however, is dependent on d, and we generally require m&gt;dto obtain a low MSE. This suggests that comparisons are most helpful when they are ample.</p><p>(3) Somewhat surprisingly, the dependence on m d is exponential, making our algorithm query efficient once it reaches m&gt;d . Suppose we aim at a MSE of &#947; for some small &#947;&gt;2&#957; 2 , classical OLS requires O(d/&#947;) labels, whereas Algorithm 2 only needs a much-less n + m = O(1/&#947; + d log(d/&#947;)) queries in total. We show in Section V that this quantity is optimal up to log factors. ( <ref type="formula">4</ref>) Finally, we remark that Theorem 1 indicates an upper bound on the minimax risk of (1).</p><p>The full proof is deferred to the Appendix due to space limits; we give a sketch here.</p><p>Proof Sketch. First, using results in <ref type="bibr">[4]</ref>, we obtain an estimator vv * 2 &#8804; &#949; = O exp -m C2d log 3 (nd) + &#957; when we finish the active-comparison in Algorithm 2. Now let T i = v T X i , and we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>And thus</head><p>The first term can be bounded using vv * 2 &#8804; &#949;; for the latter two terms, using Hoeffding bounds we can show that</p><p>Then by decomposing the sums in the latter two terms, we can bound the MSE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. L OWER BOUNDS</head><p>Now we turn to information-theoretic lower bounds of the minimax risk <ref type="bibr">(1)</ref>. We consider any active estimator w with access to the two oracles O c , O l , using m comparisons and n labels, and show that the upper bound of MSE in Theorem 1 is optimal up to log factors. Our results come in two parts: Theorem 2 shows a lower bound that captures dependency on n, for any w using comparisons. Then, Theorem 3 shows lower bound on the total number of queries (m + n). Theorem 2. Suppose X is uniform in [-1, 1] d , and &#949; &#8764; N (0,&#963; 2 ). Then for any (active) estimator w with access to both label and comparison oracles, there is a constant</p><p>Theorem 2 shows that the O( 1 n ) term in Theorem 1 is necessary. The proof is quite standard using Le Cam's method for two increasing functions when d =1, and is included in the Appendix. Theorem 3. For any (active) estimator w with access to n labels and m comparisons, there exists a ground truth weight w and a global constant C, such that when w * = w and 2m + n&lt;d,</p><p>Theorem 3 shows a lower bound on the total number of queries in order to get low error. Combining with Theorem 2, in order to get a MSE of for some &#947;&lt; C, we need to make at least O(1/&#947; + d) queries (i.e., labels+comparisons). Note that for the upper bound in Theorem 1, we need n + m = O(1/&#947; + d log(d/&#947;)) for Algorithm 1 to reach &#947; MSE. So Algorithm 1 is optimal in terms of total queries, up to log factors.</p><p>The proof of Theorem 3 is done by considering an estimator with access to n+2m noiseless labels {(x i ,w * &#8226; x i )} n+2m i=1 , which can be used to generate m comparisons and n labels. We sample w * from a prior distribution in B(0, 1), and show that the expectation of MSE in this case is at least a constant. Thus there exists a weight vector w that leads to constant error. The full proof is deferred to the Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EXPERIMENTS</head><p>We test our algorithm through experiments to verify its practical use. We compare Algorithm 1 with two strong baselines in linear regression: LASSO <ref type="bibr">[22]</ref> and support vector regression. We first conduct experiments in a controlled setting using synthetic data. After that, we consider a practical task of estimating people's ages from portraits. We repeat each experiment 20 times and report the average MSE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Synthetic Dataset</head><p>For synthetic data, we set d =5 0and generate both X and w * from N (0,I d ), and &#949; &#8764;N (0, 0.5 2 ). The comparison oracle generates response using the same noise model:</p><p>for input (x, x ), with &#949;, &#949; &#8764;N(0, 0.5 2 ). We generate a training set of size n train = 4000 and test set of size n test = 1000. At test time we compute the empirical MSE Model performances are compared in Figure <ref type="figure">1a</ref>. We vary the number of labels n from 5 to 100, and the number of comparisons m in {100, 200, 500}. When the number of labels is small, our algorithm consistently outperforms the baselines. For larger numbers of labels, our algorithm achieves similar performance as SVR, when m = 500.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Age Estimation</head><p>In this section, we consider a practical task of estimating people's ages from their portraits. The APPA-REAL dataset <ref type="bibr">[1]</ref> contains 7,591 images each associated with a biological age and an apparent age. The biological age is the person's actual age, whereas the apparent ages are crowdsourced by human. The images are divided into 4113 train, 1500 validation and 1978 test samples, and we only use the train and validation samples for our experiments. We extract the 128-dim feature for each image using the last layer of FaceNet <ref type="bibr">[19]</ref>. The features are centralized to have zero mean and unit variance.</p><p>We consider the task of predicting biological ages; typically it is hard to obtain direct biological labels of people in portraits, but it is relatively easy to crowdsource comparisons. Note that comparisons in this case are based on apparent ages instead of biological ages. So in our experiments, the comparison oracle O c returns labels according to the apparent age, and the label oracle O l returns biological ages directly.</p><p>We vary the number of labels from n = 100 to 4,063 (all labels) and number of comparisons in m &#8712; {1000, 2500, 5000}. Results (Figure <ref type="figure">1b</ref>) show that our algorithm requires fewer labels than baseline methods to achieve low MSE. Also, the total number of queries that our algorithm makes is smaller than that of baseline methods. This verifies our theoretical results in Section IV and V, and demonstrates practicality of our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. DISCUSSION AND CONCLUSION</head><p>We develop interactive learning algorithms for linear regression with access to both label and comparison oracles. Our results show that when comparison labels are copious, only a very small amount of direct labels is required to learn a linear estimator with good accuracy. We also provide complementary lower bounds to show the optimality of our algorithm. Experiments on both synthetic and real-world datasets show the practicality of our algorithm. In applications, comparisons are typically easier to obtain than labels, and our algorithm becomes more effort-efficient than label-only algorithms in these cases.</p><p>We analyze our method under the assumption of agnostic comparisons. The same results can be easily obtained for the case where each comparison is flipped (w.r.t. ground truth) with some probability &#951;&lt;1/2, using algorithm in <ref type="bibr">[13]</ref>, however that algorithm is computationally inefficient. Currently the best efficient active learning algorithm under such "bounded noise" setting requires m &#8805; O d</p><p>comparisons <ref type="bibr">[3]</ref>. So it remains open to solve such cases using around O(d) comparisons. It would also be interesting to consider other comparison models such as BTL <ref type="bibr">[7]</ref> and Thurstone <ref type="bibr">[21]</ref>.</p><p>Other interesting directions include extending our method to more complex models, such as generalized linear and graphical models. A broader goal would be to understand the fundamental benefits and limits of utility of a general class of indirect queries, along with their applications.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Note that we cannot learn the underlying weights perfectly with comparisons only; for comparisons y = w T x and y =5 w T x gives exactly the same results. ,((( $VLORPDU</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Our model can also be adapted to the bounded noise model case using a different algorithm from active learning; See Section VII for details.</p></note>
		</body>
		</text>
</TEI>
