<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Distributed Multi-Task Learning with Shared Representation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>03/07/2016</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10025959</idno>
					<idno type="doi"></idno>
					<title level='j'>arXiv.org</title>
<idno>2331-8422</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jialei Wang</author><author>Mladen Kolar</author><author>Nathan Srebro</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study the problem of distributed multitask learning with shared representation,where each machine aims to learn a separate, but related, task in an unknown sharedlow-dimensional subspaces, i.e. when the predictor matrix has low rank. We consider asetting where each task is handled by a different machine, with samples for the taskavailable locally on the machine, and study communication-efficient methods for exploitingthe shared structure.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Multi-task learning is widely used learning framework in which similar tasks are considered jointly for the purpose of improving performance compared to learning the tasks separately <ref type="bibr">[13]</ref>. By transferring information between related tasks it is hoped that samples will be better utilized, leading to improved generalization performance. Multi-task learning has been successfully applied, for example, in natural language understanding <ref type="bibr">[15]</ref>, speech recognition <ref type="bibr">[32]</ref>, remote sensing <ref type="bibr">[44]</ref>, image classification <ref type="bibr">[25]</ref>, spam filtering <ref type="bibr">[43]</ref>, web search <ref type="bibr">[14]</ref>, disease prediction <ref type="bibr">[49]</ref>, and eQTL mapping <ref type="bibr">[23]</ref> among other applications.</p><p>Here, we study multi-task learning in a distributed setting, where each task is handled by a different machine and communication between machines is expensive. That is, each machine has access to data for a different task and needs to learn a predictor for that task, where machines communicate with each other in order to leverage the relationship between the tasks. This situation lies between a homogeneous distributed learning setting <ref type="bibr">[e.g. 36]</ref>, where all machines have data from the same source distribution, and inhomogeneous consensus problems <ref type="bibr">[e.g. 30, 11, 6]</ref> where the goal is to reach a single consensus predictor or iterate which is the same on all machines. The main argument for this setting is that if each machine indeed has access to different data (e.g. from a different geographical region or different types of users), as in the consensus problems studied by <ref type="bibr">Balcan et al. [6]</ref>, then we should allow a different predictor for each distribution, instead of insisting on a single consensus predictor, while still trying to leverage the relationship and similarity between data distributions, as in classical multi-task learning. As was recently pointed out by <ref type="bibr">Wang et al. [41]</ref>, allowing separate predictors for each task instead of insisting on a consensus predictor changes the fundamental nature of the distributed learning problem, allows for different optimization methods, and necessitates a different analysis approach, more similar to homogeneous distributed learning as studied by <ref type="bibr">Shamir and Srebro [36]</ref>.</p><p>The success of multi-task learning relies on the relatedness between tasks. While <ref type="bibr">Wang et al. [41]</ref> studied tasks related through shared sparsity, here we turn to a more general, powerful and empirically more successful model of relatedness, where the predictors for different tasks lie in some (a-priori unknown) shared lowdimensional subspace and so the matrix of predictors is of low rank <ref type="bibr">[3,</ref><ref type="bibr">2,</ref><ref type="bibr">45,</ref><ref type="bibr">4]</ref>. In a shared sparsity model, information from all tasks is used to learn a subset of the input features which are then used by all tasks. In contrast, in a shared subspace model, novel features, which are linear functions of the input features, are learned. The model can thus be viewed as a two-layer neural network, with the bottom layer learned jointly across tasks and the top layer task-specific. Being arguably the most complex multi-layer network that we can fully analyze, studying such models can also serve as a gateway to using deeper networks for learning shared representations.</p><p>Multi-task learning with a shared subspace is wellstudied in a centralized setting, where data for all tasks are on the same machine, and some global centralized procedure is used to find a good predictor for each task. In such a situation, nuclear norm regularization is often used to leverage the low rank structure [e.g. 4, 2] and learning guarantees are known <ref type="bibr">([28]</ref> and see also Section 2). With the growth of modern massive data sets, where tasks and data often too big to handle on a single machine, it is important to develop methods also for the distributed setting. Unfortunately, the distributed multi-task setting is largely unexplored and we are not aware of any prior for on distributed multi-task learning with shared subspaces.</p><p>In this paper we focus on methods with efficient communication complexity (i.e. with as small as possible communication between machines), that can still leverage most of the statistical benefit of sharedsubspace multi-task learning. Although all our methods are also computationally tractable and can be implemented efficiently, we are less concerned here with minimizing the runtime on each machine separately, considering communication, instead, as the main bottleneck and the main resource to be minimized <ref type="bibr">[8]</ref>. This is similar to the focus in distributed optimization approaches such as ADMM <ref type="bibr">[11]</ref> and DANE <ref type="bibr">[37]</ref> where optimization within each machine is taken as an atomic step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contribution</head><p>The main contributions of this article are:</p><p>&#8226; Present and formalize the shared-subspace multitask learning <ref type="bibr">[4]</ref> in the novel distributed multitask setting, identifying the relevant problems and possible approaches. We analyze two baselines, several representative first-order distributed optimization methods, with careful sample and communication complexity analysis.</p><p>&#8226; We proposed and analyzed two subspace pursuit approaches which learns the shared representation in a greedy fashion, which leverage the lowdimensional predictive structure in a communication efficient way.</p><p>&#8226; We conducted comprehensive experimental comparisons of the discussed approaches on both simulated and real datasets, where we demonstrated that the proposed approaches are more communication efficient than first-order convex optimization methods.</p><p>Table <ref type="table">1</ref> summarized the approaches studied in this paper, which will be discussed in detail in the following sections.</p><p>Homogeneous, Inhomogeneous and Multi-Task Distributed Learning. We briefly review the relationship between homogeneous, inhomogeneous and multi-task learning, as recently presented by <ref type="bibr">Wang et al. [41]</ref>.</p><p>A typical situation considered in the literature is one in which data on different machines are all drawn i.i.d from the same source distribution. In this setting, tasks on different machines are all the same, which should be taken advantage of in optimization <ref type="bibr">[37]</ref>. Furthermore, as each machine has access to samples from the source distribution it can perform computations locally, without ever communicating with other machines. While having zero communication cost, this approach does not compare favorably with the centralized approach, in which all data are communicated to the central machine and used to obtain one predictor, when measured in terms of statistical efficiency. The goal in this setting is to obtain performance close to that of the centralized approach, using the same number of samples, but with low communication and computation costs <ref type="bibr">[36,</ref><ref type="bibr">20,</ref><ref type="bibr">48,</ref><ref type="bibr">47,</ref><ref type="bibr">26]</ref>. Another setting considered in the distributed optimization literature is that of consensus optimization. Here each machine has data from a different distribution and the goal is to find one vector of coefficients that is good for all the separate learning or optimization problems <ref type="bibr">[11,</ref><ref type="bibr">30,</ref><ref type="bibr">6]</ref>.</p><p>The difficulty of consensus problems is that the local objectives might be rather different, and, as a result, one can obtain lower bounds on the amount of communication that must be exchanged in order to reach a joint optimum.</p><p>In this paper we suggest a novel setting that combines aspects of the above two settings. On one hand, we assume that each machine has a different source distributions D j , corresponding to a different task, as in consensus problems. For example, each machine serves a different geographical location, or each is at a different hospital or school with different characteristics. But if indeed there are differences between the source distributions, it is natural to learn different predictors w j for each machine, so that w j is good for the distribution typical to that machine. In this regard, our distributed multi-task learning problem is more similar to single-source problems, in that machines could potentially learn on their own given enough samples and enough time. Furthermore, availability of other machines just makes the problem easier by allowing  transfer between the machine, thus reducing the sample complexity and potentially runtime. The goal, then, is to leverage as much transfer as possible, while limiting communication and runtime. As with singlesource problems, we compare our method to the two baselines, where we would like to be much better than the local approach, achieving performance nearly as good as the centralized approach, but with minimal communication and efficient runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Setting, Formulation and Baselines</head><p>We consider a setting with m tasks, each characterized by a source distribution D j (X, Y ) over feature vectors X &#8712; R p and associated labels Y , and out goal is to find linear predictors w 1 , . . . , w m &#8712; R p minimizing the overall expected loss (risk) across tasks:</p><p>where for convenience we denote W &#8712; R p&#215;m for the matrix with columns w i , and &#8467;(&#8226;, &#8226;) is some specified instantaneous loss function.</p><p>In the learning setting, we cannot observe L(W ) directly and only have access to i.i.d. sample {x ji , y ji } nj i=1</p><p>from each distribution D j , j = 1, . . . , m. For simplicity of presentation, we will assume that n j = n, j = 1, . . . , m, throughout the paper. We will denote the empirical loss L n (W ) =<ref type="foot">foot_0</ref> m m j=1 L nj (w j ) where</p><p>is the local (per-task) empirical loss.</p><p>We consider a distributed setting, where each task is handled on one of m separate machines, and each machine j has access only to the samples drawn from D j . Communication between the machines is by sending real-valued vectors. Our methods work either in a broadcast communication setting, where at each iteration each machine sends a vector which is received by all other machines, or in a master-at-the-center topology where each machine sends a vector to the master node, whom in turn performs some computation and broadcasts some other vectors to all machines. Either way, we count to total number of vectors communicated.</p><p>As in standard agnostic-PAC type analysis, our goal will be to obtain expected loss L(W ) which is not much larger then the expected loss of some (unknown) reference predictor 1 W * , and we will measure the excess error over this goal. To allow obtaining such guarantees we will assume:</p><p>Assumption 2.1. The loss function &#8467;(&#8226;) is 1-Lipschitz and bounded 2 by 1, be twice differentiable and Hsmooth, that is</p><p>All the data points are bounded by unit length, i.e.</p><p>and the reference predictors have bounded norm:</p><p>The simplest approach, which we refer to as Local, is to learn a linear predictor on each machine independently of other machines. This single task learning approach ignores the fact that the tasks are related and that sharing information between them could improve statistical performance. However, the communication cost for this procedure is zero, and with enough samples it can still drive the excess error to zero. However, compared to procedures discussed later, sample complexity (number of samples n required to achieve small excess error) is larger. A standard Rademacher complexity argument <ref type="bibr">[7]</ref> gives the following generalization guarantee, which is an extension of Theorem 26.12 in Shalev-Shwartz and Ben-David [33].</p><p>Proposition 2.2. Suppose Assumption 2.1 holds.</p><p>Then with probability at least 1&#948;,</p><p>where</p><p>That is, in order to ensure &#491; excess error, we need</p><p>samples from each task.</p><p>At the other extreme, if we ignore all communication costs, and, e.g. communicate all data to a single machine, we can significantly leverage the shared subspace. To understand this, we will first need to introduce two assumptions: one about the existence of a shared subspace (i.e. that the reference predictor is indeed low-rank), and the other about the spread of the data:</p><p>Since the data is bounded, we always have 1 &#8804; p &#8804; p, with p being a measure of how spread out the data is in different direction. A value of 1 = p indicates the data is entirely contained in a one-dimensional line. In this case, the predictor matrix will also always be rankone, imposing a low-rank structure is meaningless and we can't expect to gain from it. However, when p is close to p, or at least high, the data is spread in many directions and the low-rank assumption is meaningful. We can think of p as the "effective dimensionality" of the data, and hope to gain when r &#8810; p.</p><p>With these two assumptions in hand, we can think of minimizing the empirical error subject to a rank constraint on W . This is a hard and non-convex optimization task, but we can instead use the nuclear norm (aka trace-norm) ||W || * as a convex surrogate for the rank. This is because if Assumptions 2.1 and 2.3 hold, then we also have:</p><p>With this in mind, we can define the following centralized predictor:</p><p>which achieves the improved excess error guarantee:</p><p>Proposition 2.5. (Theorem 1 in <ref type="bibr">Maurer and Pontil [28]</ref>) Suppose Assumptions 2.1, 2.3 and 2.4 hold.</p><p>Then with probability at least 1&#948;,</p><p>The sample complexity per task, up to logarithmic factors, is thus only:</p><p>When p &#8811; m, this is a reduction by a factor of r/m. That is, it is as if we needed to only learn r linear predictors instead of m.</p><p>The problem is that a naive computation of W centralize requires collecting all data on a single machine, i.e. communicating O(n</p><p>p samples per machine. In the next Sections, we aim at developing methods of approximating W centralized using communication efficient methods, or computing an alternate predictor with similar statistical properties but using much less communication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Distributed Convex Optimization</head><p>In this section, we study how to obtain the sharing benefit of the centralized approach using distributed convex optimization techniques, while keeping the communication requirements at low.</p><p>To enjoy the benefit of nuclear-norm regularization while avoid heavy communication cost of Centralize, a flexible strategy is to solve the convex objective (2.3) via distributed optimization techniques. Let W (t) be the solution at t-iteration for some iterative distributed optimization algorithm for the following constrained objective:</p><p>By the generalization error decompsition [10], t) will have the generalization error of order O(&#491;). Therefore in order to study the generalization performance, we will study how the optimization error decreases as the function of the number of iterations t.</p><p>Constrained vs Regularized Objective Note that the constrained objective (3.1) is equivalent to the following regularized objective with a proper choice of &#955;:</p><p>Though they are equivalent, specific optimization algorithms might sometimes be more suitable for one particular type of objectives<ref type="foot">foot_1</ref> . For convenience in the following discussion we didn't distinguish between these two formulations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Distributed Proximal Gradient</head><p>Maybe the simplest distributed optimization algorithm for (3.2) is the proximal gradient descent. It is not hard to see that computation of the gradient &#8711;L n (W ) can be easily done in a distributed way as the losses are decomposable across machines:</p><p>where</p><p>Thus each machine j needs to compute the gradient &#8711;L nj (w j ) on the local dataset and send it to the master. The master concatenates the gradient vectors to form the gradient matrix &#8711;L n (W ). Finally, the master computes the proximal step</p><p>which has the following closed form solution [12]: let</p><p>The algorithm is summarized in Algorithm 4 (in Appendix), which has well established convergence rates [5]:</p><p>To obtain &#949;-generalization error, the distributed proximal gradient descent requires O mHA 2 &#949; rounds of communication, with a total O mHA 2 p &#949; bits communications per machine.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Distributed Accelerated Gradient</head><p>It is also possible to use Nesterov's acceleration idea <ref type="bibr">[29]</ref>  &#8226; p bits communicated per machine to achieve &#949;-generalization error. The algorithm is summarized in Algorithm 5 (in Appendix), where the master maintains two sequences: W and Z. First, a proximal gradient update of W is done based on Z</p><p>and then Z is updated based on a combination of the current W and the difference with previous W</p><p>ADMM and DFW We also discuss the implementation and guarantees for two other popular optimization methods: ADMM and Frank-Wolfe, which are presented in the Appendix A and B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Greedy Representation Learning</head><p>In this section we propose two distributed algorithms which select the subspaces in a greedy fashion, instead of solving the nuclear norm regularized convex program.</p><p>Algorithm 1: DGSP: Distributed Gradient Subspace Pursuit.</p><p>for t = 1, 2, . . . do Workers:</p><p>for j = 1, 2, . . . , m do Each worker compute the its gradient direction &#8711;L nj (w Solve the projected ERM problem: Our greedy approach is inspired by the methods used for sparse signal reconstruction <ref type="bibr">[39,</ref><ref type="bibr">34]</ref>. Under the assumption that the optimal model W * is low-rank, say rank r, we can write W * as a sum of r rank-1 matrices:</p><p>where</p><p>In the proposed approach, the projection matrix U is learned in a greedy fashion. At every iteration, a new one-dimensional subspace is identified that leads to an improvement in the objective. This subspace is then included into the existing projection matrix. Using the new expanded projection matrix as the current feature representation, we refit the model to obtain the coefficient vectors V . In the distributed setting, there is a master that gathers local gradient information from each task. Based on this information, it then computes the subspace to be added to the projection matrix and sends it to each machine. The key step in the distributed greedy subspace pursuit algorithm is the addition of the subspace. One possible choice is the principle component of the gradient direction; after the master collected the gradient matrix &#8711;L n (W (t) ), it computes the top left and right singular vectors of &#8711;L n (W (t) ). Let (u, v) = SV(&#8711;L n (W (t) )) be the largest singular vectors of &#8711;L n (W (t) ). The left singular vector u is used as a new subspace to be added to the projection matrix U . This vector is sent to each machine, which then concatenate it to the projection matrix and refit the model with the new representation. Algorithm 1 details the steps.</p><p>Distributed gradient subspace pursuit (DGSP), detailed in Algorithm 1, creates subspaces that are orthogonal to each other, as shown in the following proposition which is proved in Appendix D:</p><p>Proposition 4.1. At every iteration of Algorithm 1, the columns of U are orthonormal.</p><p>Both the distributed gradient subspace pursuit and the distributed Frank-Wolfe use the leading singular vector of the gradient matrix iteratively. Moreover, leading singular vectors of the gradient matrix have been used in greedy selection procedures for solving low-rank matrix learning problems <ref type="bibr">[35,</ref><ref type="bibr">42]</ref>. However, DGSP utilize the learned subspace in a very different way: GECO [35] re-fit the low-rank matrix under a larger subspace which is spanned by all left and right singular vectors; while OR1MP [42] only adjust the linear combination parameters {a i } r i=1 of the rank-1 matrices. The DGSP algorithm do not restrict on the joint subspaces {u i v T i }, but focused on the low-dimensional subspace induced the projection matrix U , and estimate the task specific predictors V based on the learned representation.</p><p>Next, we present convergence guarantees for the distributed gradient subspace pursuit. First, note that the smoothness of &#8467;(&#8226;) implies the smoothness property for any rank-1 update.</p><p>Proposition 4.2. Suppose Assumption 2.1 holds. Then for any W and unit length vectors u &#8712; R p and v &#8712; R m , we have</p><p>We defer the proof in Appendix E. The following theorem states the number of iterations needed for the distributed gradient subspace pursuit to find an &#949;suboptimal solution.</p><p>Theorem 4.3. Suppose Assumption 2.1 holds. Then the distributed gradient subspace pursuit finds</p><p>We defer the proof in Appendix F. Theorem 4.3 tells us that for the distributed gradient subspace pursuit requires O mHA 2 &#949; iterations to reach &#491; accuracy. Since each iteration requires communicating p number, the communication cost per machine is O mHA 2 &#949; &#8226; p . In some applications this communication cost might be still too high and in order to improve it we will try to reduce the number of rounds of communication.</p><p>To that end, we develop a procedure that utilizes the second-order information to improve the convergence. Algorithm 6 describes the Distributed Newton Subspace Pursuit algorithm (DNSP). Note that distributed optimization with second-order information have been studied recently to achieve communication efficiency <ref type="bibr">[37,</ref><ref type="bibr">46]</ref>. Compared to the gradient based methods, the DNSP algorithm uses second-order information to find subspaces to work with. At each iteration, each machine computes the Newton direction</p><p>based on the current solution and sends it to the master. The master computes the overall Newton direction by concatenating the Newton direction for each task &#8710;L n (W ) = [&#8710;L n1 (w 1 ), &#8710;L n2 (w 2 ), . . . , &#8710;L nm (w m )] and computes the top singular vectors of &#8710;L n (W ). The top left singular vector u is is sent back to every machine, which is then concatenated to the current projection matrix. Each machine re-fits the predictors using the new representation. Note that at every iteration a Gram-Schmidt step is performed to ensure that the learned basis are orthonormal.</p><p>DNSP is a Newton-like method which uses second-order information, thus its generic analysis is not immediately apparent. However empirical results in the next section illustrate good performance of the proposed DNSP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>We first illustrate performance of different procedures on simulated data. We generate data according to</p><p>for regression problems and</p><p>for classification problems. We generate the lowrank W * as follows. We first generate two matrices A &#8712; R p&#215;r , B &#8712; R m&#215;r with entries sampled independently from a standard normal distribution. Then we extract the left and right singular vectors of AB T , denoted as U, V . Finally, we set W * = U SV T , where S is a diagonal matrix with exponentially decaying entries: diag(S) = [1, 1/1.5, 1/(1.5) 2 , . . . , 1/(1.5) r ].</p><p>The feature vectors x ji are sampled from a mean zero multivariate normal with the covariance matrix &#931; = (&#931; ab ) a,b&#8712;[p] , &#931; ab = 2 -|a-b| . The regularization parameters for all approaches were optimized to give the best prediction performance over a held-out validation dataset. For ProxGD and AccProxGD, we initialized the solution from Local. Our simulation results are averaged over 10 independent runs.</p><p>We investigate how the performance of various procedures changes as a function of problem parameters (n, p, m, r). We compare the following procedures: i) Local, where each machine solves an empirical risk minimization problem (ordinary least squares or logistic regression) . ii) Nuclear-norm regularization: which is a popular Centralize approach: all machines send their data to the master, the master solves a nuclear-norm regularized loss minimization problem. iii) Learning with the best representation (BestRep): which assumes the true projection matrix U is known, and just fit ordinal least squares or logistic regression model in the projected low-dimensional subspace . Note that this is not a practical approach since in practice we do not know the best low-dimensional representations of the data. iv) Convex optimization approach which runs distributed optimization algorithms over the nuclear norm-regularized objective: here we implemented and compared the following algorithms: distributed proximal gradient (ProxGD); distributed accelerated proximal gradient, (AccProxGD); distributed alternating direction method of multipliers (ADMM); distributed Frank-Wolfe (DFW) . the shared representation in multi-task learning.</p><p>&#8226; ADMM and AccProxGD perform reasonably well , especially ADMM. One reason for the effectiveness of ADMM is that for the problem of nuclear norm regularized multi-task learning considered here, the ADMM update solves regularized ERM problems at every iteration. ADMM and AccProxGD clearly outperform ProxGD.</p><p>&#8226; ProxGD and DGSP perform similarly. DGSP usually becomes worse as the iterations increases , while ProxGD converges to a global optimum of the nuclear norm regularized objective.</p><p>&#8226; DNSP is the most communication-efficient method, and usually converges to a solution that is slightly better compared to the optimum of the nuclear regularization. This shows that second-order information helps a lot in reducing the communication cost.</p><p>&#8226; The DFW performs the worst in most cases, even though DFW shares some similarity with DGSP in learning the subspace. The empirical results suggest the re-fitting step in DGSP is very important.</p><p>One-shot SVD truncation A natural question to ask is whether there exists a one-shot communication method for the shared representation problem considered here, that still matches the performance of centralized methods. One reasonable solution is to consider the following SVD truncation approach, which is based on the following derivation: consider the following well specified linear regression model:</p><p>where &#491; ji is drawn from mean-zero Gaussian noise. It is easy to verify the following equation for OLS estimation:</p><p>Since W local is just W * plus some mean-zero Gaussian noise, it is natural to consider the following low-rank matrix denoising estimator:</p><p>where the solution is a simple SVD truncation, and can be implemented in a one-shot way: each worker send its Local solution to the master, which then performs an SVD truncation step to maintain the top-r components and send the resulting estimation back to each worker, where U r , S r , V r are top-r components of U, S, V . Though this approach might work well for some simple scenarios, but will generally fail when the features are highly correlated: although the Local solution W local can output normal estimation of W * , the estimation noise i x ji x T ji -1 ( i &#491; ji x ji ) might be highly correlated (depend on the correlation between features), which makes the SVD truncation estimation not reliable. To illustrate this, consider a more complex simulation which follows the same setup as above setting, except that now the feature vectors x ji are sampled from a higher correlation matrix &#931; = (&#931; ab ) a,b&#8712;[p] , &#931; ab = 2 -0.1|a-b| . The regression simulation results are shown in Figure <ref type="figure">3</ref>, where we see that the one-shot SVD truncation approach does not significantly outperforms Local, sometimes even slightly worse.</p><p>Besides simulation, we also conducted extensive experiments on real world datasets, which are presented in Appendix H due to space limitation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We studied the problem of distributed representation learning for multiple tasks, discussed the implementation and guarantees for distributed convex optimization methods, and presented two novel algorithms to learn low-dimensional projection in a greedy way, which can be communication more efficient than distributed convex optimization approaches. All approaches are extensively evaluated on simulation and real world datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Distributed Frank-Wolfe Method</head><p>Another approach we consider is the distributed <ref type="bibr">19,</ref><ref type="bibr">9)</ref>. This methods does not require performing SVD, which might bring additional computational advantages. Instead of directly minimizing the nuclear norm regularized objective, the Frank-Wolfe algorithm considers the equivalent constrained minimization problem min</p><p>At each step, Frank-Wolfe algorithm considers the following direction to update</p><p>)) is the leading singular vectors of &#8711;L n (W (t) ). The next iterate is obtained as</p><p>where &#947; is a step size parameter. To implement this algorithm in a distributed way, the master first collects the gradient matrix &#8711;L n (W (t) ) and computes u and v.</p><p>The vector v j u is sent to j-th machine, which performs the following update:</p><p>The algorithm is summarized in Algorithm 3. Similar to the distributed (accelerated) proximal gradient descent, the distributed Frank-Wolfe only requires communication of two p-dimensional vectors per round. Though computationally cheaper compared to other methods considered in this section, the distributed Frank-Wolfe algorithm enjoys similar convergence guarantees to the distributed proximal gradient descent (19), that is, after O mHA 2 &#949; iterations, the solution will be &#949; suboptimal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Pseudocode of the algorithms D Proof of Proposition 4.1</head><p>Proof. It is sufficient to prove that at every iteration, the current projection matrix U and the subspace to be added u are orthogonal to each other. Note that by the optimality condition:</p><p>Since u is the leading left singular vector of &#8711;L n (W (t) ), we have U T u = 0. Each column of U has unit length, since it is a left singular vector of some matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Proof of Proposition 4.2</head><p>Proof. It is sufficient to prove that the largest eigenvalue of &#8711; 2 L n (W ) does not exceed H. Since &#8711; 2 L n (W ) is a block diagonal matrix, it is sufficient to show that for every block j &#8712; [m], the largest eigenvalue of the block &#8711; 2 L nj (w j ) is not larger than H. This is true by the H-smoothness of &#8467;(&#8226;) and the fact that the data points have bounded length:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Proof of Theorem 4.3</head><p>Proof. By the smoothness of L n , we know</p><p>= 0 and therefore W (t) , &#8711;L n (W (t) ) = trace(V U T &#8711;L n (W (t) )) = 0. From convexity of L n (&#8226;), we have</p><p>Combining with the display above</p><p>Using Lemma G.1 in Appendix we know that after</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G An auxiliary lemma</head><p>Lemma G.1. (Lemma B.2 of Shalev-Shwartz et al.</p><p>(34)) Let x &gt; 0 and let &#949; 0 , &#949; 1 , ... be a sequence such that &#949; &#8804; &#949; tr&#949; 2 t for all t. Let &#949; be a positive scalar and t be a positive integer such that t &#8805; &#8968; 1 x&#949; &#8969;. Then &#949; t &#8804; &#949;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H Evaluation on Real World Datasets</head><p>We also evaluate discussed algorithms on several real world data sets, with 20% of the whole dataset as training set, 20% as held-out validation, then report the testing performance on the remaining 60%. For the real data, we have observed that adding &#8467; 2 regularization usually helps improving the generalization performance. For the Local procedure we added an &#8467; 2 regularization term (leads to ridge regression or &#8467; 2 regularized logistic regression). For DGSP and DNSP, we also add an &#8467; 2 regularization in finding the subspaces and refitting . We have worked on the following multitask learning datasets:</p><p>School. 5 The dataset consists of examination scores of students from London's secondary schools during the years <ref type="bibr">1985, 1986, 1987.</ref> There are 27 schoolspecific and student-specific features to describe each student.</p><p>The instances are divided by different schools, and the task is to predict the students' performance. We only considered schools with at least 100 records, which results in 72 tasks in total. The maximum number of records for each individual school is 260.</p><p>Computer Survey. The data is taken from a conjoint analysis experiment (27) which surveyed 180 persons about the probability of purchasing 20 kinds of personal computers. There are 14 variables for each computer, the response is an integer rating with scale 0 -10.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ATP. 6</head><p>The task here is to predict the airline ticket price (38). We are interested in the minimum prices next day for some specific observation date and departure date pairs. Each case is described by 411 features, and there are 6 target minimum prices for different airlines to predict. The sample size is 337.</p><p>Protein. Given the amino acid sequence, we are interested predicting the protein secondary structure (31). We tackle the problem by considering the following three binary classification tasks: coil vs helix, helix vs  Landmine. The data is collected from 19 landmine detection tasks (44). Each landmine field is represented by a 9-dimensional vector extracted from radar images, containing moment-based, correlation-based, energy ratio, and spatial variance features. The sample size for each task varies from 445 to 690. Cal500. 7 This music dataset (40) consists of 502 songs, where for each song 68 features are extracted. Each task is to predict whether a particular musically relevant semantic keyword should be an annotation for the song. We only consider tags with at least 50 times apperance, which results in 78 prediction tasks.</p><p>We compared various approaches as in the simulation study, except the BestRep as the best low-dimensional representation is unknown. We also compared with AltMin, which learns low-rank prediction matrix using the alternating minimization (21). The results are shown in Figure <ref type="figure">4</ref>. Since the labels for the real world classification datasets are often unbalanced, we report averaged area under the curve (AUC) instead of classification accuracy. We have the following observations:</p><p>&#8226; The distributed first-order approaches converge much slower than in simulations, especially on </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Despite the notation, W * need not be the minimizer of the expected loss. We can think of it as the minimizer inside some restricted hypothesis class, though all analysis and statements hold for any chosen reference predictor W * .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>e.g. ADMM for regularized objective and Frank-Wolfe for constrained objective. Gradient descent methods can be adopted for both, leads to proximal and projected methods, respectively.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>For better visualization, here we omit the plot for DFW as its performance is significantly worse than others.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>&#949; rounds of communication are needed to obtain &#949;-generalization error.</p></note>
		</body>
		</text>
</TEI>
