<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Vertical Federated Learning with Missing Features During Training and Inference</title></titleStmt>
			<publicationStmt>
				<publisher>The Thirteenth International Conference on Learning Representations</publisher>
				<date>04/24/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10592991</idno>
					<idno type="doi"></idno>
					
					<author>Pedro Valdeira</author><author>Shiqiang Wang</author><author>Yuejie Chi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Vertical federated learning trains models from feature-partitioned datasets across multiple clients, who collaborate without sharing their local data. Standard approaches assume that all feature partitions are available during both training and inference. Yet, in practice, this assumption rarely holds, as for many samples only a subset of the clients observe their partition. However, not utilizing incomplete samples during training harms generalization, and not supporting them during inference limits the utility of the model. Moreover, if any client leaves the federation after training, its partition becomes unavailable, rendering the learned model unusable. Missing feature blocks are therefore a key challenge limiting the applicability of vertical federated learning in real-world scenarios. To address this, we propose LASER-VFL, a vertical federated learning method for efficient training and inference of split neural network-based models that is capable of handling arbitrary sets of partitions. Our approach is simple yet effective, relying on the sharing of model parameters and on task-sampling to train a family of predictors. We show that LASER-VFL achieves a O( 1 / √ T ) convergence rate for nonconvex objectives and, under the Polyak-Łojasiewicz inequality, it achieves linear convergence to a neighborhood of the optimum. Numerical experiments show improved performance of LASER-VFL over the baselines. Remarkably, this is the case even in the absence of missing features. For example, for CIFAR-100, we see an improvement in accuracy of 19.3% when each of four feature blocks is observed with a probability of 0.5 and of 9.5%]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>In federated learning (FL), a set of clients collaborates to jointly train a model using their local data without sharing it <ref type="bibr">(Kairouz et al., 2021)</ref>. In horizontal FL, data is distributed by samples, meaning each client holds a different set of samples but shares the same feature space. In contrast, vertical FL (VFL) involves data distributed by features, where each client holds different parts of the feature space for overlapping sets of samples. Whether an application is horizontal or vertical FL is dictated by how the data arises, as it is not possible to redistribute the data. This work focuses on VFL.</p><p>In VFL <ref type="bibr">(Liu et al., 2024)</ref>, the global dataset D := {x 1 , . . . , x N }, where N is the number of samples, is partitioned across clients K := {1, . . . , K}. Each client k &#8712; K typically holds a local dataset D k := {x 1 k , . . . , x N k }, where x n k is the block of features of sample n observed by client k. We have that x n = (x n 1 , . . . , x n K ). Unlike horizontal FL, in VFL, different clients collect local datasets with distinct types of information (features). Such setups-e.g., an online retail company and a social media platform holding different features on shared users-typically involve entities from different sectors, reducing competition and increasing the incentive to collaborate. To train VFL models without sharing the local datasets, split neural networks <ref type="bibr">(Ceballos et al., 2020)</ref> are often considered.</p><p>In split neural networks, each client k has a representation model f k , parameterized by &#952; f k . The representations extracted by the clients are then used as input to a fusion model g, parameterized by &#952; g . This fusion model can be at one of the clients or at a server. Thus, to learn the parameters &#952; K := (&#952; f1 , . . . , &#952; f K , &#952; g ) of the resulting predictor h, we can solve the following problem:</p><p>) with &#8467; denoting a loss function and y n the label of sample n, which we assume to be held by the same entity (client or server) as the fusion model. We illustrate this family of models in Figure <ref type="figure">1a</ref>.</p><p>a simple yet effective solution that leverages the sharing of model parameters alongside a tasksampling mechanism. By fully utilizing all data (see Figure <ref type="figure">1b</ref>), our method leads to significant performance improvements, as demonstrated in Figure <ref type="figure">1c</ref>. Remarkably, our approach even surpasses standard VFL when all samples are fully observed. We attribute this to a dropout-like regularization effect introduced by task sampling.</p><p>Our contributions. The main contributions of this work are as follows.</p><p>&#8226; We propose LASER-VFL, a simple, hyperparameter-free, and efficient VFL method that is flexible to varying sets of feature blocks during training and inference. To the best of our knowledge, this is the first method to achieve such flexibility without wasting either training or test data. &#8226; We show that LASER-VFL converges at a O( 1 / &#8730; T ) rate for nonconvex objectives. Furthermore, under the Polyak-&#321;ojasiewicz (PL) inequality, it achieves linear convergence to a neighborhood around the optimum.</p><p>&#8226; Numerical experiments show that LASER-VFL consistently outperforms baselines across multiple datasets and varying data availability patterns. It demonstrates superior robustness to missing features and, notably, when all features are available, it still outperforms even standard VFL.</p><p>Related work. VFL shares challenges with horizontal FL, such as communication efficiency <ref type="bibr">(Liu et al., 2022;</ref><ref type="bibr">Valdeira et al., 2025)</ref> and privacy preservation <ref type="bibr">(Yu et al., 2024)</ref>, but also faces unique obstacles, such as missing feature blocks. During training, these unavailable partitions render the observed blocks of other clients unusable, and at test time, they can prevent inference altogether. Therefore, most VFL literature assumes that all features are available for both training and inference-an often unrealistic assumption that has hindered broader adoption of VFL <ref type="bibr">(Liu et al., 2024)</ref>.</p><p>To address the problem of missing features during VFL training, some works use nonoverlapping, or nonaligned, samples (that is, samples with missing feature blocks) to improve generalization. In particular, <ref type="bibr">Feng (2022)</ref> and <ref type="bibr">He et al. (2024)</ref> apply self-supervised learning to leverage nonaligned samples locally for better representation learning, while using overlapping samples for collaborative training of a joint predictor. Alternatively, <ref type="bibr">Kang et al. (2022b)</ref>, <ref type="bibr">Yang et al. (2022), and</ref><ref type="bibr">Sun et al. (2023)</ref> employ semi-supervised learning to take advantage of nonaligned samples. Although these methods enable VFL to utilize data that conventional approaches would discard, they still train a joint predictor and thus require all features to be available for inference, which remains a limitation.</p><p>A recent line of research leverages information from the entire federation to train local predictors. This can be achieved via transfer learning, as in the works by <ref type="bibr">Feng &amp; Yu (2020)</ref>; Kang et al. (2022a) for overlapping training samples and in the works by Liu et al. (2020); Feng et al. (2022) for handling missing features. Knowledge distillation is another approach, as used by Ren et al. (2022); Li et al. (2023b) for overlapping samples and by Li et al. (2023a); Huang et al. (2023) which leverage nonaligned samples when training local predictors (only the latter considers scenarios with more than two clients). Xiao et al. (2024) recently proposed using a distributed generative adversarial network for collaborative training on nonoverlapping data and synthetic data generation. These methods yield local predictors that outperform naive local approaches, but fail to utilize valuable information when other feature blocks are available during inference. A few recent works enable a varying number of clients to collaborate during inference. Sun et al. (2024) employ party-wise dropout during training to mitigate performance drops from missing feature blocks at inference; Gao et al. (2024) introduce a multi-stage approach with complementary knowledge distillation, enabling inference with different subsets of clients for binary classification tasks; and Ganguli et al. (2024) extend Sun et al. (2024) to deal with communication failures during inference in cross-device VFL. However, all of these methods consider the case of fully-observed training data and they all lack convergence guarantees. In contrast to prior work, LASER-VFL can handle any subset of feature blocks being present during both training and inference without wasting data, while requiring only minor modifications to the standard VFL approach. It differs from conventional VFL solely in its use of parameter-sharing and task-sampling mechanisms, without the need for additional stages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">DEFINITIONS AND PRELIMINARIES</head><p>Our global dataset D is drawn from an input space X which is partitioned into K feature spaces</p><p>where, recall, D k is the local dataset of client k. Ideally, we would train a general, unconstrained predictor h : X &#8594; Y, where Y is the label space, by solving min &#952; 1 N N n=1 &#8467;( h(x n ; &#952;), y n ). However, VFL brings the additional constraint that this must be achieved without sharing the local datasets. That is, for all k, the local dataset D k must remain at client k. As mentioned in Section 1, standard VFL methods approximate h by h : X &#8594; Y, as defined in (1), allowing for collaborative training without sharing the local datasets, but they cannot handle missing features during training nor inference.</p><p>Another way to train predictors without sharing local data is to learn local predictors. In particular, each client k &#8712; K can learn the parameters &#952; k of a predictor h k : X k &#8594; Y to approximate h:</p><p>(2)</p><p>The representation models</p><p>, where E k is the representation space of client k, are as in ( <ref type="formula">1</ref>), yet the fusion models <ref type="formula">1</ref>) and h k differs from h. This approach is useful in that h k allows client k to perform inference (and be trained) independently of other clients, but it does not make use of the features observed by clients j = k.</p><p>More generally, to achieve robustness to missing blocks in the test data while avoiding wasting other features, we wish to be able to perform inference based on any possible subset of blocks, P(K)\{&#8709;}, where P(K) denotes the power set of K. Standard VFL allows us to obtain a predictor for the blocks K and the local approach provides us with predictors for the singletons {{i} : i &#8712; K} in the power set. However, none of the other subsets of K is covered by either approach.</p><p>A naive way to achieve this would be to train a predictor for each set I &#8712; P(K) \ {&#8709;} in a decoupled manner. That is, we could train each of the following predictors {h I : k&#8712;I X k &#8594; Y} independently using the collaborative training approach of standard VFL:</p><p>where</p><p>The fusion models {g I } can either be at one of the clients or at the server. The set of predictors in (3) includes the local predictors {h k } in (2) and the standard VFL predictor h in (1), but also all the other nonempty sets in the power set P(K).<ref type="foot">foot_0</ref> This approach addresses the issue of limited flexibility and robustness in prior methods, which struggle with varying numbers of available or participating clients during inference. However, by requiring the independent training of 2 K -1 distinct predictors, it introduces a new challenge: the number of models would grow exponentially with the number of clients, K. Consequently, the associated memory, computation, and communication costs would also increase exponentially.</p><p>Further, if the fusion models in (3) are all held by a single entity, this setup introduces a dependency of all clients on that entity. To enhance robustness against clients dropping from the federation, we would like each client to be able to ensure inference whenever it has access to its corresponding block of the sample. That is, we want each client k &#8712; K to train a predictor for every nonempty subset of K that includes k. We define this family of subsets as P k (K) := {I &#8712; P(K) : k &#8712; I}. This allows client k to make predictions even if the information from blocks observed by other clients is not available and thus cannot be leveraged to improve performance.</p><p>In the next section, we present our method, LASER-VFL, which enables the efficient training of a family of predictors that can handle scenarios where features from an arbitrary set of clients are missing. We have included a table of notation and a diagram illustrating our method in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">OUR METHOD</head><p>Key idea. The main challenge, when striving for each client to be able to perform inference from any subset of blocks that includes its own, is to circumvent the exponential complexity arising from the combinatorial explosion of possible combinations of blocks. To address this, in LASER-VFL, we share model parameters across the predictors leveraging different subsets of blocks and train them so that the representation models allow for good performance across the different combinations of blocks. Further, we train the fusion model at each client to handle any combination of representations that includes its own. To train the predictors on an exponential number of combinations of missing blocks while avoiding an exponential computational complexity, we employ a sampling mechanism during training, which allows us to essentially estimate an exponential combination of objectives with a subexponential complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">A TRACTABLE FAMILY OF PREDICTORS SHARING MODEL PARAMETERS</head><p>In our approach, each client k &#8712; K resorts to a set of predictors {h kI : I &#8712; P k (K)}. Each predictor h kI : i&#8712;I X i &#8594; Y has the same target, but a different domain. Therefore, we say that each predictor performs a distinct task. We define the predictors of client k as follows:</p><p>where &#952; (k,I) = ((&#952; fi : i &#8712; I), &#952; g k ) and |I| denotes the cardinality of set I.</p><p>Sharing model parameters. Note that, although we can see (4) as</p><p>predictors, the predictors are made up of different combinations of only K representation models and K fusion models. That is, we only require parameters &#952; := (&#952; f1 , . . . ,</p><p>In particular, in contrast to (3), where the predictors used different representation models for each task, in (4), we share the representation models across predictors. Similarly to the representation models, we have a single fusion model per client.</p><p>It is also important to note that each fusion model</p><p>By employing a nonparameterized aggregation mechanism on the extracted representations whose output does not depend on the number of aggregated representations, the same fusion model can handle different sets of representations of different sizes. In particular, we opt for an average (rather than, say, a sum) because neural networks perform better when their inputs have similar distributions <ref type="bibr">(Ioffe &amp; Szegedy, 2015)</ref>. Note that the representation model can be adjusted so that using averaging instead of concatenation as the aggregation mechanism does not reduce the flexibility of the overall model, but simply shifts it from the fusion model to the representation models, as explained in Appendix A.2.</p><p>With this approach, we have avoided an exponential memory complexity by sharing the weights across an exponential number of predictors such that we have K representation models and K fusion models. In the next subsection, we will go over the efficient training of this family of predictors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">EFFICIENT TRAINING VIA TASK-SAMPLING</head><p>The family of predictors introduced in Section 3.1 can efficiently perform inference for an exponential number of tasks. We will now discuss how to train this family of predictors while avoiding exponential computation and communication complexity during the training process. The key to this approach lies in carefully sampling tasks at each gradient step of model training.</p><p>Our optimization problem. As noted in Section 1, we want to address missing features not only during inference, but also during training. Let K n o be the set of observed feature blocks of sample n, we assume that K o := {K n o : &#8704;n} follows a random distribution. Recalling (4), we denote the loss &#8467; of sample n, with label y n , for the predictor of client k using blocks</p><p>For sample n and client k, the (weighted) prediction loss across all observed subsets of blocks that include k is denoted by:</p><p>, where superscript o indicates that this loss concerns the observed dataset. The normalization by |I| avoids assigning a larger weight to tasks concerning a larger number of blocks, since the set I is used for prediction at |I| different clients, whose losses are combined in the loss of sample n:</p><p>Now, to train the predictors in (4), we consider the following optimization problem:</p><p>where the expectation is over the set of observed feature blocks, K o .</p><p>Algorithm 1: LASER-VFL Training Input: initial point &#952; 0 , training data D. 1 for t = 0, . . . , T -1 do 2 Clients K sample the same mini-batch B t with K (t) o = K n o , &#8704;n &#8712; B t , via a shared seed. 3 for k &#8712; K (t) o in parallel do 4 Broadcast f t k and receive {f t i : i &#8712; K (t) o , i = k}. 5 Sample a task-defining set of blocks S t kj &#8764; U(P j k (K (t) o )) for j = 1, . . . , K t o . 6 Compute LB t S t k (&#952; t ), as in (8), and backpropagate over fusion model g k . 7 Send the derivative of LB t S t k (&#952; t ) with respect to each k &#8712; K (t) o and receive theirs. 8 Sum the received gradients and backpropagate over f k . 9</p><p>Update the weights &#952; t+1 = &#952; t -&#951;&#8711; LB t S t (&#952; t ), where LB t S t (&#952; t ) is as in ( <ref type="formula">7</ref>).</p><p>Note that, if different blocks of features have different probabilities of being observed, we can also normalize the loss with respect to the different probabilities, which can be computed during the entity alignment stage of VFL (see below), before taking the expectation in (5). We omit this here for simplicity. We assume that the data is missing completely at random <ref type="bibr">(Zhou et al., 2024)</ref>.</p><p>Looking at (5), we see that, while we are interested in tackling K &#215; 2 K-1 different tasks, which falls in the realm of multi-objective optimization and multi-task learning <ref type="bibr">(Caruana, 1997)</ref>, we opt for combining them into a single objective, rather than employing a specialized multi-task optimizer. This corresponds to a weighted form of unitary scalarization <ref type="bibr">(Kurin et al., 2022)</ref>.</p><p>Our optimization method. To solve (6), we employ to a gradient-based method, summarized in Algorithm 1, which we now describe. At iteration t, with the current iterate &#952; t , we select mini-batch indices B t &#8838; [N ], where [N ] := {1, . . . , N }, such that the set of observed feature blocks for each sample n &#8712; B t , denoted by K n o , is identical. We denote this common set by</p><p>o holds the representations corresponding to the set of observed feature blocks of the current mini-batch,</p><p>o }. To train the model on all possible combinations of feature blocks without incurring exponential computational complexity at each iteration, each client k &#8712; K</p><p>the subset of the power set of K (t) o whose elements contain k. These sampled tasks estimate L o B t (&#952; t ) without computing predictions for an exponential number of tasks (note that K t o &#8804; K). We use this estimate to update the model parameters. More precisely, the task-defining feature blocks sampled by client k at step t are given by S t k := {S t k1 , . . . , S n kK t o }, where set S t ki contains i feature blocks. Each set S t ki is sampled from a uniform distribution over a subset of P k (K (t) o</p><p>) that contains its sets of size i. That is:</p><p>This task-sampling mechanism allows us to estimate the loss with linear complexity, instead of the exponential cost of exact computation. More precisely, we use the following loss estimate:</p><p>where S t := {S t 1 , . . . , S t K } and each LB t S t k (&#952; t ), computed at client k, is given by:</p><p>Algorithm 2: LASER-VFL Inference Input: test sample x , trained parameters &#952; T . 1 The blocks of features K o of sample x are observed.</p><p>4</p><p>Client k predicts the label using</p><p>The weighting a t i counteracts the probability of each task being sampled, allowing us to obtain an unbiased estimate of the observed loss. Now, we use this to minimize (5) by doing the following update:</p><p>After completing the training described in Algorithm 1, we can perform inference according to Algorithm 2. In this inference algorithm, each client k &#8712; K o that observes a partition of the test sample x shares its corresponding representation of x . Then, each such client k uses h (k,K o ) to predict the label.</p><p>Block availability-aware entity alignment. Entity alignment is a process the precedes training in VFL where the samples of the different local datasets are matched so that their features are correctly aligned, thus allowing for collaborative model training <ref type="bibr">(Liu et al., 2024)</ref>. The identification of the available blocks for each sample can be performed during this prelude to VFL training, allowing for a batch selection procedure that ensures that each batch contains samples with the same observed blocks, which we use in our method.</p><p>Extensions to our method. It is important to highlight that our method can also be easily employed in situations where different clients hold different labels, as it naturally fits multi-task learning problems. Further, it can easily be applied to setups where any subset of K holds the labels, sufficing to drop fusion models from the clients which do not hold the labels or have them tackle unsupervised or self-supervised learning tasks instead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONVERGENCE GUARANTEES</head><p>In this section, we present convergence guarantees for our method. Let us start by stating the assumptions used in our results.</p><p>Assumption 1 (L-smoothness and finite infimum). Function L o : &#920; &#8594; R is differentiable and there exists a positive constant L such that:</p><p>The inequality above holds if L is L -smooth. We assume and define L := inf &#952; L(&#952;) &gt; -&#8734;. Assumption 2 (Unbiasedness). The mini-batch gradient estimate is unbiased. That is, we have that:</p><p>) where the expectation is with respect to the mini-batch B. Assumption 3 (Bounded variance). There exist positive constants &#963; b and &#963; s such that, for all &#952; &#8712; &#920; and k &#8712; K, we have that:</p><p>and the (conditional) expectations are with respect to minibatch B and with respect to the tasks sampled at each client k, S k , respectively.</p><p>We use the following set in our results:</p><p>Let us start by presenting a lemma which we resort to in our proof of Theorem 1.</p><p>Lemma 1 (Unbiased update vector). If the mini-batch gradient estimate is unbiased (A2), then, for all t &#8805; 0:</p><p>where the expectation is with respect to mini-batch B t and the sampled set of tasks S t . Theorem 1 (Main result). Let {&#952; t } be a sequence generated by Algorithm 1, if L is L-smooth and has a finite infimum (A1), the mini-batch estimate of the gradient is unbiased (A2), and our update vector has a bounded variance (A3), we then have that, for &#951; &#8712; (0, 1/L]:</p><p>where &#8710; := L(&#952; 0 ) -L and the expectation is with respect to {B t }, {S t }, and K o .</p><p>If we take the stepsize to be &#951;</p><p>-1 , we get from ( <ref type="formula">10</ref>) that</p><p>thus achieving a O( 1 / &#8730; T ) convergence rate. Note the effect of the variance induced by the tasksampling mechanism, which grows with K.</p><p>We can further establish linear convergence to a neighborhood around the optimum under the Polyak-&#321;ojasiewicz inequality <ref type="bibr">(Polyak, 1963)</ref>. Assumption 4 (PL inequality). We assume that there exists a positive constant &#181; such that &#8704;&#952; &#8712; &#920; :</p><p>We use the expected suboptimality &#948; t := EL(&#952; t ) -L , where the expectation is with respect to {B t }, {S t }, and K o , as our Lyapunov function in the following result.</p><p>Theorem 2 (Linear convergence). Let {&#952; t } be a sequence generated by Algorithm 1, if L is Lsmooth and has a finite infimum (A1), the mini-batch estimate of the gradient is unbiased (A2), our update vector has a bounded variance (A3), and the PL inequality (A4) holds for L, we then have that, for &#951; &#8712; (0, 1/L]:</p><p>It follows from &#181; &#8804; L that 1 -&#181;&#951; &#8712; (0, 1). Therefore, we have linear convergence to a O &#963; 2 b + K &#8226; &#963; 2 s neighborhood of the global optimum. We defer the proofs of Lemma 1, Theorem 1, and Theorem 2 to Appendix C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>In this section, we describe our numerical experiments and analyze their results, comparing the empirical performance of LASER-VFL against baseline approaches. We begin with a brief overview of the baseline methods and tasks considered, followed by a the presentation and discussion of the experimental results.</p><p>We compare our method to the following baselines:</p><p>&#8226; Local: as in (2); the local approach leverages its block for training and inference whenever it is observed, but ignores the remaining blocks. &#8226; Standard VFL <ref type="bibr">(Liu et al., 2022)</ref>: as in (1); the standard VFL model can only be trained on and used for inference for fully-observed samples. When the model is unavailable for prediction, a random prediction is made. 91.6 &#177; 0.9 67.3 &#177; 3.5 16.0 &#177; 4.1 90.0 &#177; 1.4 66.0 &#177; 2.8 15.8 &#177; 3.9 84.6 &#177; 2.5 62.2 &#177; 2.2 15.3 &#177; 3.5 VFedTrans 82.5 &#177; 0.4 82.5 &#177; 0.4 82.3 &#177; 0.7 82.9 &#177; 0.3 82.9 &#177; 0.2 82.8 &#177; 0.7 82.2 &#177; 0.5 82.2 &#177; 0.5 82.2 &#177; 0.8 Ensemble 90.9 &#177; 0.9 90.3 &#177; 0.9 84.8 &#177; 1.1 89.8 &#177; 0.4 89.3 &#177; 0.5 84.1 &#177; 1.6 88.7 &#177; 1.5 88.1 &#177; 1.8 83.0 &#177; 2.5 Combinatorial 92.5 &#177; 0.2 92.4 &#177; 0.3 88.5 &#177; 1.7 90.0 &#177; 1.6 90.0 &#177; 1.4 87.3 &#177; 1.6 84.2 &#177; 2.5 84.7 &#177; 1.9 83.0 &#177; 1.3 PlugVFL 91.1 &#177; 1.4 88.5 &#177; 1.6 75.1 &#177; 6.3 90.9 &#177; 0.9 88.6 &#177; 2.6 78.3 &#177; 4.8 87.7 &#177; 1.6 87.2 &#177; 1.5 82.8 &#177; 1.6 LASER-VFL (ours) 94.2 &#177; 0.1 93.9 &#177; 0.2 89.0 <ref type="table">&#177; 1.1 92.4 &#177; 1.3 92.0 &#177; 1.3 86.6 &#177; 1.2 90.1 &#177; 3.2 89.5 &#177; 3.1 84.8 &#177; 2.6</ref> Credit (F1-score&#215;100) Local 37.7 &#177; 3.1 37.6 &#177; 3.1 37.5 &#177; 3.2 37.6 &#177; 3.0 37.6 &#177; 2.9 37.5 &#177; 3.2 36.2 &#177; 3.5 36.</p><p>2 &#177; 3.6 36.0 &#177; 3.6 Standard VFL 45.7 &#177; 2.6 38.8 &#177; 2.5 31.0 &#177; 0.7 42.4 &#177; 1.2 38.3 &#177; 0.7 31.0 &#177; 0.7 32.4 &#177; 2.4 32.5 &#177; 1.3 30.3 &#177; 0.5 VFedTrans 37.7 &#177; 1.5 37.6 &#177; 1.4 37.2 &#177; 1.6 39.5 &#177; 0.8 39.4 &#177; 0.8 39.2 &#177; 0.9 35.7 &#177; 0.3 35.6 &#177; 0.3 35.6 &#177; 0.4 Ensemble 42.1 &#177; 1.0 42.1 &#177; 1.0 40.1 &#177; 0.8 41.4 &#177; 1.3 40.8 &#177; 0.9 39.2 &#177; 1.1 42.4 &#177; 2.1 41.8 &#177; 1.7 40.1 &#177; 1.1 Combinatorial 42.8 &#177; 2.9 42.7 &#177; 2.5 41.7 &#177; 1.5 44.4 &#177; 1.0 41.8 &#177; 2.2 41.7 &#177; 1.5 32.8 &#177; 3.6 36.3 &#177; 0.6 37.6 &#177; 2.1 PlugVFL 44.4 &#177; 3.7 41.3 &#177; 4.0 32.7 &#177; 4.6 40.8 &#177; 2.0 39.8 &#177; 1.6 37.9 &#177; 1.7 38.6 &#177; 2.1 37.8 &#177; 0.9 35.4 &#177; 3.4 LASER-VFL (ours) 46.5 &#177; 2.8 45.0 &#177; 2.5 43.7 &#177; 1.2 43.1 &#177; 4.2 41.9 &#177; 4.0 41.3 &#177; 1.9 41.5 &#177; 4.2 40.9 &#177; 4.0 41.4 &#177; 1.7 MIMIC-IV (F1-score&#215;100) Local 74.9 &#177; 0.5 74.9 &#177; 0.5 74.8 &#177; 0.5 75.0 &#177; 0.3 75.0 &#177; 0.2 75.0 &#177; 0.3 73.0 &#177; 0.5 73.0 &#177; 0.5 73.0 &#177; 0.5 Standard VFL 91.6 &#177; 0.2 77.0 &#177; 1.3 52.5 &#177; 2.2 90.9 &#177; 0.3 76.8 &#177; 1.3 52.5 &#177; 2.2 81.1 &#177; 0.4 70.2 &#177; 0.8 51.8 &#177; 1.8 Ensemble 84.2 &#177; 0.3 83.9 &#177; 0.3 77.9 &#177; 1.6 84.1 &#177; 0.4 83.7 &#177; 0.5 78.1 &#177; 1.1 82.1 &#177; 0.3 81.8 &#177; 0.5 76.4 &#177; 1.1 Combinatorial 91.3 &#177; 0.3 87.2 &#177; 1.7 83.6 &#177; 0.4 91.0 &#177; 0.3 86.7 &#177; 1.4 83.4 &#177; 0.4 80.5 &#177; 1.2 80.6 &#177; 1.7 78.9 &#177; 0.5 PlugVFL 91.2 &#177; 0.5 89.0 &#177; 0.8 79.6 &#177; 2.8 90.6 &#177; 0.2 88.8 &#177; 0.5 79.8 &#177; 2.6 86.5 &#177; 0.7 85.3 &#177; 0.9 77.9 &#177; 2.8 LASER-VFL (ours) 92.0 &#177; 0.2 91.2 &#177; 0.3 85.5 &#177; 1.0 91.1 &#177; 0.5 90.2 &#177; 0.3 84.7 &#177; 0.9 87.7 &#177; 0.2 86.9 &#177; 0.2 82.0 &#177; 1.0 CIFAR-10 (accuracy, %) Local 76.1 &#177; 0.1 76.0 &#177; 0.1 76.2 &#177; 0.2 75.6 &#177; 0.1 75.6 &#177; 0.1 75.7 &#177; 0.3 71.2 &#177; 0.4 71.1 &#177; 0.4 71.3 &#177; 0.7 Standard VFL 89.7 &#177; 0.2 60.5 &#177; 12.1 11.6 &#177; 3.5 87.0 &#177; 0.6 58.8 &#177; 11.4 11.5 &#177; 3.4 54.9 &#177; 10.0 38.6 &#177; 9.1 10.9 &#177; 2.2 Ensemble 86.4 &#177; 0.2 85.2 &#177; 0.7 78.3 &#177; 0.9 86.1 &#177; 0.3 85.0 &#177; 0.5 77.8 &#177; 0.7 82.4 &#177; 0.2 81.0 &#177; 0.8 73.8 &#177; 0.6 Combinatorial 89.4 &#177; 0.1 88.7 &#177; 0.4 83.4 &#177; 1.2 87.1 &#177; 0.8 86.7 &#177; 0.4 82.4 &#177; 0.8 54.9 &#177; 9.7 58.6 &#177; 7.7 68.4 &#177; 3.1 PlugVFL 89.4 &#177; 0.2 88.1 &#177; 0.8 81.8 &#177; 1.7 88.1 &#177; 0.5 86.9 &#177; 0.6 81.2 &#177; 1.2 76.5 &#177; 2.1 75.6 &#177; 1.5 72.4 &#177; 1.0 LASER-VFL (ours) 91.5 &#177; 0.1 90.5 &#177; 0.4 83.8 &#177; 1.5 91.2 &#177; 0.2 90.2 &#177; 0.4 83.3 &#177; 1.4 87.4 &#177; 0.3 86.4 &#177; 0.6 79.4 &#177; 1.6 CIFAR-100 (accuracy, %) Local 50.3 &#177; 0.1 50.3 &#177; 0.1 50.4 &#177; 0.1 49.1 &#177; 0.2 49.2 &#177; 0.2 49.1 &#177; 0.3 41.3 &#177; 0.7 41.3 &#177; 0.6 41.3 &#177; 0.7 Standard VFL 64.7 &#177; 0.4 41.5 &#177; 9.7 2.4 &#177; 2.9 57.2 &#177; 2.2 36.6 &#177; 7.6 2.3 &#177; 2.8 15.8 &#177; 7.0 10.5 &#177; 4.6 1.4 &#177; 1.0 Ensemble</p><p>62.1 &#177; 0.3 60.2 &#177; 1.2 52.3 &#177; 0.9 60.8 &#177; 0.3 58.8 &#177; 1.0 51.0 &#177; 0.5 52.6 &#177; 0.6 50.4 &#177; 0.7 43.5 &#177; 0.8 Combinatorial 64.4 &#177; 0.5 64.0 &#177; 0.4 58.5 &#177; 1.3 57.2 &#177; 2.2 57.8 &#177; 1.8 55.5 &#177; 0.7 16.4 &#177; 6.9 19.5 &#177; 5.9 31.6 &#177; 4.3 PlugVFL 66.0 &#177; 0.3 64.1 &#177; 1.2 55.3 &#177; 2.1 63.2 &#177; 1.1 61.5 &#177; 0.7 53.1 &#177; 1.9 44.6 &#177; 2.3 43.9 &#177; 1.7 41.2 &#177; 1.6 LASER-VFL (ours) 72.3 &#177; 0.1 71.0 &#177; 0.7 61.3 &#177; 1.8 70.4 &#177; 0.3 68.9 &#177; 0.6 59.8 &#177; 1.6 61.4 &#177; 0.9 59.9 &#177; 0.5 51.9 &#177; 1.2</p><p>&#8226; Ensemble: we train local predictors as in Local. During inference, the clients share their predictions, and a joint prediction is selected by majority vote, with ties broken at random.</p><p>&#8226; Combinatorial: as in (3); during training, each batch is used by all the models corresponding to subsets of the observed blocks. During inference, we use the predictor corresponding to the set of observed blocks. We go over the scalability problems of this approach below.</p><p>&#8226; PlugVFL <ref type="bibr">(Sun et al., 2024)</ref>: we extend PlugVFL to handle missing features during training by replacing missing representations with zeros. This method matches the work proposed by <ref type="bibr">(Ganguli et al., 2024)</ref> for scenarios where all clients have reliable links to one another.</p><p>Our experiments focus on the following tasks:</p><p>&#8226; HAPT (Reyes-Ortiz et al., 2016): a human activity recognition dataset.</p><p>&#8226; Credit <ref type="bibr">(Yeh &amp; Lien, 2009)</ref>: a dataset for predicting credit card payment defaults.</p><p>&#8226; MIMIC-IV <ref type="bibr">(Johnson et al., 2020)</ref>: a time series dataset concerning healthcare data. We focus on the task of predicting in-hospital mortality from ICU data corresponding to patients admitted with chronic kidney disease. We resort to the data processing pipeline of <ref type="bibr">Gupta et al. (2022)</ref>.</p><p>&#8226; CIFAR-10 &amp; CIFAR-100 <ref type="bibr">(Krizhevsky et al., 2009)</ref>: two popular image datasets.</p><p>For all datasets, we split the samples into K = 4 feature blocks. For example, for the CIFAR datasets, each image is partitioned into four quadrants, each assigned to a different client. Given our interest in ensuring that all clients perform well at inference time, we average the metrics across clients. In Appendix D, we provide detailed descriptions of how we compute accuracy and F 1 -score.</p><p>Experiments on HAPT and Credit show that VFedTrans does not significantly outperform Local despite being considerably more complex and slower. The training pipeline of VFedTrans includes three stages: federated representation learning, local representation distillation, and training a local model. The first two stages are repeated K -1 times for each client pair. This makes VFedTrans time-intensive to run and hyperparameter tune, as seen in Appendix D. For this reason, we limit our experiments with VFedTrans to these simpler tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A ADDITIONAL DISCUSSIONS A.1 A MORE DETAILED COMPARISON WITH CLOSEST RELATED WORKS</head><p>As mentioned in Section 1, a few recent works have addressed the problem of missing features during VFL training. In particular, a line of work uses all samples that are not fully observed locally to train the representation models via self-supervised learning <ref type="bibr">(Feng, 2022;</ref><ref type="bibr">He et al., 2024)</ref>. This is followed by collaborative training of a joint predictor for the whole federation using the samples without missing features. Another line of work utilizes a similar framework but resorting to semi-supervised learning instead of self-supervised learning in order to take advantage of nonaligned samples <ref type="bibr">(Kang et al., 2022b;</ref><ref type="bibr">Yang et al., 2022;</ref><ref type="bibr">Sun et al., 2023)</ref>. In both cases, these works are lacking in some areas that are improved by our method: 1) they require the existence of fully observed samples; 2) they do not allow for missing features at inference time; and 3) they do not leverage collaborative training when more than one, but less than the total number of clients observed their feature blocks.</p><p>Another approach leverages information across the entire federation to train local predictors, thereby enabling inference in the presence of missing features. <ref type="bibr">Feng &amp; Yu (2020)</ref>; <ref type="bibr">Kang et al. (2022a)</ref> propose following standard, collaborative vertical FL with a transfer learning method allowing each client to have its own predictor. These works assume that the training samples are fully observed.  The following quadratic upper bound follows from the L-smoothness of L o (A1):</p><p>Let X denote a random variable and let F be a convex function, Jensen's inequality states that</p><p>We define the following shorthand notation for conditional expectations used in our proofs:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 PROOF OF LEMMA 1</head><p>From the law of total expectation, we have that</p><p>Focusing on the inner expectation, we have from the definition of LB t S t (&#952; t ) in ( <ref type="formula">7</ref>) and ( <ref type="formula">8</ref>) that</p><p>Now, from the fact that S t ki &#8764; U(P i k (K</p><p>o )), we have that</p><p>L nkI (&#952; t ).</p><p>So, we have from ( <ref type="formula">14</ref>) and from the fact that a t i = K t o -1 i-1 /i, as defined in (8), that</p><p>). So, from ( <ref type="formula">13</ref>) and (A2), it follows that</p><p>Therefore, we have from the dominated convergence theorem, which allows us to interchange the expectation and the gradient operators, that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3 PROOF OF THEOREM 1</head><p>Let us show the convergence of our method. It follows from the quadratic upper bound in (11) which ensues from the L-smoothness assumption (A1), that</p><p>Using the fact that &#952; t+1 = &#952; t -&#951;&#8711; LB t S t (&#952; t ), we get that:</p><p>We now take the conditional expectation E t [&#8226;], defined in Appendix C.2, arriving that:</p><p>Now, under the unbiasedness assumption (A2), we can use Lemma 1 to arrive at:</p><p>From Lemma 1, we also have that the following equation holds</p><p>Further, we have that</p><p>Recalling that E t + &#8711; LB t S t (&#952; t ) = &#8711;L o B t (&#952; t ), as seen in Appendix C.2, we get</p><p>Further, it follows from the bounded variance assumption in (A3) that</p><p>Therefore, we arrive at</p><p>Using the inequality above in ( <ref type="formula">16</ref>) we get that</p><p>which we can use in ( <ref type="formula">15</ref>) to arrive at</p><p>Therefore, for a stepsize &#951; &#8712; (0, 1/L], we have that</p><p>Taking the unconditional expectation (over K o and F t+1 ) on both sides of this inequality, we get:</p><p>Now, using the law of total expectation, we get that</p><p>Therefore, using the fact that L(&#952;) = E[L o (&#952;)] and Jensen's inequality (12), we have</p><p>Using the dominated convergence theorem, we can interchange the expectation and the gradient operators, arriving at the following descent lemma in expectation:</p><p>Taking the average of the inequality above for t &#8712; {0, 1, . . . , T -1}, we get that:</p><p>Lastly, rearranging the terms and using the existence and definition of L (A1), we have that</p><p>thus arriving at the result in (10).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4 PROOF OF THEOREM 2</head><p>Following the same steps as in the proof of Theorem 1, we can arrive at the same descent lemma in expectation (17),</p><p>Rearranging the terms and subtracting the infimum on both sides, we get that</p><p>Now, using the definition of our Lyapunov function, &#948; t = EL(&#952; t ) -L , we arrive at the following inequality:</p><p>Now, from the PL inequality (A4), we have that:</p><p>Therefore, again using the definition of our Lyapunov function, we arrive at</p><p>Recursing the inequality above, we get that</p><p>(1 -&#181;&#951;) t .</p><p>Finally, using the sum of a geometric series, we arrive at</p><p>which corresponds to the result that we set out to prove.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D EXPERIMENT DETAILS</head><p>Notes on VFedTrans. For the VFedTrans <ref type="bibr">(Huang et al., 2023</ref>) pipeline, we use FedSVD <ref type="bibr">(Chai et al., 2022)</ref> for federated representation learning and an autoencoder for local representation learning, as these are the best performing methods in <ref type="bibr">Huang et al. (2023)</ref>. For the local predictor, we use a multilayer perceptron, as in <ref type="bibr">Huang et al. (2023)</ref>. As seen in Table <ref type="table">1</ref>, the performance of VFedTrans is nearly identical to that of Local. However, given that requires a multi-stage pipeline with operations for each pair of clients, its runtime is much longer. In particular, for the Credit dataset, for p miss = 0.0 for training and inference, running VFedTrans takes 2843.0&#177;50.6s, while running Local takes only 127.8 &#177; 3.0s.</p><p>Notes on PlugVFL. We also note that the original PlugVFL paper <ref type="bibr">(Sun et al., 2024</ref>) 1) does not address missing training data and 2) only considers settings with two clients. In our experiments, we extended PlugVFL to handle more generic scenarios. To address missing training data fairly, we replaced representations for missing feature blocks with zeros at the fusion model, similarly to the dropout mechanisms in the original paper, but now, for missing data, these representations are zeroed throughout training rather than per iteration. To generalize PlugVFL for more than two clients, we assigned a dropout probability to each passive party (clients not holding the fusion model).</p><p>Notes on MIMIC-IV. We use ICU data of MIMIC-IV v1.0 <ref type="bibr">(Johnson et al., 2020)</ref> and follow the data processing pipeline in <ref type="bibr">Gupta et al. (2022)</ref>. We focus on the task of predicting in-hospital mortality for patients admitted with chronic kidney disease. We do feature selection with diagnosis data and the selected features of chart events (labs and vitals) used in the "Proof of concept experiments" <ref type="bibr">Gupta et al. (2022)</ref> as input features.</p><p>Computing metrics. As mentioned in the main text of the paper, we want our metrics to capture that fact that we want all clients to perform well at inference time. Therefore we consider the average metrics across the clients. More precisely, we compute the metrics for Table <ref type="table">1</ref> as follows:</p><p>&#8226; For the test accuracy, let &#375;n (k) denote the prediction of client k for sample n, we compute:</p><p>&#8226; For the F 1 -score, we compute:</p><p>where P k and R k are the precision and recall of client k, with respect to its predictor and (only) the samples it observed.</p><p>Models used. For the HAPT and Credit datasets, we trained simple multilayer perceptrons; for the MIMIC-IV dataset, we trained an LSTM <ref type="bibr">(Hochreiter, 1997)</ref>; and, for the CIFAR-10 and CIFAR-100 experiments, we use ResNets18 <ref type="bibr">(He et al., 2016)</ref> models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E ADDITIONAL EXPERIMENTS</head><p>In this appendix, we present additional experimental results.</p><p>Nonuniform missing features. In Table <ref type="table">3</ref>, we present an experiment where the probability of each feature block k not being observed, p miss (k), is sampled independent and identically distributed following p miss (k) &#8764; Beta(2.0, 2.0). Note that, for X &#8764; Beta(2.0,2.0), we have that E[X] = 0.5. This motivates our comparison with p miss (k) = 0.5 for all k. For the column corresponding to the case where both the probability of missing training data and inference data are nonuniform, these probabilities are sampled independently for training and inference.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>The notation in (3) differs slightly from (1) and (2), which use a simpler, more specific formulation.</p></note>
		</body>
		</text>
</TEI>
