<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Iterative Supervised Learning for Regression with Constraints</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10334590</idno>
					<idno type="doi"></idno>
					<title level='j'>International Conference on Ubiquitous Robots</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Tejaswi K. C. and Taeyoung Lee</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Regression in supervised learning often requires the enforcement of constraints to ensure that the trained models are consistent with the underlying structures of the input and output data. This paper presents an iterative procedure to perform regression under arbitrary constraints. It is achieved by alternating between a learning step and a constraint enforcement step, to which an affine extension function is incorporated. We show this leads to a contraction mapping under mild assumptions, from which the convergence is guaranteed analytically. The presented proof of convergence in regression with constraints is the unique contribution of this paper. Furthermore, numerical experiments illustrate improvements in the trained model in terms of the quality of regression, the satisfaction of constraints, and also the stability in training, when compared to other existing algorithms.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Enforcing constraints on supervised learning is critical when the underlying structures of the data should be respected in the trained model, or when it is required to overcome a bias in the data set. For instance, in <ref type="bibr">[1]</ref>, constraints caused by length, angle, or collision are studied with projection when predicting the motion of a physical system with neural networks. In <ref type="bibr">[2]</ref>, fairness with respect to protected features, such as race or gender, is addressed in socially sensitive decision making. Further, it has been illustrated by <ref type="bibr">[3]</ref> that the performance of deep learning can be improved by integrating the domain knowledge in the form of constraints. As such, imposing constraints is desirable in injecting our prior knowledge of the model, which is encoded indirectly in the data, to supervised learning explicitly.</p><p>One of the common techniques to implement constraints is augmenting the loss function with an additional penalty on the violation of the constraints, as presented by <ref type="bibr">[4]</ref> and <ref type="bibr">[5]</ref>. On the other hand, constraints have also been implemented directly as hard constraints that should be satisfied strictly. Imposing hard constraints on deep neural network is presented by <ref type="bibr">[6]</ref> after customizing large-scale optimization techniques. Alternatively, <ref type="bibr">[7]</ref> handles output label restrictions through a Lagrangian based formulation. Both of these approaches based on additional regularization terms or hard constrained optimization involve the process of actively adjusting model parameters in training. In other words, the possibly conflicting goals of regression and constraint enforcement should be addressed simultaneously. This may hinder the efficiency of the training procedure, while making it susceptible to various numerical issues.</p><p>Recently, an iterative procedure has been proposed by <ref type="bibr">[8]</ref>, where the constraints are enforced by adjusting the Mechanical and Aerospace Engineering, The George Washington University, Washington DC 20052 kctejaswi999@gmail.com,tylee@gwu.edu</p><p>The research was partially supported by NSF under the grants CNS-1837382, CMMI-1760928 and by AFOSR FA9550-18-1-0288. target, instead of manipulating the model parameters directly, thereby addressing the aforementioned challenges. The desirable feature is that any supervised learning technique developed without constraint consideration can be adopted in conjunction with nonlinear constrained optimization tools. However, this approach is heuristic in the sense that there is no analytical assurance for convergence through iterations, while its performance is illustrated with several numerical examples. In fact, it is challenging to present a convergence property in any supervised learning with constraints.</p><p>The main objective of this paper is to establish a certain convergence guarantee in regression with constraints. Towards this end, we utilize the procedure of adjusting the target to satisfy constraints. More specifically, the ideal target is projected to the intersection between the set of possible outputs from the chosen model and the set of feasible outputs. Then the model parameters are optimized to the adjusted target, and these two steps are repeated. The proposed approach is motivated by the alternating projections <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref> and Dykstra's algorithm <ref type="bibr">[11]</ref>. In particular, the two steps of iterations in adjusting the target, and in training the model are considered as certain projection operators, from which convergence is established by the Banach fixed point theorem <ref type="bibr">[12]</ref>.</p><p>The desirable feature is that we have a certain assurance of convergence in regression with constraints. Another interesting feature is its general formulation: as discussed above, this framework can be integrated with any supervised learning technique. And, it further addresses the challenges of adjusting the model parameters to the satisfaction of the constraints, while performing regression simultaneously. One downside is that we cannot enforce the constraint strictly as hard constraints, but there is a design parameter that provides a trade-off between regression and constraint satisfaction.</p><p>Numerical experiments demonstrate that the proposed approach improves the regression performance in the similar level of constraint violation. More importantly, it exhibits more consistent results over five-fold validations. As such, the proposed convergence proof is actually beneficial in numerical implementations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. ITERATIVE LEARNING WITH CONSTRAINTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem Formulation</head><p>Consider a regression problem where we should predict the ideal output y &#8712; R n given the inputs X &#8712; R n&#215;d . Here n denotes the number of points in the dataset, and d corresponds to the number of features in each data point. Let the model for supervised learning be denoted by &#375; = f (X, &#952;), where &#952; &#8712; R p are the model parameters and &#375; &#8712; R n is the output predicted by the current model parameters. The goal of regression is to identify the optimal model parameters &#952; * that minimize a given loss function, L(y, &#375;), L : R n &#215; R n &#8594; R. In addition, we enforce constraints on the predicted output so that it belongs to a feasible set denoted by C &#8834; R n , i.e., &#375; &#8712; C. Thus, the optimization problem for regression with constraints can be formulated as</p><p>Alternatively, this can be reorganized into an optimization on the output space as</p><p>where B = {&#375; | &#375; = f (X, &#952;), &#952; &#8712; R p } is the set of all possible outputs under the current model. In other words, (1) is to find an alternative optimal target z &#8712; R n that is closest to the ideal target y under the restriction of the given constraint and the model bias. Next, in (2), the model parameters are optimized such that the predicted output matches to the optimal target z, not the ideal target y. The intriguing feature is that the supervised learning in (2) corresponds to the usual supervised learning without constraints, as the constraints are enforced indirectly through <ref type="bibr">(1)</ref>. As such, any supervised learning scheme can be utilized for <ref type="bibr">(2)</ref>. For (1), standard tools in nonlinear constrained optimization can be applied.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Iterative Learning Algorithm with Constraints</head><p>In <ref type="bibr">[8]</ref>, this problem is tackled by a clever combination of two iterations, which is verified by various numerical examples. But it might be heuristic in the sense that no convergence property is established. Here we propose the following alternative iterative scheme for (1) and ( <ref type="formula">2</ref>), summarized by Algorithm 1, which provides a certain convergence property in regression. Here, &#945;, &#946; are non-negative parameters in the adjustment step, and N i is the total number of iterations of this procedure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 Regression with constraints</head><p>end if</p><p>end for Output: &#375;Ni</p><p>In the first step of initial training, supervised learning is performed without considering the constraint. The next iterations are composed of two parts of target adjustment and unconstrained training, and the target adjustment step has two sub-cases depending on the output of the previous step. In particular, the most critical step is when the output of the trained model does not satisfy the constraint. In the step 4, denoted by infeasible adjustment, the target is adjusted to minimize L(z, (1 -&#945;)y + &#945;&#375; i ). That is, we find a feasible target z &#8712; C that is closest to a point on the line connecting y and &#375; in terms of the loss function.</p><p>Next, when the output of the trained model satisfies the constraint, in the step of feasible adjustment, the target z is moved closer to the original target y within a ball of radius &#946; measured in terms of the loss. Finally, the model is trained with the adjusted target, and the whole procedure is repeated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Convergence Property</head><p>Now we present a convergence property of Algorithm 1, which has motivated the proposed form of the objective function in the infeasible adjustment step.</p><p>Denote norms on the Euclidean space by &#8226; : R n &#8594; R, with the L 2 and the L 1 norms represented by &#8226; 2 , and &#8226; 1 , respectively. A projection operator P Z,L : R n &#8594; R n on the set Z with respect to the loss L is defined as</p><p>In other words, x &#8712; R n is projected to z &#8712; Z such that the distance between x, z is minimized in terms of the loss L. Consider a convex subset Z &#8838; X of a finite-dimensional normed vector space (X, &#8226; ). There exists a unique projection</p><p>if the underlying geometric constraint is satisfied (see <ref type="bibr">[13,</ref><ref type="bibr">Proposition 3.2]</ref>). That is, P Z, &#8226; should not be contained in some non-degenerate line segment of &#8706;Z which is parallel to some non-degenerate line segment in the boundary of the unit</p><p>&#8226; ball.</p><p>Assumption 1. We make the following assumptions.</p><p>&#8226; The sets B and C are convex.</p><p>&#8226; The projection operator in B and C is Lipschitz, i.e., there exists a norm</p><p>When the loss function in the projection (3) is MSE, the above two statements are actually equivalent <ref type="bibr">[13]</ref>.</p><p>Now we are concerned with convergence of the sequence, (&#375; i ) &#8712; B generated after the training step. In other words, we wish to show that &#375;i &#8594; &#563; as i &#8594; &#8734; for some &#563; &#8712; R n . The convergence of Algorithm 1 is established as follows.</p><p>Theorem 1. Suppose &#945; &lt; 1/K 2 , where K &#8805; 1 is the Lipschitz constant introduced in Assumption 1. The iterations of Algorithm 1 has a unique fixed point in B, which is the limit of the sequence (&#375; i ) for an initial &#375;1 &#8712; B, when &#946; is sufficiently small. Proof. When &#946; &#8594; 0, Algorithm 1 iterates between the infeasible adjustment step and unconstrained training, and it can be written as</p><p>Therefore, Algorithm 1 corresponds to a concatenation of two projections as</p><p>where h : R n &#8594; R n is the affine extension function defined as h(&#375; i ) = (1 -&#945;)y + &#945;&#375; i . Consider any two points &#375;1 , &#375;2 &#8712; B. We have</p><p>If K 2 &#945; &lt; 1, then each iteration is a contraction mapping on B with the metric induced by this norm, d(x, y) = y -x . Since (B, d) is a complete metric space, the series of iterations has a unique fixed point &#563; = f (&#563;) according to the Banach fixed point theorem <ref type="bibr">[12]</ref>. Moreover, the sequence {&#375; 1 , &#375;2 , . . .} converges to &#563; for any &#375;1 &#8712; B.</p><p>In short, the Lipschitz properties of Assumption 1 ensures that each iteration is a contraction. The critical question is how we can ensure the Lipschitz property of the projection operators.</p><p>Corollary 1. The convergence of Algorithm 1 is guaranteed as described in Theorem 1 for the following loss functions.</p><p>&#8226; For the mean squared error (MSE) given by L(z, y) = Proof. For MSE, the projection in (3) corresponds to</p><p>which is equal to minimization with respect to the standard Euclidean norm, &#8226; 2 . The proximity map for a closed convex set in the Hilbert space with Euclidean inner product satisfies the condition P x -P y &#8804; x -y [9], <ref type="bibr">[13]</ref>. Since its Lipschitz constant is K = 1, according to Theorem 1, Algorithm 1 converges for &#945; &#8712; [0, 1).</p><p>Next, for the MAE loss, the projection becomes</p><p>which is optimization with respect to the L 1 norm, &#8226; 1 . It is shown by <ref type="bibr">[14]</ref> that the Lipschitz constant is K = 2 with respect to the L 1 norm. Hence convergence is guaranteed if &#945; &#8712; [0, 0.25).</p><p>Corollary 1 is the main result of this paper establishing the convergence of iterative algorithm for regression with constraints. The key idea which facilitated it is selection of the objective function of the infeasible adjustment as L(z, (1 -&#945;)y + &#945;&#375; i ). Interestingly, for a specific choice of the loss function, namely mean squared error (MSE) loss, it is equivalent to the form of L(z, y) + 1 &#945; L(z, &#375;) as described below.</p><p>Remark 1. If the loss function is mean squared error, the procedure in Algorithm 1 and the Moving Targets algorithm in <ref type="bibr">[8]</ref> are equivalent after adjusting the parameter &#945;.</p><p>Proof. In the infeasible adjustment case of the Moving Targets, a vector is obtained that considers the original label, y and the current prediction, &#375; separately in the form of L(z, y) + 1 &#945; L(z, &#375;). Since the main difference between the two algorithms is in this formulation, we compare the corresponding optimization problems. With L(z, y) = (1/n) n k=1 (z k -y k ) 2 as the MSE loss, Algorithm 1 addresses</p><p>Whereas the master step from Moving Targets is,</p><p>Here, the subscript a represents our algorithm while variables with m as the subscript are from Moving Targets. The objective functions in z a , z m differ only by a scale if,</p><p>Hence, the solutions that are obtained from them will be identical, i.e., z a = z m .</p><p>Next, we show that the proposed algorithm further exhibits improved numerical properties in several examples, beyond providing mathematical assurance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. NUMERICAL SIMULATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Constrained Learning Problem</head><p>We evaluate the performance of the proposed algorithm with various datasets, parameter values, and loss functions. First, we underscore that this section is meant to be an exercise in understanding an algorithmic procedure, and the resulting output is supposed to be interpreted as purely technical results. The type of constraints considered in this section is called fairness constraints in socially sensitive decision making (see [15]), which is measured in the form of Disparate Impact Discrimination Index as</p><p>Here D p is the set of values for the p-th protected feature, such as gender or disability, from the set P , and X p,v represents the inputs whose p-th feature has value v. Roughly speaking, it represents the difference between the mean of the output and the mean conditioned by the protected feature, and the higher DIDI, the more the dataset suffers from disparate impact. The constraint on the DIDI value, , is taken to be a fraction (0.2) of the DIDI value for the training set.</p><p>Three different datasets are considered for this regression problem with fairness constraints:</p><p>&#8226; student dataset (n = 649 points, d = 33 attributes) for Portuguese class from the UCI repository which has been used to predict secondary school student performance in <ref type="bibr">[16]</ref>. We are going to protect the feature, sex, and try to estimate the final grade of each student, G3, after removing unrelated features. &#8226; crime dataset also from the UC Irvine Machine Learning repository <ref type="bibr">[17]</ref> which has n = 2, 215, d = 147.</p><p>Since the target variable is violentPerPop representing per capita violent crimes, we want to impose fairness constraints w.r.t. the protected feature race. &#8226; blackfriday dataset which is available online at <ref type="bibr">[18]</ref>.</p><p>Considering computational challenges, we select a sample of data from the start with size, n = 50, 000, among the original training data (n &#8776; 550, 000, d = 12). Here, the goal is to estimate the amount of money spent, Purchase, while ensuring that the predictions are fair with respect to feature, Gender. A new attribute, Product_ID_Count, which is the value count of Product_ID is introduced since it represents the number of times a product has been purchased. All the categorical features in the data are encoded into an integer array using an Ordinal Encoder. Finally, obtained values are normalized to be between 0 and 1 to ensure balanced regression.</p><p>Next the values of parameters &#945;, &#946; are chosen as follows. For the parameter &#945;, three different values are chosen, and the corresponding values of &#945; m are calculated from (5) in Remark 1. These are listed at Table I where as &#945; a is increased, more weight is assigned to &#375; that has been adjusted for the constraint, compared with the original target y. Therefore, there is more emphasis on the satisfaction of the constraint. We choose &#946; = 0.1 according to the empirical study presented in <ref type="bibr">[8]</ref>.</p><p>For the machine learning model of regression, a gradient boosted tree is chosen as it ensures repeatability. It also achieves higher accuracy as well as better constraint satisfaction. To study the convergence property, the algorithms are executed for the total of N i = 30 iterations. A five-fold cross validation is performed to obtain a reliable estimate of performance as well as the standard deviation. We utilize a computer with Intel i7 CPU and Nvidia GK107 GPU with 16 GB RAM. To solve the optimization problems in the adjustment step of Algorithm 1, we utilize the IBM software CPLEX <ref type="bibr">[19]</ref> for MSE and MAE. Additionally, we consider the mean Huber loss (MHL) L(z, y)</p><p>with M = 0.1, which is implemented by CVXPY <ref type="bibr">[20]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Numerical Results</head><p>Table <ref type="table">II</ref> presents the results for varying loss functions, datasets and &#945; values. Performance is measured through the regression coefficient R 2 , and the ratio C of DIDI ( <ref type="formula">6</ref>) of the predicted output to training data. We also compare between our algorithm (denoted by A) and the Moving Targets (M) for the corresponding &#945; values from Table <ref type="table">I</ref>. According to Remark 1, both methods are equivalent for MSE. For the blackfriday dataset, two cases of &#945; are left out since they could not be solved with the available computing resources. The trade-off between constraint satisfaction and regression is well reflected in Table <ref type="table">II</ref>: as &#945; a is increased, C decreases at the cost of reduced R 2 .</p><p>Next, the bold fonts in Table <ref type="table">II</ref> represent the cases for which our algorithm performs better than Moving Targets in a statistically meaningful manner, and the italic fonts represent the opposite case. The statistical importance is assumed to occur when |&#181; a -&#181; m | &#8805; &#963; a + &#963; m , i.e., the difference between the mean figures is greater than the sum of their standard deviations. It can be observed that our procedure performs better in terms of both R 2 and C in more cases for both crime and blackfriday datasets. For the student dataset, which is the smallest one (n = 649), the results are mostly comparable.</p><p>Beyond the regression results summarized by Table <ref type="table">II</ref>, the advantages of the proposed approach are well illustrated by investigating the learning process. Figures <ref type="figure">1</ref> and<ref type="figure">2</ref> presents the evolution of R 2 and C over iterations for crime and blackfriday data, respectively. When &#945; a is small (0.1), both the algorithms yield very similar results as seen in Figures 1.(a) and (d).</p><p>However, once &#945; a is increased to 0.5 and 0.9 for more emphasis on constraint satisfaction, the proposed Algorithm 1 performs noticeably better. As shown in   [('dataset', 'crime'), ('loss', 'mhl')] movtar 0.1111 affine 0.9 (f) C for &#945;a = 0.9, using MHL This suggests that the presented proof of convergence is in fact beneficial in numerical implementations as well, and the improved numerical properties in iterations may be more important for the scalability of regression and the complexity of constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. CONCLUSIONS</head><p>We have proposed an iterative algorithm for regression with constraints, composed of feasible/infeasible adjustments and training. A convergence guarantee is also provided with an affine extension function in the infeasible adjustment step. Later, the results of numerical experiments are presented with varying datasets and parameter values. The proposed convergence proof in supervised learning with constraints is the unique contribution, and it is further shown that the </p></div></body>
		</text>
</TEI>
