<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/25/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10446725</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume">206</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhenbang Wang</author><author>Emanuel Ben-David</author><author>Martin Slawski</author><author>Francisco Ruiz</author><author>Jennifer Dy</author><author>Jan-Willem van de Meent</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption is that each pair corresponds to the same observational unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this setting under the term “Shuffled Data” in which the underlying correct pairing of (X, Y)-pairs is represented via an unknown permutation. Explicit modeling of the permutation tends to be associated with overfitting, prompting the need for suitable methods of regularization. In this paper, we propose an exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and local shuffling. This prior turns out to be conjugate for canonical shuffled data problems in which the likelihood conditional on a fixed permutation can be expressed as product over the corresponding (X,Y)-pairs. Inference can be based on the EM algorithm in which the E-step is approximated by sampling, e.g., via the Fisher-Yates algorithm. The M-step is shown to admit a reduction from n^2 to n terms if the likelihood of (X,Y)-pairs has exponential family form. Comparisons on synthetic and real data show that the proposed approach compares favorably to competing methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Shuffled data problems refer broadly to situations in which the goal is to perform inference for a functional of the joint distribution of a pair of random variables (x, y) (such as, e.g., their covariance) based on separate samples {x i } n and {y i } m i=1 that involve matching pairs {(x &#8673; &#8676; (i) , y i )} m i=1 pertaining to the same statistical unit, where the map &#8673; &#8676; : {1, . . . , m} ! {1, . . . , n} may only be observed incompletely. This is a rather common scenario in data integration problems in which different pieces of information about a shared set of entities reside in multiple data sources that need to be combined in order to perform a given data analysis task. The process of identifying matching parts across two or more files is often far from trivial in the absence of unique identifiers, and has thus grown into a vast and active field of research known as record linkage, e.g., <ref type="bibr">Binette and Steorts (2020)</ref>. The above shuffled data model represents a direct approach to account for mismatches in record linkage and the impact on downstream data analysis. Shuffled data problems were first systematically discussed in <ref type="bibr">DeGroot et al. (1971)</ref>, with little progress until a few years ago given advances in computation <ref type="bibr">(Gutman et al., 2013)</ref>. Recently, shuffled data problems have generated widespread interest, fueled by applications in signal processing <ref type="bibr">(Unnikrishnan et al., 2018;</ref><ref type="bibr">Pananjady et al., 2018)</ref>, correspondence problems in computer vision <ref type="bibr">(Pananjady et al., 2017;</ref><ref type="bibr">Li et al., 2021)</ref> and NLP <ref type="bibr">(Grave et al., 2019;</ref><ref type="bibr">Shi et al., 2021)</ref>, biomedical data analysis <ref type="bibr">(Ma et al., 2021a;</ref><ref type="bibr">Abid and Zou, 2018)</ref>, and data privacy <ref type="bibr">(Domingo-Ferrer and Muralidhar, 2016;</ref><ref type="bibr">Gordon et al., 2021)</ref>.</p><p>On the theoretical side, several papers have investigated the statistical limits of signal estimation and permutation recovery in unlabeled sensing in which the goal is to recover a signal &#10003; &#8676; from n noisy linear measurements</p><p>, where &#8673; &#8676; is an unknown permutation, i.e., m = n and &#8673; &#8676; is one-to-one <ref type="bibr">(Unnikrishnan et al., 2018;</ref><ref type="bibr">Pananjady et al., 2018;</ref><ref type="bibr">Hsu et al., 2017;</ref><ref type="bibr">Abid et al., 2017;</ref><ref type="bibr">Tsakiris and Peng, 2019)</ref>. Another line of research has studied the setting in which x and y are scalar and related by a monotone map <ref type="bibr">(Carpentier and Schl&#252;ter, 2016;</ref><ref type="bibr">Rigollet and Weed, 2019;</ref><ref type="bibr">Flammarion et al., 2019;</ref><ref type="bibr">Ma et al., 2020;</ref><ref type="bibr">Balabdaoui et al., 2021)</ref>.</p><p>A common conclusion from these works is that shuffled data problems are generally plagued by both statistical and computational challenges. First, the combinatorial nature of &#8673; &#8676; makes it hard to devise computationally tractable approaches with provable guarantees. Existing algorith-mic "solutions" involve integer programming <ref type="bibr">(Tsakiris and Peng, 2019;</ref><ref type="bibr">Peng and Tsakiris, 2020;</ref><ref type="bibr">Mazumder and Wang, 2021)</ref>, the EM algorithm <ref type="bibr">(Gutman et al., 2013;</ref><ref type="bibr">Abid and Zou, 2018;</ref><ref type="bibr">Tsakiris et al., 2020)</ref>, sampling and approximate inference <ref type="bibr">(McVeigh et al., 2019;</ref><ref type="bibr">Steorts et al., 2016;</ref><ref type="bibr">Klami, 2012)</ref>. Regardless of the computational challenges, shuffled data problems tend to be highly susceptible to noise and prone to overfitting. In fact, statistical guarantees typically involve unrealistically stringent signal-tonoise requirements <ref type="bibr">(Pananjady et al., 2018;</ref><ref type="bibr">Hsu et al., 2017)</ref>. Loosely speaking, this issue results from the fact that the set of permutations grows rapidly in size with n. This observation suggests that suitable forms of regularization hinging on prior information on &#8673; &#8676; are needed to constrain the size of the parameter space under consideration. Several papers consider partial shufflings in which varying fractions of (x i , y i )-pairs are already observed with the correct correspondence <ref type="bibr">(Slawski and Ben-David, 2019;</ref><ref type="bibr">Slawski et al., 2019</ref><ref type="bibr">Slawski et al., , 2020;;</ref><ref type="bibr">Zhang and Li, 2020;</ref><ref type="bibr">Peng et al., 2021)</ref>, and only the remaining portion of the data is subject to shuffling. Another constraint commonly encountered in record linkage is that &#8673; &#8676; is block-structured with known composition of the blocks based on auxiliary variables that are required to agree for matching records. In domains such as signal processing and computer vision, &#8673; &#8676; is often constrained to act locally in the sense that indices are shuffled only within small time windows or image regions <ref type="bibr">(Ma et al., 2021b;</ref><ref type="bibr">Abbasi et al., 2021)</ref>.</p><p>The goal of the present paper is the development of a regularization framework for shuffled data problems that integrates those and other constraints in a unified way. To that end, we introduce an exponential family prior on the permutation group that is flexible enough to accommodate any kind of prior information that can be expressed solely in terms of index pairs (i, j). This prior turns out to be conjugate for canonical shuffled data problems in which the likelihood conditional on a fixed permutation can be expressed as product over the corresponding (x, y)-pairs. Inference is based on the MC-EM algorithm considered in <ref type="bibr">Wu (1998)</ref> and <ref type="bibr">Abid and Zou (2018)</ref>. We show that for exponential family likelihood, the resulting M-step is particularly scalable since it involves n instead of n 2 terms. Moreover, computation of the MAP estimator of &#8673; &#8676; with the remaining parameters fixed reduces to a linear assignment problem, and hence remains computationally tractable. Theoretical results and a collection of experiments for various shuffled data setups demonstrate the usefulness of regularization based on the proposed prior in comparison to the unregularized counterpart and other baselines.</p><p>Notations. We denote by D = {(x i , y i )} n i=1 the observed merged data, subject to shuffling. We use (x, y) for a generic pair of matching records. We use X and Y for the row-wise concatenation of {x i } n i=1 and {y i } n i=1 , respectively. We let p(&#8226;) denote the density (PDF) of a list of variables in (&#8226;), and accordingly p(&#8226; | &#8226;) is used for conditional PDFs. We write u &#8672; p to express that random variable u has density p. The symbol E (...) <ref type="bibr">[&#8226;]</ref> is the expectation w.r.t. (. . .). The Hamming distance on the permutation group P(n) of [n] = {1, . . . , n} is denoted by d H . The symbol tr is used for the matrix trace, and I n denotes the identity matrix of dimension n. The cardinality of a set is denoted by | &#8226; |, and I denotes indicator function.</p><p>Conventions. We often refer to a permutation via the underlying map &#8673; and the associated matrix &#8679; = (&#8673; ij ) in an interchangeable fashion, and accordingly P(n) and subsets thereof may refer to both maps and matrices. Asterisked symbols such as &#8673; &#8676; , &#10003; &#8676; , &#8676; etc. refer to ground truth parameters; non-asterisked symbols such as &#8673;, &#10003;, etc. refer to generic elements of the associated parameter spaces.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Approach</head><p>Our approach will be presented as follows: we start with a brief motivation, followed by a more formal systematic introduction, and conclude with technical details pertaining to computation and model fitting. </p><p>) n i=1 and corresponding amplified slope b ML . R: Re-paired data (x b &#8673;(i) , y i ) n i=1 based on the proposed Hamming prior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Motivating examples</head><p>Consider the simple linear regression setup y i =</p><p>x &#8673; &#8676; (i) &#8676; + &#8676; &#9999; i , where x i and &#9999; i are independent standard normal random variables, 1 &#63743; i &#63743; n, and &#8673; &#8676; permutes 10% of the indices uniformly at random. Suppose that the sign of &#8676; is known to be positive. Then the ML estimator of &#8673; &#8676; (or equivalently, the MAP estimator under a uniform prior over P(n)) is given by the permutation b &#8673; ML that matches the corresponding order statistics in {x i } n i=1 and {y i } n i=1 :</p><p>As shown in Figure <ref type="figure">1</ref>, b &#8673; ML performs rather poorly. The scatterplot of the matching of corresponding order statistics is far from that of the underlying correct pairing. In fact, b &#8673; ML is associated with massive overfitting. Specifically, let</p><p>denote the resulting ML estimators of &#8676; and 2 &#8676; , respectively. It is straightforward to show that </p><p>) based on the proposed prior.</p><p>in probability as n ! 1. This is alarming since it implies that the least squares fit absorbs all the noise.</p><p>Figure <ref type="figure">1</ref> shows that the ML estimator is too aggressive in forming "corrected" pairs (x b &#8673;ML(i) , y i ) given that only 10% of the observations are actually mismatched, and among those 10%, only a fraction contributes substantial mismatch that exceeds the noise inherent in the problem. Sparsity of &#8673; &#8676; is often a reasonable assumption in post-linkage data analysis <ref type="bibr">(Chambers and Diniz da Silva, 2020;</ref><ref type="bibr">Slawski and Ben-David, 2019)</ref>, where sparsity here means that set of mismatches {i 2 [n] : &#8673; &#8676; (i) 6 = i} has significantly smaller cardinality than n. Given an upper bound on the number of mismatches, say k, it is appropriate to consider the following constrained ML estimator of &#8673; &#8676; :</p><p>where id is the identity map on [n] and d H (&#8673;, &#8673; 0 ) = P n i=1 I(&#8673;(i) 6 = &#8673; 0 (i)) denotes the Hamming distance on P(n). To the best of our knowledge, there is no efficient algorithm for computing the maximizer directly. However, there exists a Lagrangian multiplier &gt; 0 such that (3) is equivalent to the optimization problem</p><p>which is a linear assignment problem with cost matrix C = I(i 6 = j) x j y i , which is computationally tractable according to the discussion following (7) below. The right panel of Figure <ref type="figure">1</ref> highlights the improvement that can be achieved by the resulting estimator which here only makes a small number of re-pairings of (x, y) capturing pairs that correspond to massive mismatch error in the left panel.</p><p>Figure <ref type="figure">2</ref> illustrates scenarios in which &#8673; &#8676; is not sparse (with a mismatch rate exceeding 80%), but constrained to be a "local shuffling" in the sense that max i2[n] |&#8673; &#8676; (i) i| &#63743; r, i.e., the corresponding permutation matrix is a band matrix with bandwidth at most r. This scenario is particularly relevant when the data is recorded sequentially (e.g., over different time points) or across a spatial domain endowed with a notion of distance, and it is known that &#8673; &#8676; can only mix up the order of data inside a specific time window or within a local neighborhood. There are numerous applications in which &#8673; &#8676; is locally constrained such as genome se-quencing <ref type="bibr">(Abid et al., 2017</ref><ref type="bibr">), signal processing (Balakhrisnan, 1962;</ref><ref type="bibr">Abbasi et al., 2021)</ref>, or computer vision <ref type="bibr">(Ma et al., 2021b)</ref>. The illustrative example in Figure <ref type="figure">2</ref> can be thought of a regression problem in which the signal is a sine with known frequency but unknown (positive) amplitude &#8676; , i.e., y ti = &#8676; sin(10&#8673;t i )+0.1&#9999; i , 1 &#63743; i &#63743; n (left panel). However, the observed data is of the form (y t &#8673; &#8676; (i) ) n i=1 for some unknown (local) permutation &#8673; &#8676; (middle panel). If</p><p>&#8676; is known to be positive, then the (unconstrained) ML estimator b &#8673; ML of &#8673; &#8676; matches the order statistics {&#181; (i) } n i=1 and {y (i) } n i=1 , where</p><p>In order to improve over the ML estimator using the prior knowledge of local shuffling, we impose the constraint that the alternative estimator b &#8673; does not pair any indices more than r = 3 apart. This estimator can be obtained as solution of the optimization problem max</p><p>where c ij = 0 if |i j| &#63743; r and c ij = +1 otherwise. As in (4), the problem in (5) side is a linear assignment problem and hence computationally tractable, and corresponds to MAP estimation under the family of priors considered below. The corrected, i.e., repaired data</p><p>based on this approach are depicted in Figure <ref type="figure">2</ref> (R).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Exponential family prior on P(n)</head><p>The priors discussed in the two examples of the previous subsection can be understood as specific instances of a more general family of prior distributions over P(n). Specifically, we consider the family of priors</p><p>where &gt; 0 is the concentration parameter, and the matrix M (which is not required to have any specific properties) defines the mode(s) of the distribution argmax &#8679;2P(n) h&#8679;, Mi, where h&#8226;, &#8226;i here represents the trace inner product on the space of matrices. In the same vein, the mode(s) of the distribution correspond to the set of matrices closest to M with respect to the same norm. Moreover, the distribution specified by ( <ref type="formula">6</ref>) is of exponential family form with respect to the trace inner product <ref type="bibr">(Wainwright and Jordan, 2008)</ref>.</p><p>Linear Assignment Problems (LAPs). Linear assignment problems are a class of optimization problems for computing optimal one-to-one matchings of two sets of items <ref type="bibr">(Burkard et al., 2009)</ref>. LAPs are of the form</p><p>where C is a given cost matrix. By the Birkhoff-von Neumann theorem <ref type="bibr">(Ziegler, 1995)</ref>, the minimum over P(n) can be replaced by the minimum over DS(n), the set of n-by-n doubly stochastic matrices. Therefore, (7) reduces to a linear program in n 2 variables and n 2 + 2n linear constraints. This implies that computing a mode of ( <ref type="formula">6</ref>) is tractable.</p><p>Specific examples. Below, we consider a few examples of interest that are special cases of ( <ref type="formula">6</ref>).</p><p>(I) Hamming prior. Consider the choice M = I n . For any &#8679; 2 P(n), we then have h&#8679;,</p><p>denotes the Hamming distance on P(n). Since n does not depend on &#8673;, Eq. ( <ref type="formula">6</ref>) reduces to</p><p>which appeared in the first example of the preceding section, cf. ( <ref type="formula">4</ref>), in which the goal was to take into account the underlying low rate of mismatches. The prior ( <ref type="formula">8</ref>) is a specific Mallows' prior p(&#8673;) / exp( d(&#8673;, &#8673; 0 )) for a base permutation &#8673; 0 and a metric d on P(n) <ref type="bibr">(Mallows, 1957)</ref>.</p><p>(II) Local shuffling prior. As in the second example in &#167;2.1, suppose we want to have the prior p place most of its mass on permutations that move indices within small windows, i.e., |&#8673;(i) i| tends to be small. This can be achieved by choosing the entries of M in (6) as M ij = (|i j|) for a non-decreasing function . The choice (u) = 0 if |u| &#63743; r for a positive integer r and (u) = +1 otherwise yields the approach (5) that underlies the example in Fig. <ref type="figure">2</ref> above.</p><p>(III) Block prior. In record linkage, it is common that</p><p>) is block diagonal with known block composition. For example, suppose that gender, ethnicity, and age group are used as matching variables, and that these three categorical variables are free of errors. In this case, mismatches can only involve pairs (i, j) falling into the same block corresponding to a specific combination of the above variables. Such known block structure can be encoded via prior ( <ref type="formula">6</ref>) by choosing M ij = 1 if (i, j) are not contained in the same block and M ij = 0 otherwise. This corresponds to a uniform prior for each block, i.e., p(</p><p>The prior for each block does not have to be uniform; e.g., a Hamming prior as in Example (I) above can be used instead. Moreover, the hard block constraint can be relaxed.</p><p>(IV) Lahiri-Larsen prior. In their seminal work on linear regression in the presence of mismatch errors, <ref type="bibr">Lahiri and Larsen (2005)</ref> and <ref type="bibr">Chambers (2009)</ref>  <ref type="formula">6</ref>). An example for Q is the so-called exchangeable linkage model <ref type="bibr">(Chambers, 2009;</ref><ref type="bibr">Zhang and Tuoto, 2021)</ref> </p><p>In this case, the resulting prior is equivalent to the Hamming prior considered in Example (I). More complex priors are obtained depending on the structure of Q.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Integration in Shuffled Data Problems</head><p>We now outline how the above prior can be integrated into generic shuffled data problems. The proposed Monte-Carlo EM <ref type="bibr">(Wei and Tanner, 1990)</ref> framework builds upon the paper by <ref type="bibr">Wu (1998)</ref> that has been rediscovered in the more recent work <ref type="bibr">Abid and Zou (2018)</ref>. The MC-EM scheme in <ref type="bibr">Wu (1998)</ref> was further developed in <ref type="bibr">Gutman et al. (2013)</ref> based on the concept of data augmentation <ref type="bibr">(Tanner and Wong, 1987)</ref>. None of <ref type="bibr">Wu (1998)</ref>; <ref type="bibr">Abid and Zou (2018)</ref>; <ref type="bibr">Gutman et al. (2013)</ref> consider informative priors for &#8673;.</p><p>Conditional &amp; Integrated Likelihood. Suppose we are given data D = {(x i , y i )} n i=1 potentially contaminated by mismatch error. Let p(x j , y i ; &#10003;) be the likelihood (depending on a parameter &#10003;) for the pair (x j , y i ), (i, j) 2 [n] 2 . The likelihood for &#10003; resulting from D conditional on a specific &#8673; 2 P(n) is given by</p><p>Conjugacy. It is worth noting that under (9), the posterior p(&#8673;|D, &#10003;) is a member of the family of distributions specified by p(&#8673;), i.e., the latter is a conjugate prior. This follows from the observation that</p><p>with M D,&#10003;, = log(p(x j ,</p><p>The (conditional) likelihood ( <ref type="formula">9</ref>) can be maximized with respect to both &#10003; and &#8673; as, e.g., in <ref type="bibr">Pananjady et al. (2018)</ref>; <ref type="bibr">Abid et al. (2017)</ref>; <ref type="bibr">Slawski and Ben-David (2019)</ref>. Alternatively, &#10003; is considered as the quantity of primary interest, which suggests the integrated likelihood</p><p>As seen in &#167;2.1, maximizing the conditional likelihood is prone to overfitting, prompting a need for regularization. The use of the integrated likelihood mitigates that problem at best slightly, but not substantially (cf. supplement for details), hence regularization remains relevant.</p><p>MC-EM scheme. The Expectation-Maximization (EM) algorithm <ref type="bibr">(Dempster et al., 1977)</ref> is an established heuristic for minimizing the negative log-likelihood `(&#10003;) = log L(&#10003;) corresponding to (11) via a sequence of surrogates { e `(t) (&#10003;; &#10003; (t) )} t 0 that are minimized successively:</p><p>where e `(t) (&#10003;;</p><p>the so-called expected complete data negative loglikelihood. The surrogates { e `(t) (&#8226; ; &#10003; (t) )} tend to be easier to minimize since they are linear combinations of standard likelihood terms as encountered for fixed and known &#8673;. Surrogates are updated according to</p><p>Here, the main challenge of this scheme is the E-step, i.e, the calculation of the expectation on the right term in (12). For any pair (i, j), we have</p><p>Since the summation over P(n) is not computationally tractable, the expectation needs to be approximated, e.g., via Monte Carlo simulation. Since for the same reason, the posterior p(&#8673;|D, &#10003; (t) ) is only accessible up to an unknown constant (cf. ( <ref type="formula">10</ref>)), it is appropriate to resort to Markov Chain Monte Carlo (MCMC) <ref type="bibr">(Gelman et al., 2013)</ref>. The Metropolis-Hastings (MH) algorithm can be used to generate a Markov Chain {&#8673; (k) } k 1 whose stationary distribution equals p(&#8673;|D, &#10003; (t) ). This yields the approximation</p><p>where b denotes the length of the "burn-in" period, and m denotes the total length of the Markov chain. Substituting (13) into ( <ref type="formula">12</ref>) then yields what is known as MC-EM scheme, cf. Algorithm 1. Conveniently, there is a proposal distribution for the MH algorithm that is easy to work with, known as Fisher-Yates sampling: it generates a new permutation from the current one by swapping the assignments of a pair of indices (cf. Algorithm 2).</p><p>Initialization. The choice of the initial iterate &#10003; (0) can critically impact the quality of the solution that is returned by EM schemes given that the latter is a local strategy that finds a stationary point of a (in general) non-convex objective near the initial iterate. Several consistent initial estimators are known for regression setups depending on the structure of &#8673; (Lahiri and Larsen, 2005; Chambers and Diniz da Silva, 2020; <ref type="bibr">Slawski et al., 2021;</ref><ref type="bibr">Peng et al., 2021)</ref>, and those naturally lend themselves as initial iterate.</p><p>Careful initialization of the MH subroutine is important in order to ensure that p(&#8673;|D, &#10003; (t) ) is explored well given that |P(n)| = n! while the number of MCMC iterations m is limited. Fortunately, under the prior (6), computing the mode argmax &#8673; p(&#8673;|D, &#10003; (t) ) reduces to an LAP of the form (7) in virtue of (10). Initialization via the mode has the advantage that the Markov chain is started in a high density region. The hope is that the resulting iterates (generated according to a localized proposal distribution) will pick up most of the mass of p(&#8673;|D, &#10003; (t) ) so that (13) will well approximate the underlying expectation.</p><p>Algorithm 1 Monte Carlo EM (MC-EM) algorithm</p><p>r(e &#8673;, &#8673; (k) ) min n p(e &#8673;|D,&#10003;; )</p><p>Reduction under exponential family likelihood. For a variety of exponential family models, the expected complete data negative log-likelihood (12) involves n instead of n 2 terms. Specifically, (12) will be</p><p>Examples of interest are presented in the sequel.</p><p>(i) Least squares regression. In this case, we take log p(x, y; , 2 ) = (y x &gt; ) 2 /(2 2 ), which corresponds to the negative likelihood of a linear regression model with i.i.d. zero-mean Gaussian errors with variance 2 . This yields the following expression for the expected complete data negative log-likelihood:</p><p>which is identical to a least squares objective with design matrix X and response vector E[&#8679; &gt; |D, &#10003; (t) ]Y.</p><p>(ii) Generalized linear models. In this case, we have log p(x, y; ,</p><p>, where a, and c denote scale, cumulant, and partition function, respectively. Similar to above, one shows that 1 a( )</p><p>While the canonical link is assumed above, this is not necessary to achieve the reduction from n 2 to n terms.</p><p>where z = [x &gt; y &gt; ] &gt; ; zz &gt; has diagonal blocks xx &gt; , yy &gt; and off-diagonal blocks xy &gt; , yx &gt; . Thus tr(&#8998;zz &gt; ) = tr(&#8998; xx xx &gt; ) + tr(&#8998; yy yy &gt; ) + 2tr(&#8998; yx xy &gt; ), where &#8998; xx , &#8998; yy etc. are the corresponding sub-matrices of &#8998;. Hence,</p><p>where</p><p>Computational complexity of Algorithm 1. For exponential family models benefitting from the above reduction, updating &#10003; in the M-step involves n terms, and is computationally equivalent to a standard estimation problem. Apart from the initialization of the Markov chain, the approximate E-step has complexity O(m), where m denotes the length of the Markov chain. Computing the acceptance probability, updating &#8673; (k) , and keeping track of b E[&#8679;|D, &#10003; (t) ] &gt; Y within Algorithm 2 can be done in time O(1) since the proposal distribution only changes &#8673; (k) at two positions. However, m is recommended to be of the order &#8998;(n), heuristically justified by the fact that in the worst case a permutation is the product of n 1 transpositions.</p><p>Remarks. (i) Following <ref type="bibr">Tanner and Wong (1987)</ref> and Gutman et al. ( <ref type="formula">2013</ref>), the MC-EM approach can be converted into a fully Bayesian approach: given that MC-EM involves sampling from p(&#8673;|D, &#10003;), one can as well sample from p(&#10003;|D, &#8673;) in an alternating fashion, which yields a Gibbs sampler for the joint posterior p(&#10003;, &#8673;|D). (ii) The framework herein is not limited to permutations: we may be given D x = {x i } N i=1 and D y = {y i } n i=1 , N &gt; n (w.l.o.g.), and then consider maps &#8673; : [n] ! [N ] represented by a matrix &#8679; 2 {0, 1} n&#8677;N with unit row sums. Priors of the form (6) given a mode M 2 R n&#8677;N as well as conditional and integrated likelihoods can be defined analogously to (9) and ( <ref type="formula">11</ref>). (iii) We think of the sampling scheme as a template rather than an efficient approach; improving efficiency, e.g., along the lines of Zanella (2020); <ref type="bibr">Grathwohl et al. (2021)</ref>, is left for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Theoretical Insights</head><p>In this section, we present some analysis of the proposed prior from a regularization perspective and provide guidance on the choice of the hyperparameter . Data-driven selection of based on Empirical and Hierarchical Bayes approaches are detailed in the supplement.</p><p>Hamming prior. Our first results concerns the MAP estimator of &#8679; &#8676; under the Hamming prior (8). Specifically, we consider the linear regression setup</p><p>as considered in prior work on shuffled linear regression <ref type="bibr">(Pananjady et al., 2018;</ref><ref type="bibr">Hsu et al., 2017)</ref>. The theorem be-low considers the sparse setting in which the underlying &#8673; &#8676; satisfies the constraint d H (&#8673; &#8676; , id) &#63743; k for k "small enough" as made precise below. For simplicity, it is assumed that &#8676; and 2 &#8676; are known; various estimators for the regression parameter in this scenario have been proposed <ref type="bibr">(Zhang and Li, 2020;</ref><ref type="bibr">Slawski et al., 2021;</ref><ref type="bibr">Peng et al., 2021)</ref>. </p><p>with probability at least 1 2/n and 1 3/n, respectively, where &#181; = (&#181; i (x)) n i=1 and SNR = k &#8676; k 2 2 / 2 &#8676; . Theorem 3.1 implies that if is chosen larger than the threshold 0 , the MAP estimator b &#8679; will be 2k-sparse, which matches the sparsity of &#8679; &#8676; up to the factor 2. By the triangle inequality, d</p><p>&#8679; and &#8679; &#8676; will be close in Hamming distance. Moreover, for values such that 3 0 &#63743; &#63743; C 0 for C &gt; 3, we obtain</p><p>The dependence on the signal-to-noise ratio SNR is improved compared to the naive estimator b &#8679; 0 = I n , whose error scales as &#8676; p k log(en/k)SNR 1/2 ; for small SNR, one cannot hope for improvements over b &#8679; 0 in general. In light of the discussion in &#167;2.1, the improvement over the maximum likelihood estimator b &#8679; ML whose error scales as &#8676; p n, is substantial as long as k is small relative to n. The next result yields a lower bound on ensuring a prescribed level of sparsity k based on the prior only.</p><p>Proposition 3.2. Suppose that &#8673; follows the Hamming prior (8) with parameter . Then for all 2 &#63743; k &lt; n</p><p>Proposition 3.2 asserts that the hyperparameter of the prior (8) should be chosen proportional to log(n k) &#8672; log n as n gets large in order to ensure that the prior places essentially no mass outside the Hamming ball {&#8673; : d H (&#8673;, id) &#63743; k}. The threshold &#8672; log n is sharp in the sense that if &#63743; log(n k), the prior will place at least mass &#8998;(1) = 1 4 !k/k! &#8672; 1 4e for not too small k outside that Hamming ball. The likelihood p(D|&#8673;) favors permutations with best fit to the given data, so that the posterior mass P &#8673;|D ({&#8673; : d H (&#8673;, id) &#63743; k}) will be less than the prior mass. It is thus natural to consider &#8672; log n as initial point.</p><p>Local shuffling prior. The next statement addresses the scenario in Fig. <ref type="figure">2</ref> for Lipschitz functions. In particular, the level of penalty needed for the MAP solution to satisfy the condition max i |b &#8673;(i) i| &#63743; r is provided.</p><p>&#8672; N (0, 1), where</p><p>, with probability at least 1 exp( ( p 2 1) 2 /2) 2/n. Under the same event,</p><p>In theory, the assertion of the above proposition can be achieved by setting = 1. Established solvers of LAPs require the entries of the cost matrix to be finite. In addition, solver accuracy can degrade with the magnitude of the entries <ref type="bibr">(Bernhard, 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we present the results of experiments conducted with synthetic and real data. In the supplement, we also show an example demonstrating the use of the data augmentation approach discussed at the end of &#167;2.3 as an alternative to the Monte-Carlo EM scheme.</p><p>Synthetic data. We consider data generation according to the following three models ( 1 &#63743; i &#63743; n):</p><p>Linear Regression (LR):</p><p>i=1 and {&#9999; i } n i=1 are i.i.d random samples from the N (0, I d ) and N (0, 1) distributions, respectively. The regression parameter &#8676; is sampled uniformly from the sphere { 2 R d : k k 2 = 3}, and &#8676; 0 &#8672; N (0, 1). For MVN, we let &#181; &#8676; = 0 and &#8998;</p><p>Finally, &#8673; &#8676; is a permutation selected uniformly at random from one of the following constraint sets:</p><p>(ii) r-Banded:</p><p>where P(B) denotes the set of block-structured permutations corresponding to B blocks of uniform size n/B, i.e., {1, . . . , n/B}, . . . , {(B 1)(n/B) + 1, . . . , n}. Note that in (i), k refers to the number total of mismatches, whereas in (iii) k refers to the number of mismatches per block. We fix n = 1, 000, d = 20, &#8676; = 1, &#8674; &#8676; = 0.8, p = q = 5, B = 50. The mismatch rates k/n and k &#8226; B/n in (15)(i) and (15)(iii), respectively, are varied between 0.2 and 0.5 in steps of 0.05, and the bandwidth r in (15)(ii) is varied between 3 and 10. For each setup and each value of k and r, 100 independent replications are performed. The following approaches are compared:</p><p>(I) naive. Standard maximum likelihood estimation as used for parameter estimation in the absence of mismatches, which corresponds to fixing &#8673; = id as the identity.</p><p>(II) oracle. The unknown permutation &#8673; &#8676; is considered as known, and standard maximum likelihood estimation is used for parameter estimation after fixing &#8673; = &#8673; &#8676; .</p><p>(III) robust [for setting k-Sparse only]. For setup LR, the regression parameter is estimated on the robustfit function in <ref type="bibr">(MATLAB, 2019)</ref>. For setup GLM, the regression parameter is estimated based on the robust GLM estimation method <ref type="bibr">(Wang et al., 2020)</ref> that uses observationspecific dummy variables and penalization. For setup MVN, &#8998; 1 &#8676; is estimated according to the robustcov function in MATLAB which implements the minimum covariance determinant estimator <ref type="bibr">(Rousseeuw and Driessen, 1999)</ref>.</p><p>(IV) EM, EMH, EML, EMB. Algorithm 1 using uniform, Hamming, local shuffling, and block-Hamming prior, respectively, which reflect the constraint sets (i) to (iii) in ( <ref type="formula">15</ref>). The EM iterations are initialized by setting &#8673; = id, and the number of EM iterations is limited to 400. The number of MCMC iterations per EM iteration is set to 8k, half of which are counted towards the "burn-in period". We note that a modified MH algorithm is used for EML (cf. supplement); for EMB, the MH scheme in Algorithm 2 is applied blockwise. For the Sparse and SparseBlock settings, the hyperparameter is chosen based on Proposition 3.2, which suggests / log(n). For the Banded setting, the choice = 1 was found to yield good performance.  <ref type="formula">2009</ref>) and <ref type="bibr">Lahiri and Larsen (2005)</ref> with the choice Real data. We consider seven benchmark data sets for shuffled data problems. The data sets are preprocessed versions of their original counterparts (details on data processing can be found in the supplement). Even though the data sets themselves are real, the permutations that scramble the given matching pairs {(x i , y i )} n i=1 are synthetic; for each data set, we consider 100 independent random permutations depending on the underlying setting. We consider the same list of competitors and associated configurations as for the synthetic data experiments. Asterisked ground truth parameters here refer to oracle estimates based on knowledge of &#8673; &#8676; , and relative estimation error (REE) is defined accordingly in terms of those ground truth parameters.</p><p>Hamming &amp; Block prior. As shown in Fig. <ref type="figure">4</ref>, the proposed approach consistently improves over naive least squares once the fraction of mismatches exceeds 0.2, and yields substantial improvements as that fraction increases. The regularized EM approach based on the priors in EMB, EMH, EML noticeably reduces error induced by shuffling.</p><p>Local shuffling prior. As shown in Table <ref type="table">1</ref>, the EM approach with local shuffling prior achieves significant error reductions compared to the naive approach and the EM approach without regularization. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we have proposed a framework for regularized estimation in shuffled data problems by means of an exponential family prior on the permutation group. The exponential family form is convenient for computational purposes yet sufficiently rich to incorporate common forms of prior knowledge. The proposed prior is not tailored to specific data analysis problems, but can be applied generically. The results in this paper confirm the importance of regularization in shuffled data problems given the inherent danger of overfitting already with little noise. While the approach covers various constraints that can be imposed on the underlying permutation, it is not exhaustive. For example, suppose we have information on the cycles of the permutation (numbers or lengths). Such information cannot be expressed in terms of index pairs, and hence requires a different paradigm. <ref type="bibr">Kondor et al. (2007)</ref>; <ref type="bibr">Huang et al. (2009)</ref> use Fourier analysis on the permutation group <ref type="bibr">(Diaconis, 1988)</ref> to facilitate learning of permutations, and it is an interesting direction of future research to study how that approach can be leveraged for the type of shuffled data problems considered in the present paper.</p><p>A Detailed derivations of the expected complete data negative log-likelihood for selected models</p><p>In this section, we provide detailed steps for deriving the expected complete data negative likelihood for least squares regression and precision matrix estimation for multivariate Normal data.</p><p>Recall that the expected complete data negative likelihood is given by</p><p>In this case, we take log p(x, y; , 2 ) = (y x &gt; ) 2 /(2 2 ), and we have that</p><p>(ii) Precision matrix estimation for multivariate normal data. In this case, log p(x, y; &#8998;) = log det &#8998; + tr(&#8998;zz &gt; ), where z = [x &gt; y &gt; ] &gt; ; zz &gt; has diagonal blocks xx &gt; , yy &gt; and off-diagonal blocks</p><p>and analyzing each of the terms accordingly.</p><p>C Proof of Theorem 3.1</p><p>In light of relation ( <ref type="formula">10</ref>), straightforward manipulations and omission of terms not depending on &#8679; show that the MAP estimator b &#8679; is the minimizer of the optimization problem</p><p>Since b &#8679; minimizes (S.2), the following basic inequality holds true:</p><p>Decomposing Y = &#181; + &#8672; with &#8672; = &#8676; &#8679; &#8676; &#9999; and re-arranging terms in the above inequality yields that</p><p>where we have substituted d H (&#8679; &#8676; , I n ) = k. By the Cauchy-Schwarz inequality, h&#8679; &#8676; &#181;, b &#8679;&#181;i &#63743; k&#8679; &#8676; &#181;k 2 2 , which implies that the first term in the previous inequality is non-negative. This in turn yields that h&#8672;, ( b</p><p>In the sequel, we will derive a probabilistic lower bound on the first term on the left hand side.</p><p>For any integer 1 &#63743; s &#63743; n, consider the event where for any integer 1 &#63743; `&#63743; n, B 0 (`) here denotes the set of all unit vectors having at most `non-zero entries.</p><p>Denote by w(B(`)) = E g&#8672;N (0,In) [sup u2B(`) hu, gi] the corresponding Gaussian width <ref type="bibr">(Vershynin, 2018, &#167;7.5)</ref>. Choosing t = w(B(m s )) + c 1 p 2 log n in (S.5) for c 1 1, standard tail bounds for the suprema of Gaussian processes <ref type="bibr">(Boucheron et al., 2013, Theorem 5.8)</ref> yield</p><p>Using that w(B 0 (m s )) &#63743; 4 p m s log(en/m s ) (e.g., <ref type="bibr">Plan Vershynin, 2013, Lemma 2.3</ref>) and the fact that s log(en/s) log n for any n s 1, we have with c 1 = p 2</p><p>Combining this with the definition of v yields that</p><p>, and note that</p><p>Using the same argument as above, we choose t = &#8676; &#8999; s {w(B 0 (m s )) + c 2 p 2 log n} with c 2 = p 2. Putting together the pieces as above, we obtain</p><p>Now let 0 = 72 p SNR log(en/k) 72 p SNR log(en/m s ) and define the event</p><p>Observe that conditional on E s \ G s , the earlier inequality (S.4) implies that (recalling that m s = s + k)</p><p>Combination of the left and right hand sides and re-arranging terms implies the inequality 0 s k s + k Now for any s 2k, the right hand side is lower bounded by (1/3) . This in turn yields a contradiction whenever is chosen such that &gt; 3 0 . In order to conclude that d H ( b &#8679;, I n ) &#63743; 2k with the stated probability in that case, it remains to provide a corresponding lower bound on the probability of the event S n s=1 (E s \ G s ), i.e., at least one of the events {E s \ G s } n s=1 occurs. Since the events inside the union are disjoint, we obtain that</p><p>(1 2/n 2 ) P(E s ) 1 2/n.</p><p>In order to prove the second part of Theorem 3.1, we first invoke the following basic inequality equivalent to (S.3)</p><p>Expanding the squares and re-arranging yields conditional on</p><p>where in the second inequality, we have used that if &gt; 3 0 , conditional on on</p><p>After elementary manipulations, we obtain the inequality</p><p>with probability at least 1 2/n 1/n = 1 3/n, where the term sup u2B0(3k) is controlled similarly to (S.6).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Proof of Proposition 3.2</head><p>The probability mass function of ( <ref type="formula">8</ref>) is given by <ref type="bibr">(Fligner and Verducci, 1986)</ref>:</p><p>In the sequel, let us write {D(&#8673;) = d} as a shortcut for the event {d H (&#8673;, id) = d}. We then have</p><p>where !d denotes the number of derangements of {1, . . . , d}, i.e., the number of permutations &#8999; of d objects such that &#8999; (j) 6 = j for all 1 &#63743; j &#63743; d. Straightforward manipulations yield</p><p>For x 0 and integer m 1, define the (upper) incomplete Gamma function and its "normalized" counterpart by</p><p>where (m) = (m, 0) = (m 1)! denotes the Gamma function. It can be shown that <ref type="bibr">(Abramowitz and Stegun, 1964</ref></p><p>, we obtain the following for the right hand side of (S.8):</p><p>e (n + 1, exp( ) 1) .</p><p>At this point, we consider the upper bound on the probability of interest as stated in the proposition. We have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>provided</head><p>(1 + ) log n, which concludes the proof of the upper bound.</p><p>Regarding the lower bound, observe that in view of relation (S.9), the ratio of normalized incomplete Gamma functions can be expressed via the ratio of CDFs of two independent Poisson random variables, that is</p><p>where X 1 and X 2 are two independent Poisson random variables with parameters exp( ) and exp( ) 1, respectively. Setting = log(n k) yields that the right hand side is a function of the form c 1 (n, k) that is lower and upper bounded by 1/4 and 1, respectively, as</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Proof of Proposition 3.3</head><p>Similar to Eq. (S.3) in the proof of Proposition 3.1, we have the basic inequality</p><p>In the sequel, we will show that under the stated conditions, the left hand side must exceed the right hand side unless b &#8679; ij = 0 for all (i, j) such that |i j| &gt; r. Expanding Y = &#8679; &#8676; &#181; + &#8676; &#9999; and re-arranging terms yields the inequality</p><p>where we have used that k&#8679; &#8676; &#181;k 2 2 h b &#8679;&#181;, &#8679; &#8676; &#181;i 0. For the second term on the right hand side, the triangle inequality yields that for all indices i that are summed over, we have |b &#8673;(i) &#8673; &#8676; (i)| &#63743; 2r. Using the Cauchy-Schwarz inequality in combination with the Lipschitz property of the underlying function, we obtain that</p><p>By standard concentration results (e.g., Wainwright, 2019, &#167; 2.3), the event E 1 = {k&#9999;k 2 &#63743; p 2n} holds with probability at least 1 exp(( p 2 1) 2 /2). We now turn to the first term on the right hand side of (S.11). We have the upper bound X &#8676; ensures that the left hand side exceeds the right hand side, which is a contradiction, and hence it must hold that |b &#8673;(i) i| &#63743; r, 1 &#63743; i &#63743; n.</p><p>Observe that conditional on the event {|b &#8673;(i) i| &#63743; r, 1 &#63743; i &#63743; n}, the basic inequality (S.10) reduces to hY, b &#8679;&#181;i &#63743; hY, &#8679; &#8676; &#181;i () kY b &#8679;&#181;k 2 2 &#63743; kY &#8679; &#8676; &#181;k 2 2 Substituting Y = &#8679; &#8676; &#181; + &#9999; in the inequality of the right hand side and expanding squares, we obtain that conditional on {|b &#8673;(i) i| &#63743; r, 1 &#63743; i &#63743; n} and E 1</p><p>with the same arguments as used for (S.11) and (S.12). Dividing both sides by n yields the assertion. if r(e &#8673;, &#8673; (k) ) &gt; u: &#8673; (k+1) e &#8673;.</p><p>else: &#8673; (k+1) &#8673; (k) .</p><p>end for return b E[&#8673;|D, &#10003;] as in ( <ref type="formula">13</ref>) with m replaced by m invalid-mcmc-steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G Additional information regarding real data analysis</head><p>In this section, we provide references of each data set and regression model used in the real data analysis. A summary of each data set is shown in Table <ref type="table">S</ref>.1 below.  <ref type="bibr">(Slawski et al., 2021)</ref> 2011 2 LR hamming 2k El Nino data(END) <ref type="bibr">(Slawski et al., 2021)</ref> 93935 5 LR block 1.5k* CPS wages(CPS) <ref type="bibr">(Slawski et al., 2021)</ref> 534 11 LR hamming 2k Bike sharing data(BSD) <ref type="bibr">(Wang et al., 2020)</ref> 731 16 GLM block 1.5k* Flight Ticket Prices(FTP) <ref type="bibr">(Slawski et al., 2020)</ref> 335 30 6 MVN hamming 2k Supply Chain Management(SCM) <ref type="bibr">(Slawski et al., 2020)</ref>   where Inv-2 (&#9003;, a 2 ) refers to the scaled inverse 2 -distribution with scale parameter a &gt; 0 and &#9003; degrees of freedom (cf. &#167;14 in <ref type="bibr">Gelman et al. (2013)</ref> for more details on Bayesian inference for linear regression models).</p><p>For this illustration, we use m = 100, where each sequence {&#8673; (j) } is generated by uniform thinning of Markov chains of length 4, 000 generated by Algorithm 2. The number of samples ( (k) , 2 (k) ) obtained via the above scheme is taken as 1,000. The sampling procedure is initialized from step (II) with the identity permutation. We compare both the unregularized case with the uniform prior for &#8673; as well as the regularized case with the Hamming prior (8) ( = log n in view of Proposition 3.2). Figure S.1 confirms that the proposed approach achieves visible improvements over the unregularized approach which suffers from serious amplification bias affecting the slope parameter 1 and serious underestimation of the error variance, as predicted by the brief analysis accompanying the first introductory example in &#167;2.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>J Empirical and Hierarchical Bayes approaches</head><p>In this section, we outline how the hyperparameter of the proposed prior on &#8673; can be selected based on Empirical and Hierarchical Bayes approaches. For simplicity, these approaches are presented for the linear regression model ( <ref type="formula">14</ref>) in an exemplary fashion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>J.1 Hierarchical Bayes</head><p>Consider the following hierarchical model specification: where ( ) denotes the terms in the normalization constant in the prior p(&#8673;| ) that depend on .</p></div></body>
		</text>
</TEI>
