<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Optimal sample acquisition for optimally weighted PCA from heterogeneous quality sources</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE Signal Processing Letters</publisher>
				<date>01/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10578897</idno>
					<idno type="doi">10.1109/LSP.2025.3550280</idno>
					<title level='j'>IEEE Signal Processing Letters</title>
<idno>1070-9908</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>David Hong</author><author>Laura Balzano</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Modern high-dimensional datasets are often formed by acquiring samples from multiple sources having heterogeneous quality, i.e., some sources are noisier than others. Collecting data in this manner raises the following natural question: what is the best way to collect the data (i.e., how many samples should be acquired from each source) given constraints (e.g., on time or energy)? In general, the answer depends on what analysis is to be performed. In this paper, we study the foundational signal processing task of estimating underlying low-dimensional principal components. Since the resulting dataset will be high-dimensional and will have heteroscedastic noise, we focus on the recently proposed optimally weighted PCA, which is designed specifically for this setting. We develop an efficient method for designing sample acquisitions that optimize the asymptotic performance of optimally weighted PCA given resource constraints, and we illustrate the proposed method through various case studies.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>M ODERN high-dimensional datasets are often formed by combining samples acquired from multiple sources, where the sources have heterogeneous quality and cost. Namely, some sources are noisier than others, and acquiring samples from higher-quality (less noisy) sources is typically more costly (whether in time or energy). For example, air quality data are currently acquired using both low-cost consumergrade sensors and high-precision instruments that are carefully maintained by government agencies <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. In general, one can imagine deploying sensor networks with a heterogeneous mix of sensors of varying cost and corresponding quality.</p><p>One often seeks to find underlying low-dimensional structure revealed by the data. Principal component analysis (PCA) is a foundational technique for this task, and it is a workhorse method in modern signal processing. For example, PCA has been used to identify sources of air pollution from air quality data <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>. However, conventional PCA is not designed for samples with heterogeneous amounts of noise; it treats all samples uniformly. A weighted PCA that gives noisier samples less weight is a more suitable method, and recent work <ref type="bibr">[5]</ref> derived optimal weights for this scenario. The resulting optimally D. <ref type="bibr">Hong</ref> was supported in part by NSF Graduate Research Fellowship DGE 1256260, NSF Grant ECCS-1508943, NSF BIGDATA grant IIS 1837992, the Dean's Fund for Postdoctoral Research of the Wharton School, and NSF Mathematical Sciences Postdoctoral Research Fellowship DMS 2103353. L. Balzano was supported in part by DARPA-16-43-D3M-FP-037, NSF CAREER award CCF-1845076, NSF Grant ECCS-1508943, and ARO YIP Award W911NF1910027. D. Hong is with the Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716 USA (email: hong@udel.edu). L. Balzano is with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA.</p><p>weighted PCA optimally downweights noisier samples to best recover the underlying low-dimensional components.</p><p>This paper tackles the following question: what is the best way to acquire samples for optimally weighted PCA? Namely, given multiple sources of data, how many samples should be acquired from each source to maximize the performance of optimally weighted PCA? Naturally, the more data, the better the performance of optimally weighted PCA. However, in practice there are typically constraints. In particular, acquiring samples often has associated costs, resulting in constraints of the following form:</p><p>where n 1 , . . . , n L are the number of samples acquired from each of L available data sources, &#954; 1 , . . . , &#954; L are the per-sample costs for each source, and &#964; is the corresponding budget. Each type of cost (e.g., in time or energy) adds a constraint of this form. Moreover, each source often has a finite quantity of samples it can provide, resulting in additional constraints:</p><p>where q 1 , . . . , q L are the quantity of samples that each source can provide. Both (1) and ( <ref type="formula">2</ref>) are linear constraints, so can be captured in general as follows:</p><p>where n := (n 1 , . . . , n L ) is the number of samples to acquire from each source, denotes entrywise inequality, and A and b define the constraints. The goal is to choose n to maximize the performance of optimally weighted PCA subject to the constraints <ref type="bibr">(3)</ref>. A number of recent works have studied the topic of PCA for high-dimensional heterogeneous-quality data <ref type="bibr">[5]</ref>- <ref type="bibr">[21]</ref>, but they generally focus on how to use the given data rather than on how to acquire it. The question of how to best acquire data falls within the field of experimental design, for which there are numerous works (many more than can be reviewed here, see, e.g., the textbooks <ref type="bibr">[22]</ref>- <ref type="bibr">[27]</ref>). However, to the best of our knowledge, the question we consider here (optimal design for optimally weighted PCA) has not yet been addressed. This paper tackles this open problem.</p><p>Section II describes the data model and reviews optimally weighted PCA. Section III states the main result: a characterization of the optimal sample acquisition designs that enables us to develop a computationally efficient algorithm for finding an optimal design. Sections IV and V demonstrate the method through an illustrative example and case studies. Section VI concludes the paper with a proof of the main result. II. DATA MODEL AND OPTIMALLY WEIGHTED PCA This paper considers acquiring n 1 , . . . , n L samples from L data sources with associated noise variances v 1 , . . . , v L &gt; 0, where the sources share k underlying orthonormal components u 1 , . . . ,</p><p>Precisely put, we model the data block Y &#8467; &#8712; C d&#215;n &#8467; obtained by acquiring n &#8467; samples from the &#8467;th source as in <ref type="bibr">[5]</ref>:</p><p>where d is the number of features,</p><p>is a deterministic factor matrix common to all the sources, &#8226; Z &#8467; &#8712; C k&#215;n &#8467; is a coefficient matrix with IID entries that have zero mean and unit variance, &#8226; E &#8467; &#8712; C d&#215;n &#8467; is a noise matrix with IID entries that have zero mean and variance v &#8467; &gt; 0, and the noise further satisfies a technical condition: bounded a-th moment for some a &gt; 4, i.e., sup i,j E|(E &#8467; ) i,j | a &lt; &#8734;. <ref type="foot">1</ref>Since the data sources have different amounts of noise, one should use a weighted PCA to account for their heterogeneous quality. Given weights w := (w 1 , . . . , w L ), weighted PCA estimates the ith underlying component u i as</p><p>i.e., &#251;i (w, Y ) is the ith leading eigenvector of the w-weighted sample covariance matrix. Naturally, one wants to use weights that maximize the recovery of u i . Such weights were recently found in <ref type="bibr">[5]</ref>. Namely, asymptotically optimal weights w i and their corresponding asymptotic performance r i are given by</p><p>r i when the number of features d and the numbers of samples n 1 , . . . , n L are all sufficiently large. Specifically, r i is the optimal performance in the limit where d and n 1 , . . . , n L all grow to infinity with fixed aspect ratios n &#8467; /d; 3 see <ref type="bibr">[5]</ref> for further details. This highdimensional regime corresponds to many big data settings.</p><p>Note that the asymptotic performance of optimally weighted PCA is component-specific, so the optimal sample acquisition strategy may differ from component to component. Note also that computing r i requires either knowing the signal and noise variances &#955; i and v 1 , . . . , v L a priori or estimating them from data, e.g., using the methods described in <ref type="bibr">[5,</ref><ref type="bibr">Example 7.2]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. MAIN RESULT: OPTIMAL SAMPLE ACQUISITION</head><p>The problem is to choose n := (n 1 , . . . , n L ), the numbers of samples to acquire from each source, to optimize the performance <ref type="bibr">(7)</ref> of optimally weighted PCA subject to linear constraints on n. Precisely put,</p><p>where R + denotes the set of nonnegative real numbers, is entrywise inequality, A and b define the linear constraints, and we have made the dependence of r i on n explicit. Throughout this paper, we will assume that A and b define nontrivial constraints in the sense that {n &#8712; R L + : An b} is not only nonempty but also bounded. Otherwise, one can of course take n &#8594; &#8734; and achieve perfect performance. However, one can always expect to have nontrivial constraints in practice since collecting and processing samples takes both time and space.</p><p>Solving the optimization problem ( <ref type="formula">8</ref>) is nontrivial because r i (n) is not given as an explicit expression in n. It is instead defined implicitly in <ref type="bibr">(7)</ref> via the roots of a rational function with coefficients that depend on n. Furthermore, r i (n) is a nonlinear function of n. Fortunately, as our main result shows, solutions occur at extreme points of the constraint region.</p><p>Theorem 1 (Optimal sample acquisition). The optimal sample acquisition problem (8) has a solution (not necessarily unique) that is an extreme point of the constraint polyhedron</p><p>Remark 1 (Non-integer values). Note that we have relaxed n to have nonnegative real-valued (rather than integer-valued) entries. This relaxation should have little impact in the highdimensional regime of interest; simply round the solution of (8) to obtain integer numbers of samples to acquire. One can also avoid this technicality by instead formulating <ref type="bibr">(8)</ref> in terms of the (already real-valued) asymptotic aspect ratios n &#8467; /d.</p><p>It follows from Theorem 1 that global optimization of (8) can be accomplished by simply choosing the best among the extreme points of P (a finite set). This leads naturally to the following simple and efficient algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 Sample acquisition optimization</head><p>choose best candidate 4: return n Remark 2 (Computing extreme points). The extreme points of P in Line 2 of Algorithm 1 can be efficiently obtained from (A, b) using standard polyhedral software; see, e.g., <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>.</p><p>The runtime of Algorithm 1 can grow dramatically when the number of sources is large; developing algorithms that scale better is an interesting direction for future work. That said, we found that cases with tens to even hundreds of constraints can often complete in fractions of a second when the number of sources is moderate (e.g., around ten or less), as is the case in many applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. ILLUSTRATIVE EXAMPLE</head><p>We illustrate our main result (Theorem 1) and the resulting algorithm (Algorithm 1) with the following example of two sources with limited quantities and a single budget constraint. Example 1. Consider acquiring d = 100 dimensional samples from L = 2 sources with an underlying component u i having signal variance &#955; i = 10, where &#8226; source 1 samples have noise variance v 1 = 2, a limited quantity of 180 samples, and a per-sample cost of 1, &#8226; source 2 samples have noise variance v 2 = 1, a limited quantity of 90 samples, and a per-sample cost of 3, and the total budget is 300. This yields the linear constraints defined by A = &#63726; &#63728; 1 0 0 1 1 3 &#63737; &#63739; , b = &#63726; &#63728; 180 90 300 &#63737; &#63739; , illustrated in Fig. 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>An b</head><p>To find the optimal sample acquisition, we proceed according to Algorithm 1 (based on Theorem 1): 1) Find the extreme points E of P = {n &#8712; R L + : An b}: E = {(0, 0), (0, 90), <ref type="bibr">(30,</ref><ref type="bibr">90)</ref>, (180, 40), (180, 0)}.</p><p>2) Evaluate r i (n) for each n &#8712; E and choose the best:</p><p>Note that (0, 0), (0, 90), and (180, 0) can all be skipped because either n 1 or n 2 can be increased for each choice, trivially yielding an improvement in performance. The optimal sample acquisition n = (180, 40) corresponds to collecting all the available samples from source 1 (i.e., all the cheaper, noisier samples that are available) then spending the rest of the budget on samples from source 2 (i.e., the more costly, higher quality samples).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CASE STUDIES AND INSIGHTS</head><p>The following case studies demonstrate Theorem 1 through a few interesting scenarios that provide some new insights into optimal sample acquisition for optimally weighted PCA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Optimal sample acquisition under a single constraint</head><p>Suppose the constraint region reduces to a single constraint. For example, consider the setting of Example 1 but with over 300 source 1 samples and over 100 source 2 samples available. In this case, the constraints reduce to the single constraint</p><p>with three extreme points E = {(0, 0), (300, 0), (0, 100)}. As before, (0, 0) can be skipped, so either (300, 0) or (0, 100) is optimal. Namely, it is optimal to use only one source. In this case, (300, 0) is better, i.e., it is optimal to acquire samples from only the noisier less-costly source. Indeed, as the following corollary states, it is always optimal to use only one source when there is effectively one constraint.</p><p>Corollary 2 (Optimality for a single constraint). If the constraint region reduces to a single constraint, i.e., &#8707; a&#8712;R L ,b&#8712;R P = {n &#8712; R L + : a n b}, then the optimal sample acquisition problem (8) has a solution using only one source, i.e., a 1-sparse solution of the form n = (0 &#8467;-1 , b/a &#8467; , 0 L-&#8467; ) where &#8467; &#8712; {1, . . . , L}. <ref type="bibr">(10)</ref> Corollary 2 follows straightforwardly from Theorem 1 by observing that the nonzero extreme points are of the form <ref type="bibr">(10)</ref>.</p><p>While this result is perhaps natural given Theorem 1, note that it was not obvious a priori. It would have been natural to expect the optimal sample acquisition to require a precise mix of both high-quality high-cost samples and low-quality low-cost samples, especially given the nonlinearity of r i (n).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Optimal sample acquisition for multiple components</head><p>The performance of optimally weighted PCA depends nontrivially on the signal variance, so may vary from component to component. Thus, the optimal sample acquisition strategy could also differ among components. However, unlike the optimal weights (6), the optimal sample acquisition can also be the same for components that have different signal variances.</p><p>Consider the setting of Example 1 but where we sweep &#955; i from 1 to 10. Fig. <ref type="figure">2</ref> shows the optimal sample acquisition strategies computed by Algorithm 1 (omitting the small interval &#955; i &#8712; [1, 1.013] where the optimal performance was zero). Notably, the optimal sample acquisition strategy is constant over large intervals. Thus, there can be a single optimal strategy for all the components, as long as their signal variances lie in the same interval. Interestingly, these results also suggest that the optimal strategy may be somewhat robust to potential errors in estimating the signal and noise variances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. PROOF OF MAIN RESULT (THEOREM 1)</head><p>Let Q be the solution set for the optimization problem <ref type="bibr">(8)</ref>, and let &#961; be the associated optimal value, i.e.,</p><p>The goal is to show that</p><p>Note first that <ref type="bibr">(12)</ref> holds trivially if &#961; = 0 because Q = P in that case and P always has at least one extreme point (by <ref type="bibr">[31,</ref><ref type="bibr">Theorem 2.6]</ref>) since it is a nonempty bounded polyhedron (by assumption). So it remains to show <ref type="bibr">(12)</ref> for &#961; &gt; 0.</p><p>The proof for &#961; &gt; 0 proceeds in two stages: (i) show that Q has extreme points, and (ii) show that extreme points of Q are also extreme points of P. (12) then follows immediately.</p><p>Stage 1. This stage shows that Q has extreme points when &#961; &gt; 0. To begin, we establish the following properties of Q: (nonempty) r i (n) is a continuous function on the domain P and P is both compact (it is a bounded polyhedron) and nonempty. Thus, r i (n) attains its maximum &#961; on P, and Q is nonempty. (bounded) Q &#8838; P and P is bounded (by assumption), so Q is also bounded. (polyhedron) Note that</p><p>where</p><p>because R i,n (x) has exactly one root in the interval (0, 1) unless L &#8467;=1 (n &#8467; /d)(&#955; i /v &#8467; ) 2 &#8804; 1, in which case it has no roots in the interval. Thus, Q can be expressed as follows:</p><p>where the first line follows from the definition of Q, the second line follows from (13), the third line follows by substituting <ref type="bibr">(14)</ref>, and the fourth line is rearranging. It follows from the form of (15) that Q is a polyhedron. Hence, Q is a nonempty bounded polyhedron, and it follows from <ref type="bibr">[31,</ref><ref type="bibr">Theorem 2.6</ref>] that Q has at least one extreme point.</p><p>Stage 2. This stage shows that any extreme point of Q must also be an extreme point of P when &#961; &gt; 0. We proceed by proving the contrapositive, i.e., that any non-extreme point of P cannot be an extreme point of Q.</p><p>Note that the statement trivially holds for any non-extreme point of P that is not in Q. So, it remains to consider the case where the non-extreme point of P is in Q</p><p>Diagram illustrating the relative positions of s, t, n , and t. Namely, n is between s and t, and t is between s and n .</p><p>be a non-extreme point of P. Then, there must exist points s, t &#8712; P \ {n } and &#947; &#8712; (0, 1) so that</p><p>Moreover, since r i (n) is continuous and P is convex, we can always choose the points s and t to be close enough to n so that r i (s), r i (t) &gt; &#961; /2. Without loss of generality, suppose that r i (s) &#8804; r i (t). Since n maximizes r i (n), this yields</p><p>Thus, it follows by the intermediate value theorem that r i (n) takes on the value &#961; := r i (t) at a point between s and n , i.e., there exists some &#947; &#8712; [0, 1] so that r i ( t) = &#961;, where t := &#947;</p><p>Note next that s, t, n , and t are collinear (as shown in Fig. <ref type="figure">3</ref>), so it follows that n and s are both affine combinations of t and t. Specifically, solving ( <ref type="formula">16</ref>) and ( <ref type="formula">18</ref>) for n and s yields n = &#181; &#8226; t + (1 -&#181;) &#8226; t, s = &#956; &#8226; t + (1 -&#956;) &#8226; t, <ref type="bibr">(19)</ref> where &#181; = &#947;/(&#947; + &#947; -&#947;&#947;) and &#956; = 1/(&#947; + &#947; -&#947;&#947;).</p><p>Substituting the first equation of ( <ref type="formula">19</ref>) into (14) yields</p><p>Likewise, substituting the second equation of ( <ref type="formula">19</ref>) into ( <ref type="formula">14</ref>) yields</p><p>Since r i ( t) = r i (t) = &#961;, it follows that R i, t( &#961;) = R i,t (&#961;) = 0 and hence R i,n (&#961;) = R i,s (&#961;) = 0. Noting that &#961; &gt; &#961; /2 &gt; 0 and applying <ref type="bibr">(13)</ref> then yields that r i (n ) = r i (s) = &#961;, and as a result</p><p>i.e., r i (s) = r i (t) = &#961; so s, t &#8712; Q \ {n }, implying that n is not an extreme point of Q and completing the proof.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>This technical condition is satisfied by numerous distributions including the sub-Gaussian and sub-Exponential families<ref type="bibr">[28,</ref> Prop.  </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>2.5.2 and 2.7.1], notably including the real-valued Gaussian setting considered in<ref type="bibr">[8]</ref>,<ref type="bibr">[9]</ref>.<ref type="bibr">2</ref> When L &#8467;=1 (n &#8467; /d)(&#955; i /v &#8467; ) 2 &gt; 1, the solution x &#8712; (0, 1) to the equation in<ref type="bibr">(7)</ref> can be found simply and efficiently via bisection since it is unique.<ref type="bibr">3</ref> The number of components k and the number of data sources L, as well as the signal and noise variances &#955; 1 , . . . , &#955; k and v 1 , . . . , v L , are also considered fixed with respect to n &#8467; and d.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>This article has been accepted for publication in IEEE Signal Processing Letters. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/LSP.2025.3550280 &#169; 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: University of Michigan Library. Downloaded on March 25,2025 at 18:43:42 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
