<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Detecting Low-Degree Truncation</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/26/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10501485</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the annual ACM Symposium on Theory of Computing</title>
<idno>0737-8017</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Anindya De</author><author>Huan Li</author><author>Shivam Nadimpalli</author><author>Rocco A. Servedio</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[One of the most basic and natural ways that a probability distribution D can be altered is by truncating it, i.e. conditioning on some subset of possible outcomes. Indeed, the study of truncated distributions is one of the oldest topics in probability and statistics: already in the 19th century, Galton [Gal97] attempted to estimate the mean and standard deviation of the running times of horses on the basis of sample data that did not include data for horses that were slower than a particular cutoff value. Since the running times were assumed to be normally distributed, this was an early attempt to infer the parameters of an unknown normal distribution given samples from a truncated version of the distribution. Subsequent early work by other statistical pioneers applied the method of moments [Pea02, Lee14] and maximum likehood techniques [Fis31] to the same problem of estimating the parameters of an unknown univariate normal distribution from truncated samples. The study of truncation continues to be an active area in contemporary statistics (see [Sch86,BC14,Coh16] for recent books on this topic).Quite recently, a number of research works in theoretical computer science have tackled various algorithmic problems that deal with high-dimensional truncated data. Much of this work attempts to learn a parametric description of an unknown distribution that has been subject to truncation. For example, in [DGTZ19] Daskalakis, Gouleakis, Tzamos and Zampetakis gave an efficient algorithm for high-dimensional truncated linear regression, and in [DGTZ18] Daskalakis, Gouleakis, Tzamos and Zampetakis gave an efficient algorithm for estimating the mean and covariance of an unknown]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>multivariate normal distribution from truncated samples given access to a membership oracle for the truncation set. In <ref type="bibr">[FKT20]</ref> Fotakis, Kalavasis and Tzamos gave a similar result for the setting in which the unknown background distribution is a product distribution over {0, 1} d instead of a multivariate normal distribution, and in [KTZ19] Kontonis, Tzamos and Zampetakis extended the results of <ref type="bibr">[DGTZ18]</ref> to the case of an unknown truncation set satisfying certain restrictions. In summary, the recent work described above has focused on learning (parameter estimation) of truncated normal distributions and product distributions over {0, 1} d . This Work: Detecting Truncation. In the current paper, rather than the learning problem we study what is arguably the most basic problem that can be considered in the context of truncated datanamely, detecting whether or not truncation has taken place at all. A moment's thought shows that some assumptions are required in order for this problem to make sense: for example, if the truncation set is allowed to be arbitrarily tiny (so that only an arbitrarily small fraction of the distribution is discarded by truncation), then it can be arbitrarily difficult to detect whether truncation has taken place. It is also easy to see that truncation cannot be detected if the unknown truncation set is allowed to be arbitrarily complex. Thus, it is natural to consider a problem formulation in which there is a fixed class of possibilities for the unknown truncation set; this is the setting we consider.</p><p>We note that the truncation detection problem we consider has a high-level resemblance to the standard hypothesis testing paradigm in statistics, in which the goal is to distinguish between a "null hypothesis" and an "alternate hypothesis." In our setting the null hypothesis corresponds to no truncation of the known distribution having taken place, and the alternate hypothesis is that the known distribution has been truncated by some unknown truncation set belonging to the fixed class of possibilities. However, there does not appear to be work in the statistics literature which deals with the particular kinds of truncation problems considered in this work, let alone computationally efficient algorithms for those problems.</p><p>Prior Work: Convex Truncation of Normal Distributions. Recent work <ref type="bibr">[DNS23]</ref> considered the truncation detection problem in a setting where the background distribution D is the standard multidimensional normal distribution N (0, 1) n and the truncation set is assumed to be an unknown convex set in R n . This specific problem formulation enabled the use of a variety of sophisticated tools and results from high-dimensional convex geometry, such as Gaussian isoperimetry and the Brascamp-Lieb inequality <ref type="bibr">[BL76b]</ref> and extensions thereof due to Vempala <ref type="bibr">[Vem10]</ref>. Using these tools, <ref type="bibr">[DNS23]</ref> gave several different algorithmic results and lower bounds. Chief among these were (i) a polynomial-time algorithm that uses O(n/&#949; 2 ) samples and distinguishes the non-truncated standard normal distribution N (0, 1) n from N (0, 1) n conditioned on a convex set of Gaussian volume at most 1 -&#949;; and (ii) a &#937;( &#8730; n)-sample lower bound for detecting truncation by a convex set of constant volume. The results of <ref type="bibr">[DNS23]</ref> provide a "proof of concept" that in sufficiently well-structured settings it can sometimes be possible to detect truncation in a computationally and statistically efficient way. This serves as an invitation for a more general study of truncation detection; in particular, it is natural to ask whether strong structural or geometric assumptions like those made in <ref type="bibr">[DNS23]</ref> (normal distribution, a convex truncation set) are required in order to achieve nontrivial algorithmic results. Can efficient algorithms detect truncation for broader classes of "background" distributions beyond the standard normal distribution, or for other natural families of truncations besides convex sets? This question is the motivation for the current work. This Work: "Low-Degree" Truncation of Hypercontractive Product Distributions. In this paper we consider</p><p>&#8226; A broader range of possibilities for the background distribution D over R n , encompassing many distributions which may be either continuous or discrete; and &#8226; A family of non-convex truncation sets corresponding to low-degree polynomial threshold functions.</p><p>Recall that a Boolean-valued function f : R n &#8594; {0, 1} is a degree-d polynomial threshold function (PTF) if there is a real multivariate polynomial p(x) with deg(p) &#8804; d such that f (x) = 1 if and only if p(x) &#8805; 0. Low-degree polynomial threshold functions are a well-studied class of Boolean-valued functions which arise naturally in diverse fields such as computational complexity, computational learning theory, and unconditional derandomization, see e.g. [BS92, GL94, HKM14, DRST14, Kan14a, DOSW11, CDS20, DKPZ21, BHYY22, DKN10, Kan11b, Kan11a, Kan12, KM13, MZ13, Kan14b, DDS14, DS14, Kan15, KKL17, KL18, KR18, ST18, BV19, OST20] among many other references.</p><p>Our main results, described in the next subsection, are efficient algorithms and matching information-theoretic lower bounds for detecting truncation by low-degree polynomial threshold functions for a wide range of background distributions and parameter settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Our Results</head><p>To set the stage for our algorithmic results, we begin with the following observation: Observation 1. For any fixed ("known") background distribution P over R n , O d (n d /&#949; 2 ) samples from an unknown distribution D are sufficient to distinguish (with high probability) between the two cases that (i) D is the original distribution P, versus (ii) D is P | f , i.e. P conditioned on f -1 (1), where f is an unknown degree-d PTF satisfying</p><p>This is an easy consequence of a standard uniform convergence argument using the well-known fact that the Vapnik-Chervonenkis dimension of the class of all degree-d polynomial threshold functions over R n is O(n d ). For the sake of completeness, we give a proof in Appendix A of the full version of this paper.</p><p>While the above observation works for any fixed background distribution P, several drawbacks are immediately apparent. One is that a sample complexity of O(n d ) is quite high, in fact high enough to information-theoretically learn an unknown degree-d PTF; are this many samples actually required for the much more modest goal of merely detecting whether truncation has taken place? A second and potentially more significant issue is that the above VC-based algorithm is computationally highly inefficient, involving a bruteforce enumeration over "all" degree-d PTFs; for a sense of how costly this may be, recall that even in the simple discrete setting of the uniform distribution over the Boolean cube {-1, 1} n there are 2 &#8486;(n d +1 ) distinct degree-d PTFs over {-1, 1} n for constant d [Sak93, Theorem 2.34]. So it is natural to ask whether there exist more efficient (either in terms of running time or sample complexity) algorithms for interesting cases of the truncation detection problem, and to ask about lower bounds for this problem.</p><p>On the lower bounds side, it is natural to first consider arguably the simplest case, in which P is the uniform distribution U over the Boolean hypercube {-1, +1} n . In this setting we have the following observation:</p><p>Observation 2. If the truncating PTF f is permitted to have as few as n d/2 satisfying assignments, then any algorithm that correctly decides whether its samples come from D = U versus from D = U| f must use &#8486;(n d/4 ) samples.</p><p>This lower bound can be established using only basic linear algebra and simple probabilistic arguments; it is inspired by the "voting polynomials" lower bound of Aspnes et al. <ref type="bibr">[ABFR94]</ref> against MAJ-of-AC 0 circuits. We give the argument in Appendix B of the full version of this paper.</p><p>Taken together, there is a quartic gap between the (computationally inefficient) upper bound given by Observation 1 and the information-theoretic lower bound of Observation 2 for PTFs with extremely few satisfying assignments. Our main result is a proof that the true complexity of the truncation distinguishing problem lies exactly in the middle of these two extremes. We: (i) Give a computationally efficient distinguishing algorithm which has sample complexity O(n d/2 ) for a wide range of product distributions and values of Vol(f ), and (ii) Show that even for the uniform background distribution U over {-1, +1} n , distinguishing whether or not U has been truncated by a degree-d PTF of volume &#8776; 1/2 requires &#8486;(n d/2 ) samples.</p><p>We now describe our results in more detail. for which there are two constants 0 &lt; c &lt; C such that everywhere on [a, b] the pdf is between c/(ba) and C/(ba).</p><p>An informal statement of our main positive result is below:</p><p>Theorem 3 (Efficiently detecting PTF truncation, informal theorem statement). Let 0 &lt; &#949; &lt; 1. Fix any constant d and any hypercontractive i.i.d. product distribution &#181; &#8855;n over R n . Let f : R n &#8594; {0, 1} be an unknown degree-d PTF such that</p><p>There is an efficient algorithm that uses &#920;(n d/2 /&#949; 2 ) samples from D and successfully (w.h.p.) distinguishes between the following two cases: (i) D is &#181; &#8855;n , i.e. the "un-truncated" distribution; versus (ii) D is &#181; &#8855;n | f , i.e. &#181; &#8855;n truncated by f .</p><p>Note that &#949; is a lower bound on the probability mass of the distribution &#181; &#8855;n which has been "truncated;" as remarked earlier, without a lower bound on &#949;, it can be arbitrarily difficult to distinguish the truncated distribution. Thus, as long as the background distribution is a "nice" i.i.d. product distribution and the truncating PTF's volume is "not too tiny", in polynomial time we can achieve a square-root improvement in sample complexity over the naive brute-force computationally inefficient algorithm. A Matching Lower Bound. It is natural to wonder whether Theorem 3 is optimal: Can we establish lower bounds on the sample complexity of determining whether a "nice" distribution has been truncated by a PTF? And can we do this when the truncating PTF (unlike in Observation 2) has volume which is not extremely small?</p><p>Our main lower bound achieves these goals; it shows that even for the uniform distribution U over {-1, 1} n and for PTFs of volume &#8776; 1/2, the sample complexity achieved by our algorithm in Theorem 3 is best possible up to constant factors.</p><p>Theorem 4 (Lower bound for detecting PTF truncation, informal theorem statement). Fix any constant d.</p><p>Any algorithm that uses samples from D and successfully (w.h.p.) distinguishes between the cases that (i)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Techniques</head><p>We now give a technical overview of both the upper bound (Theorem 3) and the lower bound (Theorem 4), starting with the former.</p><p>1.2.1 Overview of Theorem 3. For simplicity, we start by considering the case when the background distribution P = &#181; &#8855;n is the uniform measure on the Boolean hypercube. The Boolean Hypercube {-1, +1} n . Let us denote the uniform measure over {-1, +1} n by U n . Recall that our goal is to design an algorithm with the following performance guarantee: Given i.i.d. sample access to an unknown distribution D, the algorithm w.h.p.</p><p>(i) Outputs "un-truncated" when D = U n ; and (ii) Outputs "truncated" when</p><p>To avoid proliferation of parameters, we set &#949; = 0.1 for the rest of the discussion. We thus have Pr</p><p>For any point x &#8712; {-1, +1} n , let x &#8712; {-1, +1} ( n 1 )+...+( n d ) be the vector given by</p><p>In other words, every coordinate of x corresponds to a non-constant monomial in x of (multilinear) degree at most d. Note that the map x &#8594; x can be viewed as a feature map corresponding to the "polynomial kernel" in learning theory.</p><p>The main idea underlying our algorithm, which is given in Theorem 3, is the following:</p><p>(1) When D = U n , then it is easy to see that E x = 0 (the all-0 vector). This is immediate from the fact that the expectation of any non-constant monomial under U n is 0.</p><p>(2) On the other hand, suppose D = U n | f -1 (1) for a degree-d PTF f as above. In this case, it can be shown that</p><p>This is done by relating the quantity in the LHS above to the Fourier spectrum of degree-d PTFs, which has been extensively studied in concrete complexity theory (see for example [GL94, DRST14, HKM14, Kan13]). In particular, we obtain this lower bound on E x 2 from an anti-concentration property of low-degree polynomials over the Boolean hypercube. This in turn is a consequence of hypercontractivity of the uniform measure over {-1, +1} n , a fundamental tool in discrete Fourier analysis (see Section 9.5 of [O'D14]).</p><p>Items 1 and 2 above together imply that estimating E x 2 2 up to an additive error of &#177;c 2 d /2 suffices to distinguish between</p><p>Using the idea of "U-statistics" <ref type="bibr">[Hoe94]</ref>, this suggests a natural unbiased estimator, namely drawing 2T points x (1) , . . . , x (T ) and y (1) , . . . , y (T ) for some T which we will fix later, and then setting</p><p>In particular, we have</p><p>In order to be able to distinguish between the un-truncated and truncated distributions by estimating M (and then appealing to Chebyshev's inequality), it therefore suffices to upper bound the variance of M in both the un-truncated and the truncated settings. When D = U n , then Var[M] is straightforward to calculate and it turns out that</p><p>However, in the truncated setting, Var[M] is significantly trickier to analyze; at a high-level, our analysis expresses Var[M] in terms of the "weights" of various levels of the Fourier spectrum of f . 1</p><p>The key technical ingredient we use to control the variance is the so-called "level-k inequality" for Boolean functions, which states that for any Boolean function</p><p>for the "Fourier weight at level-k", we have</p><p>.</p><p>, and so the level-k inequality bounds higher-level Fourier weight in terms of the mean of the function.</p><p>We remark that the level-k inequality is also a consequence of hypercontractivity over the Boolean hypercube (as before, see Section</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.5">of [O'D14]). With this in hand, we can show that</head><p>Var</p><p>Finally, taking T = &#920; d (n d/2 ) implies that the standard deviation of each of our estimators is comparable to the difference in means (which was c 2 d ), allowing us to distinguish between the un-truncated and truncated settings.</p><p>Hypercontractive Product Distributions &#181; &#8855;n . We use the same high-level approach (as well as the same estimator M) in order to distinguish low-degree truncation of a hypercontractive product measure &#181; &#8855;n , but the analysis becomes more technically involved. To explain the principal challenge, note that over the Boolean hypercube {-1, +1} n , the Fourier basis functions</p><p>form a multiplicative group. This group structure is useful because it means that the product of two basis functions is another basis function: For &#945;, &#946; &#8838; [n], we have the product formula</p><p>where &#945; &#9651; &#946; denotes the symmetric difference of &#945; and &#946;.</p><p>Over an arbitrary hypercontractive measure &#181; &#8855;n , this may no longer be the case; as a concrete example, this fails for the Gaussian measure and the Hermite basis (cf. Chapter 11 of [O'D14]). Over a general product space &#181; &#8855;n the Fourier basis functions are now indexed by multi-subsets of [n] (as opposed to subsets of [n] over {-1, +1} n )-see the discussion following Definition 5. More importantly, there is no simple formula for the product of two Fourier basis functions, and this makes the analysis technically more involved. We remark that this problem, which is known as the linearization problem, has been well studied for various classes 1 See Section 2.1 for a formal definition. of orthogonal polynomials (see Section 6.8 of <ref type="bibr">[AAR99]</ref>). Lemmas 16 and 17 establish a weak version of a "product formula" between two Fourier basis functions for &#181; &#8855;n that suffices for our purposes and lets us carry out an analysis similar to the above sketch for the Boolean hypercube {-1, +1} n . 1.2.2 Overview of Theorem 4. We turn to an overview of our lower bound, Theorem 4. As in the previous section, we write U n to denote the uniform distribution over the n-dimensional Boolean hypercube {-1, +1} n and (&#967; S ) S for the Fourier basis over {-1, +1} n . To prove Theorem 4, it suffices to construct a distribution F d over degree-d PTFs over {-1, +1} n with the following properties:</p><p>(1) The distribution F d is supported on thresholds of homogenous degree-d polynomials over {-1, +1} n . Note that such polynomials are necessarily multilinear; in particular, each PTF f &#8764; F d can be expressed as</p><p>The coefficients p(S) will be i.i.d. random variables drawn from the standard Gaussian distribution N (0, 1). are known in the literature as Gaussian random polynomials, and have been extensively studied (with an emphasis on the behavior of their roots) [IZ97, Ham56, BS14]. We will however be interested in a certain "pseudorandom-type" behavior of these polynomials.</p><p>In particular, we first reduce the problem of proving indistinguishability of D 1 and D 2 to proving the following: Suppose u 1 , . . . , u m are m randomly chosen points from {-1, 1} n (which we fix). Then, with probability 1 -o(1) over the choice of these m points, the distribution of</p><p>for f &#8764; F d and where each b i is an independent unbiased random bit. In other words, we aim to show that if the evaluation points u 1 , . . . , u m are randomly chosen (but subsequently known to the algorithm), then f</p><p>We establish this last statement by proving something even stronger. Namely, we first observe that the R m -valued random variable (p(u m ), . . . , p(u m )) is an m-dimensional normal random variable for any fixed outcome of u 1 , . . . , u m . Subsequently, we show that this random variable (p(u m ), . . . , p(u m )) is o(1)-close to the standard m-dimensional normal random variable N (0, I m ) where I m is the identity matrix in m dimensions. This exploits a recent bound on total variation distance between multivariate normal distributions <ref type="bibr">[DMR20]</ref> in terms of their covariance matrices, and involves bounding the trace of the Gram matrix generated by random points on the hypercube; details are deferred to the main body of the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Related Work</head><p>As mentioned earlier, "truncated statistics" has been a topic of central interest to statisticians for more than a century and recently in theoretical computer science as well. Starting with the work of Daskalakis et al. <ref type="bibr">[DGTZ18]</ref>, several works have looked at the problem of learning an unknown high-dimensional distribution in settings where the algorithm only gets samples from a truncated set [FKT20, <ref type="bibr">KTZ19,</ref><ref type="bibr">BDNP21]</ref>. We note here that in the recent past, there have also been several works on truncation in the area of statistics related to supervised learning scenarios [DSYZ21, DGTZ19, DRZ20], but the models and techniques in those works are somewhat distant from the topic of the current paper. Finally, in retrospect, some earlier works on "learning from positive samples" [CDS20, DDS15, DGL05] also have a similar flavor. In particular, the main result of [CDS20] is a poly(n) time algorithm which, given access to samples from a Gaussian truncated by an unknown degree-2 PTF, approximately recovers the truncation set; and one of the main results of <ref type="bibr">[DDS15]</ref> is an analogous poly(n)time algorithm but for degree-1 PTF (i.e. LTF) truncations of the uniform distribution over {-1, 1} n . Note that while the settings of <ref type="bibr">[CDS20,</ref><ref type="bibr">DDS15]</ref> are somewhat related to the current paper, the goals and results of those works are quite different; in particular, the focus is on learning (as opposed to testing / determining whether truncation has taken place), and the sample complexities of the algorithms in [CDS20, DDS15], albeit polynomial in n, are polynomials of high degree.</p><p>In terms of the specific problem we study, the work most closely related to the current paper is that of <ref type="bibr">[DNS23]</ref>. In particular, as noted earlier, in <ref type="bibr">[DNS23]</ref>, the algorithm gets access to samples from either (i) N (0, 1) n or (ii) N (0, 1) n conditioned on a convex set. Besides the obvious difference in the truncation sets which are considered-convex sets in <ref type="bibr">[DNS23]</ref> vis-a-vis PTFs in the current paper-the choice of the background distribution in <ref type="bibr">[DNS23]</ref> is far more restrictive. Namely, <ref type="bibr">[DNS23]</ref> requires the background distribution to be the normal distribution N (0, 1) n , whereas the results in current paper hold for the broad family of hypercontractive product distributions (which includes many other distributions as well as the normal distribution). The difference in the problem settings is also reflected in the techniques employed in these two papers. In particular, the algorithm and analysis of <ref type="bibr">[DNS23]</ref> heavily rely on tools from convex geometry including Gaussian isoperimetry <ref type="bibr">[Bor85]</ref>, the Brascamp-Lieb inequality <ref type="bibr">[BL76a,</ref><ref type="bibr">BL76b]</ref>, and recent structural results for convex bodies over Gaussian space <ref type="bibr">[DNS21,</ref><ref type="bibr">DNS22]</ref>. In our setting, truncation sets defined by PTFs even of degree two need not be convex, so we must take a very different approach. The algorithm in the current paper uses techniques originating from the study of PTFs in concrete complexity theory, in particular on the hypercontractivity of low-degree polynomials, anti-concentration, and the "level-k" inequalities [O'D14]. So to summarize the current work vis-a-vis <ref type="bibr">[DNS23]</ref>, the current work studies a different class of truncations under a significantly less restrictive assumption on the background distribution, and our main algorithm, as well as its analysis, are completely different from those of <ref type="bibr">[DNS23]</ref>.</p><p>Our lower bound argument extends and strengthens a &#937;(n 1/2 ) lower bound, given in <ref type="bibr">[DNS23]</ref>, for distinguishing the standard normal distribution N (0, 1) n from N (0, 1) n | f -1 (1) where f is an unknown origin-centered LTF (i.e. a degree-1 PTF); both arguments use a variation distance lower bound between a standard multivariate normal distribution and a multivariate normal distribution with a suitable slightly perturbed covariance matrix. Our lower bound argument in the current paper combines tools from the LTF lower bound mentioned above with ingredients (in particular, the use of a "shadow sample"; see Section 4 of the full version) from a different lower bound from <ref type="bibr">[DNS23]</ref> for symmetric slabs; extends the <ref type="bibr">[DNS23]</ref> analysis from degree-1 to degree-d for any constant d; and gives a tighter analysis than <ref type="bibr">[DNS23]</ref> which does not lose any log factors.</p><p>We end this discussion of related work with the following overarching high-level question, which we hope will be investigated in future work: Suppose P is a background distribution and F is a class of Boolean functions. Under what conditions can we distinguish between D = P versus D = P | f (for some f &#8712; F ) with sample complexity asymptotically smaller than the sample complexity of learning F ? We view our results on distinguishing truncation by PTFs as a step towards answering this question.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PRELIMINARIES</head><p>We write N := {0, 1, . . .} and 1{&#8226;} for the 0/1 indicator function. We will write</p><p>[n]</p><p>Let (R, &#181;) be a probability space. For n &#8712; N, we write L 2 (R n , &#181; &#8855;n ) for the (real) inner-product space of functions f : R n &#8594; R with the inner product</p><p>Here &#181; &#8855;n denotes the product probability distribution on R n . For q &gt; 0 we write</p><p>In particular, for f : R n &#8594; {0, 1}, we write Vol(f</p><p>We say that a function f : R n &#8594; {0, 1} is a degree-d polynomial threshold function (PTF) if there exists a polynomial p : R n &#8594; R of degree at most d such that</p><p>The primary class of distributions we will consider throughout is that of truncations of an i.i.d. product distribution &#181; &#8855;n by a degreed PTF of at least some minimum volume; more precisely, we will consider the following class of truncations:</p><p>where &#945; = &#945;(n) may depend on n (in fact we will be able to take &#945; as small as 2 -&#920;( &#8730; n) ). Throughout the paper we will assume that d (the degree of the PTFs we consider) is a fixed constant independent of the ambient dimension n.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Harmonic Analysis over Product Spaces</head><p>Our notation and terminology in this section closely follow those of O'Donnell [O'D14]; in particular, we refer the reader to Chapter 8 of [O'D14] for further background. Definition 5. A Fourier basis for L 2 (R, &#181;) is an orthonormal basis B = {&#967; 0 , &#967; 1 , . . .} with &#967; 0 &#8801; 1.</p><p>It is well known that if L 2 (R, &#181;) is separable, 2 then it has a Fourier basis (see for e.g. Section I.4 of <ref type="bibr">[Con19]</ref>). Note that we can obtain a Fourier basis for L 2 (R n , &#181; &#8855;n ) by taking all possible n-fold products of elements of B; more formally, for a multi-index &#945; &#8712; N n , we define</p><p>Then the collection B n := &#967; &#945; : &#945; i &#8712; N n forms a Fourier basis for L 2 (R n , &#181; &#8855;n ); this lets us write</p><p>We can assume without loss of generality that the basis elements of L 2 (R, &#181;), namely &#967; 0 , &#967; 1 , . . . , are polynomials with deg(&#967; i ) = i. This is because a polynomial basis can be obtained for L 2 (R, &#181;) by running the Gram-Schmidt process. By extending this basis to L 2 (R n , &#181; &#8855;n ) by taking products, it follows that we may assume without loss of generality that for a multi-index &#945; &#8712; N n , we have</p><p>We will also write #&#945; := supp(&#945;) where supp(&#945;) := {i : &#945; i 0}.</p><p>Remark 6. While the Fourier coefficients { f (&#945;)} depend on the choice of basis &#967; &#945; , we will always work with some fixed (albeit arbitrary) polynomial basis, and hence there should be no ambiguity in referring to the coefficients as though they were unique. We assume that the orthogonal basis {&#967; &#945; } is "known" to the algorithm; this is certainly a reasonable assumption for natural examples of hypercontractive distributions (e.g. distributions with finite support, the uniform distribution on intervals, the Gaussian distribution, etc.), and is in line with our overall problem formulation of detecting whether a known background distribution has been subjected to truncation.</p><p>As a consequence of orthonormality, we get that for any f &#8712; L 2 (R n , &#181; &#8855;n ), we have</p><p>with the latter called Parseval's formula. We also have Plancharel's formula, which says that</p><p>2 Recall that a metric space is separable if it contains a countable dense subset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Product Distribution</head><p>Table <ref type="table">1</ref>: Examples of hypercontractive distributions, along with their accompanying hypercontractivity constants.</p><p>Here min(&#181;) denotes the minimal non-zero probability of any element in the support of the (finitely supported) distribution &#181;.</p><p>Finally, we write</p><p>for the Fourier weight of f at level k and the Fourier weight of f up to level k respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Hypercontractive Distributions</head><p>The primary analytic tools we will require in both our upper and lower bounds are consequences of hypercontractive estimates for functions in L 2 (R n , &#181; &#8855;n ); we refer the reader to Chapters 9 and 10 of [O'D14] for further background on hypercontractivity and its applications.</p><p>Definition 7. We say that (R, &#181;) is hypercontractive if for every q &#8805; 2, there is a fixed constant C q (&#181;) such that for every n &#8805; 1 and every multivariate degree-d polynomial p : R n &#8594; R we have</p><p>where C q (&#181;) is independent of n and satisfies</p><p>for an absolute constant K. When the product distribution &#181; &#8855;n is clear from context, we will sometimes simply write C q := C q (&#181;) instead.</p><p>It is clear from the monotonicity of norms that C q &#8805; 1; see Table <ref type="table">1</ref> for examples of hypercontractive distributions with accompanying hypercontractivity constants C q (&#181;).</p><p>Remark 8. We note that Definition 7 is not the standard definition of a hypercontractive product distribution (cf. Chapters 9 and 10 of [O'D14]), but is in fact an easy consequence of hypercontractivity that is sometimes referred to as the "Bonami lemma. " The guarantees of Equations ( <ref type="formula">2</ref>) and (3) are all we require for our purposes, and so we choose to work with this definition instead.</p><p>Remark 9. While Equation (3) may seem extraneous, we note that the "level-k inequalities" (Proposition 11) crucially rely on this bound on the hypercontractivity constant C q .</p><p>We turn to record several useful consequences of hypercontractivity which will be crucial to the analysis of our estimator as well as to our lower bound. We defer the proofs of Propositions 10 and 11 to the full version of this paper.</p><p>The following anti-concentration inequality is a straightforward consequence of hypercontractivity. A similar result for arbitrary product distributions with finite support was obtained by Austrin-H&#229;stad <ref type="bibr">[AH09]</ref>, and a similar result for functions over {-1, +1} n with the uniform measure was obtained by Dinur et al. <ref type="bibr">[DFKO06]</ref>. The proof of the following proposition closely follows that of Proposition 9.7 in [O'D14]:</p><p>Proposition 10 (Anti-concentration of low-degree polynomials). Suppose (R n , &#181; &#8855;n ) is a hypercontractive probability space. Then for any degree-d polynomial p : R n &#8594; R with E[p] = 0 and Var[p] = 1, we have</p><p>The following proposition bounds Fourier weight up to level k (i.e. W &#8804;k [f ]) in terms of the bias (i.e. the degree-0 Fourier coefficient) of the function. We note that an analogous result for functions over {-1, +1} n with the uniform measure is sometimes known as "Chang's Lemma" or "Talagrand's Lemma" <ref type="bibr">[Cha02,</ref><ref type="bibr">Tal96]</ref>; see also Section 9.5 of [O'D14].</p><p>Proposition 11 (Level-k inequalities). Suppose (R, &#181;) is hypercontractive and f : R n &#8594; {0, 1} is a Boolean function. Then for all</p><p>where K is a constant independent of n.</p><p>Remark 12. We note that the Proposition 11 also holds for bounded functions f : R n &#8594; [-1, 1] with Vol(f ) := E[| f |], although we will not require this.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">AN O(n d /2 )-SAMPLE ALGORITHM FOR DEGREE-d PTFS</head><p>In this section, we present a O(n d/2 )-sample algorithm for distinguishing a hypercontractive product distribution &#181; &#8855;n from &#181; &#8855;n truncated by the satisfying assignments of a degree-d PTF. More precisely, we prove the following in Section 3.2:</p><p>Theorem 13. Let &#949; &gt; 0 and let (R, &#181;) be hypercontractive. There is an algorithm, PTF-Distinguisher (Algorithm 1), with the following performance guarantee: Given access to independent samples from any unknown distribution D &#8712; {&#181; &#8855;n } &#8746; C PTF (d, 2 - &#8730; n ), the algorithm uses T samples where (1) If D = &#181; &#8855;n , then with probability at least 9/10 the algorithm outputs "un-truncated;"</p><p>Output: "Un-truncated" or "truncated" PTF-Distinguisher(D):</p><p>(1) Draw 2T independent sample points x (1) , . . . , x (T ) , y (1) , . . . , y (T ) &#8764; D, where</p><p>(2) Compute the statistic M where</p><p>1&#8804; |&#945; | &#8804;d and y (j) defined similarly.</p><p>(</p><p>"un-truncated" otherwise. Before proceeding to the proof of Theorem 13, we give a brief high-level description of Algorithm 1. The algorithm draws 2T independent samples {x (i) , y (i) } i &#8712;T where T is as above, and then performs a feature expansion to obtain the 2T vectors { x (i) , y (i) } i &#8712;T where</p><p>x (i) := &#967; &#945; (x (i) )</p><p>1&#8804; |&#945; | &#8804;d and y (i) is defined similarly. The statistic M employed by the algorithm to distinguish between the un-truncated and truncated is then given by (1) First computing the average of the kernelized x (i) vectors and the kernelized y (i) vectors; and then (2) Taking the inner product between the two averaged kernel vectors.</p><p>An easy calculation, given below, relates the statistic M to the low-degree (but not degree-0) Fourier weight of the truncation function (note that if no truncation is applied then the truncation function is identically 1). The analysis then proceeds by using anticoncentration of low-degree polynomials to show that the means of the estimators differ by &#8486; &#949; (1) between the two settings. We bound the variance of the estimator in both the un-truncated and truncated setting (using the level-k inequalities at a crucial point in the analysis of the truncated setting), and given a separation between the means and a bound on the variances, it is straightforward to distinguish between the two settings using Chebyshev's inequality. Remark 14. We note that the trick of drawing a "bipartite" set of samples, i.e. drawing 2T samples {x (i) , y (i) } i &#8712;T , was recently employed in the algorithm of Diakonikolas, Kane, and Pensia <ref type="bibr">[DKP22]</ref> for the problem of Gaussian mean testing. For our problem we could have alternately used the closely related estimator M &#8242; given by</p><p>x (i) , x (j)   to distinguish between the un-truncated and truncated distributions via a similar but slightly more cumbersome analysis. We note that the main technical tool used in the analysis of Diakonikolas, Kane, and Pensia <ref type="bibr">[DKP22]</ref> is the Carbery-Wright inequality for degree-2 polynomials in Gaussian random variables, whereas our argument uses the above-mentioned kernelization approach and other consequences of hypercontractivity, namely the level-k inequalities, beyond just anti-concentration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Useful Preliminaries</head><p>The following lemma will be crucial in obtaining a lower-bound for the expectation of our test statistic E[M] in the truncated setting; we note that an analogous statement in the setting of the Boolean hypercube was obtained by Gotsman and Linial <ref type="bibr">[GL94]</ref>.</p><p>2 for an absolute constant c := c(&#181;) &#8712; (0, 1].</p><p>Proof. We may assume that f (x) = 1{p(x) &#8805; &#952; where p : R n &#8594; R is a degree-d polynomial with E[p(x)] = p(0 n ) = 0 and &#8741;p&#8741; 2 2 = Var[p] = &#945; p(&#945;) 2 = 1. By Cauchy-Schwarz and Plancherel, we get</p><p>where we made use of the the fact that p is a degree-d polynomial with p(0 n ) = 0 and Var p(x) = 1. Note that by Proposition 10, we have that either</p><p>(5) Suppose that it is the former. (As we will briefly explain later, the argument is symmetric in the latter case.) We further break the analysis into cases depending on the magnitude of &#952; .</p><p>Case 1: &#952; &#8805; 1 2 . In this case, we have by Equation (4) that</p><p>and so the result follows.</p><p>Case 2: 0 &#8804; &#952; &lt; 1 2 . In this case, we have by Proposition 10 that</p><p>Once again by Equation (4), we have</p><p>where the second inequality follows from f &#8226; p being always non-negative and at least 1 2 with probability Pr[p(x) &#8805; 1 2 ]. Case 3: &#952; &lt; 0. Consider the degree-d PTF f &#8224; := 1 -f given by</p><p>&#8709; and that Vol(f &#8224; ) = 1 -Vol(f ). Repeating the above analysis then gives that</p><p>Putting Cases 1 through 3 together, we get that</p><p>completing the proof. Recall, however, that we assumed that</p><p>in Equation (5). Suppose that we instead have</p><p>Then note that the same trick used in Case 3 by considering f &#8224; instead of f and repeating the three cases completes the proof. &#9633;</p><p>We will also require the following two lemmas which are closely linked to the linearization problem for orthogonal polynomials (see Section 6.8 of <ref type="bibr">[AAR99]</ref>). The first lemma bounds the magnitude of the Fourier coefficients of the product of basis functions; while the estimate below relies on hypercontractivity, we note that exact expressions for the Fourier coefficients are known for various classes of orthogonal polynomials including the Chebyshev, Hermite, and Laguerre polynomials.</p><p>Proof. Using Cauchy-Schwarz, we have that</p><p>due to the orthonormality of &#967; &#945; . Using Cauchy-Schwarz once again, we have that</p><p>where the final inequality uses hypercontractivity and the fact that &#967; &#946; (&#967; &#947; respectively) is a polynomial of two-norm 1 and degree at </p><p>where &#946;, &#947; &#8712; N n .</p><p>Proof. We will upper bound the number of pairs (&#946;, &#947; ) such that |&#946; |, |&#947; | &#8804; d and &#967; &#945; , &#967; &#946; &#967; &#947; 0; so fix such a pair (&#946;, &#947; ). We first note that for any i supp(&#945;), we must have &#946; i = &#947; i ; for otherwise &#967; &#945; , &#967; &#946; &#967; &#947; = 0 due to orthonormality of &#967; &#946; i and &#967; &#947; i . Also, for each i &#8712; supp(&#945;), we must have</p><p>by orthonormality and the fact that &#967; &#946; i &#8226; &#967; &#947; i is a linear combination of basis functions {&#967; j } j &lt;&#945; i .</p><p>Summing over all i &#8712; supp(&#945;), we get that i &#8712;supp(&#945; )</p><p>and so it follows that either i &#8712;supp(&#945; )</p><p>without loss of generality we suppose it is the former. It follows that the total number of ways of choosing such a pair (&#946;, &#947; ) is bounded by </p><p>completing the proof. &#9633;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Proof of Theorem 13</head><p>Proof of Theorem 13. Throughout, we will write</p><p>Recalling that there are at most n+k -1</p><p>We compute the mean and variance of the estimator M in each case (i.e. when D = &#181; &#8855;n and when D &#8712; C PTF (d, 2 - &#8730; n )) separately.</p><p>Case 1: D = &#181; &#8855;n . For brevity (and to distinguish the calculations in this case from the following one), we denote the un-truncated distribution by D u := &#181; &#8855;n .</p><p>When x (1) , . . . , x (T ) , y (1) , . . . , y (T ) &#8764; D u , we have by bi-linearity of the inner product that</p><p>as x, y &#8764; D u are independent samples and also because of the orthonormality of {&#967; &#945; }.</p><p>Turning to the variance, we have</p><p>where x, y &#8764; D are independent samples. In this case, we have by our previous calculation (Equation ( <ref type="formula">7</ref>)) that this is in fact equal to</p><p>where we used orthonormality as well as the fact that E D u y &#8226; y T = Id m . In this case, f : R n &#8594; {0, 1} is a degree-d PTF with</p><p>We may assume that f (x) = 1{p(x) &#8805; &#952; } where p : R n &#8594; R is a degree-d polynomial with p(0 n ) = 0 and &#8741;p&#8741; 2 2 = Var[p] = &#945; p(&#945;) 2 = 1. We have the following easy relation between the Fourier coefficients of f and the means of the characters {&#967; &#945; } under the truncated distribution D t :</p><p>We thus have</p><p>where the second line follows from Equation (11) and the final inequality is due to Lemma 15; here c := c(&#181;) is as in Proposition 10. Turning to the variance of the estimator, we have as before by Equation (9) that</p><p>x, y x, y </p><p>where the final inequality is due to Lemma 17. Finally, we have that</p><p>where Equation ( <ref type="formula">15</ref>) is by the level-k inequalities (Proposition 11) and Equation ( <ref type="formula">16</ref>) uses the fact that Vol(f ) &#8805; 2 - &#8730; n (which is by assumption). Putting everything together, we get that</p><p>To summarize, when D = &#181; &#8855;n , we have from Equations ( <ref type="formula">7</ref>) and ( <ref type="formula">10</ref>) that </p><p>and</p><p>For the number of samples being</p><p>min 1, &#949;/(1 -&#949;), c &#920;(d) /(1 -&#949;)</p><p>2 , the correctness of the distinguishing algorithm Algorithm 1 follows directly from Equation (18) and Equations ( <ref type="formula">19</ref>) and (20) by a simple application of Chebyshev's inequality. &#9633;</p></div></body>
		</text>
</TEI>
