<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Detecting multiple replicating signals using adaptive filtering procedures</title></titleStmt>
			<publicationStmt>
				<publisher>The Annals of Statistics</publisher>
				<date>08/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10553800</idno>
					<idno type="doi">10.1214/21-aos2139</idno>
					<title level='j'>The Annals of Statistics</title>
<idno>0090-5364</idno>
<biblScope unit="volume">50</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Jingshu Wang</author><author>Lin Gui</author><author>Weijie J Su</author><author>Chiara Sabatti</author><author>Art B Owen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, different populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, e.g., comparing multiple high-throughput genetic experiments, a large number M of PC nulls need to be tested simultaneously, calling for a multiple comparisons correction. However, standard multiple testing adjustments on the M PC p-values can be severely conservative, especially when M is large and the signals are sparse. We introduce AdaFilter, a new multiple testing procedure that increases power by adaptively filtering out unlikely candidates of PC nulls. We prove that AdaFilter can control FWER and FDR as long as data across studies are independent, and has much higher power than other existing methods. We illustrate the application of AdaFilter with three examples: microarray studies of Duchenne muscular dystrophy, single-cell RNA sequencing of T cells in lung cancer tumors and GWAS for metabolomics.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1. Introduction. Replication is "the cornerstone of science" <ref type="bibr">[34]</ref>. An important scientific finding should be supported by further evidence from similar conditions, by other researchers or with new samples. In the last decade, however, both the popular <ref type="bibr">[29]</ref> and the scientific press <ref type="bibr">[3,</ref><ref type="bibr">1]</ref> have reported the lack of replicability in modern research. While there are many reasons behind this phenomenon, one important factor is that many scientific discoveries are obtained from complicated large-scale experiments with biases from various sources. Even when the data are carefully analyzed, idiosyncratic aspects of a single experiment can fail to extend to other settings, and any finding from just one study can easily lack external validity. Thus, it is crucial to have a statistical framework to objectively and precisely evaluate the consistency of scientific discoveries across multiple studies, while properly accounting for experimental heterogeneity.</p><p>The partial conjunction (PC) test, which was introduced by <ref type="bibr">[15]</ref> and further studied in <ref type="bibr">[4]</ref>, provides such a framework. Given n null hypotheses (base nulls) and a number r 2 {2, 3, . . . , n}, the PC null states that there are fewer than r base non-nulls. In the setting where each base hypothesis represents a test from one study, rejecting a PC null explicitly guarantees that the signal replicates at least r times. The PC framework has been used to identify replicating signals in neuroimaging <ref type="bibr">[36]</ref>, to detect genes that show consistent effects across genetic experiments <ref type="bibr">[21]</ref>, and recently to study mediation effects <ref type="bibr">[32]</ref> and find evidence factors <ref type="bibr">[27]</ref> in causal inference.</p><p>In high-throughput genetic experiments, there is a special need to identify replicating signals across multiple studies. For instance, for gene expression data, it is important to find stable gene markers for a disease or cell type, which remain differentially expressed across similar experiments or in multiple patients. In multi-tissue expression quantitative trait loci (eQTL) studies, scientists are interested in identifying DNA loci with consistent regulation over tissues <ref type="bibr">[14,</ref><ref type="bibr">43]</ref>. With a growing trend in multi-omics data sharing <ref type="bibr">[20]</ref>, there is also active research in finding replicating signals across platforms <ref type="bibr">[47]</ref>, ethnic groups <ref type="bibr">[33,</ref><ref type="bibr">16]</ref> and even species. Though the PC framework fits all above scenarios, finding multiple replicating signals by simultaneously performing a large number of PC tests for thousands of genes or millions of DNA loci, however, typically suffers from extremely low power.</p><p>Specifically, let M denote the number of hypotheses in one study and suppose that we compare across n related studies. Then, to find replicating signals across the n studies, we have M PC nulls to test, each with n base nulls. The above framework gives us an n &#8677; M matrix of base p-values, with one column per PC null and one row per study. Now, as we want to identify signals whose PC nulls are false, a "direct approach" is to first get a combined p-value for each PC null and then apply standard multiple testing adjustment to the M PC pvalues. However, this "direct approach" for testing multiple PC tests has been shown to have extremely low power <ref type="bibr">[23,</ref><ref type="bibr">41]</ref>. Both <ref type="bibr">[23]</ref> and <ref type="bibr">[7]</ref> suggest procedures to counter that power loss. Unfortunately, the appoach in <ref type="bibr">[7]</ref> is designed only for n = r = 2 and the empirical Bayes approach repfdr in <ref type="bibr">[23]</ref> encounters both accuracy and computational barriers for n as large as 8, as shown in our simulations. There is thus a need for a powerful and fast method that can guarantee simultaneous error control and can handle a larger number of studies.</p><p>In this paper, we introduce AdaFilter, an adaptive filtering multiple testing procedure for multiple PC hypotheses. We propose different versions of AdaFilter to control simultaneous error rates including FDR (false discovery rate) and FWER (familywise error rate). AdaFilter can control FWER and FDR when all nM base p-values are independent. In addition, it asymptotically controls FDR when M goes to infinity, allowing base p-values to be weakly associated within each study. The weak dependence only assumes that within each study, the number of pairs (j, j 0 ) where the base p-values p j and p j 0 are dependent is o(M 2 ), which is reasonable for most genetics and genomics data. Using simulations and real data applications, we show that AdaFilter is robust to dependence of p-values within each study and can have much higher power than the "direct approach" or using repfdr.</p><p>Deferring precise statements to later sections, we give an intuitive explanation for how AdaFilter gains power. The low power of the "direct approach" is due to the fact that partial conjunction has a composite null. AdaFilter's power gain is linked to its ability to borrow information across studies and learn from the data which PC hypotheses are likely to be least favorable nulls. Intuitively, AdaFilter filters the set of hypotheses down to a number m &lt; M of candidate least favorable nulls, which are the nulls that have exactly r 1 base nonnulls. The PC p-values are still "valid" conditioning on filtering and the decreased number of hypotheses lowers multiplicity burden. More surprisingly, the power gain also links to a lack of "monotonicity" of the number rejections in the base p-values, where increasing some base p-values can result in more rejections. In the extreme case, combining multiple studies while requiring replicability can even lead to more rejections than the union of rejections by testing each individual study separately.</p><p>The structure of the paper is as follows. Section 2 precisely defines the PC framework, and illustrates the power limitation of the "direct approach". Section 3 introduces our AdaFilter procedures. Section 4 discusses theoretical properties of AdaFilter. Section 5 explores the performance with simulations. Section 6 applies AdaFilter to several real studies. Section 7 has conclusions. An R package implementing AdaFilter is available at <ref type="url">https://github.  com/jingshuw/adaFilter</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Multiple testing for partial conjunctions.</head><p>In this section, we provide a brief introduction of the partial conjunction hypotheses and the low power in detecting multiple PC hypotheses using the "direct approach". 2.1. Problem setup. We consider the problem where M null hypotheses are tested in n studies. The base null hypotheses are (H 0ij ) n&#8677;M . In high-throughput experiments, M is the number of genes or DNA loci. We work with summary statistics that are base p-values (p ij ) n&#8677;M for (H 0ij ) n&#8677;M . Each p ij is the realization of a random variable P ij . We assume that each base P-value is valid, satisfying P(P ij &#63743; ) &#63743; under its null. Also, let P (1)j &#63743; P (2)j &#63743; &#8226; &#8226; &#8226; &#63743; P (n)j be the sorted P-values for each j = 1, 2, . . . , M . is the commonly tested global null for meta-analysis. Rejecting it would not guarantee replicability. In high-throughput experiments, for each DNA locus or gene j 2 {1, 2, . . . , M}, we test for a PC null H r/n 0j to evaluate if genetic signals have been replicated at least r times across n studies. Throughout the paper, we assume that p-values across studies are independent. This can be assumed when samples do not overlap across studies.</p><p>For a multiple testing procedure on {H r/n 01 , . . . , H r/n 0M }, denote the decision function as ' j = 1 if we reject H r/n 0j and ' j = 0 otherwise. The total number of discoveries is then R = P M j=1 ' j . Among these, the number of false discoveries is</p><p>0j is true and v j = 1 otherwise. There are many measures of the simultaneous error rate <ref type="bibr">[11]</ref>, with FWER and FDR being the most common ones. In addition, we consider the per-family error rate (PFER), as it provides a motivation for our procedures. With the notation introduced, we have</p><p>where FDP = V /(R _ 1) is the false discovery proportion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.2.</head><p>The "direct approach". We start with a brief review of p-value construction for a single PC null, while more details can be found in <ref type="bibr">[44]</ref> and <ref type="bibr">[4]</ref>. Consider a single PC null H r/n 0 with a vector of base P-values (P 1 , P 2 , . . . , P n ) and let P r/n = f (P 1 , P 2 , . . . , P n ) be the combined P-value for H r/n 0 . Benjamini and Heller <ref type="bibr">[4]</ref> discussed three approaches, which we report here, using the standard notation (P (1</p><p>1. Simes' method:</p><p>2. Fisher's method:</p><p>3. Bonferroni's method:</p><p>The idea is to apply meta-analysis to the largest n r + 1 base P-values. For instance, if n = r = 2, then P S 2/2 = P F 2/2 = P B 2/2 = max(p 1 , p 2 ). All three methods construct valid PC P -values for H r/n 0 under independence, and <ref type="bibr">[44]</ref> showed that they also provide the most powerful tests for a single PC null. For M hypotheses, we denote P r/n,j as the PC p-value for the jth PC null.</p><p>The "direct approach" is to simply apply standard multiple testing adjustment procedures to the M PC P-values. For example, to control the FWER at level &#8629;, we could use the Bonferroni rule, rejecting H r/n 0j if P r/n,j &#63743; &#8629;/M , which also controls the PFER at level &#8629; <ref type="bibr">[42]</ref>.</p><p>To control the FDR we could apply BH procedure <ref type="bibr">[5]</ref> on {P r/n,j , j = 1, &#8226; &#8226; &#8226; , M}.</p><p>However, this direct approach is often too conservative, as we illustrate now for the case r = n. To quantify how the performance associates with the composite nature of a PC null, define sets</p><p>&#8226; &#8226; , M} | exactly k of H 01j , . . . , H 0nj are false for k = 0, . . . , n. Sets {I k , k = 0, . . . , n} define a partition of {1, . . . , M}. If a false rejection of H n/n 0j happens, then the jth column must belong to one of I k where</p><p>Thus, if we use Bonferroni to control for FWER at a nominal level &#8629;, the true FWER instead satisfies</p><p>where the second inequality is close to an equality when all the tests for non-nulls H 1ij have high power. Let k = |I k |/M be the proportion of hypotheses in each partition. Then we have</p><p>which in the limit is dominated by n 1 &#8629; (when n 1 6 = 0) or is of order O(M 1 ) (when n 1 = 0) for large M . Thus, when n 1 &#8673; 0, a typical scenario in genetics problems with sparse signal, the expected number of rejections E(V ) would be much smaller than &#8629; and the "direct approach" can become highly deficient, in fact much more conservative than Bonferroni usually is.</p><p>The point is that if we do not account for the fact that the PC null is composite, we will control the simultaneous error rates under the worst case scenario ( n 1 = 1), which is unnecessary. For general r &#63743; n, the level of E(V ) for Bonferroni correction will depend mainly on r 1 in the large M setting. So does the BH control for FDR.</p><p>It is clear that there can be more efficient procedures if the fractions k were known or if good estimates of k can be obtained. This is what motivates the Bayesian methods <ref type="bibr">[23,</ref><ref type="bibr">14]</ref>. In this paper we take a frequentist perspective. Rather than estimating k , AdaFilter works directly on an alternative estimation of V , implicitly and adaptively adjusting for the size of r 1 , the fraction of the least favorable nulls.</p><p>3. The idea of AdaFilter. In Section 2.2, we showed that a PC null hypothesis is composite, thus the inequality P(P r/n &#63743; ) &#63743; for a given is only tight for the least favorable null, while standard multiple testing procedures are designed to control error when P(P r/n &#63743; ) = is always true. To overcome this, AdaFilter leverages a region A &#8674; [0, 1] n such that the much tighter inequality P(P r/n,j &#63743; | (P 1j , . . . , P nj ) 2 A ) &#63743; holds for any configuration in the PC null space. Figure <ref type="figure">1</ref> illustrates the construction of the filtering region A for r = n = 2. The PC test j has base p-values P 1j and P 2j , and its PC p-value is P 2/2,j = max(P 1j , P 2j ). The null H 2/2 j0 contains three configurations: (H 01j , H 02j ) being (True, True), (True, False) or (False, True). It is easy to see that P(P 2/2,j &#63743; ) &#63743; 2 under (True, True), while P(P 2/2,j &#63743; ) can be close to under the other two less favorable configuration. Let us consider, instead, conditioning on (P 1j , P 2j ) being in the "L"-shaped filtering region A = {(p 1 , p 2 ) | min(p 1 , p 2 ) &#63743; }. We get P(P 2/2,j &#63743; | (P 1j , P 2j ) 2 A ) &#63743; being true for all three null scenarios, which is much tighter than P(P r/n &#63743; ) &#63743; . The inequality holds since at least one of P 1j and P 2j is stochastically greater than uniform under all three configurations.</p><p>Since Bonferroni and BH procedures are based on an implicit estimate of the number of false rejections V associated with a threshold : b V = M , we can improve their efficiency with a smaller estimate of b V using the new inequality. Specifically, the estimated</p><p>, where M is replaced by the number of hypotheses falling into the L shaped region, a possibly much smaller number than M . Alternatively, the quantity (1/M ) P M j=1 1 (P1j,P2j)2A is our "estimate" of r 1 , the fraction of least favorable nulls. Hypotheses that fall outside of the "L"-shaped filtering region are not counted towards the multiplicity of the PC hypotheses.</p><p>To control the FWER (and PFER) at level &#8629;, we can adaptively choose the largest satisfying b</p><p>V A &#63743; &#8629;. Similarly, to control the FDR at level &#8629;, we estimate the FDP as b V A /(R _ 1) and select the largest such that b V A /(R _ 1) &#63743; &#8629;. These are essentially the Bonferroni or BH procedure with an alternative estimate of V .</p><p>3.1. Definition of AdaFilter procedures. Now we formally define AdaFilter for general n and r. It is convenient to first introduce the notion of filtering and selection "P -values". These are F j := (n r + 1)P (r 1)j , and (3)</p><p>DEFINITION 3.1 (AdaFilter Bonferroni). For a level &#8629;, and with F j and S j given by ( <ref type="formula">3</ref>) and ( <ref type="formula">4</ref>) respectively, reject</p><p>DEFINITION 3.2 (AdaFilter BH). For a level &#8629;, and with F j and S j given by ( <ref type="formula">3</ref>) and ( <ref type="formula">4</ref>)</p><p>We define the filtering region as {F j &lt; } instead of {F j &#63743; } to guarantee that Bon 0 and BH 0 themselves satisfy the corresponding inequalities. This is important for showing the theoretical properties of adaFilter procedures, especially when base p-values are discrete. The rejection criterion is set to S j &lt; 0 instead of S j &#63743; 0 where 0 is either Bon 0 or BH 0 accordingly (for Lemma 4.1).</p><p>We also introduce AdaFilter adjusted "p-values" like those commonly computed for standard Bonferroni and BH procedures. They provide equivalent sets of rejections as the above definitions, while can be more efficiently computed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DEFINITION 3.3 (AdaFilter adjusted p-values). Rank the selection p-values as</head><p>where S (j) is for the null hypothesis H r/n 0(j) . For each j, define an AdaFilter adjustment number</p><p>Then the AdaFilter Bonferroni adjusted P-value for H</p><p>and the AdaFilter BH adjusted P-value for H r/n 0(j) is</p><p>)</p><p>.</p><p>For any level &#8629; &gt; 0, we reject the hypotheses whose AdaFilter adjusted p-values are smaller than &#8629;. We can verify that the AdaFilter adjusted p-values give the same set of rejections as Definition 3.1 and Definition 3.2. PROPOSITION 3.4. For any level &#8629; &gt; 0, the set of rejections defined as {j : P Bon j &lt; &#8629;} is equivalent to the set of rejections from Definition 3.1. Similarly, the set of rejections defined as {j : P BH j &lt; &#8629;} is equivalent to the set of rejections from Definition 3.2.</p><p>In practice, the AdaFilter adjusted p-values can be more easily computed than finding Bon 0 and BH 0 . Our simulations and real data applications in Sections 5 and 6 also compute these adjusted p-values for getting the rejections of AdaFilter procedures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.2.</head><p>A heuristic comparison with the "direct approach". Before we discuss the theoretical properties of AdaFilter procedures in Section 4, we revisit the case of r = n in Section 2.2 to understand the level of power gain from AdaFilter procedures compared with the "direct approach". When r = n, the PC p-values for the "direct approach" are P r/n,j = P (n)j , which are the same as the selection p-values of AdaFilter procedures. As a consequence, AdaFilter procedures would not change the ordering/ranking of the individual PC hypotheses. AdaFilter gains power by selecting a much less conservative PC p-values threshold than the "direct approach" for the same nominal FWER/FDR level.</p><p>If one controls FWER at level &#8629;, then the PC p-value threshold from the "direct approach" using Bonferroni adjustment is &#8629;/M . We now give an approximation of the threshold from AdaFilter Bonferroni. When r = n, at any given threshold , the estimate of the number of false discoveries used in AdaFilter is</p><p>AdaFilter Bonferroni finds the largest so that V ( ) &#63743; &#8629;. As defined in (1), let I k &#8674; {1, &#8226; &#8226; &#8226; , M} be the set of hypotheses with exactly k base non-nulls and let k = |I k |/M . When M is large, the expected value of V ( ) satisfies that</p><p>. The first inequality is due to the fact that all base null p-values are independent and for each j 2 I k , we can decompose P P (n 1),j &lt; into the events that all n k base nulls i satisfy P ij &#63743; and exactly n k 1 base nulls satisfy this constraint. So roughly, the AdaFilter Bonferroni threshold Bon will be around some value that is at least &#8629;/ M ( n + n 1 ) + o(1/M ). Compared with the Bonferroni threshold &#8629;/M in the "direct approach", AdaFilter Bonferroni increases this threshold by 1/( n + n 1 ). In our motivating applications, both n and n 1 are typically small, and so such an increase would be substantial. The resulting actual FWER is also less conservative. If we use a fixed threshold at = &#8629;/ M ( n + n 1 ) , then</p><p>Compared to the bound &#8629; n 1 + O(1/M ) in (2) from the "direct approach", we can now be much less conservative especially when the proportion of least favorable PC nulls n 1 is small.</p><p>4. Theoretical properties of AdaFilter. Now we prove that AdaFilter procedures control simultaneous error rates under various conditions. As stated in Section 2.1, all the following results assume that p-values across n studies are independent. The key property that AdaFilter relies on is the following conditional validity lemma: LEMMA 4.1 (Conditional validity). When H r/n 0j is true, for any fixed &gt; 0 (5) P S j &lt; | F j &lt; &#63743; holds whenever P F j &lt; &gt; 0. Here F j and S j are given by (3) and (4), respectively.</p><p>Inequality ( <ref type="formula">5</ref>) can be equivalently written as P S j &lt; &#63743; P(F j &lt; ), which holds even when P(F j &lt; ) = 0 as S j F j is always true. Intuitively, the "conditional validity" guarantees that for a fixed threshold , the estimated upper bound on the number of false rejections V is P j 1 Fj&lt; . However, AdaFilter uses a data-dependent , so extra assumptions on the base p-values within one study are needed to prove simultaneous error control of AdaFilter. REMARK 4.1. Though we name our method AdaFilter Bonferroni, we can only prove FWER/PFER control under independence of the p-values within each study, though simulations in Section 5 show that FWER/PFER control can also be achieved in practice for dependent p-values within each study. REMARK 4.2. For controlling for FWER, one can combine adaFilter Bonferroni with the sequential rejection principle <ref type="bibr">[18]</ref> to further increase the number of rejections while controlling for FWER at the same level. Intuitively, this is similar to improving the standard Bonferroni procedure with Holm's procedure. For a more detailed discussion, see Section S1.</p><p>For AdaFilter BH, however, we can only prove that it controls FDR at the nominal level of &#8629;C(M ) where C(M ) = P M j=1 1/j &#8673; log M . In other words, adjusting the threshold to be &#8629;/C(M ) can guarantee control of the FDR at level &#8629;. The inflation factor C(M ) in Theorem 4.3 for the adaFilter BH procedure is due to a technical difficulty encountered when proving for FDR control for finite M . In Section 5, we find in simulations that the AdaFilter BH procedure adjusted by C(M ) still achieves higher power than other bench-marking approaches. Our simulations also suggest that the adjustment C(M ) is actually not needed in practice. In Section 4.2, we will show that AdaFilter BH can asymptotically controls FDR without using the inflation factor C(M ) when M ! 1. The asymptotic results also do not require independence among p-values within each study. 4.2. Asymptotic FDR control when M ! 1. Now we discuss FDR control of AdaFilter BH when the number of hypotheses M is very large, the usual case in high-throughput genetic experiments. Inspired by <ref type="bibr">[13]</ref>, we make the following three assumptions.</p><p>First, instead of requiring independent p-values within each study, we only assume a weak dependence structure among the p-values within each study. ASSUMPTION 1 (Weak dependence). Within any study i, the p-values P ij for j = 1, 2, &#8226; &#8226; &#8226; , M satisfy weak dependence where for any fixed</p><p>One scenario where the weak dependence holds is that, within each study i, the number of pairs (P ij , P ij 0 ) where P ij and P ij 0 are not independent is o(M 2 ). For microarrays or RNA-seq experiments, gene-gene networks are typically sparser than O(M 2 ). For GWAS or eQTLs, DNA loci are usually associated only when they are close enough along the DNA chain, say when |j j 0 | &lt; b for some constant b. The weak dependence assumption is reasonable for both the above two scenarios.</p><p>Now let H r/n 0 = {j : H r/n 0j is true} be the set of true PC nulls and M 0 be its cardinality. Similarly, define H</p><p>1j is true} to be the set of true PC non-nulls and let M 1 be its cardinality. Besides weak dependence, we also assume that when M ! 1, the following limits exist: ASSUMPTION 2 (Existence of limits). The following limits exist:</p><p>For a given n, there are 2 n combinations of base hypotheses being null or non-null. A special case where Assumption 2 is satisfied is when each of these combinations has a limiting proportion and within each study, the base p-values have identical distributions under the null, and identical distributions under the non-null, such as a mixture driven by random underlying effect sizes. Specifically, for any c 2 {0, 1} n representing one of the 2 n combinations, let m c be the number of PC hypotheses that fall into this combination. Also, let H 0i and H 1i be the sets of true nulls and true non-nulls for the ith study. If (a) lim M !1 m c /M exists for all c and, (b) for each i, {P ij : j 2 H 0i } have identical distributions across j and {P ij : j 2 H 1i } also have identical distributions across j, then Assumption 2 is satisfied.</p><p>Under Assumption 2, we denote</p><p>and further define the "asymptotic FDR" for a given as</p><p>and the largest 1 0 such that f 1 ( ) &#63743; &#8629;, i.e., 1 0 = sup{ : f 1 ( ) &#63743; &#8629;}. Then f 1 ( ) is 0 when = 0 and exceeds 1 when = 1, thus the above set is not empty. We make a final technical assumption on the functions f 1 (&#8226;), S0 (&#8226;) and S1 (&#8226;) around 1 0 :</p><p>ASSUMPTION 3 (Technical conditions). The following two conditions hold:</p><p>(a) There exists &gt; 0 such that f 1 ( ) is monotonically increasing in the interval ( 1 0 , 1 0 ], and (b) S0 ( ) and S1 ( ) are both continuous at the point 1 0 .</p><p>Intuitively, (a) guarantees that the limit of the AdaFilter threshold BH 0 is unique when M ! 1 and (b) is satisfied if there are sufficient points (selection p-values) around 1 0 when M is large. Now we are ready to state the asymptotic FDR control of AdaFilter BH. Notice that Assumption 3(a) implies that f 1 ( 1 0 ) &gt; 0, thereby guaranteeing S( 1 0 ) &gt; 0.</p><p>REMARK 4.3. Theorem 4.4 still holds if Assumption 2 is weakened to allow &#8673; 0 = 0 while M 0 ! 1 and Assumption 1 is modified to: for any fixed ,</p><p>for both s = 0, 1. We can not deal with &#8673; 0 = 1 as that would lead to S( 1 0 ) = 0 and violates Assumption 3(a). In Section 5, we show with simulations that both simultaneous error rates can be controlled in practice even when M 0 /M = 0.99.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.3.</head><p>Lack of complete monotonicity. The increased power of AdaFilter can lead to an unexpected power gain when combining multiple similar studies. Suppose that we test the involvement of M genes in a disease with two studies. One researcher uses BH or Bonferroni separately on the M base p-values in each study and claims that a gene is important for the pathology if it is rejected in any of the two studies. Another researcher runs AdaFilter with r = 2 on the same data while claiming that a gene is selected only when its nulls are false in both studies. The second researcher has a stricter goal, however, it is possible that she makes more discoveries than the first.</p><p>To see how this could happen, consider the toy example in Table <ref type="table">1a</ref> where M = 2. In both studies, neither of the two hypotheses can be rejected at significance level &#8629; = 0.05 when using either Bonferroni or BH on each study separately. However, both AdaFilter Bonferroni and AdaFilter BH can reject H 2/2 01 at the same nominal level. This interesting phenomenon arises from the lack of monotonicity of the number of rejections in the base p-values. A multiple testing procedure has "complete monotonicity" if reducing any base p-values can never cause any of the decisions on the null hypotheses to switch from 'reject' to 'accept'. (b) A counterexample to show that AdaFilter violates "complete monotonicity". The significance level is &#8629; = 0.05.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DEFINITION 4.5 (Complete monotonicity). A multiple testing procedure has complete monotonicity if each decision function ' j is a non-increasing function in all the elements of</head><p>Simes', Fisher's and Bonferroni's meta-analyses have complete monotonicity. So does the BH procedure with n = 1. Heller, Bogomolov and Benjamini <ref type="bibr">[21]</ref> call this property "stability" and it holds for the PC tests of <ref type="bibr">[22]</ref>. However, AdaFilter do not satisfy complete monotonicity: lowering one of the p-values for gene j can change the rejection of H r/n 0,j 0 to acceptance for j 0 6 = j. Table <ref type="table">1b</ref> shows how AdaFilter does not have complete monotonicity. Compared with Table 1a, the second hypothesis has a decreased p-value in study 1 while all other p-values are kept fixed. In Table <ref type="table">1a</ref>, both Bon 0 = BH 0 = 0.05 so the first PC hypothesis is rejected. In contrast, in Table <ref type="table">1b</ref> Bon 0 = BH 0 = 0.03 so that none of the hypotheses can be rejected though it has a smaller p-value matrix.</p><p>This lack of complete monotonicity, which might appear undesirable, in fact is at the core of the efficiency of AdaFilter. A larger P ij can increase F j to reduce the multiplicity burden. When only a few hypotheses are non-null-as in a sparse genomics setting-we expect lots of large P ij . This gives AdaFilter a substantial advantage in identifying the few non-null PC hypotheses. From another perspective, increased base p-values may make the signal configuration across genes more similar among studies. AdaFilter can implicitly learn such similarity and utilize it to allow more rejections.</p><p>Though lacking "complete monotonicity", AdaFilter retains a "partial monotonicity" property: reducing one of the n base p-values for test j can never change the decision from reject H r/n 0,j to accept. DEFINITION 4.6 (Partial monotonicity). A multiple testing procedure has partial monotonicity if for all j 2 {1, &#8226; &#8226; &#8226; , M}, its decision function ' j (p &#8226;1 , . . . , p &#8226;M ) is non-increasing in all elements of (p 1j , p 2j , . . . , p nj ).</p><p>Partial monotonicity only requires the test of hypothesis j to be monotone in the p-values for that same hypothesis. It allows a reduction in p ij 0 for j 0 6 = j to reverse a rejection of H r/n 0j . We have the following result: COROLLARY 4.7. Both the AdaFilter Bonferroni and the AdaFilter BH procedures satisfy partial monotonicity for all null hypotheses H r/n 0j , j = 1, 2, . . . , M . Corollary 4.7 indicates that AdaFilter is reasonable in a way that reducing the base pvalues of the jth PC hypothesis indeed strengthens the evidence of replicability for the jth PC hypothesis, though possibly weakening the evidence of replicability for other PC hypotheses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.4.</head><p>Extensions and discussion of related literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.4.1.</head><p>Comparison with other strategies. Two directly related methods to AdaFilter are <ref type="bibr">[7]</ref> for n = r = 2 and the empirical Bayes approach in <ref type="bibr">[23]</ref> for controlling the Bayes FDR, both of which are designed to test for multiple PC nulls. Both methods were developed to improve the efficiency of the "direct approach" we described. AdaFilter is similar to the method of <ref type="bibr">[7]</ref> but works for any n and r. It provides a frequentist approach comparable to and sometimes better than <ref type="bibr">[23]</ref>.</p><p>The procedures of <ref type="bibr">[7]</ref> use a filtering step for each study based on the p-values in the other study and a selection step that rejects hypotheses that have small enough p-values in both studies. To maximize the efficiency, the authors suggest a data-adaptive threshold. For instance, to control FWER, they chose two thresholds 1 and 2 to satisfy</p><p>Thus Bon 0 &#8673; 1 &#8673; 2 and AdaFilter becomes similar to their procedure. The proposed method only applies for n = r = 2; this simplification makes the approach less widely applicable, despite its strong theoretical guarantees. In addition, for n = r = 2, some other methods <ref type="bibr">[10,</ref><ref type="bibr">9]</ref> have also discussed powerful multiple testing procedures controlling for FWER and in [? ], the authors proposed a new procedure controlling for local FDR.</p><p>In repfdr <ref type="bibr">[23]</ref>, the authors tried to learn the proportion of each of the 2 n (or 3 n for sign replicability) configurations of base hypotheses, along with the distribution of some Z-values under each configuration. This has cost at least O(M 2 n ) while AdaFilter has cost O(Mn log(n)). There are other multiple testing procedures that aim to find consistent signals across conditions <ref type="bibr">[43,</ref><ref type="bibr">45,</ref><ref type="bibr">48]</ref>, all of which use an empirical Bayes framework as in <ref type="bibr">[23]</ref>. Compared to these methods, AdaFilter is typically faster, guarantees simultaneous error rate control and is more robust to the dependence of p-value within each study.</p><p>Finally, there has been much other recent literature on efficient FDR control by using some special data structure as prior knowledge <ref type="bibr">[30,</ref><ref type="bibr">31,</ref><ref type="bibr">2,</ref><ref type="bibr">6]</ref> and then adaptively determining the selection threshold. AdaFilter shares some similar adaptive filtering ideas, but works directly from an n &#8677; M matrix of p-values without assuming any special structure and is uniquely tailored to the special nature of the PC hypotheses. 4.4.2. Variable r and n. In many genetic problems, the M genes or DNA loci can have varying r j or n j as they may not be present in every experiment. Then the jth PC null hypothesis is H rj/nj 0j . AdaFilter procedures still work in this scenario because Lemma 4.1 still holds. We only need to replace formulas ( <ref type="formula">3</ref>) and ( <ref type="formula">4</ref>) by F j = (n j r j + 1)P (rj 1)j and S j = (n j r j + 1)P (rj)j , respectively. 4.4.3. Requiring sign replicability. Partial conjunctions with two-sided test statistics can reject H r/n 0j in settings where some of the significant findings have test statistics with positive signs and others negative. It is more natural to think of replication as having concordant signs, be either consistently positive or consistently negative. In meta-analysis, one can pool n one-sided tests for positive alternatives, repeat that for negative alternatives and double the smaller of the resulting one-sided p-values <ref type="bibr">[35]</ref>. This approach is very effective when either the most likely or most useful alternatives to the null have concordant signs. We can adapt this approach to PC tests and AdaFilter as follows.</p><p>We start with two base P-value matrices, (P + ij ) n&#8677;M and (P ij ) n&#8677;M , for null hypotheses (H + 0ij ) n&#8677;M and (H 0ij ) n&#8677;M respectively. The rejection of H + 0ij is for a positive sign of the signal and the rejection of H 0ij is for a negative sign. We also define two vectors of PC hypotheses {H r/n,+ 01 , . . . , H r/n,+ 0M } and {H r/n, 01 , . . . , H r/n, 0M }. The PC null H r/n,+ 0j is rejected if the signal j is positive in at least r studies, and H r/n, 0j is rejected if the signal j is negative in at least r studies. If r &gt; n/2 then it will be impossible to reject both H r/n,+ 0j and H r/n, 0j for the same j. We can apply AdaFilter twice, separately on {H r/n,+ 01 , . . . , H r/n,+ 0M } and {H r/n, 01 , . . . , H r/n, 0M }, controlling the simultaneous error rate (FWER, PFER or FDR) at levels &#8629; 1 and &#8629; 2 respectively, with &#8629; 1</p><p>Let the set of rejected PC nulls be R + and R , respectively. Rejecting the union of these two sets</p><p>[ R controls the corresponding error rate at a level &#8629; = &#8629; 1 + &#8629; 2 for the null hypotheses {H r/n,&#177; 01 , . . . , H r/n,&#177; 0M }. If r &#63743; n/2, then there might be some j 2 R + \ R . While such findings are not what we usually have in mind with replication they could nonetheless be scientifically interesting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.4.4.</head><p>Testing for all possible values of r. The partial conjunction null H r/n 0 can be meaningfully defined whenever 2 &#63743; r &#63743; n, and sometimes it is of interest to test for all possible r values, adding another layer of multiplicity. In <ref type="bibr">[4]</ref>, it is shown that as the PC p-values P r/n j are monotone increasing when r increases, the "direct approach" can control for multiple r values simultaneously, without any further multiplicity adjustment of r. Unfortunately, this is not true for AdaFilter. As the filtering information learnt by AdaFilter varies for different r values, a signal that is rejected by a larger r using AdaFilter is not guaranteed to also be rejected at a smaller replicability level. The current formulation of AdaFilter is therefore not suited to data dependent selection of the r value, but requires this to be specified by the user.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.</head><p>Simulations. We benchmark the performance of AdaFilter versus the "direct approach" with the three forms of PC p-values in Section 2.2. For FDR control, we also include <ref type="bibr">[23]</ref>, using their R package repfdr. Within each study, we assume a block dependence structure while changing the block size to create two scenarios, weak dependence with a small block size and strong dependence with a large block size.</p><p>We set M = 10,000 and consider six different configurations of n and r, as listed in Table 2a. For a given n, there are 2 n combinations of base hypotheses. In generating different configurations of the truth, we use two parameters to control the probability of each combination: &#8673; 00 is the probability of the global null combination and &#8673; 1 is the probability of the combinations not belonging to H r/n 0j . We set &#8673; 1 = 0.01 and consider two values for &#8673; 00 : 0.8 or 0.98, to mimic the signal sparsity in gene expression and genetic regulation studies. All PC null combinations except for the global null have equal probabilities adding up to 1 &#8673; 00 &#8673; 1 . All non-null PC combinations also have equal probabilities.</p><p>We assume that p-values belonging to different studies are independent and, within one study, the correlation of the M Z-values is I b&#8677;b &#8998; &#8963; &#8674; where &#8998; is the Kronecker product. The covariance block &#8963; &#8674; 2 R M/b&#8677;M/b has 1s on the diagonal and common value &#8674; = 0.5 off the diagonal. We set the number of blocks b = 100 for weak dependence and b = 10 for strong dependence, which should cover the spectrum of what is typically expected in genomics. When the base hypothesis is non-null, we sample the mean of its Z-value uniformly and independently from I = {&#177;&#181; 1 , &#177;&#181; 2 , &#177;&#181; 3 , &#177;&#181; 4 } where the four levels of signals {&#181; 1 , &#181; 2 , &#181; 3 , &#181; 4 } correspond to detection power of 0.02, 0.2, 0.5, 0.95 respectively.</p><p>In the analysis, we target controlling PFER at the nominal level &#8629; = 1, FDR at the nominal level &#8629; = 0.2, and Bayes FDR at the same level &#8629; = 0.2 for repfdr. Bayes FDR corresponds to the posterior probability of a null hypothesis given the test statistics falling into the rejection region, which has been shown to be similar to the frequentist FDR under independence <ref type="bibr">[12]</ref>. Studying PFER control, we compare four methods: AdaFilter Bonferroni and three forms of the "direct approach". For FDR control, we compare 6 methods: AdaFilter BH, AdaFilter BH with the inflation factor C(M ) = P M j=1 1/j &#8673; log M , repfdr and the "direct approaches". For each parameter configuration, we run B = 100 random experiments and calculate the average power, number of false discoveries and false discovery proportions of each procedure.</p><p>Table <ref type="table">2b</ref> shows the average PFER and recall over the six combinations of n and r for each setting of b and &#8673; 00 . More detailed results for each n and r separately are shown in Figures <ref type="figure">S1-S2</ref>. All methods that target PFER successfully control it at the nominal level, while the direct approaches are much more conservative, especially when both n and r are large. The gain in power is more pronounced when &#8673; 00 is higher, which is expected in many genetics applications.</p><p>Table <ref type="table">2c</ref> shows the average FDR and recall over the six combinations of n and r for each setting of b and &#8673; 00 . More detailed results for each n and r separately are shown in Figure <ref type="figure">S3</ref>-S4. AdaFilter BH, even not inflated, and the "direct approach" control FDR at the nominal level. However, similar to the PFER control, the "direct approach" procedures are too conservative. The inflated AdaFilter BH has lower power than AdaFilter BH, while its power still exceed the "direct approach", especially for large r. The repfdr method fails to consistently control FDR especially when n is large: we believe that this is due to the large number of parameters that need to be estimated in these scenarios. In the cases when repfdr does control FDR, its power is comparable to AdaFilter when &#8673; 00 = 0.8 while is less when &#8673; 00 = 0.98 is large and further reduces when dependence increases.</p><p>Finally, we point out that our simulations only compare different methods for a pre-defined r value. As discussed in Section 4.4.4, AdaFilter needs another layer of multiplicity adjustment if multiple r values are tested simultaneously. In practice, if one aims to testing for mulitple replicability levels or is interested in obtaining the lower bound of r for each hypotheses <ref type="bibr">[26]</ref>, the "direct approach" may still be a preferred method as it automatically controls for the error rates of multiple r values simultaneously. Method PFER Recall(%) PFER Recall(%) PFER Recall(%) PFER Recall(%) Bon-P B r/n 0.04 14.72 0.05 14.87 0.00 14.72 0.00 14.83 Bon-P F r/n 0.05 19.30 0.06 19.50 0.01 19.18 0.00 19.38 Bon-P S r/n 0.04 14.80 0.05 14.93 0.00 14.78 0.00 14.88 AdaFilter Bonferroni 0.73 28.71 0.76 28.93 0.29 38.10 0.21 38.25 (c) Comparison of methods targeting a nominal FDR of &#8629; = 0.2 &#8673;00 = 0.8 &#8673;00 = 0.98 b = 100 b = 10 b = 100 b = 10 Method FDR Recall(%) FDR Recall(%) FDR Recall(%) FDR Recall(%) BH-P B r/n 0.01 29.50 0.01 29.55 0.00 29.04 0.00 29.10 BH-P F r/n 0.01 32.94 0.01 32.80 0.00 32.68 0.00 32.74 BH-P S r/n 0.01 29.68 0.01 29.70 0.00 29.16 0.00 29.28 repfdr 0.33 59.39 0.29 23.53 0.14 24.31 0.13 11.56 AdaFilter BH 0.15 58.64 0.14 58.71 0.06 71.27 0.06 71.49 Inflated AdaFilter BH 0.02 34.39 0.01 34.22 0.01 45.70 0.01 46.17 6. Case studies. We apply AdaFilter to analyze two datasets: one investigates the replication of gene differential expression results in four microarray experiments of Duchenne muscular dystrophy and one focuses on identifying marker genes of one T cell subtype from lung cancer tumors using single-cell RNA-sequencing (scRNA-seq) data. In Section S2, we also discuss the application of AdaFilter BH to a third dataset, testing for consistently significant signals across different metabolic super-pathways within one study. 6.1. Duchenne Muscular Dystrophy microarray studies. Following <ref type="bibr">[28]</ref>, we investigate four independent Duchenne muscular dystrophy (DMD)-related microarray datasets in the Gene Expression Omnibus (GEO) database (GDS 214, GDS 563, GDS 1956 and GDS 3027, Table <ref type="table">3a</ref>), to understand the signature genes for the disease. The goal here is to find differentially expressed marker genes for DMD that show replicating signals in multiple datasets. For each experiment, the data is preprocessed using a standard data reprocessing tool RMA <ref type="bibr">[25]</ref> for microarrays. Within each study, we find genes that are differentially expressed between the disease and healthy group, using a popular software Limma <ref type="bibr">[40]</ref> and adjust for covariates like batch and patients' age and gender when they are available.</p><p>The four datasets are from three different microarray platforms where different probe-sets are used. In order to compare across platforms, we map probe-sets to common gene names. When multiple probe-sets map to the same gene, a Bonferroni rule is applied combining pvalues of these probe sets into a single p-value for the gene. There are only M = 1871 genes present in all four studies, with M = 9848 genes shared in at least 3 studies and M = 13912</p><p>(a) GEO datasets information</p><p>GEO ID Platform Description Source GDS 214 custom Affymetrix 4 healthy, 26 DMD Muscle GDS 563 Affymmetrix U95A 11 healthy, 12 DMD Quadriceps Muscle GDS 1956 Affymetrix U133A 18 healthy, 10 DMD Muscle GDS 3027 Affymetrix U133A 14 healthy, 23 DMD Quadriceps Muscle (b) AdaFilter BH rejections r M Rejected 2 13912 494 3 9848 142 4 1871 32 (c) Known marker genes detected by AdaFilter at r = 4 Gene Symbol GDS 214 GDS 563 GDS 1956 GDS 3027 MYH3 5.47e-14 2.18e-69 3.31e-07 2.49e-20 MYH8 5.74e-06 9.09e-11 2.58e-03 5.16e-33 MYL5 8.97e-04 3.06e-06 1.87e-03 6.63e-08 MYL4 1.48e-06 7.94e-08 1.21e-02 2.66e-08</p><p>TABLE 3 Replicability analysis for DMD microarrays genes in at least two studies. As discussed in Section 4.4.2, AdaFilter can work with varying n j thus allow missing entries in the p-value matrix.</p><p>The application of AdaFilter BH at level &#8629; = 0.05 leads to the discovery of many consistently differentially expressed genes at r = 2, 3, 4 (Table <ref type="table">3b</ref>). Specifically, at r = 4, AdaFilter BH finds 32 significant genes (Table <ref type="table">S2</ref>). By contrast, a BH adjustment on the Fisher combined PC p-values (P F r/n,j ) only detects two genes (MYH3 and S100A4) and repfdr reports no significant genes as it fails to perform the distribution estimation of p-values with M = 1871 being too small. Table <ref type="table">3c</ref> shows four of the 32 genes that are known to play important roles in muscle contraction (Table <ref type="table">S1</ref>). Notice that besides MYH3, all three markers do not have a small enough p-value in the third study (GDS1956, which is the least powerful study) to be detected when BH is applied to the study alone with a nominal FDR level 0.05. However, AdaFilter can compensate for this deficiency by leveraging the overall similarity of the results in this study compared with other studies. 6.2. scRNA-seq of T cells in lung cancer tumors. Understanding T cell heterogeneity in tumors brings in key information to cancer immunotherapies, and the recent single-cell RNAsequencing (scRNA-seq) technology enables measurement of gene expression levels at the single cell resolution. In <ref type="bibr">[19]</ref>, the authors sequenced tumor T cells from 14 treatment-na&#239;ve non-small-cell lung cancer patients and one main finding is the discovery of a new subtype of the CD4+ regulatory T cells (Tregs), named the suppressive tumor-resident Tregs (CD4-C9-CTLA4), that is different from the normal Tregs (CD4-C8-FOXP3). We download data from the GEO database (GSE99254), where cell type labels are also provided.</p><p>In order to characterize the new cell type CD4-C9-CTLA4, one need to identify a list of reliable marker genes that are consistently highly expressed in CD4-C9-CTLA4 across multiple patients. Thus we apply AdaFilter treating each patient as a "study". For each patient, we obtain p-values of each gene for whether the gene expression is higher in CD4-C9-CTLA4 than in CD4-C8-FOXP3. These one-sided base p-values are calculated using the Wilcoxon rank-sum test, which is the standard test for analyzing scRNA-seq. Two patients who have less than 10 Treg cells in either of the two groups are excluded from the analysis. In summary, we obtain a p-value matrix for 23459 genes and n = 12 patients.</p><p>We vary the replicability level r and Figure <ref type="figure">2a</ref> compares the number of genes detected using different methods. For large r(r 8), AdaFilter is more powerful than the "direct approach" with Fisher's PC p-values. However, it is less powerful when r is relatively small, as the power gain of Fisher's combination to construct PC p-values may exceed the power gain using AdaFilter, whose selection p-values are from the Bonferroni's combination. The other two forms of "direct approach" show limited power for all r and repfdr fails to run with insufficient memory for r 6 even with 300G of RAM. In Table <ref type="table">S3</ref>, we list the 20 genes that are detected at r = 10, most of which are known to be linked to immunoresponse in tumors. To further show the benefit of requiring replicability on marker gene selection, we compare a list of genes on their base p-values per patient, their standard BH adjusted merged p-values and AdaFilter BH adjusted p-values at r = 4 (Figure <ref type="figure">2b</ref>). All 10 genes in Figure <ref type="figure">2b</ref> would be selected in the original paper as their adjusted merged p-values are far less than 0.05. However, the top 5 genes only have one or two patients whose base p-values are less than 0.01. Intuitively, they are less convincing markers as there is no replicability across patients. While the merged p-values can not distinguish the more convincing markers, they can easily be separated with their AdaFilter BH adjusted p-values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion.</head><p>Testing PC hypotheses provides a framework to detect consistently significant signals across multiple studies, leading to an explicit assessment of the replicability of scientific findings. We introduced AdaFilter, a multiple testing procedure which greatly increases the power in simultaneous testing of PC hypotheses over other existing methods. AdaFilter implicitly learns and utilizes the overall similarity of results across studies and exhibits a lack of complete monotonicity.</p><p>We proved that AdaFilter procedures control FWER and FDR under independence of all pvalues for a given finite number of hypotheses, and further showed that AdaFilter BH asymptotically controls FDR allowing weak dependence within each study. In our simulations, we demonstrated that both AdaFilter Bonferroni and AdaFilter BH are robust to the dependence of p-values within each study in practice, even when such dependence is not weak. On the other hand, the validity of AdaFilter does need independence of the base p-values across different studies, as Lemma 4.1 can be easily violated when these base p-values are dependent.</p><p>We applied AdaFilter to three case studies, encompassing gene expression and genetic association. Other types of applications include eQTL studies and multi-ethnic GWAS (such as new Population Architecture using Genomics and Epidemiology (PAGE) study) where it is of great interest to understand which genetic regulations are shared and which are tissue / population specific. Actually, PC tests can be quite useful in even broader context. According to Hume <ref type="bibr">[24]</ref>, "constant conjunction" is a characteristic of causal effects. If some hypotheses are rejected repeatedly under various distinct settings, that can be supportive evidence for some causal mechanism instead simple associations. These directions can be further investigated in future research.</p></div></body>
		</text>
</TEI>
