<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Understanding the Eluder Dimension</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10439496</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Gene Li</author><author>Pritish Kamath</author><author>Dylan J. Foster</author><author>Nati Srebro</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We provide new insights on eluder dimension, a complexity measure that has been extensively used to bound the regret of algorithms for online bandits and reinforcement learning with function approximation. First, we study the relationship between the eluder dimension for a function class and a generalized notion of rank, defined for any monotone "activation" σ : R → R, which corresponds to the minimal dimension required to represent the class as a generalized linear model. It is known that when σ has derivatives bounded away from 0, σ-rank gives rise to an upper bound on eluder dimension for any function class; we show however that eluder dimension can be exponentially smaller than σ-rank. We also show that the condition on the derivative is necessary; namely, when σ is the relu activation, the eluder dimension can be exponentially larger than σ-rank. For binary-valued function classes, we obtain a characterization of the eluder dimension in terms of star number and threshold dimension, quantities which are relevant in active learning and online learning respectively.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Russo and Van Roy <ref type="bibr">[37]</ref> introduced the notion of eluder dimension for a function class and used it to analyze algorithms (based on the Upper Confidence Bound (UCB) and Thompson Sampling paradigms) for the multi-armed bandit problem with function approximation. Since then, eluder dimension has been extensively used to construct and analyze the regret of algorithms for contextual bandits and reinforcement learning (RL) with function approximation [see, e.g., <ref type="bibr">46,</ref><ref type="bibr">36,</ref><ref type="bibr">44,</ref><ref type="bibr">6,</ref><ref type="bibr">14,</ref><ref type="bibr">19,</ref><ref type="bibr">12,</ref><ref type="bibr">16,</ref><ref type="bibr">26,</ref><ref type="bibr">24]</ref>. Even though the eluder dimension has become a central technique for reinforcement learning theory, little is known about when exactly it is bounded. This paper makes progress toward filling this gap in our knowledge.</p><p>Russo and Van Roy <ref type="bibr">[37]</ref> established upper bounds on eluder dimension for (i) function classes for which inputs have finite cardinality (the "tabular" setting), (ii) linear functions over R d of bounded norm, and (iii) generalized linear functions over R d of bounded norm, with any activation that has derivatives bounded away from 0. Apart from these function classes (and those that can be embedded into these), understanding of eluder dimension has been limited. Indeed, one might wonder whether a function class has "small" eluder dimension only if it can be realized as a class of (generalized) linear functions! This leads us to our first motivating question: Question 1. Are all function classes with small eluder dimension essentially generalized linear models?</p><p>Answering this question has substantial ramifications on the scope of prior work. An answer of "yes" would imply that the results in the aforementioned work which gives regret guarantees in terms of eluder dimension do not go beyond already-established regret guarantees for generalized linear bandit or RL settings [see, e.g. <ref type="bibr">17,</ref><ref type="bibr">32,</ref><ref type="bibr">45,</ref><ref type="bibr">31</ref>]. An answer of "no" can be construed as a positive result for RL theory, as it would indicate that existing (and future) results which use the eluder dimension apply to a richer set of function classes than generalized linear models.</p><p>To answer Question 1, we first formally define what it means for a function class to be written as a generalized linear model (GLM). Informally, for an activation &#963; : R &#8594; R and a function class F &#8838; (X &#8594; R), we define the &#963;-rank to be the smallest dimension d needed to express every function in F as a generalized linear function in R d with activation &#963; (see <ref type="bibr">Definition 3)</ref>. Intuitively, the &#963;-rank captures the best possible upper bound on eluder dimension that the results from Russo and Van Roy <ref type="bibr">[37]</ref> can give for a given F by treating it as a GLM with activation &#963;. We ask how the eluder dimension of any function class relates to its &#963;-rank for various activations &#963;. We show that the answer to Question 1 is indeed "no", i.e., there exists a function class with eluder dimension d but the &#963;-rank for any monotone &#963; is at least exp(&#8486;(d)). Thus, while Russo and Van Roy <ref type="bibr">[37]</ref> show that the set of function classes with small eluder dimension is a superset of the set of GLM function classes, we (roughly speaking) show that the set of function classes with small eluder dimension is strictly larger than the set of GLM function classes.</p><p>We also prove that the requirement from Russo and Van Roy <ref type="bibr">[37]</ref> that the derivative of the activation is bounded away from 0 is necessary in order to bound the eluder dimension of GLMs. The upper bound in the paper <ref type="bibr">[37]</ref> becomes vacuous when the activation function has zero derivative; we show a lower bound which indicates this requirement cannot be dropped. Namely, when &#963; is the relu activation, we show that eluder dimension can be exponentially larger than &#963;-rank.</p><p>In a second line of inquiry, we study a combinatorial version of eluder dimension. The original definition of Russo and Van Roy <ref type="bibr">[37]</ref> is defined for real-valued function classes, but one can specialize the definition to binary-valued function classes, leading to a so-called combinatorial eluder dimension. Thus, our second motivating question is: Question 2. Can we bound the combinatorial eluder dimension, perhaps in terms of more familiar learning-theoretic quantities?</p><p>One might wonder: if the combinatorial eluder dimension is just a special case of the scale-sensitive version, why study it at all? Our reasons are threefold.</p><p>&#8226; The first and most immediate reason is that we are able to show new characterizations of eluder dimension once we move to the combinatorial definition. We elucidate a fundamental connection between eluder dimension and two other well-studied learning-theoretic quantities: (1) star number, a quantity that characterizes the label complexity of pool-based active learning <ref type="bibr">[21]</ref>, (2) threshold dimension, a quantity that characterizes the regret of online learning <ref type="bibr">[3]</ref>. <ref type="foot">1</ref> We believe that this new result may help us better understand how different learning tasks relate to each other.</p><p>&#8226; The second reason is that the combinatorial eluder dimension (or a multi-class variant of it) has already been studied for policy-based learning for contextual bandits and RL [see, e.g., <ref type="bibr">19,</ref><ref type="bibr">34]</ref>. Thus, understanding the combinatorial eluder dimension may shed light on the challenges of policy-based RL.</p><p>&#8226; Our last reason has more philosophical bent. Historically, the discovery of VC dimension placed statistical learning theory on solid footing. Specifically, the fundamental theorem of statistical learning allows us to precisely characterize the statistical complexity of PAC learnability in terms of the combinatorial VC dimension. The insights from understanding the role of VC dimension in classification have led researchers to develop scale-sensitive complexity measures such as fat shattering dimension to provide sharper guarantees on learning. While we do not claim that the eluder dimension is fundamental to online RL and bandit settings, we believe that a better combinatorial understanding can lead to a better understanding of online decision making. In some sense, we are "working backwards" from the original scale-sensitive definition of eluder dimension to understand its combinatorial properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Main contributions</head><p>In this work, we provide several results which show when eluder dimension can be bounded (or is unbounded). Our results can be separated into two categories: (1) those pertaining to the (scalesensitive) eluder dimension (as originally defined by Russo and Van Roy <ref type="bibr">[37]</ref>); (2) those pertaining to the combinatorial eluder dimension, defined for binary-valued function classes.</p><p>First, we investigate the relationship between eluder dimension and our notion of generalized rank that captures the realizability of any function class as a GLM. In Section 2, we formally introduce the eluder dimension (as well as a related quantity of the scale-sensitive star number). In Section 3, we formalize the notion of "generalized rank". In Section 4, we provide several results.</p><p>1. We show that eluder dimension can be exponentially smaller than &#963;-rank for any monotone activation &#963;, not just those with derivatives bounded away from 0 (Theorem 6). 2. We show that the condition on the derivative being bounded away from 0 is necessary for &#963;-rank to be an upper bound on eluder dimension. Namely, when &#963; is the relu activation, we show that eluder dimension can be exponentially larger than &#963;-rank (Theorem 7). <ref type="foot">2</ref>In Section 5, we specialize the eluder dimension to the binary-valued setting and present our results on the combinatorial eluder dimension.</p><p>1. We show that eluder dimension is finite if and only if both star number and threshold dimension are finite. Specifically, in Theorem 8 we show the following:</p><p>Furthermore, we demonstrate that both inequalities can be tight (see Theorem 9 and discussion above it). 2. We investigate the comparison between eluder dimension and sign-rk and prove stronger separations than in the scale-sensitive case; namely we show examples where one quantity is finite while the other is infinite (Theorem 10).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Related work</head><p>In this section, we highlight several related works.</p><p>Bounds on eluder dimension. Several papers provide bounds on eluder dimension for various function classes. The original bounds on tabular, linear, and generalized linear functions were proved by Russo and Van Roy <ref type="bibr">[37]</ref> (and later generalized by Osband and Van Roy <ref type="bibr">[36]</ref>). Mou et al. <ref type="bibr">[34]</ref> provide several bounds for the combinatorial eluder dimension, mostly focusing on linear function classes. When the function class lies in an RKHS, Huang et al. <ref type="bibr">[25]</ref> show that the eluder dimension is equivalent to the notion of information gain <ref type="bibr">[41,</ref><ref type="bibr">38]</ref>, which can be seen as an infinite dimensional generalization of the fact that the eluder dimension for linear functions over R d is &#920;(d). In concurrent work, Dong et al. <ref type="bibr">[12]</ref> also prove an exponential lower bound on the eluder dimension for ReLU networks.</p><p>Applications of eluder dimension. The main application of the eluder dimension is to design algorithms and prove regret guarantees for contextual bandits and reinforcement learning. A few examples include the papers <ref type="bibr">[46,</ref><ref type="bibr">36,</ref><ref type="bibr">44,</ref><ref type="bibr">6,</ref><ref type="bibr">14,</ref><ref type="bibr">19,</ref><ref type="bibr">28,</ref><ref type="bibr">16,</ref><ref type="bibr">26,</ref><ref type="bibr">24,</ref><ref type="bibr">34]</ref>. While the majority of papers prove upper bounds via eluder dimension, Foster et al. <ref type="bibr">[19]</ref> provided lower bounds for contextual bandits in terms of eluder dimension, if one is hoping for instance-dependent regret bounds. In addition, several works observe that eluder dimension sometimes does not characterize the sample complexity, as the guarantee via eluder dimension can be too loose <ref type="bibr">[23,</ref><ref type="bibr">20]</ref>. Beyond the online RL setting, eluder dimension has been applied to risk sensitive RL <ref type="bibr">[15]</ref>, Markov games <ref type="bibr">[24,</ref><ref type="bibr">29]</ref>, representation learning <ref type="bibr">[47]</ref>, and active online learning <ref type="bibr">[10]</ref>.</p><p>Other complexity measures for RL. We also touch upon other complexity measures which have been suggested for RL. One category is Bellman/Witness rank approach [see e.g. <ref type="bibr">27,</ref><ref type="bibr">11,</ref><ref type="bibr">42]</ref>, which is generalized to bilinear classes <ref type="bibr">[13]</ref>. These complexity measures capture an interplay between the MDP dynamics and the function approximator class; in contrast, eluder dimension is purely a property of the function approximator class and can be stated (and studied) without referring to an online RL problem. Jin et al. <ref type="bibr">[28]</ref> define a Bellman-Eluder dimension which captures function classes which have small Bellman rank or eluder dimension. Lastly, Foster et al. <ref type="bibr">[20]</ref> propose a Decision-Estimation Coefficient and prove that it is necessary and sufficient for sample-efficient interactive learning.</p><p>Notions of rank. The notion of rank we propose is a generalization of the classical notion of sign rank, also called dimension complexity. Sign rank has been studied extensively in combinatorics, learning theory, and communication complexity [see e.g. 1, 18, 5, 2, and references therein]. The norm requirements in our definition of &#963;-rank are related to the notion of margin complexity <ref type="bibr">[5,</ref><ref type="bibr">7,</ref><ref type="bibr">30]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Eluder dimension and star number</head><p>Eluder dimension is a "sequential" notion of complexity for function classes, originally defined by Russo and Van Roy <ref type="bibr">[37]</ref>. Informally speaking, it characterizes the longest sequence of adversarially chosen points one must observe in order to accurately estimate the function value at any other point. We consider a variant of the original definition, proposed by Foster et al. <ref type="bibr">[19]</ref>, that is never larger and is sufficient to analyze all the applications of eluder dimension in literature.</p><p>Definition 1. For any function class F &#8838; (X &#8594; R), f &#8712; F, and scale &#949; &#8805; 0, the exact eluder dimension Edim f (F, &#949;) is the largest m such that there exists (x 1 , f 1 ), . . . , (x m , f m ) &#8712; X &#215; F satisfying for all i &#8712; [m]:</p><p>Then for all &#949; &gt; 0:</p><p>This definition is never larger than the original definition of Russo and Van Roy <ref type="bibr">[37]</ref>, which asks for a witnessing pair of functions f i , f i &#8712; F (the above restricts f i = f ). Hence, all lower bounds on our variant of eluder dimension immediately apply to the original definition. Moreover, all upper bounds on eluder dimension in this paper can also be shown to hold for the other definition (unless stated otherwise).</p><p>We also consider the closely related notion of star number defined by Foster et al. <ref type="bibr">[19]</ref>, which generalizes a combinatorial parameter introduced in the active learning literature by Hanneke and Yang <ref type="bibr">[21]</ref> (we will denote it as Sdim for consistency). We study the combinatorial star number in more detail in Section 5. The only difference between the definitions of eluder dimension and star number is that j&lt;i is replaced by j =i , which makes the star number a "non-sequential" notion of complexity.</p><p>Definition 2. For any function class F &#8838; (X &#8594; R), f &#8712; F, and scale &#949; &#8805; 0, the exact star number Sdim f (F, &#949;) is the largest m such that there exists</p><p>Then for all &#949; &gt; 0:</p><p>It immediately follows from these definitions that the star number is never larger than eluder dimension. On the other hand, the star number can be arbitrarily smaller than eluder dimension.</p><p>Proposition 1. For all F, f * &#8712; F and scale &#949; &#8805; 0, it holds that<ref type="foot">foot_3</ref> </p><p>Proposition 2 (simplified from Foster et al. <ref type="bibr">[19,</ref><ref type="bibr">Prop 2.3]</ref>). For the class of threshold functions given as</p><p>, where f i (x) := 1 {x &#8805; i}, and for f = f n+1 , it holds for all &#949; &lt; 1 that Sdim f (F, &#949;) = 2 and Edim f (F, &#949;) = n.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Generalized rank</head><p>Dimension complexity has been studied extensively in combinatorics, learning theory, and communication complexity [see e.g. <ref type="bibr">1,</ref><ref type="bibr">18,</ref><ref type="bibr">5,</ref><ref type="bibr">2]</ref>. The classical notion of dimension complexity, also known as sign rank, corresponds to the smallest dimension required to embed the input space such that all hypotheses in the function class under consideration are realizable as halfspaces. We consider a generalized notion of rank that is specified for any particular activation &#963; : R &#8594; R, and captures to the smallest dimension required to represent the function class as a GLM when &#963; is the activation. In what follows, we let</p><p>Definition 3. For any &#963; : R &#8594; R, the &#963;-rank of a function class F &#8838; (X &#8594; R) at scale R &gt; 0, denoted as &#963;-rk(F, R), is the smallest dimension d for which there exists mappings &#966; : X &#8594; B d (1) and w :</p><p>or &#8734; if no such d exists. For a collection of activation functions</p><p>Examples. We present some examples of &#931;-rk that motivate our definition above.</p><p>1. Threshold activation. sign(z) yields the classic notion of sign-rank (equivalent to dimension complexity, as already mentioned). In this case, the scale R is irrelevant, so we denote sign-rk(F) := sign-rk(F, R) for any R. Note that this quantity is meaningful only for F &#8838; (X &#8594; {-1, 1}).</p><p>2. Identity activation. For id(z) := z, id-rk(F, R) is the smallest dimension needed to represent each f &#8712; F as a (norm-bounded) linear function. We abbreviate rk := id-rk, as this corresponds to the standard notion of rank of the matrix (f (x)) x,f (albeit with the additional norm constraint).</p><p>3. Monotone activations. For L &#8805; &#181; &#8805; 0, M L &#181; consists of all activations &#963; such that for all z &lt; z , it holds that &#181; &#8804; &#963;(z )-&#963;(z) z -z &#8804; L (for differentiable &#963;, this is equivalent to &#181; &#8804; &#963; (z) &#8804; L for all z &#8712; R). <ref type="foot">5</ref> An important special case is when &#181; = 0, and a particular &#963; &#8712; M 1 0 of interest is the rectified linear unit (ReLU) defined as relu(z) := max {z, 0}. For ease of notation, we denote M &#181; := M 1 &#181; . While we will always be explicit about the Lipschitz constant, note that the scale of the Lipschitz constant L (and &#181;) is interchangable with the scale of R. In particular,</p><p>4. All activations. &#931; all consists of all activations &#963;. We mention this notion of rank only in passing, and we will not focus on it for the rest of the paper.</p><p>We present a result which relates the aforementioned quantities (proof in Appendix A).</p><p>rk</p><p>for all F, where the dependence on R and &#949; is suppressed for clarity (see Propositions 1, 3, 4 for precise bounds). Whenever M 2 &#8594; M 1 arrow is missing, there is an example of a class F where M 1 (F)</p><p>Proposition 3. For all F &#8838; (X &#8594; R), R &gt; 0 and &#181; &#8712; (0, 1], we have:</p><p>where the last inequality is meaningful only for F &#8838; (X &#8594; {-1, 1}). Moreover, for each inequality above, there exists a function class F which exhibits an infinite gap between the two quantities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Eluder dimension versus generalized rank</head><p>In this section, we compare eluder dimension and star number with each notion of generalized rank: rk, M &#181; -rk (for &#181; &gt; 0) and M 0 -rk. Our results are summarized in Figure <ref type="figure">1</ref>.</p><p>Eluder vs. rk and M &#181; -rk. Russo and Van Roy <ref type="bibr">[37]</ref> and Osband and Van Roy <ref type="bibr">[36]</ref> provided upper bounds on eluder dimension for linear and generalized linear function classes. For completeness, we restate this result, with a slight improvement and include the proof with precise dependence on problem parameters in Appendix B. Intuitively, the generalized linear rank allows us to capture the tightest possible upper bound that the guarantees in the papers <ref type="bibr">[37,</ref><ref type="bibr">36]</ref> can provide on eluder dimension. Proposition 4 (cf. <ref type="bibr">[37]</ref>, Prop. 6, 7; <ref type="bibr">[36]</ref>, Prop. 2, 4). For any function class F &#8838; (X &#8594; R) and &#949; &gt; 0:</p><p>This result has been used to prove upper bounds on eluder dimension of various function classes beyond GLMs; for example, the class of bounded degree polynomials, by taking the feature map &#966;(x) to be the vector of low degree monomials. The upper bound in Part (i) is in fact tight (up to constants) for the class of linear functions, as shown in Proposition 5 below. This trivially implies the optimality of the bound in Part (ii) up to the factor of (L/&#181;) 2 which to the best of our knowledge is open. Proposition 5 <ref type="bibr">([33]</ref>). For any R &gt; 0, X := B d (1) and</p><p>For completeness, we include the proof in Appendix C.</p><p>Eluder vs. M 0 -rk. It turns out that eluder dimension and M 0 -rk are incomparable. That is, there exists a function class for which eluder dimension is exponentially smaller than M 0 -rk (and hence M &#181; -rk and rk by Proposition 3). Moreover, there exists a different function class for which eluder dimension (even the star number) is exponentially larger than relu-rk (and hence M 0 -rk).</p><p>First, we show that the eluder dimension can be exponentially smaller than M 0 -rk for the class of parities over d bits. Thus, parities over d bits exhibits an example where the eluder dimension is exponentially smaller than the best possible bound one can derive using the existing results of Proposition 4. Theorem 6. For X = {-1, 1} d and</p><p>Proof. Part (i). From Proposition 3, we have that M 0 -rk(F &#8853; , R) &#8805; sign-rk(F &#8853; ) -1 for any &#963; &#8712; M 0 . The proof is now complete by noting a well known result that sign-rk(F &#8853; ) &#8805; 2 d/2 <ref type="bibr">[18]</ref>.</p><p>Part (ii). For any x &#8712; {-1, 1} d consider its 0-1 representation x &#8712; F d 2 (representing +1 by 0 and -1 by 1). All functions in F &#8853; can be simply viewed as linear functions over F 2 . Namely, any parity function is indexed by a vector a &#8712; F d 2 , with f a (x) := (-1) a, x . Note that Edim(F &#8853; , &#949;) = 0 for all &#949; &#8805; 2. For any &#949; &lt; 2, suppose Edim f (F &#8853; , &#949;) = m, witnessed by (x 1 , f a1 ), . . . , (x m , f am ) &#8712; {-1, 1} d and f = f a . We have</p><p>&#8226; f ai (x j ) = f a (x j ) for all j &lt; i : since j&lt;i (f ai (x j ) -f a (x j )) 2 &lt; &#949; 2 &lt; 4 iff all terms are 0. Thus, we have a i -a , x = 0 for all x &#8712; F 2 -span({ x 1 , . . . , x i-1 }). But a i -a , x i = 1 and hence x i is linearly independent of { x 1 , . . . , x i-1 } over F d 2 . Thus, { x 1 , . . . , x m } are all linearly independent over F d 2 , and hence m &#8804; d.</p><p>The bound in part (ii) of Theorem 6 was also calculated by Mou et al. <ref type="bibr">[34,</ref><ref type="bibr">Prop. 3]</ref>.</p><p>Next, we show a separation in the other direction for eluder dimension vs. M 0 -rk using the ReLU function class. Thus, we cannot hope to remove the requirement for the activation function &#963; to be strictly monotonically increasing in Proposition 4 (ii) for bounding the eluder dimension. Theorem 7. Let R &gt; 0 and X = B d (1). Define</p><p>It holds that</p><p>Proof. Part (i) is immediate from the definition. We show Part (ii) in the special case of R = 2; the general case follows by relatively scaling &#949;, since relu is homogeneous, namely, relu(ax</p><p>A standard sphere packing argument shows that such a set U exists with |U | &#8805; (1/2&#949;) d/2 for all &#949; &lt; 1/2. In particular, the &#948;-packing number of the unit sphere is at least We remark that while we considered the variant of eluder dimension as defined by Foster et al. <ref type="bibr">[19]</ref>, the lower bound on eluder dimension in Theorem 7 immediately holds for the notion defined by Russo and Van Roy <ref type="bibr">[37]</ref>. On the other hand, the upper bound on eluder dimension in Theorem 6 can be shown to hold even with the definition of Russo and Van Roy <ref type="bibr">[37]</ref> (by replacing every instance of a by a i ). </p><p>Figure <ref type="figure">2</ref>: Illustration of witnessing sequences of length 6 for eluder dimension, star number and threshold dimension with respect to f = 0 (we use the 0/1 representation of functions for clarity). '*' in the eluder witness sequence refers to a free value, either 0 or 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Relationships between combinatorial measures</head><p>In this section, we specialize the eluder dimension to binary-valued function classes and prove several additional characterizations that relate this combinatorial eluder dimension to other learning-theoretic quantities. First, we (re)define these learning-theoretic quantities. The first two definitions (of combinatorial eluder dimension and star number respectively) are the specialization of the scalesensitive versions (cf. Definition 1 and 2) to the binary-valued function setting. We abuse notation by dropping the argument &#949; from previous definitions in order to be consistent. Definition 4. Fix any function class F &#8838; (X &#8594; {1, -1}) and any f &#8712; F.</p><p>&#8226; The combinatorial eluder dimension w.r.t. f , denoted Edim f (F), is defined as the largest m such that there exists (x 1 , f 1 ), . . . , (x m , f m ) &#8712; X &#215; F satisfying for all i &#8712; [m]:</p><p>, and for all j &lt; i :</p><p>&#8226; The star number w.r.t. f , denoted Sdim f (F), is defined as the largest m such that there exists (x 1 , f 1 ), . . . , (x m , f m ) &#8712; X &#215; F satisfying for all i &#8712; [m]:</p><p>, and for all j = i : f i (x j ) = f (x j ).</p><p>&#8226; The threshold dimension w.r.t. f , denoted Tdim f (F), is defined as the largest m such that there exists (x 1 , f 1 ), . . . , (x m , f m ) &#8712; X &#215; F satisfying for all i &#8712; [m]:</p><p>, and for all j &lt; i :</p><p>As before, we define the combinatorial eluder dimension (resp. star number and threshold dimension) to be Edim(F) := sup f &#8712;F Edim f (F) (Sdim(F) and Tdim(F) are defined similarly).</p><p>Let us pause to unpack these definitions and give some background. In fact, the star number definition stated above is the original definition of Hanneke and Yang <ref type="bibr">[21]</ref>, who give tight upper and lower bounds on the label complexity of pool-based active learning via the star number Sdim(F) and show that almost every previously proposed complexity measure for active learning takes a worst case value equal to the star number. Roughly speaking, the star number corresponds to the number of "singletons" one can embed in a function class; that is, the maximum number of functions that differ from a base function f at exactly one point among a subset of the domain {x 1 , . . . , x m } &#8838; X .</p><p>The threshold dimension has recently gained attention due to its role in proving an equivalence relationship between private PAC learning and online learning [see, e.g., <ref type="bibr">3,</ref><ref type="bibr">9]</ref>. We slightly generalize the definition of Alon et al. <ref type="bibr">[3]</ref> to allow for any base function f , in the spirit of the other two definitions. A classical result in model theory provides a link between the threshold dimension and Littlestone dimension (which we denote Ldim), a quantity which is both necessary and sufficient for online learnability <ref type="bibr">[8,</ref><ref type="bibr">4]</ref>. In particular, results by Shelah <ref type="bibr">[40]</ref> and Hodges et al. <ref type="bibr">[22]</ref> show that for any binary-valued F and any f &#8712; F:</p><p>A combinatorial proof of this fact can be found in Thm. 3 of Alon et al. <ref type="bibr">[3]</ref>. <ref type="foot">6</ref> Thus, finiteness of threshold dimension is necessary and sufficient for online learnability (albeit in a much weaker, "qualitative" sense).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">A qualitative equivalence</head><p>Now, we prove a rather surprising characterization that closely ties all three quantities in Definition 4 together. Our result implies that for any binary-valued function class, finiteness of the combinatorial eluder dimension is equivalent to finiteness of both star number and threshold dimension. Theorem 8. For any function class F &#8838; (X &#8594; {1, -1}) and any f &#8712; F, the following holds:</p><p>The proof of Theorem 8 can be found in Appendix D. The lower bound is trivial by examining Definition 4. To prove the upper bound, we rely on a novel connection to Ramsey theory. In particular, we show that sequences (x 1 , f 1 ), . . . (x m , f m ) which witness Edim f = m form a bijection with redblue colorings of the complete graph K m , while subsequences of the witnessing eluder sequence that are valid star number witnesses or threshold dimension witnesses can be interpreted as monochromatic colorings of subgraphs of K m (see Figure <ref type="figure">2</ref>). Thus, applying classical bounds from Ramsey theory implies the result. </p><p>We also show that the upper bound cannot be improved, for example, to Edim(F) &#8804; poly(Sdim(F), Tdim(F)) in general. Theorem 9. For every N &gt; 0, there exists a function class F N such that Edim(F N ) = N and max{Sdim(F), Tdim(F)} &lt; c &#8226; log 2 N , where c &gt; 1/2 is some absolute numerical constant.</p><p>The proof of Theorem 9 can be found in Appendix E. It relies on the probabilistic method to show the existence of a randomly constructed F which satisfies the desired properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Comparisons with sign-rk</head><p>In this section, we investigate the comparison of these combinatorial quantities (eluder, star, threshold) with sign-rk, and examine whether we can prove stronger separations. One direction is clear; for the class of linear classifiers in R d , the combinatorial eluder dimension, star number, and threshold dimension are all infinite. However, we ask if the other separation is also possible: can we construct F where eluder/star/threshold are finite but sign-rk = &#8734;? We have already provided an explicit exponential separation: Theorem 6 shows that for the function class F &#8853; , we had Edim(F &#8853; ) = Sdim(F &#8853; ) = d but sign-rk(F &#8853; ) &#8805; 2 d/2 . (One can also show that Tdim(F &#8853; ) = d.) We are able to show stronger (but nonconstructive) separations by extending the probabilistic techniques of Alon et al. <ref type="bibr">[2]</ref>, who recently provided similar separations for VC dimension versus sign-rk. In view of Proposition 3, this result also provides a separation for the scale-sensitive Definition 1. Theorem 10. For every N &gt; 0, there exists a function class F N &#8838; ([N ] &#8594; {1, -1}) such that Edim 1 (F N ) = 4 and sign-rk(F N ) &#8805; &#8486;(N 1/9 / log N ), where 1 is shorthand for the all 1s function.</p><p>The proof of Theorem 10 can be found in Appendix F. It is straightforward to replace the reference function f (x) = 1 with any fixed reference function f : [N ] &#8594; {1, -1}. First, we use Lemma 22 of Alon et al. <ref type="bibr">[2]</ref> which bounds the number of distinct matrices with sign-rk = r; then using a probabilistic argument we show that there must be many (more) matrices with Edim 1 = 4, so at least one of them must have large sign-rk.</p><p>The careful reader might notice that we do not prove the existence of a function class where Edim(F) is constant and the sign-rk is infinite; instead we prove the weaker statement that a function class exists with Edim w.r.t. any fixed function f is bounded. We conjecture that the stronger statement holds; see Appendix F.1 for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Back to scale-sensitive?</head><p>It is natural to ask if our results can be extended back to the scale-sensitive definitions. First, we require a scale-sensitive version of threshold dimension. One proposal is the following: Definition 5. For any function class F &#8838; (X &#8594; R), f &#8712; F, and scale &#949; &#8805; 0, the exact threshold dimension Tdim f (F, &#949;) is the largest m such that there exists (x 1 , f 1 ), . . . , (x m , f m ) &#8712; X &#215; F satisfying for all i &#8712; [m]: &#8704;j &#8805; i : |f i (x j ) -f (x j )| &gt; &#949;, and j&lt;i (f i (x j ) -f (x j )) 2 &#8804; &#949; 2 .</p><p>Then for all &#949; &gt; 0:</p><p>&#8226; the threshold dimension is Tdim f (F, &#949;) = sup &#949; &#8805;&#949; Tdim f (F, &#949; ).</p><p>&#8226; Tdim(F, &#949;) := sup f &#8712;F Tdim f (F, &#949;) and Tdim(F, &#949;) := sup f &#8712;F Tdim f (F, &#949;).</p><p>Definition 5 mirrors the scale-sensitive definitions for eluder and star; it also recovers the combinatorial definition when F is binary-valued. By definition, the relationship that Edim f (F, &#949;) &#8805; max{Sdim f (F, &#949;), Tdim f (F, &#949;)} for every F, f &#8712; F, &#949; &gt; 0 is trivial. However, one cannot hope to prove the corresponding upper bound under this definition. For example, for any &#949; &gt; 0, take the function class which is represented by the N &#215; N matrix:</p><p>It is easy to see that Edim 0 (F, &#949;) = N , while Sdim 0 (F, &#949;) = 2 and Tdim 0 (F, &#949;) = 1. Notice that this class is still "threshold-like", but Definition 5 does not capture this for said value of &#949;. Generally speaking, it is unclear how to carry over the intuition from Ramsey theory that applies in the combinatorial case to the scale-sensitive case; we leave this to future work.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Finiteness of the threshold dimension is equivalent to finiteness of Littlestone dimension[40,  </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p><ref type="bibr">22,</ref><ref type="bibr">3]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>This result was independently established by Dong et al.<ref type="bibr">[12]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>For the definition of eluder dimension considered by<ref type="bibr">[37]</ref>, an upper bound of min{|X |, |F | 2 } holds, which can be tight. This upper bound holds because the witnessing pair of functions (fi, f i ) has to be distinct for each i.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4"><p>Note that only the product of the scales of &#966; and w is relevant. The definition remains equivalent if we let &#966; : X &#8594; B d (R &#966; ) and w : F &#8594; B d (Rw) for any R &#966; and Rw such that R = R &#966; &#8226; Rw.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5"><p>To prove upper bounds on eluder dimension, it suffices for this condition to hold only when |z| &#8804; R, [see e.g. 37]. Since we fix &#963; in our definition first and then consider &#963;-rank at different scales R, this weaker condition complicates our definitions. Note that at any specific scale R, we can always modify &#963; to satisfy the required constraint everywhere by extending it linearly whenever |z| &gt; R.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6"><p>Alon et al. [3]  prove the result for f (x) = -1, but it is easy to extend their proof to hold for any f .</p></note>
		</body>
		</text>
</TEI>
