<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Neural network approximation: Three hidden layers are enough</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10230772</idno>
					<idno type="doi">10.1016/j.neunet.2021.04.011</idno>
					<title level='j'>Neural Networks</title>
<idno>0893-6080</idno>
<biblScope unit="volume">141</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>Zuowei Shen</author><author>Haizhao Yang</author><author>Shijun Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A three-hidden-layer neural network with super approximation power is introduced. This network is built with the floor function (⌊x⌋), the exponential function (2 x ), the step function (1 x≥0 ), or their compositions as the activation function in each neuron and hence we call such networks as Floor-Exponential-Step (FLES) networks. For any width hyper-parameter N ∈ N + , it is shown that FLES networks with width max{d, N } and three hidden layers can uniformly approximate a Hölder continuous function f on [0, 1] d with an exponential approximation rate 3λ(2 √ d) α 2 -αN , where α ∈ (0, 1] and λ > 0 are the Hölder order and constant, respectively. More generally for an arbitrary continuous function f on [0, 1] d with a modulus of continuity ω f (⋅), the constructive approximation rate is 2ω). Moreover, we extend such a result to general bounded continuous functions on a bounded set E ⊆ R d . As a consequence, this new class of networks overcomes the curse of dimensionality in approximation power when the variation of ω f (r) as r → 0 is moderate (e.g., ω f (r) ≲ r α for Hölder continuous functions), since the major term to be concerned in our approximation rate is essentially √ d times a function of N independent of d within the modulus of continuity. Finally, we extend our analysis to derive similar approximation results in the L p -norm for p ∈ [1, ∞) via replacing Floor-Exponential-Step activation functions by continuous activation functions.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Floor-Exponential-Step Activation Function, Continuous Function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>This paper studies the approximation power of neural networks and shows that three hidden layers are enough for neural networks to achieve super approximation capacity. In particular, leveraging the power of advanced yet simple activation functions, we will introduce new theories and network architectures with only three hidden layers achieving exponential convergence and avoiding the curse of dimensionality simultaneously for (H&#246;lder) continuous functions with an explicit approximation bound. The theories established in this paper would provide new insights to explain why deeper neural networks are better than one-hidden-layer neural networks for large-scale and high-dimensional problems. The approximation theories here are constructive (i.e., with explicit formulas to specify network parameters) and quantitative (i.e., results valid for essentially arbitrary width and/or depth without lower bound constraints) with explicit error bounds working for three-hiddenlayer networks with arbitrary width.</p><p>Constructive approximation with quantitative results and explicit error bounds would provide important guides for deciding the network sizes in deep learning. For example, the (nearly) optimal approximation rates of deep ReLU networks with width O(N ) and depth O(L) for a Lipschitz continuous function and a C s function f on [0, 1] <ref type="bibr">Shen et al. (2020)</ref>; <ref type="bibr">Lu et al. (2020)</ref>, respectively. For results in terms of the number of nonzero parameters, the reader is referred to <ref type="bibr">Yarotsky (2017)</ref>; Schmidt-Hieber (2020); <ref type="bibr">Petersen and Voigtlaender (2018)</ref>; <ref type="bibr">Yarotsky (2018)</ref>; <ref type="bibr">G&#252;hring et al. (2019)</ref>; <ref type="bibr">Yarotsky and Zhevnerchuk (2020)</ref> and the reference therein. Obviously, the curse of dimensionality exists in ReLU networks for these generic functions and, therefore, ReLU networks would need to be exponentially large in d to maintain a reasonably good approximation accuracy. The curse could be lessened when target function spaces are smaller. To name a few, <ref type="bibr">Poggio et al. (2017)</ref>; <ref type="bibr">Barron and Klusowski (2018);</ref><ref type="bibr">E et al. (2019)</ref>; <ref type="bibr">Montanelli et al. (2020)</ref>; <ref type="bibr">Chen et al. (2019a)</ref>; <ref type="bibr">Hutzenthaler et al. (2020)</ref> and the reference therein for ReLU networks. The limitation of ReLU networks motivated the work in <ref type="bibr">Shen et al. (2021)</ref> to introduce Floor-ReLU networks built with either a Floor (&#8970;x&#8971;) or ReLU (max{0, x}) activation function in each neuron. It was shown by construction in <ref type="bibr">Shen et al. (2021)</ref> that Floor-ReLU networks with width max{d, 5N + 13} and depth 64dL + 3 can uniformly approximate a H&#246;lder continuous function f on [0, 1] d with a root-exponential approximation rate 3&#955;d &#945; 2 N -&#945; &#8730; L without the curse of dimensionality, where &#945; &#8712; (0, 1] and &#955; &gt; 0 are the H&#246;lder order and constant, respectively.</p><p>The most important message of <ref type="bibr">Shen et al. (2021)</ref> (and probably also of <ref type="bibr">Yarotsky and Zhevnerchuk (2020)</ref>) is that the combination of simple activation functions can create super approximation power. In the Floor-ReLU networks mentioned above, the power of depth is fully reflected in the approximation rate 3&#955;d &#945; 2 N -&#945; &#8730; L that is root-exponential in depth. However, the power of width is much weaker and the approximation rate is polynomial in width if depth is fixed. This seems to be inconsistent with recent development of network optimization theory <ref type="bibr">Jacot et al. (2018)</ref>; <ref type="bibr">Du et al. (2019)</ref>; <ref type="bibr">Mei et al. (2018)</ref>; <ref type="bibr">Wu et al. (2018)</ref>; <ref type="bibr">Chen et al. (2019b)</ref>; <ref type="bibr">Lu et al. (2020)</ref>; <ref type="bibr">Luo and Yang (2020)</ref>, where larger width instead of depth can ease the challenge of highly noncovex optimization. The mystery of the power of width and depth remains and it motivates us to demonstrate that width can also enable super approximation power when armed with appropriate activation functions.</p><p>In particular, we explore the floor function, the exponential function (2 x ), the step function (1 x&#8805;0 ), or their compositions as activation functions to build fully-connected feed-forward neural networks. These networks are called Floor-Exponential-Step (FLES) networks. As we shall prove by construction, Theorem 1.1 below shows that FLES networks with width max{d, N } and three hidden layers can uniformly approximate a continuous function f on [0, 1] d with an exponential approximation rate 2&#969; f (2 &#8730; d)2 -N +&#969; f (2 &#8730; d 2 -N ), where &#969; f (&#8901;) is the modulus of continuity defined as &#969; f (r) &#8758;= sup f (x) -f (y) &#8758; xy 2 &#8804; r, x, y &#8712; [0, 1] d , for any r &#8805; 0.</p><p>In particular, there are three kinds of activation functions denoted as &#963; 1 , &#963; 2 , and &#963; 3 in FLES networks (see Figure <ref type="figure">1</ref> for an illustration): &#963; 1 (x) &#8758;= &#8970;x&#8971;, &#963; 2 (x) &#8758;= 2 x , and &#963; 3 (x) &#8758;= T (x -&#8970;x&#8971; -1 2 ), for any x &#8712; R, where</p><p>Theorem 1.1. Given an arbitrary continuous function f defined on [0, 1] d , for any</p><p>, where &#966; is defined by a formula in a 1 , a 2 , &#8943;, a N as follows.</p><p>We remark that &#966; in Theorem 1.1 is essentially determined by N parameters a 1 , a 2 , &#8943;, a N , which can be trained by a (&#963; 1 , &#963; 2 , &#963; 3 )-activated network with width max{d, N }, three hidden layers, and 2(d + N + 1) nonzero parameters. See Figure <ref type="figure">1</ref> for an illustration.</p><p>x 1 x 2 . . .</p><p>Figure <ref type="figure">1</ref>: An illustration of the desired three-hidden-layer network in Theorem 1.1 for any x = (x 1 , x 2 , &#8943;, x d ) &#8712; R. Each of the red functions "&#963; 1 ", "&#963; 2 ", and "&#963; 3 " above the network is the activation function of the corresponding hidden layer. The number of neurons in each hidden layer is indicated by the red number below it.</p><p>The rate in &#969; f (2 &#8730; d 2 -N ) implicitly depends on N through the modulus of continuity of f , while the rate in 2&#969; f (2 &#8730; d)2 -N is explicit in N . Simplifying the implicit approximation rate to make it explicitly depend on N is challenging in general. However, if f is a H&#246;lder continuous function on [0, 1] d of order &#945; &#8712; (0, 1] with a H&#246;lder constant &#955; &gt; 0, i.e., f (x) satisfying</p><p>then &#969; f (r) &#8804; &#955;r &#945; for any r &#8805; 0. Therefore, in the case of H&#246;lder continuous functions, the approximation rate is simplified to 3&#955;(2 &#8730; d) &#945; 2 -&#945;N as shown in the following corollary. In the special case of Lipschitz continuous functions with a Lipschitz constant &#955; &gt; 0, the approximation rate is simplified to 6&#955; &#8730; d 2 -N .</p><p>Corollary 1.2. Given any H&#246;lder continuous function f on [0, 1] d of order &#945; &#8712; (0, 1] with a H&#246;lder constant &#955; &gt; 0, for any N &#8712; N + , there exists a 1 , a 2 , &#8943;, a N such that</p><p>where &#966; is defined by a formula in a 1 , a 2 , &#8943;, a N as follows.</p><p>First, Theorem 1.1 and Corollary 1.2 show that the approximation capacity of three-hidden-layer neural networks with simple activation functions for continuous functions can be exponentially improved by increasing the network width, and the approximation error can be explicitly characterized in terms of the width O(N ). Second, this new class of networks overcomes the curse of dimensionality in the approximation power when the modulus of continuity is moderate, since the approximation order is essentially &#8730; d times a function of N independent of d within the modulus of continuity. Therefore, three hidden layers are enough for neural networks to achieve exponential convergence and avoid the curse of dimensionality for generic functions. The width is also powerful in network approximation.</p><p>The rest of this paper is organized as follows. In Section 2, we discuss the application scope of our theory, study the connection between the approximation error and the Vapnik-Chervonenkis (VC) dimension, establish Corollary 2.3 to extend our analysis to general bounded continuous functions on a bounded set, and compare related works in the literature. We will prove Theorem 1.1 and Corollary 2.3 in Section 3. In Section 4, we explore alternative continuous activation functions other than &#963; 1 , &#963; 2 , and &#963; 3 for super approximation power. Finally, we conclude this paper in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Discussion</head><p>In this section, we will further interpret our results and discuss related research in the field of neural network approximation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Application scope of our theory in machine learning</head><p>Let &#966;(x; &#952;) denote a function computed by a (fully-connected) network with &#952; as the set of parameters. Given a target function f , consider the expected error/risk of &#966;(x; &#952;)</p><p>with a loss function typically taken as (y, y &#8242; ) = 1 2 yy &#8242; 2 , where U (X ) is an unknown data distribution over X . For example, when (y,</p><p>The expected risk minimizer &#952; D is defined as</p><p>It is unachievable in practice since f and U (X ) are not available. Instead, we only have samples of f . Given samples {(x i , f (x i ))} n i=1 , the empirical risk is defined as</p><p>And we usually use it to approximate/model the expected risk R D (&#952;). The goal of supervised learning is to identify the empirical risk minimizer</p><p>to obtain &#966;(x; &#952; S ) &#8776; f (x). When a numerical optimization method is applied to solve (2), it may result in a numerical solution (denoted as &#952; N ) that is not a global minimizer. Hence, the actually learned function generated by a neural network is &#966;(x; &#952; N ). The discrepancy between the target function f and the actually learned function &#966;(x; &#952; N ) is measured by an inference error</p><p>where the second equality holds when (y, y &#8242; ) = 1 2 yy &#8242; 2 and U is a uniform distribution over X = [0, 1] d .</p><p>Since R D (&#952; N ) is the expected inference error over all possible data samples, it can quantify how good the learned function &#966;(x; &#952; N ) is. Note that</p><p>&#8804; 0 by Eq. ( <ref type="formula">2</ref>)</p><p>Approximation error (AE)</p><p>Optimization error (OE)</p><p>where the inequality comes from the fact that [R S (&#952; S ) -R S (&#952; D )] &#8804; 0 since &#952; S is a global minimizer of R S (&#952;). Constructive approximation provides an upper bound of R D (&#952; D ) in terms of the network size, e.g., in terms of the network width and depth, or in terms of the number of parameters.</p><p>The second term of Equation ( <ref type="formula">3</ref>) is bounded by the optimization error of the numerical algorithm applied to solve the empirical loss minimization problem in Equation ( <ref type="formula">2</ref>). Note that one only needs to make R S (&#952; N ) -R S (&#952; S ) small, but not &#952; N -&#952; S . The study of the bounds for the third and fourth terms is referred to as the generalization error analysis of neural networks. See Figure <ref type="figure">2</ref> for the intuitions of these three errors. One of the key targets in the area of deep learning is to develop algorithms to reduce R D (&#952; N ). The constructive approximation established in this paper and in the literature provides upper bounds of the approximation error R D (&#952; D ) for several function spaces, which is crucial to estimate an upper bound of R D (&#952; N ). Instead of deriving an approximator to attain the approximation error bound, deep learning algorithms aim to identify a solution &#966;(x; &#952; N ) reducing the generalization and optimization errors in Equation (3). Solutions minimizing both generalization and optimization errors will lead to a good solution only if we also have a good upper bound estimate of R D (&#952; D ) as shown in Equation (3). Independent of whether our analysis here leads to a good approximator, which is an interesting topic to pursue, the theory here does provide a key ingredient in the error analysis of deep learning algorithms.</p><p>Theorem 1.1 and Corollary 1.2 provide an upper bound of R D (&#952; D ). This bound only depends on the given budget of neurons and layers of FLES networks. Hence, this bound is independent of the empirical loss minimization in Equation (2) and the optimization algorithm used to compute the numerical solution of Equation (2). In other words, Theorem 1.1 and Corollary 1.2 quantify the approximation power of FLES networks with a given size. Designing efficient optimization algorithms and analyzing the generalization </p><p>The intuitions of the approximation error (AE), the optimization error (OE), and the generalization error (GE). DL is short of deep learning. One needs to control AE, OE, and GE in order to bound the discrepancy between the target function f and the numerical solution &#966;(x; &#952; N ) (what we can get in practice), measured by</p><p>bounds for FLES networks are two other separate future directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Connection between approximation error and VC-dimension</head><p>The approximation error and the Vapnik-Chervonenkis (VC) dimension are two important measures of the capacity (complexity) of a set of functions. In this section, we discuss the connection between them.</p><p>Let us first present the definitions of VC-dimension and related concepts. Assume H is a class of functions mapping from a general domain X to {0, 1}. We say H shatters a set of points</p><p>where &#8901; means the size of a set. The above equation means, given any</p><p>For a general function set F with its elements mapping from X to R, we say</p><p>where</p><p>For any m &#8712; N + , the growth function of H is defined as</p><p>Definition 2.1 (VC-dimension). Assume H is a class of functions from X to {0, 1}. The VC-dimension of H, denoted by VCDim(H), is the size of the largest shattered set, namely,</p><p>Let F be a class of functions from X to R. The VC-dimension of F , denoted by VCDim(F ), is defined by VCDim(F ) &#8758;= VCDim(T &#9675;F ),<ref type="foot">foot_0</ref> where</p><p>In particular, the expression "the VC-dimension of a network (architecture)" means the VC-dimension of the function set that consists of all functions computed by this network (architecture).</p><p>As shown in <ref type="bibr">Yarotsky (2018</ref><ref type="bibr">Yarotsky ( , 2017))</ref>; <ref type="bibr">Shen et al. (2019</ref><ref type="bibr">Shen et al. ( , 2020))</ref>; <ref type="bibr">Lu et al. (2020)</ref>; <ref type="bibr">Shen et al. (2021)</ref>; Zhang (2020); <ref type="bibr">Shen et al. (to appear)</ref>, VC-dimension essentially determines the lower bound of the approximation errors of networks. For simplicity, we use H&#246;lder([0, 1] d , &#945;, &#955;) as an example, where H&#246;lder([0, 1] d , &#945;, &#955;) denotes the space of H&#246;lder continuous functions of order &#945; &#8712; (0, 1] and a H&#246;lder constant &#955; &gt; 0. Without loss of generality, we assume &#955; = 1. Theorem 2.2 below shows that the best possible approximation error of functions in H&#246;lder([0, 1] d , &#945;, 1) approximated by functions in F is bounded by a formula characterized by VCDim(F ).</p><p>Theorem 2.2 (Theorem 2.4 of <ref type="bibr">Shen et al. (to appear)</ref> or Theorem 4.17 of Zhang ( <ref type="formula">2020</ref>) ). Assume F is a function set with all elements defined on</p><p>This theorem investigates the connection between VC-dimension of F and the approximation errors of functions in H&#246;lder([0, 1] d , &#945;, 1) approximated by elements of F . Denote the best approximation error of functions in H&#246;lder([0, 1] d , &#945;, 1) approximated by the elements of F as</p><p>Then, Theorem 2.2 implies that</p><p>which means that the best possible approximation error is controlled by VCDim(F ) -&#945; d 9. A typical application of this theorem is to prove the optimality of approximation errors when using ReLU networks to approximate functions in H&#246;lder([0, 1] d , &#945;, 1). It is shown in <ref type="bibr">Harvey et al. (2017)</ref> that the VC-dimension of F N,L is bounded by</p><p>where F N,L is the space consisting of all functions implemented by ReLU networks with width N and depth L. It is shown in Section 4.4.1 of Zhang (2020) that</p><p>where C 1 (&#945;, d) and C 2 (&#945;, d) are two positive constants determined by &#945; and d.</p><p>Finally, we would like to point out that a large VC-dimension of the hypothesis space F is a necessary condition of a good approximation error, but cannot guarantee a good approximation error, which also relies on other properties of the hypothesis space F . For example, it is easy to check by Proposition 4.2 that</p><p>However, &#966; &#8758; &#966;(x) = cos(ax), a &#8712; R cannot achieve a good approximation error when approximating H&#246;lder continuous functions. Designing a hypothesis space with a large VC-dimension is the first step for a good approximation toll, but to realize the desired approximation power requires refined design of the hypothesise space, which is also the philosophy we followed in this paper. Our initial goal is to design a network architecture with a fixed depth (e.g., three hidden layers) to generate a hypothesis space with a sufficiently large VC-dimension (&#8734;). As we shall see later, Proposition 3.2 implies that the VC-dimension of FLES networks is infinity, which is a necessary condition for our FLES networks to attain super approximation power.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Further interpretation of our theory</head><p>In the interpretation of our theory, three more aspects are important to discuss. The first one is whether it is possible to extend our theory to functions on a more general domain, e.g, E &#8838; [-R, R] d for any R &gt; 0, because R &gt; 1 may cause an implicit curse of dimensionality in some existing theory. The second one is how bad the modulus of continuity would be since it is related to a high-dimensional function f that may lead to an implicit curse of dimensionality in our approximation rate. The last one is the discussion of overcoming the zero derivative in training FLES networks.</p><p>First, we can generalize Theorem 1.1 to the function space C(E) with E &#8838; [-R, R] d for any R &gt; 0 in the following corollary with the modulus of continuity &#969; E f (&#8901;) defined as follows. For an arbitrary set</p><p>As defined earlier, in the case</p><p>The proof of this corollary will be presented in Section 3.2.</p><p>Corollary 2.3. Given an arbitrary bounded continuous function f on E &#8838; [-R, R] d where R is an arbitrary positive real number, for any</p><p>where C f is a constant determined by f .</p><p>Hence, the volume of the function domain E &#8838; [-R, R] d only has a mild influence on the approximation rate of our FLES networks. FLES networks can still avoid the curse of dimensionality and achieve exponential convergence for continuous functions on E &#8838; [-R, R] d when R &gt; 1. For example, in the case of H&#246;lder continuous functions of order &#945; &#8712; (0, 1] with a constant</p><p>Second, most interesting continuous functions in practice have a good modulus of continuity such that there is no implicit curse of dimensionality hidden in &#969; f (&#8901;). For example, we have discussed the case of H&#246;lder continuous functions previously. We would like to remark that the class of H&#246;lder continuous functions implicitly depends on d through its definition in Equation (1), but this dependence is moderate since the 2 -norm in Equation ( <ref type="formula">1</ref>) is the square root of a sum with d terms. Let us now discuss several cases of &#969; f (&#8901;) when we cannot achieve exponential convergence or cannot avoid the curse of dimensionality. The first example is &#969; f (r) = 1 ln(1 r) for small r &gt; 0, which leads to an approximation rate</p><p>Apparently, the above approximation rate still avoids the curse of dimensionality but there is no exponential convergence, which has been canceled out by "ln" in &#969; f (&#8901;). The second example is &#969; f (r) = 1 ln 1 d (1 r) for small r &gt; 0, which leads to an approximation rate</p><p>The power 1 d further weakens the approximation rate and hence the curse of dimensionality exists. The last example we would like to discuss is &#969; f (r) = r &#945; d for small r &gt; 0, which results in the approximation rate</p><p>which achieves the exponential convergence and avoids the curse of dimensionality when we use very wide networks. Though we have provided several examples of immoderate &#969; f (&#8901;), to the best of our knowledge, we are not aware of useful continuous functions with &#969; f (&#8901;) that is immoderate. Finally, we would like to point out that the training of FLES networks in practice may encounter two issues. First, network weights in our main theorems require high-precision computation that might not be available in existing computer systems when the dimension d and the network size parameter N are large. But there is no theoretical evidence to exclude the possibility that similar approximation results can be achieved with reasonable weights in practical computation. Second, the vanishing gradient of piecewise constant activation functions makes standard SGD infeasible. There are two possible directions to solve the optimization problem for FLES networks: 1) gradient-free optimization methods, e.g., Nelder-Mead method <ref type="bibr">Nelder and Mead (1965)</ref>, genetic algorithm <ref type="bibr">Holland (1992)</ref>, simulated annealing <ref type="bibr">Kirkpatrick et al. (1983)</ref>, particle swarm optimization <ref type="bibr">Kennedy and Eberhart (1995)</ref>  <ref type="bibr">et al. (2019)</ref>. For example, an empirical way is to use a straight-through estimator (STE) by setting the incoming gradients to the activation function (e.g., Floor) equal to its outgoing gradients, disregarding the derivative of the activation function itself. It would be interesting future work to explore efficient learning algorithms based on the FLES network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Kolmogorov-Arnold Superposition Theorem</head><p>A closely related research topic is the Kolmogorov-Arnold representation theorem (KST) <ref type="bibr">Kolmogorov (1956)</ref>; <ref type="bibr">Arnold (1957)</ref>; <ref type="bibr">Kolmogorov (1957)</ref> and its approximation in a form of modern neural networks. Our FLES networks admit super approximation power with a fixed number of layers for continuous functions and the KST exactly represent continuous functions using two hidden layers and O(d) neurons. More specifically, given any f &#8712; C([0, 1] d ), the KST shows that there exist continuous functions</p><p>Note that the activation functions {&#966; q } (also called outer functions) of the neural network in Equation ( <ref type="formula">4</ref>) have to depend on the target function f , though {&#968; q,p } (also called inner functions) can be independent of f . The modulus of continuity of {&#968; q,p } can be constructed such that they moderately depend on d, but the modulus of continuity of {&#966; q } would be exponentially bad in d. In sum, the outer functions are too pathological such that there is no existing numerical algorithms to evaluate these activation functions, even though they are shown to exist by iterative construction <ref type="bibr">Braun and Griebel (2009)</ref>.</p><p>There has been an active research line to develop more practical network approximation based on <ref type="bibr">KST K&#367;rkov&#225; (1991</ref><ref type="bibr">, 1992)</ref>; <ref type="bibr">Maiorov and Pinkus (1999)</ref>; <ref type="bibr">Guliyev and Ismailov (2018)</ref>; <ref type="bibr">Montanelli and Yang (2020)</ref>; <ref type="bibr">Igelnik and Parikh (2003)</ref>; Schmidt-Hieber (2021) by relaxing the exact representation to network approximation with an &#949;-error. The key issue these KST-related networks attempting to address is the f -dependency of the activation functions and the main goal is to construct neural networks conquering the curse of dimensionality in a more practical way computationally. The main idea of these variants is to apply computable activation functions independent of f to construct neural networks to approximate the outer and inner functions of the KST, resulting in a larger network that can approximate a continuous function with the desired accuracy. Using this idea, the seminal work in <ref type="bibr">K&#367;rkov&#225; (1992)</ref> applied sigmoid activation functions and constructed twohidden-layer networks to approximate f &#8712; C([0, 1] d ). Though the activation functions are independent of f , the number of neurons scales exponentially in d and the curse of dimensionality exists. Cubic-splines and piecewise linear functions have also been used to approximate the outer and inner functions of KST in <ref type="bibr">Igelnik and Parikh (2003)</ref>; <ref type="bibr">Montanelli and Yang (2020)</ref>; Schmidt-Hieber (2021), resulting in cubic-spline networks or deep ReLU networks to approximate f &#8712; C([0, 1] d ). But the approximation bounds in these works still suffer from the curse of dimensionality unless f has simple outer functions in the KST. It is still an open problem to characterize the class of functions with a moderate outer function in KST.</p><p>To the best of our knowledge, the most successful construction of neural networks with f -independent activation functions conquering the curse of dimensionality is in <ref type="bibr">Maiorov and Pinkus (1999)</ref>; <ref type="bibr">Guliyev and Ismailov (2018)</ref>, where a two-hidden-layer network with O(d) neurons can approximate f &#8712; C([0, 1] d ) within an arbitrary error &#949;. Let us briefly summerize their main ideas to obtain such an exciting result here. 1) Identify a dense and countable subset</p><p>), e.g., polynomials with rational coefficients. 2) Construct an activation function to "store" all u k (x) for x &#8712; [-1, 1]. For example, divide the domain of (x) into countable pieces and each piece is a connected interval of length 2 associated with a u k . In particular, let</p><p>for any x &#8712; [-1, 1] with carefully chosen constants a k , b k , and c k such that (x) can be a sigmoid function. 3) By construction, there exists a one-hidden-layer network with width 3 and (x) as the activation function to approximate any outer or inner function in KST with an arbitrary accuracy parameter &#948;. Only the parameters of the onehidden-layer network depend on the target function and accuracy. 4) Replace the inner and outer function in KST with these one-hidden-layer networks to achieve a two-hidden-layer network with (x) as the activation function and width O(d) to approximate an arbitrary f &#8712; C([0, 1] d ) within an arbitrary error &#949;. Unfortunately, the construction of the parameters of this magic network relies on the evaluation of the outer and inner functions of KST, which is not computationally feasible even if computation with arbitrary precision is allowed.</p><p>We would like to remark that, though the approximation rate of FLES networks in this paper is relatively worse than the approximation rate in <ref type="bibr">Maiorov and Pinkus (1999)</ref>; <ref type="bibr">Guliyev and Ismailov (2018)</ref>, our activation functions are much simpler and there are explicit formulas to specify the parameters of FLES networks. If computation with an arbitrary precision is allowed and the target function f can be arbitrarily sampled, we can specify all the weights in FLES networks. Besides, our approximation rate is sufficiently attractive since it is exponential and avoids the curse of dimensionality. For a large dimension d, the width parameter of our FLES network can be chosen as N = d, which leads to a FLES network of size O(d) with an approximation accuracy O(2 -d ) for Lipschitz continuous functions. O(2 -d ) is sufficiently attractive. In practice, when d is very large, N could be much smaller than d and our approximation rate is still attractive.</p><p>Finally, we list several KST-related results in Table <ref type="table">1</ref> for a quick comparison.<ref type="foot">foot_1</ref> As shown in Table <ref type="table">1</ref>, there exists a trade-off between the complexity of activation functions and the network size when the approximation error is fixed. A key advantage of our FLES networks is to use simple and explicit activation functions to attain an exponential convergence rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">Discussion on the literature</head><p>In this section, we will discuss other recent development of neural network approximation. Our discussion will be divided into mainly three parts according to the analysis methodology in the references: 1) functions admitting integral representations; 2) linear approximation; 3) bit extraction. In the seminal work of <ref type="bibr">Barron (1993)</ref>, its variants or generalization Barron and Klusowski ( <ref type="formula">2018</ref>); E et al. ( <ref type="formula">2019</ref>); Chen and Wu (2019); <ref type="bibr">Montanelli et al. (2020)</ref>, and related references therein, d-dimensional functions of the following form were considered:</p><p>where &#937; &#8838; R d , &#181;(w) is a Lebesgue measure in w, and x &#8712; &#8486; &#8838; R d . The above integral representation is equivalent to the expectation of a high-dimensional random function when w is treated as a random variable. By the law of large number theory, the average of N samples of the integrand leads to an approximation of f (x) with an approximation error bounded by <ref type="bibr">Barron (1993)</ref>), where O(N ) is the total number of parameters in the network, C f is a d-dimensional integral with an integrand related to f , and &#181;(&#8486;) is the Lebesgue measure of &#8486;.</p><p>As discussed in <ref type="bibr">Barron (1993)</ref>, &#181;(&#8486;) and C f would be exponential in d and standard smoothness properties of f alone are not enough to remove the exponential dependence of C f on d. Therefore, the curse of dimensionality exists in the whole approximation error while the curse does not exist in the approximation rate in N . Linear approximation is an efficient approximation tool for smooth functions that computes the approximant of a target function via a linear projection to a Hilbert space or a Banach space as the approximant space. Typical examples include approximation via orthogonal polynomials, Fourier series expansion, etc. Inspired by the seminal work in <ref type="bibr">Yarotsky (2017)</ref>, where deep ReLU networks were constructed to approximate polynomials with exponential convergence, subsequent works in E and <ref type="bibr">Wang (2018)</ref>; <ref type="bibr">Opschoor et al. (2019)</ref>; <ref type="bibr">Montanelli and Du (2019)</ref>; <ref type="bibr">Chen and Wu (2019)</ref>; <ref type="bibr">Montanelli et al. (2020)</ref>; <ref type="bibr">Yarotsky and Zhevnerchuk (2020)</ref>; <ref type="bibr">Lu et al. (2020)</ref>; <ref type="bibr">Montanelli and Yang (2020)</ref>; <ref type="bibr">Yang and Wang (2020)</ref> have constructed deep ReLU networks to approximate various smooth function spaces. The main idea of these works is to approximate smooth functions via (piecewise) polynomial approximation first and then construct deep ReLU networks to approximate the ensemble of polynomials. But the curse of dimensionality exists since polynomial approximation cannot avoid the curse. Finally, a different approach is used in <ref type="bibr">Li et al. (to appear)</ref>. The authors of Li et al. (to appear) use a dynamic system based approach to obtain a universal approximation property of residual neural networks.</p><p>The bit extraction proposed in <ref type="bibr">Bartlett et al. (1998)</ref> has been a very important technique to develop nearly optimal approximation rates of deep ReLU neural networks <ref type="bibr">Yarotsky (2018)</ref>; <ref type="bibr">Shen et al. (2020)</ref>; <ref type="bibr">Lu et al. (2020)</ref>; <ref type="bibr">Yang and Wang (2020)</ref>; <ref type="bibr">Zhang (2020)</ref>; <ref type="bibr">Shen et al. (to appear)</ref> and the optimality is based on the nearly optimal VC-dimension bound of ReLU networks in <ref type="bibr">Harvey et al. (2017)</ref>. The bit extraction was also applied in <ref type="bibr">Shen et al. (2021);</ref><ref type="bibr">Schmidt-Hieber (2021)</ref> and this paper to develop network approximation theories. In the first step, an efficient projection map in a form of a ReLU, or a Floor-ReLU, or a FLES network is constructed to project highdimensional points to one-dimensional points such that the high-dimensional approximation problem is reduced to a one-dimensional approximation problem. In the second step, the one-dimensional approximation problem is solved by constructing a ReLU, or a Floor-ReLU, or a FLES network, which can be efficiently compressed via the bit extraction. Although shallower neural networks can also carry out the above two steps, bit extraction can take full advantage of the power of depth and construct deep neural networks with a nearly optimal number of parameters or neurons to fulfill the above two steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Theoretical Analysis</head><p>In this section, we first introduce basic notations in this paper in Section 3.1. Then we prove Theorem 1.1 and Corollary 2.3 in Section 3.2 based on Theorem 3.1, which is proved in Section 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Notations</head><p>The main notations of this paper are listed as follows.</p><p>&#8226; Vectors and matrices are denoted in a bold font. Standard vectorization is adopted in the matrix and vector computation. For example, adding a scalar and a vector means adding the scalar to each entry of the vector.</p><p>&#8226; Let R denote the set of real numbers.</p><p>&#8226; Let Z, N, and N + denote the set of integers, natural numbers, all positive integers, respectively, i.e., Z = {0, 1, 2, &#8943;} &#8746; {-1, -2, -3, &#8943;}, N = {0, 1, 2, &#8943;}, and N + = {1, 2, 3, &#8943;}.</p><p>&#8226; The floor function (Floor) is defined as &#8970;x&#8971; &#8758;= max{n &#8758; n &#8804; x, n &#8712; Z} for any x &#8712; R.</p><p>&#8226; For &#952; &#8712; [0, 1), suppose its binary representation is &#952; = &#8721; &#8734; =1 &#952; 2 -with &#952; &#8712; {0, 1}, we introduce a special notation bin0.&#952; 1 &#952; 2 &#8943;&#952; L to denote the L-term binary representation of &#952;, i.e., bin0.&#952;</p><p>&#8226; The expression "a network with width N and depth L" means -The maximum width of this network for all hidden layers is no more than N .</p><p>-The number of hidden layers of this network is no more than L.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Proof of Theorem 1.1 and Corollary 2.3</head><p>In this section, we will prove Theorem 1.1 and Corollary 2.3. To this end, we first introduce Theorem 3.1 that works only for [0, 1) d , regraded as a weaker variant of Theorem 1.1. Theorem 3.1. Given an arbitrary continuous function f on [0, 1] d , for any</p><p>, where &#966; is defined by a formula in a 1 , a 2 , &#8943;, a N as follows.</p><p>We will prove Theorem 1.1 and Corollary 2.3 based on Theorem 3.1, the proof of which can be found in Section 3.3. </p><p>where</p><p>Define &#966;(x) &#8758;= &#966;(2x) for any x &#8712; R d . Then by Equation ( <ref type="formula">5</ref>), for any x &#8712;</p><p>where &#966;(x) &#8758;= &#966;( x 2 ) can be represented by</p><p>With the discussion above, we have proved Theorem 1.1.</p><p>Next, we present the proof of Corollary 2.3 below. </p><p>By applying Theorem 3.</p><p>where</p><p>= &#969; E f (3Rr), for any r &#8805; 0.</p><p>Define &#966;(x) &#8758;= &#966;( x+R 3R ) for any x &#8712; R d . Then by Equation ( <ref type="formula">6</ref>), for any</p><p>where</p><p>) is a constant essentially determined by f . With the discussion above, we have proved Corollary 2.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Proof of Theorem 3.1</head><p>To prove Theorem 3.1, we first present the proof sketch. Shortly speaking, we construct piecewise constant functions to approximate continuous functions. There are five key steps in our construction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>and denote</head><p>x &#946; as the vertex of Q &#946; with minimum &#8901; 1 norm, where J is an integer determined later. See Figure <ref type="figure">3</ref> for the illustrations of Q &#946; and x &#946; .</p><p>2. Construct a vector-valued function</p><p>. Finally, re-scale and shift &#966; to obtain the final function &#966; approximating f well.</p><p>Recall that</p><p>Step 1 and 5 are straightforward. To implement Step 2, we introduce &#963; 1 since it can help to significantly simplify the construction of the vector-valued projecting function &#934; 1 . The implementation of Step 3 is based on the J-ary representation, namely, define  <ref type="bibr">Yarotsky (2018)</ref>. To extract sufficient many bits with a limited neuron budget, we introduce two powerful activation functions &#963; 2 and &#963; 3 , as shown in the proposition below.</p><p>Proposition 3.2. Given any K &#8712; N + and arbitrary &#952; 1 , &#952; 2 , &#8943;, &#952; K &#8712; {0, 1}, it holds that</p><p>Proof. Since &#952; j &#8712; {0, 1} for j &#8712; {1, 2, &#8943;, K}, we have</p><p>.<ref type="foot">foot_2</ref> (7)</p><p>Clearly, the first term in Equation ( <ref type="formula">7</ref>) &#8721; k-1 j=1 2 k-j-1 &#8901;&#952; j is a non-negative integer since &#952; j &#8712; {0, 1} for any j &#8712; {1, 2, &#8943;, K}. As for the third term in Equation ( <ref type="formula">7</ref>), we have</p><p>Therefore, by Equation ( <ref type="formula">7</ref>), we have</p><p>If &#952; k = 0, by Equation ( <ref type="formula">8</ref>), we have</p><p>Similarly, if &#952; k = 1, by Equation ( <ref type="formula">8</ref>), we have</p><p>So we finish the proof.</p><p>We would like to point out that Proposition 3.2 indicates that the VCdimension of the function space {f &#8758; f (x) = &#963; 3 (a &#8901; x), for a &#8712; R} is infinity, which implies that the VC-dimension of FLES networks is also infinity. As discussed previously in Section 2.2, having an infinite VCdimension is a necessary condition for FLES networks to attain super approximation power.</p><p>With Proposition 3.2 in hand, we are ready to prove Theorem 3.1.</p><p>Proof of Theorem 3.1. The proof consists of five steps.</p><p>Step 1&#8758; Set up. Assume f is not a constant function since it is a trivial case. Then &#969; f (r) &gt; 0 for any r &gt; 0. Clearly,</p><p>To be exact, defined x &#946; &#8758;= &#946; J and</p><p>x &#946; for &#946; &#8712; {0, 1, 2, 3}</p><p>(a) Step</p><p>for any x = (x 1 , x 2 , &#8943;, x d ) &#8712; R d . Then, for any x &#8712; Q &#946; and each &#946; &#8712; {0, 1, &#8943;, J -1} d , we have</p><p>Step 3&#8758; Construct &#966; 2 bijectively mapping &#946; &#8712; {0, 1, &#8943;, J -</p><p>Inspired by the J-ary representation, we define a linear function</p><p>Then &#966; 2 is a bijection from {0, 1, &#8943;, J -1} d to {1, 2, &#8943;, J d }.</p><p>Step 4&#8758; Construct &#966; 3 mapping &#966; 2 (&#946;) &#8712; {1, 2, &#8943;, J d } approximately to f (x &#946; ).</p><p>For each k &#8712; {1, 2, &#8943;, J d }, there exists a unique &#946; &#8712; {0, 1, &#8943;, J -1} d such that &#966; 2 (&#946;) = k. Thus, define</p><p>For each k &#8712; {1, 2, &#8943;, J d }, there exist &#952; k,1 , &#952; k,2 , &#8943;, &#952; k,N &#8712; {0, 1} such that</p><p>For each j &#8712; {1, 2, &#8943;, N }, by Proposition 3.2 (set K = J d therein), there exists</p><p>Then, for any k &#8712; {1, 2, &#8943;, J d }, we have</p><p>Step 5&#8758; Define &#966; &#8758;= &#966; 3 &#9675; &#966; 2 &#9675; &#934; 1 approximating f well, and re-scale and shift &#966; to obtain &#966; approximating f well.</p><p>Define &#966; &#8758;= &#966; 3 &#9675; &#966; 2 &#9675; &#934; 1 , by Equation ( <ref type="formula">10</ref>), ( <ref type="formula">11</ref>), (12), and (13), we have, for any x &#8712; Q &#946; and each &#946; &#8712; {0, 1, &#8943;, J -</p><p>It follows from J = 2 N and the definitions of &#934; 1 , &#966; 2 , and &#966; 3 that</p><p>So we finish the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Approximation with continuous activation functions</head><p>As discussed previously, our FLES networks can attain super approximation power. However, two activation functions in FLES networks are piecewise constant functions that would lead to challenges in numerical algorithm design. It is interesting to explore continuous activation functions achieving similar results. To this end, we introduce three new activation functions as follows. First, for any &#948; &#8712; (0, 1), we define</p><p>for any n &#8712; Z.</p><p>In fact, 1,&#948; can be regarded as a "continuous version" of the floor function.</p><p>Next, we define 2 (x) &#8758;= 3 x , and 3 (x) &#8758;= T cos(2&#960;x) , for any x &#8712; R, where</p><p>is a continuous piecewise linear function. 2 plays the same role of &#963; 2 (x) = 2 x and 3 is essentially a "continuous version" of &#963; 3 in FLES networks. With these three activation functions in hand, we have the following theorem.</p><p>Theorem 4.1. Let f be an arbitrary continuous function defined on [0, 1] d . For any &#948; &#8712; (0, 1), N &#8712; N + , and p</p><p>where &#966; is defined by a formula in a 1 , a 2 , &#8943;, a N as follows</p><p>The approximation error in Theorem 4.1 is characterized by L p -norm for p &#8712; [1, &#8734;) instead of a pointwise error estimate in Theorem 1.1. By using ideas in <ref type="bibr">Lu et al. (2020)</ref>; <ref type="bibr">Zhang (2020)</ref>, we can extend this result to L &#8734; -norm. However, this extension requires 2d + 3 hidden layers instead of 3 hidden layers. Since our focus here is the approximation using three hidden layers, we will leave this extension as future work.</p><p>To prove Theorem 4.1, we need the following proposition.</p><p>Proposition 4.2. Given any K &#8712; N + and arbitrary &#952; 1 , &#952; 2 , &#8943;, &#952; K &#8712; {0, 1}, it holds that</p><p>Proof. Since &#952; j &#8712; {0, 1} for j &#8712; {1, 2, &#8943;, K}, we have 0 Proof of Theorem 4.1. The proof consists of five steps.</p><p>Step 1&#8758; Set up.</p><p>Assume f is not a constant function since it is a trivial case. Then</p><p>It follows that f (x) &#8712; [0, 1] for any x &#8712; [0, 1] d . Set J = 2 N and divide [0, 1] d into J d cubes {Q &#946; } &#946; and a small region &#923;([0, 1] d , J, &#948;). To be exact, define x &#946; &#8758;= &#946; J and <ref type="figure">5</ref> for illustrations.</p><p>Step</p><p>0.00 0.25 0.50 0.75 1.00 </p><p>&#923;([0, 1] d , J, &#948;) for J = 4, d = 2</p><p>Q &#946; for &#946; &#8712; {0, 1, 2, 3} 2</p><p>x &#946; for &#946; &#8712; {0, 1, 2, 3} 2 (b) Step 3&#8758; Construct &#966; 2 bijectively mapping &#946; &#8712; {0, 1, &#8943;, J -1} d to &#966; 2 (&#946;) &#8712; {1, 2, &#8943;, J d }.</p><p>Inspired by the J-ary representation, we define an affine linear map &#966; 2 (x) &#8758;= 1 + d i=1 J i-1 x i , for each x = (x 1 , x 2 , &#8943;, x d ) &#8712; R d .</p><p>Then &#966; 2 is a bijection from {0, 1, &#8943;, J -1} d to {1, 2, &#8943;, J d }.</p><p>Step 4&#8758; Construct &#966; 3 mapping &#966; 2 (&#946;) &#8712; {1, 2, &#8943;, J d } approximately to f (x &#946; ).</p><p>For each k &#8712; {1, 2, &#8943;, J d }, there exists a unique &#946; &#8712; {0, 1, &#8943;, J -1} d such that &#966; 2 (&#946;) = k. Thus, define</p><p>For each k &#8712; {1, 2, &#8943;, J d }, there exist &#952; k,1 , &#952; k,2 , &#8943;, &#952; k,N &#8712; {0, 1} such that</p><p>For each j &#8712; {1, 2, &#8943;, N }, by Proposition 4.2 (set K = J d therein), there exists a j &#8712; [0, and</p><p>for any k &#8712; {1, 2, &#8943;, J d }.</p><p>Step 5&#8758; Define &#966; &#8758;= &#966; 3 &#9675; &#966; 2 &#9675; &#934; 1 approximating f well, and re-scale and shift &#966; to obtain &#966; approximating f well. Define &#966; &#8758;= &#966; 3 &#9675; &#966; 2 &#9675; &#934; 1 , by Equation ( <ref type="formula">16</ref>), ( <ref type="formula">17</ref>), ( <ref type="formula">18</ref>), and (20), we have, for any x &#8712; Q &#946; and each &#946; &#8712; {0, 1, &#8943;, J -</p><p>Finally, define &#966; &#8758;= 2&#969; f ( &#8730; d) &#966; + f (0) -&#969; f ( &#8730; d). Equation ( <ref type="formula">15</ref>) implies &#969; f (r) = 2&#969; f ( &#8730; d)&#969; f (r) for any r &#8805; 0, deducing</p><p>for any x &#8712; &#8899; &#946;&#8712;{0,1,&#8943;,J-1} d Q &#946; . By Equation ( <ref type="formula">19</ref>) and the definition of</p><p>By the definitions of &#934; 1 , &#966; 2 , and &#966; 3 , we have</p><p>So we finish the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This paper has introduced a theoretical framework to show that three hidden layers are enough for neural network approximation to achieve exponential convergence and avoid the curse of dimensionality for approximating functions as general as (H&#246;lder) continuous functions. The key idea is to leverage the power of multiple simple activation functions: the floor function (&#8970;x&#8971;), the exponential function (2 x ), the step function (1 x&#8805;0 ), or their compositions. This new class of networks is called the FLES network. Given a Lipschitz continuous function f on [0, 1] d , it was shown by construction that FLES networks with width max{d, N } and three hidden layers admit a uniform approximation rate 6&#955; &#8730; d 2 -N , where &#955; is the Lipschitz constant of f . More generally for an arbitrary continuous function f on [0, 1] d with a modulus of continuity &#969; f (&#8901;), the constructive approximation rate is 2&#969;</p><p>We also extend such a result to general bounded continuous functions on a bounded set E &#8838; R d . The results in this paper provide a theoretical lower bound of the power of FLES networks. Whether or not this bound is achievable in actual computation relies on advanced algorithm design as a separate line of research. Finally, we have also derived similar approximation results in the L p -norm for p &#8712; [1, &#8734;) using continuous activation functions.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>One may also define VCDim(F ) &#8758;= VCDim( T &#9675; F ), where T (t) &#8758;= 1, t&gt;0, 0, t&#8804;0 .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>The result in<ref type="bibr">Shen et al. (2019)</ref> is for H&#246;lder functions, but can be easily generalized to general continuous functions.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>By convention, &#8721; m j=n a j = 0 if n &gt; m no matter what a j is for each j.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>(3 k &#8901; a j ) = &#952; k,j , for any k &#8712; {1, 2, &#8943;, J d }.</p></note>
		</body>
		</text>
</TEI>
