<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>On Enhancing Expressive Power via Compositions of Single Fixed-Size ReLU Network</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 40th International Conference on Machine Learning</publisher>
				<date>10/26/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10552232</idno>
					<idno type="doi"></idno>
					
					<author>Shijun Zhang</author><author>Jianfeng Lu</author><author>Hongkai Zhao</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper explores the expressive power of deep neural networks through the framework of function compositions. We demonstrate that the repeated compositions of a single fixed-size ReLU network exhibit surprising expressive power, despite the limited expressive capabilities of the individual network itself. Specifically, we prove by construction that L 2 →g →r →L 1 can approximate 1-Lipschitz continuous functions on [0, 1] d with an error O(r ↑1/d ), where g is realized by a fixedsize ReLU network, L 1 and L 2 are two affine linear maps matching the dimensions, and g →r denotes the r-times composition of g. Furthermore, we extend such a result to generic continuous functions on [0, 1] d with the approximation error characterized by the modulus of continuity. Our results reveal that a continuous-depth network generated via a dynamical system has immense approximation power even if its dynamics function is time-independent and realized by a fixed-size ReLU network.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, there has been a notable increase in the exploration of the expressive power of deep neural networks, driven by their impressive success in various learning tasks. The increasing size of deep neural network models poses significant challenges in terms of training and computational requirements. Consequently, numerous techniques have emerged to compress and expedite these models, with the goal of alleviating the associated computational complexity. These techniques predominantly center around parameter-sharing schemes, which efficiently reduce the number of parameters, leading to reductions in memory usage and computation costs. This paper explores the expressive power of deep neural networks, approaching it from the standpoint of function compositions. We focus on a novel network architecture constructed through the repeated compositions of a single fixed-size network, enabling parameter sharing. To illustrate our ideas and concepts, we specifically utilize the rectified linear unit (ReLU) activation function. Our investigation reveals that the repeated compositions of a single fixed-size ReLU network demonstrate surprising expressive power, even though the individual network itself has limited expressive capabilities. These findings provide new insights into the potential of parameter-sharing schemes in neural networks, showcasing their ability to reduce computational complexity while preserving a high level of expressive power.</p><p>For ease of notation, we employ NN{N, L; R d1 &#8593; R d2 } to represent the set of functions &#969; : R d1 &#8593; R d2 that can be realized by ReLU networks of width N &#8595; N + and depth L &#8595; N + . In our context, the width of a network means the maximum number of neurons in a hidden layer and the depth refers to the number of hidden layers. Let g &#8594;r denote the r-times composition of g, e.g., g &#8594;3 = g &#8594; g &#8594; g. In the degenerate case, g &#8594;0 represents the identity map. We use C([0, 1] d ) to denote the set of continuous functions on [0, 1] d and define the modulus of continuity of a continuous function f &#8595; C([0, 1] d ) via &#969; f (t) := sup |f (x) &#8594; f (y)| : &#8593;x &#8594; y&#8593; 2 &#8595; t, x, y &#8596; [0, 1] d for any t &#8596; 0. Under these settings, we can construct L 2 &#8594; g &#8594;r &#8594; L 1 to approximate a continuous function f &#8595; C([0, 1] d ) with an error O &#969; f (r &#8593;1/d ) , where L 1 and L 2 are two affine linear maps and g is realized by a fixed-size ReLU network, as shown in the theorem below. Theorem 1.1. Given any f &#8595; C([0, 1] d ), r &#8595; N + , and p &#8595; [1, &#8599;), there exist g &#8595; NN{69d + 48, 5; R 5d+5 &#8593; R 5d+5 } and two affine linear maps L 1 : R d &#8593; R 5d+5 and L 2 : R 5d+5 &#8593; R such that</p><p>It should be noted that in Theorem 1.1, the affine linear maps L 1 and L 2 are used to ensure matching dimensions and can be replaced by various other functions that achieve the desired input and output dimensions. In our research, we choose a straightforward approach by considering them as affine linear maps. In Theorem 1.1, we propose a novel network architecture constructed via repeated compositions of a single sub-network, which will be referred to as repeated-composition networks (RCNets). The hypothesis space of the RCNet corresponding to g is defined as H(g) := L 2 &#8594; g &#8594;r &#8594; L 1 : r &#8595; N, L 1 and L 2 are affine .</p><p>Then we have an immediate corollary as follows.</p><p>Corollary 1.2. Given any p &#8595; [1, &#8599;), suppose H(g) is defined as mentioned above and set G = NN{69d + 48, 5; R 5d+5 &#8593; R 5d+5 }. Then H = &#8658; g&#8595;G H(g) is dense in L p ([0, 1] d ) in terms of the L p -norm.</p><p>The proof of Corollary 1.2 is straightforward. Theorem 1.1 implies H is dense in C([0, 1] d ) in terms of the L p -norm for any p &#8595; [1, &#8599;). Recall that C([0, 1] d ) is dense in the Lebesgue spaces L p ([0, 1] d ) for any p &#8595; [1, &#8599;). Therefore, we can conclude that H is dense in L p ([0, 1] d ) in terms of the L p -norm for any p &#8595; [1, &#8599;). Furthermore, it should be noted that the set G in Corollary 1.2 is generated by a fixedsize ReLU network. As a result, G is a set of continuous piecewise linear functions with (at most) a fixed number of pieces.</p><p>It is important to note that the approximation error in Theorem 1.1 is quantified by the L p -norm for any p &#8595; [1, &#8599;). However, it is possible to extend this result to the L &#8596; -norm as well, although the associated constants will be significantly larger.</p><p>Theorem 1.3. Given any f &#8595; C([0, 1] d ) and r &#8595; N + , there exist g &#8595; NN{4 d+5 d, 3 + 2d; R d &#8593; R d } and two affine linear maps L 1 : R d &#8593; R d and L 2 : R d &#8593; R such that</p><p>for any x &#8595; [0, 1] d , where d = 3 d (5d + 4) &#8600; 1.</p><p>The main ideas for proving Theorems 1.1 and 1.3 are provided in Section 3 and the detailed proofs of these two theorems can be found in Section A of the appendix.</p><p>In general, it is challenging to simplify the approximation error in Theorem 1.1 (or 1.3) due to the complexity of &#969; f (&#8226;). However, in the case of special target function spaces like H&#246;lder continuous function space, one can simplify the approximation error to make its dependence on r explicit. If f is an H&#246;lder continuous function on [0, 1] d of order &#949; &#8595; (0, 1] with an H&#246;lder constant &#977; &gt; 0, we have</p><p>implying &#969; f (t) &#8771; &#977;t &#969; for any t &#8596; 0. Thus, the approximation error in Theorem 1.1 (or 1.3) can be simplified to 6&#977; &#8656; d r &#8593;&#969;/d . In the special case of &#949; = 1, where f is a Lipschitz continuous function with a Lipschitz constant &#977; &gt; 0, the approximation error can be further simplified to 6&#977; &#8656; d r &#8593;1/d .</p><p>A constant-width ReLU network of depth O(r) can be represented as L 2 &#8594; g r &#8594; &#8226; &#8226; &#8226; &#8594; g 2 &#8594; g 1 &#8594; L 1 , where L 1 and L 2 are affine linear maps and each g i is a fixed-size ReLU network. It has been shown in <ref type="bibr">(Shen et al., 2020;</ref><ref type="bibr">Yarotsky, 2018;</ref><ref type="bibr">Zhang, 2020)</ref> that the optimal approximation error is O(r &#8593;2/d ) when using L 2 &#8594; g r &#8594; &#8226; &#8226; &#8226; &#8594; g 2 &#8594; g 1 &#8594; L 1 to approximate 1-Lipschitz continuous functions on [0, 1] d . In contrast, our RCNet architecture L 2 &#8594; g &#8594;r &#8594; L 1 can approximate 1-Lipschitz continuous functions on [0, 1] d with an error O(r &#8593;1/d ), where g is a fixed-size ReLU network.</p><p>That means, at a price of a slightly worse approximation error, our RCNet architecture L 2 &#8594; g &#8594;r &#8594; L 1 essentially shares most of the parameters in L 2 &#8594; g r &#8594; &#8226; &#8226; &#8226; &#8594; g 2 &#8594; g 1 &#8594; L 1 and reduce trainable parameters to a constant. Furthermore, our RCNet architecture L 2 &#8594; g &#8594;r &#8594; L 1 is anticipated to exhibit improved gradient behavior compared to L 2 &#8594; g r &#8594; &#8226; &#8226; &#8226; &#8594; g 2 &#8594; g 1 &#8594; L 1 as the gradient with respect to the parameters in g is less likely to vanish for larger values of r.</p><p>Next, we point out some relations between our approximation results and dynamical systems. Our results reveal that a continuous-depth network generated via a dynamical system has enormous approximation power even if the dynamics is time-invariant and realized by a fixed-size ReLU network. Let us now delve into further details regarding this matter. A dynamical system is generally described by an ordinary differential equation (ODE)</p><p>where F : R n+1 &#8659; ! &#8593; R n is the dynamics function of this dynamical system, parameterized with &#949; &#8595; !, where ! is the parameter space.</p><p>For any y = z 0 &#8595; R n , z(T ) can be regarded as a function of y and we denote this function by "(&#8226;, &#949;) : R n &#8593; R n . Such a map is known as the flow map (or Poincar&#233; map) of the dynamical system (1). Then we can use</p><p>where L 1 and L 2 are two affine linear maps matching the dimensions.</p><p>Choose a large S &#8595; N + and set &#982; = T /S. It follows from ODE (1) that</p><p>We denote z s as the numerical solution and use it to approximate the true solution z(&#982;s) for s = 0, 1, &#8226; &#8226; &#8226; , S. By using the forward Euler method to discretize ODE (1), we have</p><p>Such an iteration step can be regarded as a residual network <ref type="bibr">(He et al., 2016)</ref>. Thus, a dynamical system can be viewed as a continuous-time version of a residual network. The network generated via a dynamical system is generally called continuous-depth network. The function L 2 &#8594; "(&#8226;, &#949;) &#8594; L 1 mentioned above is indeed generated by a continuous-depth network. As we know, z S can approximate z(&#982;S) = z(T ) arbitrarily well for sufficiently large S with some proper conditions on the dynamics function F .</p><p>Suppose</p><p>Then, we have</p><p>Our results imply that L 2 &#8594; g &#8594;S &#969; &#8594; L 1 has immense approximation power even if g &#969; is realized by a fixed-size ReLU network, where L 1 and L 2 are affine maps matching the dimensions. Define</p><p>where g &#969; is realized by a fixed-size ReLU network. Then, the function L 2 &#8594; "(&#8226;, &#949;) &#8594; L 1 , modelled by a continuousdepth network, can approximate L 2 &#8594; g &#8594;S &#969; &#8594; L 1 well and hence also has immense approximation power. The definition of the dynamics function F in Equation (2) implies that F (y, t, &#949;) is independent of t for any (y, &#949;) &#8595; R n &#8659; ! and F can also be realized by a fixed-size ReLU network. In short, we have shown that a continuous-depth network can also have immense approximation power even if its dynamics function is time-independent and realized by a fixed-size ReLU network. One may refer to Section 2.1 for a further discussion on dynamical systems.</p><p>The remaining sections of this paper are structured as follows. In Section 2, we discuss the connections between our results and existing work. Section 3 outlines the main ideas behind the proofs of Theorems 1.1 and 1.3. Next, in Section 4, we provide two simple experiments to numerically validate our theoretical results. Finally, Section 5 concludes this paper with a brief discussion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In this section, we will provide a comprehensive overview of previous research that is pertinent to our results. We commence by emphasizing the correlation between deep learning and dynamical systems. Subsequently, we delve into the subject of parameter-sharing schemes in neural networks. Finally, we compare our results with existing research from the standpoint of function approximation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Deep Learning via Dynamical Systems</head><p>A dynamical system is a mathematical framework that describes the evolution of a system over time. Its origins can be traced back to Newtonian mechanics. For a comprehensive overview of the history of dynamical systems, one may refer to <ref type="bibr">(Holmes, 2007)</ref>. In general, a dynamical system consists of two fundamental components. First, we have the state variable(s), which represent the variables that fully describe the state of the system. These variables capture the relevant properties or quantities of interest in the system. The second component is the time evolution rule, which specifies how the future states of the system evolve from the current state. It provides the mathematical equations or rules that govern the dynamics of the system over time. By studying the behavior and properties of dynamical systems, we gain insights into how systems change and develop over time. This framework has found applications in various fields, including physics, biology, economics, and computer science. In the context of deep learning, the connection to dynamical systems highlights the temporal aspect of learning and the potential for capturing complex dynamics in neural networks.</p><p>In recent years, there has been a growing body of research establishing connections between dynamical systems and deep learning. One such work (E, 2017) introduces a novel concept that interprets the discretization of a continuous dynamical system as a continuous-depth residual network. This approach utilizes continuous dynamical systems to model high-dimensional nonlinear functions commonly encountered in machine learning tasks. Another notable contribution by the authors in <ref type="bibr">(Chen et al., 2018)</ref> parameterizes the derivative of the hidden state using a neural network, introducing continuous-depth residual networks. This work highlights several advantages of continuousdepth models, including constant memory cost. The study in <ref type="bibr">(Li et al., 2023)</ref> establishes general sufficient conditions for the universal approximation property of continuousdepth residual networks, further connecting the dynamical systems approach to deep learning. A similar result is demonstrated in <ref type="bibr">(Li et al., 2022)</ref>, where the focus shifts to specific invariant functions instead of generic continuous functions. Additionally, the universal approximation property of deep fully convolutional networks is explored from the perspective of dynamical systems in <ref type="bibr">(Lin et al., 2022)</ref>. The authors demonstrate that deep residual fully convolutional networks, along with their continuous-depth counterparts of constant channel width, can achieve the universal approximation of specific symmetric functions. These studies serve as exemplary demonstrations of the endeavors made to establish connections between dynamical systems and deep learning. They explore the potential benefits and theoretical foundations of continuous-depth models in various contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Parameter Sharing in Neural Networks</head><p>In recent years, deep neural network models have demonstrated notable accomplishments across various domains. Nonetheless, the growing size of deep neural network models frequently introduces complexities in terms of computation and memory usage. To tackle these challenges, several techniques for model compression and acceleration have been developed, many of which involve the concept of parameter sharing. Parameter-sharing schemes are utilized in neural networks to minimize the total number of parameters, resulting in reduced memory and computational requirements. Our network architecture, which involves the repeated compositions of a single fixed-size network, can be viewed as a particular instance of a parameter-sharing scheme in neural networks.</p><p>To the best of our knowledge, parameter-sharing schemes in neural networks can be broadly categorized into three basic cases. The first case involves sharing parameters within the same layer, as seen in convolutional neural networks (CNNs), where kernels (filters) are shared across all image positions. The second case entails sharing parameters among different layers of neural networks, as in recurrent neural networks (RNNs). Our network architecture follows this second scheme by sharing parameters through repeated compositions of a single fixed-size network. We demonstrate that this approach can yield immense approximation capabilities by repeating a fixed number of parameters. In addition to these two parameter-sharing schemes, there is also the practice of sharing parameters across different neural networks or models, which is often employed in multitask learning scenarios. For further insights into parameter sharing in neural networks, interested readers can refer to the references <ref type="bibr">(Savarese &amp; Maire, 2019;</ref><ref type="bibr">Wang et al., 2020;</ref><ref type="bibr">Plummer et al., 2022;</ref><ref type="bibr">Wang et al., 2020;</ref><ref type="bibr">Zhang et al., 2022;</ref><ref type="bibr">Wallingford et al., 2022)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Discussion from an Approximation Perspective</head><p>The approximation power of neural networks has been extensively studied, with numerous publications focusing on constructing various neural networks to approximate a wide range of target functions. Some notable exam-ples include <ref type="bibr">(Cybenko, 1989;</ref><ref type="bibr">Hornik et al., 1989;</ref><ref type="bibr">Barron, 1993;</ref><ref type="bibr">Yarotsky, 2018;</ref><ref type="bibr">2017;</ref><ref type="bibr">B&#246;lcskei et al., 2019;</ref><ref type="bibr">Zhou, 2020;</ref><ref type="bibr">Chui et al., 2018;</ref><ref type="bibr">Gribonval et al., 2022;</ref><ref type="bibr">G&#252;hring et al., 2020;</ref><ref type="bibr">Suzuki, 2019;</ref><ref type="bibr">Nakada &amp; Imaizumi, 2020;</ref><ref type="bibr">Chen et al., 2019;</ref><ref type="bibr">Bao et al., 2023;</ref><ref type="bibr">Li et al., 2023;</ref><ref type="bibr">Montanelli &amp; Yang, 2020;</ref><ref type="bibr">Shen et al., 2019;</ref><ref type="bibr">2020;</ref><ref type="bibr">Lu et al., 2021;</ref><ref type="bibr">Zhang, 2020;</ref><ref type="bibr">Shen et al., 2022b;</ref><ref type="bibr">a)</ref>. In the early stages of this field, the focus was on exploring the universal approximation power of one-hidden-layer networks. The universal approximation theorem <ref type="bibr">(Cybenko, 1989;</ref><ref type="bibr">Hornik, 1991;</ref><ref type="bibr">Hornik et al., 1989)</ref> demonstrated that a sufficiently large neural network can approximate a certain type of target function arbitrarily well, without explicitly estimating the approximation error in terms of the network size. Subsequent work, such as <ref type="bibr">(Barron, 1993;</ref><ref type="bibr">Barron &amp; Klusowski, 2018)</ref>, analyzed the approximation error of one-hiddenlayer networks of width n and showed an asymptotic approximation error of O(n &#8593;1/2 ) in the L 2 -norm for target functions with certain smoothness properties.</p><p>Recent research has placed significant emphasis on the approximation of deep neural networks. Notably, the findings presented in <ref type="bibr">(Shen et al., 2020;</ref><ref type="bibr">Yarotsky, 2018;</ref><ref type="bibr">Zhang, 2020)</ref> indicate that ReLU networks with n parameters can achieve an optimal approximation error of O(n &#8593;2/d ) when approximating 1-Lipschitz continuous functions on [0, 1] d . However, it is crucial to recognize that this optimal approximation rate suffers from the curse of dimensionality. To overcome the limitations imposed by the curse of dimensionality and achieve better approximation errors, various approaches have been proposed and explored. These approaches aim to enhance the quality of approximation or even directly address the challenges arising from the curse of dimensionality. One approach is to consider smaller function spaces, such as smooth functions <ref type="bibr">(Lu et al., 2021;</ref><ref type="bibr">Yarotsky &amp; Zhevnerchuk, 2020)</ref>, band-limited functions <ref type="bibr">(Montanelli et al., 2021)</ref>, and Barron spaces <ref type="bibr">(Barron &amp; Klusowski, 2018;</ref><ref type="bibr">Barron, 1993;</ref><ref type="bibr">E et al., 2022)</ref>. By restricting the class of functions being approximated, it is possible to achieve better approximation errors with neural networks. Another approach is to design new network architectures that can improve the approximation capabilities. Examples of such architectures include Floor-ReLU networks <ref type="bibr">(Shen et al., 2021a)</ref>, Floor-Exponential-Step networks <ref type="bibr">(Shen et al., 2021b)</ref>, (Sin, ReLU, 2 x )-activated networks <ref type="bibr">(Jiao et al., 2021)</ref>, and three-dimensional networks <ref type="bibr">(Shen et al., 2022)</ref>. By exploring different function spaces and designing novel network architectures, researchers have been able to push the limits of approximation accuracy in neural networks, providing more flexibility and better performance for various tasks. It is important to note that the literature on the approximation analysis of deep neural networks is vast, and the publications mentioned here represent only a subset of the existing research.</p><p>Many other approaches, techniques, and architectures have been proposed to address the challenge of improving the approximation error for specific function classes.</p><p>In this paper, we propose a specific neural network architecture generated by repeated compositions of a single fixed-size network. Theorems 1.1 and 1.3 demonstrate that repeating a small ReLU network block can enhance the approximation power of our network architecture. We will conduct experiments in Section 4 to numerically verify our theoretical results and evaluate the approximation capabilities of our network architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Ideas for Proving Theorems 1.1 and 1.3</head><p>Let us outline the main ideas behind the proofs of Theorems 1.1 and 1.3. During the proofs, our main approach involves constructing a piecewise constant function to approximate the desired continuous function. However, the continuity of ReLU networks poses a challenge in uniformly approximating piecewise constant functions. To bridge this gap, we first design ReLU networks to realize piecewise constant functions outside a sufficiently small region to approximate the target function well. Then, we will introduce a theorem to deal specifically with the approximation inside this small region for achieving uniform approximation.</p><p>Based on the aforementioned ideas, let us delve into the specific details. We divide [0, 1] d into a union of "important" cubes {Q &#949; } &#949;&#8595;{0,1,&#8226;&#8226;&#8226;,K&#8593;1} d and a small region !, where K is a proper integer determined later. Each Q &#949; is associated with a representative <ref type="figure">1</ref> for an illustration of x &#949; , !, and Q &#949; . Then, the construction of the desired network approximating the target function can be divided into three steps as follows.</p><p>1. First, we design a sub-network to realize a vectorvalued function " 1 mapping the whole cube Q &#949; to its index &#977; for each &#977;. That is, "</p><p>2. Next, we design a sub-network to realize a function &#1009; 2 mapping &#977; approximately to f</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Finally, by defining</head><p>Additionally, we must also address the approximation occurring within ! and demonstrate that &#1009; = &#1009; 2 &#8594; " 1 can be represented in the desired form &#1009; = L 2 &#8594; g &#8594;r &#8594; L 1 , where L 1 and L 2 are affine linear maps and g is realized by a fixed-size ReLU network.</p><p>0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00</p><p>A set of function values at representatives: </p><p>See Figure <ref type="figure">1</ref> for an illustration of these three steps. More details on these three steps can be found below.</p><p>Step 1: Constructing " 1 .</p><p>As mentioned previously, the aim of</p><p>, where &#1009; 1 : R &#8593; R is a step function outside a small region and hence can be realized by a ReLU network. It is generally challenging to design a ReLU network with a limited budget and the required architecture to realize such a function &#1009; 1 . Thus, we establish a proposition, Proposition 3.1 below, to do this step and place its proof in Section C of the appendix.</p><p>Proposition 3.1. Given any &#982; &#8595; (0, 1) and n, m &#8595; N + with n &#8771; m, there exist g &#8595; NN 9, 1; R 5 &#8593; R 5 and two affine linear maps L 1 : R &#8593; R 5 and L 2 :</p><p>Step 2:</p><p>which is a key point to ease the design of a ReLU network realizing &#1009; 2 . Indeed, if we can define a proper affine linear map L : R d &#8593; R, then we only need to construct</p><p>It is still challenging to construct a ReLU network with a limited budget and the required architecture to realize &#1009; 2 . Thus, we establish Proposition 3.2 below to simplify the construction of &#1009; 2 . The proof of Proposition 3.2 is complicated and hence is placed in Section D of the appendix.</p><p>Proposition 3.2. Given any &#962; &gt; 0, n, m &#8595; N + with n &#8771; m, and</p><p>there exist g &#8595; NN{16, 2; R 6 &#8593; R 6 } and two affine linear maps L 1 : R &#8593; R 6 and L 2 : R 6 &#8593; R such that</p><p>Step 3: Representing &#1009; = &#1009; 2 &#8594; " 1 properly.</p><p>With " 1 and &#1009; 2 constructed in the first two steps, we can define &#1009; := &#1009; 2 &#8594; " 1 and we have</p><p>That means &#1009; can approximate f well outside !. By making &#1009; bounded and ! sufficiently small, we can easily control the L p -norm approximation error to prove Theorem 1.1 for any p &#8595; [1, &#8599;). To prove Theorem 1.3, we require &#1009; to pointwise approximate f well. To this end, we use the idea of Lemma 3.11 in (Zhang, 2020) (or Lemma 3.4 in <ref type="bibr">(Lu et al., 2021)</ref>) to control the approximation error inside a small region.</p><p>Apart from a good approximation error, we also need to show that &#1009; can be represented as the desired form L 2 &#8594; g &#8594;r &#8594; L 1 , where L 1 and L 2 are two affine linear maps and g is realized by a fixed-size ReLU network. Note that " 1 and &#1009; 2 are constructed based on Propositions 3.1 and 3.2, respectively. Thus, both " 1 and &#1009; 2 are expected to have the following form:</p><p>where L 1 , L 2 are affine linear maps and g is realized by fixed-size ReLU networks. Then, &#1009; = &#1009; 2 &#8594;" 1 are expected to have the following form:</p><p>where L 1 , L 2 , L 3 are affine linear maps and g 1 , g 2 are realized by fixed-size ReLU networks. It is not trivial to convert the form in Equation (3) to the desired form. A proposition is established to facilitate such a conversion.</p><p>and L 3 : R d2 &#8593; R d3 be three affine linear maps. Suppose</p><p>The proof of Proposition 3.3 is technical and hence is deferred to Section E of the appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical Experiments</head><p>The primary objective of this section is to numerically validate the theoretical results presented in Theorems 1.1 and 1.3. To achieve this goal, we have designed two distinct experiments. The first experiment, described in Section 4.1, focuses on a function approximation task. Our aim is to demonstrate that increasing r in our RCNet architecture, denoted as L 2 &#8594; g &#8594;r &#8594; L 1 , improves the error of function approximation. In this architecture, L 1 and L 2 represent affine linear maps, while g represents a ReLU network block. The second experiment, outlined in Section 4.2, focuses on a classification task. We intend to illustrate that increasing the value of r in our RCNet architecture L 2 &#8594;g &#8594;r &#8594;L 1 leads to enhanced classification performance. By evaluating the accuracy of the classification results, we can empirically validate the advantages of incorporating multiple compositions of the fixed-size ReLU network block. Through these experiments, our goal is to provide numerical evidence that supports the theoretical claims and showcases the potential of our RCNet architecture in terms of improving approximation power. The results obtained from these experiments will contribute to a comprehensive understanding of the practical implications and advantages of our network design.</p><p>Next, let us briefly discuss the expected impact of increasing r in our RCNet architecture L 2 &#8594;g &#8594;r &#8594;L 1 on the experiment results. In this discussion, we will focus on the ReLU activation function and consider the three main sources of errors in the test results: approximation error, generalization error, and optimization error. In our experiments where r is small, we assume that the optimization error is wellcontrolled due to the utilization of a good optimization algorithm. Thus, we can primarily focus on the effects of increasing r on the approximation and generalization errors. Increasing r has the potential to reduce the approximation error, as demonstrated by our theoretical results. However, it may also lead to an increase in the generalization error. Therefore, the performance improvement associated with larger values of r depends on whether the approximation error or the generalization error dominates. If the approximation error is the leading term, then increasing r can be beneficial. To emphasize the approximation error, we design our first experiment to involve a sufficiently complex target function. By choosing a challenging binary classification problem for our second experiment, we create conditions where the approximation error is relatively large. In both experiments, a sufficient number of samples are generated to control the generalization error, ensuring that it does not overshadow the effects of the approximation error. By carefully setting up these experiments and controlling the different sources of errors, we can gain insights into the impact of increasing r in our RCNet architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Function Approximation</head><p>To evaluate and compare the approximation capabilities of our RCNet architecture L 2 &#8594; g &#8594;r n &#8594; L 1 across various values of r and n, we will utilize it for a function approximation task. This architecture comprises two affine linear maps, L 1 and L 2 , along with a ReLU network block g n . To facilitate a comprehensive comparison, we have specifically chosen a target function f that exhibits a high degree of complexity. The function f :</p><p>, where the coefficient matrices are given by</p><p>, and (d i,j ) = 4&#966; 6&#966; 8&#966; 6&#966; .</p><p>To visually represent the target function f , we have included illustrations in Figure <ref type="figure">2</ref>. By choosing this intricate function as our target, we can effectively assess the approximation capability of our network architecture across various values of r and n. In this experiment, we will employ the RCNet architecture L 2 &#8594; g &#8594;r n &#8594; L 1 to approximate the target function f for different values of r and n. Specifically, we consider r = 1, 2, 3, 4 and n = 100, 200. The ReLU network block g n is constructed by combining an affine linear map and the ReLU activation function, i.e., g n is defined as</p><p>for any x &#8595; R n , where A &#8595; R n&#8600;n and b &#8595; R n are parameters and &#8636; is the ReLU activation function that can be applied element-wise to a vector. Then, the dimensions of the input and output for the two affine linear maps, L 1 : R 2 &#8593; R n and L 2 : R n &#8593; R, are determined accordingly. Notably, the ReLU network block g n consists of n 2 + n parameters. The affine linear map L 1 contains 2n + n = 3n parameters, while L 2 has n + 1 parameters. Consequently, the total number of parameters in L 2 &#8594; g &#8594;r n &#8594; L 1 is (n 2 + n) + 3n + (n + 1) = n 2 + 5n + 1 for different values of r and n. It is essential to note that when r &#8596; 2, the parameters of L 2 &#8594; g &#8594;r n &#8594; L 1 are partially shared through repetitions of the ReLU network block g n . Our objective is to provide numerical evidence demonstrating that increasing the value of r results in improved test losses for each fixed n.</p><p>Before presenting the numerical results, let us delve into the hyperparameters utilized for training our network architecture L 2 &#8594; g &#8594;r n &#8594; L 1 for varying values of r and n, specifically r = 1, 2, 3, 4 and n = 100, 200. To generate the training and test samples, we employ the uniform distribution, resulting in 10 6 training samples and 10 5 test samples in [0, 1] 2 . During the training process, we utilize the RAdam optimization method <ref type="bibr">(Liu et al., 2020)</ref>, which aids in optimizing the network parameters. We set the minibatch size for training to 500, which signifies the number of training samples processed in each iteration. Our training process comprises a total of 500 epochs, representing complete passes through the training dataset. The learning rate is adjusted every 5 epochs. More specifically, the learning rate during epochs 5(i &#8600; 1) + 1 to 5i is set to 0.002 &#8659; 0.9 i&#8593;1 for i = 1, 2, &#8226; &#8226; &#8226; , 100. This adaptive adjustment allows for fine-tuning the model during training. To train our model, we employ the mean squared error (MSE) loss function, which measures the average squared difference between the network-generated function and the target function. To ensure the reliability of our experiment, we repeat it 12 times. From these repetitions, we discard 3 top-performing and 3 bottom-performing trials based on the average test accuracy of the last 100 epochs. The target accuracy is then determined by taking the average of the test accuracies from the remaining 6 trials for each epoch.</p><p>We are now prepared to present the experiment results that compare the numerical performances of L 2 &#8594; g &#8594;r n &#8594; L 1 for different values of r and n, specifically r = 1, 2, 3, 4 and n = 100, 200. The test losses over the last 100 epochs are averaged to obtain the target losses, considering two types of loss functions: mean squared error (MSE) and maximum (MAX) loss functions. Table <ref type="table">1</ref> provides a comprehensive comparison of the numerical results obtained from L 2 &#8594; g &#8594;r n &#8594; L 1 for the given values of r and n. Additionally, Figure <ref type="figure">3</ref> illustrates the test losses measured in MSE on a logarithmic scale, allowing for an intuitive comparison. The values presented in Table <ref type="table">1</ref> and the trends observed in Figure <ref type="figure">3</ref> clearly indicate a significant improvement in test losses with increasing values of r. These findings align with the theoretical results stated in Theorems 1.1 and 1.3, On Enhancing Expressive Power via Compositions of Single Fixed-Size ReLU Network providing further confirmation of the effectiveness of increasing r.</p><p>Lastly, it is crucial to acknowledge that further increasing the value of r may not lead to additional improvements in the results due to the inherent challenges involved in optimizing deep learning models. Issues such as local minima, saddle points, and vanishing gradients make it increasingly difficult to identify the global minimizer, particularly for larger values of r.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Classification</head><p>To assess and compare the approximation capabilities of our RCNet architecture L 2 &#8594;g &#8594;r n &#8594;L 1 across different values of r and n, we will employ it for a classification task. This architecture consists of two affine linear maps, L 1 and L 2 , along with a ReLU network block g n . For a comprehensive comparison, we have selected a complex binary classification experiment utilizing the Archimedean spiral, as proposed in <ref type="bibr">(Shen et al., 2022)</ref>. The objective of this classification problem is to accurately classify samples from two distinct sets, denoted as S 0 and S 1 . These two sets are constructed based on the Archimedean spiral, as illustrated in Figure <ref type="figure">4</ref>.</p><p>Let us delve into the construction details of the sets S 0 and S 1 . An Archimedean spiral can be represented by the equation r = a + b&#8637; in polar coordinates (r, &#8637;) for proper a, b &#8595; R. We begin by defining two curves s = 24. Next, we normalize</p><p>, where C i is defined as</p><p>With C 0 and C 1 defined, we can construct the target sets S 0 and S 1 as</p><p>, where &#962; = 0.006 in our experiments. Refer to Figure <ref type="figure">4</ref> for illustrations of S 0 and S 1 .</p><p>In this experiment, we will employ the network architecture L 2 &#8594;g &#8594;r n &#8594;L 1 to classify samples in S 0 &#8658;S 1 for different values of r and n, specifically r = 1, 2, 3, 4 and n = 30, 40.</p><p>Here, L 1 and L 2 represent two affine linear maps, and g n corresponds to a ReLU network block. The construction of the ReLU network block g n involves combining an affine linear map with the ReLU activation function. Mathematically, g n is defined as</p><p>for any x &#8595; R n , where A &#8595; R n&#8600;n and b &#8595; R n are parameters and &#8636; denotes the ReLU activation function that can be applied element-wise to a vector. Subsequently, the input and output dimensions of the two affine linear maps, L 1 : R 2 &#8593; R n and L 2 : R n &#8593; R 2 , are appropriately determined. The total number of parameters in L 2 &#8594; g &#8594;r n &#8594; L 1 can be easily verified to be (2n+2)+(n 2 +n)+(2n+n) = n 2 +6n+2 for different values of r and n. An important observation is that for r &#8596; 2, the parameters in L 2 &#8594; g &#8594;r n &#8594; L 1 are partially shared due to the repeated utilization of the ReLU network block g n . Our objective is to provide numerical evidence demonstrating that increasing the value of r results in improved test accuracies for each fixed n. Before proceeding with the numerical results, let us provide an overview of the hyperparameters employed in training our network architecture L 2 &#8594; g &#8594;r n &#8594; L 1 for different values of r and n, specifically r = 1, 2, 3, 4 and n = 30, 40. First, we generate training and test samples from S 0 and S 1 using the uniform distribution. Specifically, we randomly generate 3 &#8659; 10 5 training samples and 3 &#8659; 10 4 test samples for each class. These 6 &#8659; 10 5 training samples are used for network training, while 6 &#8659; 10 4 test samples are utilized to compute the test accuracy. For optimization, we employ the RAdam method <ref type="bibr">(Liu et al., 2020)</ref>. The training process consists of 1000 epochs with a mini-batch size of 300. The learning rate is defined as 0.001 &#8659; 0.95 i&#8593;1 during epochs 5(i &#8600; 1) + 1 to 5i for i = 1, 2, &#8226; &#8226; &#8226; , 200. To evaluate the model output, we apply the softmax activation function to the network output and employ the cross-entropy loss function to measure the loss between the target function and the network output. To guarantee the standardization of training and test samples, we perform a rescaling procedure to adjust their mean to 0 and standard deviation to 1. To ensure reliability of our results, we conduct the experiment 12 times. Among these trials, we exclude 3 top-performing and 3 bottom-performing trials based on the average test accuracy over the last 100 epochs. The target accuracy is then determined by averaging the test accuracies from the remaining 6 trials for each epoch.</p><p>Let us now present the results of our experiments by comparing the numerical performances of L 2 &#8594; g &#8594;r n &#8594; L 1 for for varying values of r and n, specifically r = 1, 2, 3, 4 and n = 30, 40. The target test accuracy is computed by averaging the test accuracies over the last 100 epochs. Table 2 provides a comprehensive overview of the test accuracies obtained by L 2 &#8594; g &#8594;r n &#8594; L 1 for different values of r and n. To enhance the information presented in Table 2, Figure <ref type="figure">5</ref> serves as a complementary visual representation. It showcases graphical depictions of the data, enabling an intuitive comparison and facilitating a visual analysis of the performance trends. The results presented in Table <ref type="table">2</ref> combined with the trends observed in Figure <ref type="figure">5</ref> confirm our initial expectation that increasing the value of r leads to improved test accuracies. These experiment results provide compelling numerical evidence demonstrating the effectiveness of increasing r, which aligns with the theoretical results stated in Theorems 1.1 and 1.3.</p><p>Finally, it is important to note that further increasing r may not necessarily result in additional improvements in the results. This is because optimizing deep learning models is notoriously challenging, as it involves various issues such as local minima, saddle points, and vanishing gradients. In our experiments, the primary difficulty lies in identifying the global minimizer, especially when dealing with large values of r.</p><p>Table 2. Test accuracy comparison. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This paper investigates the expressive power of deep neural networks from the perspective of function compositions. We demonstrate that the repeated compositions of a single fixed-size ReLU network exhibit surprising expressive power, despite the limited expressive capabilities of the individual network itself. As shown in Theorems 1.1 and 1.3, our RCNet architecture L 2 &#8594; g &#8594;r &#8594; L 1 can approximate any continuous function f &#8595; C([0, 1] d ) with an error O &#969; f (r &#8593;1/d ) . Here, g represents a fixed-size ReLU network, while L 1 and L 2 correspond to two affine linear maps matching the dimensions. Furthermore, we explore the connection between our findings and dynamical systems. Our results reveal that a continuous-depth network generated through a dynamical system possesses enormous approximation capabilities, even when the dynamics function is time-independent and realized by a fixed-size ReLU network. Finally, we conduct experiments to provide numerical evidence that validates the theoretical results stated in Theorems 1.1 and 1.3.</p><p>It is worth mentioning that our analysis is currently focused on the ReLU activation function and fully connected network architectures. Extending our results to other activation functions, such as the sigmoid and tanh functions, as well as different neural network architectures, such as convolutional neural networks, would be of great interest for future research. Additionally, the numerical examples presented in this paper are relatively simple. Further exploration of the numerical performance of our network architecture and its application to real-world problems would be an intriguing direction for future studies. As we shall see later in the proofs of Theorems 1.1 and 1.3, our main approach involves constructing a piecewise constant function to approximate the desired continuous function. However, the inherent continuity of ReLU networks hinders their ability to uniformly approximate piecewise constant functions effectively. To address this limitation, we introduce the concept of the trifling region !([0, 1] d , K, &#982;) as defined in Equation (4). By utilizing ReLU networks, we can accurately represent piecewise constant functions outside the trifling region.</p><p>To streamline the proofs of Theorems 1.1 and 1.3, we introduce an auxiliary theorem, referred to as Theorem A.1 below, where we disregard the approximation within the trifling region !([0, 1] d , K, &#982;). Theorem A.1. Given a continuous function f &#8595; C([0, 1] d ), for any r &#8595; N + , there exist g &#8595; NN 39d + 24, 3; R 5d+3 &#8593; R 5d+3 and two affine linear maps L 1 : R d &#8593; R 5d+3 and L 2 : R 5d+3 &#8593; R such that</p><p>where K = &#8601;r 1/d &#8733; and &#982; is an arbitrary number in (0, 1 3K ].</p><p>The proof of Theorem A.1 will be presented in Section B. Assuming the validity of Theorem A.1, we will provide the detailed proofs of Theorems 1.1 and 1.3 in Sections A.2 and A.3, respectively. To enhance clarity, Section A.1 offers a concise overview of the notations employed throughout this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Notations</head><p>Below is a summary of the fundamental notations employed in this paper.</p><p>&#8226; The set difference of two sets A and B is denoted as A\B := {x : x &#8595; A, x / &#8595; B}.</p><p>&#8226; The sets of natural numbers (including 0), integers, rational numbers, and real numbers are denoted as N, Z, Q, and R, respectively. Set N + = N\{0}.</p><p>&#8226; The indicator (characteristic) function of a set A is denoted as 1 A , which takes the value 1 on elements of A and 0 otherwise.</p><p>&#8226; The floor and ceiling functions of a real number x are denoted as &#8601;x&#8733; = max{n : n &#8771; x, n &#8595; Z} and &#8242;x&#8734; = min{n : n &#8596; x, n &#8595; Z}.</p><p>&#8226; Vectors and matrices are represented by bold lowercase and uppercase letters, respectively. For example, a = (a 1 , &#8226; &#8226; &#8226; , a d ) &#8595; R d , A &#8595; R m&#8600;n is a real matrix of size m &#8659; n, and A T denotes the transpose of A.</p><p>&#8226; Slicing notation is used for a vector </p><p>&#8226; By convention, &#63737; m j=n a j = 0 if n &gt; m, no matter what a j is for each j.</p><p>&#8226; For any</p><p>In the degenerate case  0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 &#8226; The rectified linear unit (ReLU) is denoted as &#8636;(x) = max{0, x} for any x &#8595; R. With a slight abuse of notation, we allow &#8636; to be applied element-wise to a vector, i.e., &#8636;(x</p><p>&#8226; Suppose &#969; is a function realized by a ReLU network, whether scalar or vector-valued. Then, &#969; can be expressed as</p><p>where W i &#8595; R Ni+1&#8600;Ni and b i &#8595; R Ni+1 are the weight matrix and the bias vector in the i-th affine linear map L i , respectively, i.e.,</p><p>and</p><p>Furthermore, &#969; can be expressed as a composition of functions. Specifically, it can be written as</p><p>Refer to Figure <ref type="figure">7</ref> for an illustration. &#8226; A network is referred to as "a network of width N and depth L" if it satisfies the following conditions.</p><p>-The number of neurons in each hidden layer of this network is less than or equal to N .</p><p>-The number of hidden layers of this network is less than or equal to L.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Proof of Theorem 1.1 with Theorem A.1</head><p>By assuming the validity of Theorem A.1, we can proceed to prove Theorem 1.1.</p><p>Proof of Theorem 1.1. We assume that f is not a constant function, as considering constant functions would lead to a trivial case. Therefore, for any t &gt; 0, we have &#969; f (t) &gt; 0. Set K = &#8601;r 1/d &#8733; and let &#982; &#8595; (0, 1 3K ] be an arbitrary number determined later. By Theorem A.1, there exist g 1 &#8595; NN 39d + 24, 3; R 5d+3 &#8593; R 5d+3</p><p>and two affine linear maps &#63738; L 1 : R d &#8593; R 5d+3 and &#63738; L 2 : R 5d+3 &#8593; R such that</p><p>That means the approximation error is well controlled outside the trifling region</p><p>To this end, we define</p><p>where</p><p>We claim g 2 &#8595; NN{4, 2; R &#8593; R}. To see this, we need to show how to realize g 2 by a ReLU network. Clearly, we have</p><p>where the last equality comes from</p><p>As shown in Figure <ref type="figure">8</ref>, g 2 &#8595; NN{4, 2; R &#8593; R} as desired.</p><p>x &#969;(x + M ) Let &#63738; L 3 : R &#8593; R as the identity map. Then, by Proposition 3.3 with</p><p>and two affine linear maps L 1 : R d &#8593; R 5d+5 and L 2 : R 5d+5 &#8593; R such that</p><p>Thus, we have</p><p>, where the last inequality comes from Equation ( <ref type="formula">5</ref>).</p><p>Observe that the Lebesgue measure of !([0, 1] d , K, &#982;) is bounded by Kd&#982;. Hence, by choosing a small &#982; &#8595; (0, 1 3K ] with</p><p>Therefore, we can conclude that</p><p>). Thus, we have completed the proof of Theorem 1.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3. Proof of Theorem 1.3 with Theorem A.1</head><p>To establish Theorem 1.3, we will rely on Theorem A.1, which permits unbounded approximation errors in the trifling region !([0, 1] d , K, &#982;). However, when it comes to proving Theorem 1.3 using pointwise approximation, it becomes essential to control the approximation error within the trifling region. To address this, we introduce a separate theorem that specifically deals with the approximation within the trifling region.</p><p>Theorem A.2 (Lemma 3.11 of <ref type="bibr">(Zhang, 2020)</ref> or Lemma 3.4 of <ref type="bibr">(Lu et al., 2021)</ref>). Given any &#962; &gt; 0, K &#8595; N + , and</p><p>where &#1009; := &#1009; d is defined by induction through &#1009; 0 := g and</p><p>where {e i } d i=1 is the standard basis in R d and mid(&#8226;, &#8226;, &#8226;) is the function returning the middle value of three inputs.</p><p>Now, we are prepared to provide the detailed proof of Theorem 1.3 by assuming the validity of Theorem A.1.</p><p>Proof of Theorem 1.3. We may assume f is not a constant function since it is a trivial case. Then &#969; f (t) &gt; 0 for any t &gt; 0.</p><p>Set K = &#8601;r 1/d &#8733; and choose a sufficiently small &#982; &#8595; (0, 1 3K ] such that</p><p>By Theorem A.1, there exist g 0 &#8595; NN 39d + 24, 3; R 5d+3 &#8593; R 5d+3 and two affine linear maps L 0,1 : R d &#8593; R 5d+3 and L 0,2 : R 5d+3 &#8593; R such that</p><p>where &#1009; := &#1009; d is defined by induction through</p><p>Here, {e i } d i=1 is the standard basis in R d and mid(&#8226;, &#8226;, &#8226;) is the function returning the middle value of three inputs. It remains to show &#1009; = &#1009; d can be represented as the desired form. We claim that &#1009; i can be represented as</p><p>, and g i satisfy the following conditions:</p><p>&#8226; r i = 3r + 2i &#8600; 1 and A i = d + 1 &#8600; i;</p><p>&#8226; L i,1 : R d &#8593; R di and L i,2 : R di &#8593; R are two affine linear maps with d i = 3 i (5d + 4) &#8600; 1;</p><p>We will prove this claim by induction. First, let us consider the base case i = 0. Clearly,</p><p>, where d 0 = 3 0 (5d + 4) &#8600; 1 = 5d + 3, L 0,1 : R d &#8593; R d0 and L 0,2 : R d0 &#8593; R are two affine linear maps and</p><p>Next, let us assume the claim holds for the case i = j &#8595; {0, 1, &#8226; &#8226; &#8226; , d &#8600; 1}. We will prove the claim for the case i = j + 1. By the induction hypothesis, &#1009; i can be represented as</p><p>where L j,1 : R d &#8593; R dj and L j,2 : R dj &#8593; R are two affine linear maps and g j &#8595; NN{N j , L j ; R dj &#8593; R dj }.</p><p>and &#63738; L 3 : R 3 &#8593; R via &#63738; L 3 (y 1 , y 2 , y 3 ) := y 1 for any (y 1 , y 2 , y 3 ) &#8595; R 3 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Note that</head><p>Clearly, <ref type="bibr">(Shen et al., 2021b)</ref>, mid(&#8226;, &#8226;, &#8226;) can be realized by a ReLU network of width 14 and depth 2, implying &#63738; G &#8595; NN{14, 2; R 3 &#8593; R 3 }.</p><p>Then, by Proposition 3.3 with</p><p>and two affine linear maps L j+1,1 : R d &#8593; R dj+1 and L j+1,2 : R dj+1 &#8593; R such that</p><p>where the last equality comes from r j +1+1 = (3r+2j&#8600;1)+1+1 = 3r+2(j+1)&#8600;1 = r j+1 . Therefore, for any x &#8595; [&#8600;A j+1 , A j+1 ], we have</p><p>By the principle of mathematical induction, we finish the proof of the claim.</p><p>Then, by the claim and setting d = 3 d (5d + 4) &#8600; 1, &#1009; = &#1009; d can be represented as</p><p>where L d,1 : R d &#8593; R d and L i,2 : R d &#8593; R are two affine linear maps and</p><p>By defining L 1 := L d,1 , g := g d , and</p><p>Thus, we finish the proof of Theorem 1.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Proof of Theorem A.1 with Propositions</head><p>In this section, we will provide the proof of the auxiliary theorem, Theorem A.1, by relying on Propositions 3.1, 3.2, and 3.3. The detailed proofs of these propositions can be found in Sections C, D, and E, respectively. By assuming the validity of these three propositions, we now proceed to prove Theorem A.1.</p><p>Proof of Theorem A.1. We may assume &#969; f (t) &gt; 0 for any t &gt; 0 since &#969; f (t 0 ) = 0 for some</p><p>Set K = &#8601;r 1/d &#8733; and let &#982; be an arbitrary number in (0, 1 3K ]. The proof can be divided into four main steps as follows.</p><p>is the trifling region defined in Equation (4).</p><p>2. Use Proposition 3.1 to construct a vector function</p><p>, " 1 (x) = &#977; for all x &#8595; Q &#949; , where &#63738; L 1 and &#63738; L 2 are affine linear maps and G 1 is realized by a fixed-size ReLU network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Construct a function &#1009;</head><p>L 4 , and &#63738; L 5 are affine linear maps and g 2 is realized by a fixed-size ReLU network. This core step can be further divided into two sub-steps: 3.1. Design an affine linear map &#63738; L 3 bijectively mapping the index set {0, 1, <ref type="figure">10</ref> for an illustration. 3.2. Apply Proposition 3.2 to design a sub-network to realize a function &#63738; L 5 &#8594; g</p><p>) and we use Proposition 3.3 to show &#1009; can be represented as L 2 &#8594; g &#8594;(3r&#8593;1) &#8594; L 1 , where L 1 and L 2 are affine linear maps and g is realized by a fixed-size ReLU network. Then we have &#1009;</p><p>The details of the above steps are presented below.</p><p>Step 1:</p><p>where !([0, 1] d , K, &#982;) is the trifling region defined in Equation (4). See Figure <ref type="figure">9</ref> for illustrations of !([0, 1] d , K, &#982;), Q &#949; , and</p><p>0.00 0.25 0.50 0.75 1.00 </p><p>Step 2: Construct " 1 mapping x &#8595; Q &#949; to &#977;.</p><p>By Proposition 3.1 with m = r and n = K = &#8601;r 1/d &#8733; &#8771; r = m therein and setting &#982; = K&#982;, there exist g 1 &#8595; NN 9, 1; R 5 &#8593; R 5 and two affine linear maps L 1 : R &#8593; R 5 and L 2 : R 5 &#8593; R such that</p><p>It is easy to verify that G 1 &#8595; NN 9d, 1; R 5d &#8593; R 5d .</p><p>For any</p><p>Therefore, for any</p><p>By defining</p><p>Step 3: Construct &#1009; 2 mapping &#977; approximately to f (x &#949; ).</p><p>We will use Proposition 3.2 to construct the desired &#1009; 2 . To meet the requirements of applying Proposition 3.2, we first define two auxiliary sets A 1 and A 2 as</p><p>Clearly,</p><p>See Figure <ref type="figure">9</ref> for an illustration of A 1 and A 2 . Next, we further divide this step into two sub-steps.</p><p>Step 3.1:</p><p>Inspired by the base-K representation, we define</p><p>Then &#63738; L 3 is a linear function bijectively mapping the index set {0, 1,</p><p>Step 3.2: Apply Proposition 3.2 to construct a sub-network mapping &#63738; L 3 (&#977;) approximate to f (x &#949; ).</p><p>Recall that</p><p>We will use a set of</p><p>to construct a continuous piecewise linear function h :</p><p>Precisely, we design h by making it satisfy the following two conditions.</p><p>&#8226; First, we set h(1) = f (1) and h &#63738;</p><p>&#8226; Next, we let h be linear between any two adjacent points in A 1 &#8658; {1}.</p><p>See Figure <ref type="figure">10</ref> for an illustration of h.</p><p>It is easy to verify that</p><p>0.00 0.25 0.50 0.75 1.00 0 1 </p><p>By defining &#63738; L 4 (x) := L 4 (2K d x) for any x &#8595; R, we have</p><p>By Equation ( <ref type="formula">9</ref>) and &#63738;</p><p>(10)</p><p>Step 4: Construct the desired function &#1009; and show it can be represented by the desired form.</p><p>We are ready to define the desired function &#1009; via</p><p>By defining &#63738; L</p><p>Recall that G 1 &#8595; NN 9d, 1; R 5d &#8593; R 5d and g 2 &#8595; NN 16, 2; R 6 &#8593; R 6 . By Proposition 3.3 with N 1 = 9d, N 2 = 16, L 1 = 1, L 2 = 2, d 0 = d, d 1 = 5d, d 2 = 6 and d 3 = 1 therein and setting d = 5d + 1 &#8596; max{5d, 6}, there exist</p><p>and two affine linear maps L 1 : R d &#8593; R 5d+3 and L 2 : R 5d+3 &#8593; R such that</p><p>Next, let us estimate the approximation error. Recall that</p><p>. By Equations ( <ref type="formula">7</ref>) and ( <ref type="formula">10</ref>), for any</p><p>where the last inequality comes from</p><p>where the last equality comes from the fact 2 &#63735; 2 &#8656; n &#8771; 5 &#8656; n for any n &#8595; N + . So we finish the proof of Theorem A.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Proof of Proposition 3.1</head><p>The main idea behind proving Proposition 3.1 lies in the composition architecture of neural networks. To streamline the proof, we begin by introducing a lemma, Lemma C.1 below, which can be seen as a weaker version of Proposition 3.1. Lemma C.1. Given any &#982; &#8595; (0, 1) and n &#8595; N + with n &#8596; 2, there exist g &#8595; NN 9, 1; R 5 &#8593; R 5 and two affine linear maps L 1 : R &#8593; R 5 and L 2 : R 5 &#8593; R such that Proof of Proposition 3.1. We may assume m &#8596; n &#8596; 2 since n = 1 is a trivial case. Set &#982; = (1&#8593;&#949;)&#949; n &#8595; (0, 1). By Lemma C.1, there exist g &#8595; NN 9, 1; R 5 &#8593; R 5 and two affine linear maps L 1 : R &#8593; R 5 and L 2 : R 5 &#8593; R such that</p><p>Define L 0 (x) := n&#8593;&#949;&#8593; &#949; n</p><p>x + &#982; for any x &#8595; R and L 1 := L 1 &#8594; L 0 . We claim</p><p>Then, by Equations ( <ref type="formula">11</ref>) and ( <ref type="formula">12</ref>), for any</p><p>It remains to prove Equation ( <ref type="formula">12</ref>). Clearly,</p><p>implying L 0 is increasing. To prove Equation ( <ref type="formula">12</ref>), we only need to prove</p><p>Let us first prove Equation ( <ref type="formula">13</ref>). Clearly, for k</p><p>where the inequality comes from the fact &#8600;kn &#8600;</p><p>Next, let us prove Equation ( <ref type="formula">14</ref>). In the case of k = n &#8600; 1, we have</p><p>So we finish the proof of Proposition 3.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2. Proof of Lemma C.1</head><p>To ensure the completeness of the proof of Proposition 3.1, we now provide the proof of Lemma C.1.</p><p>Proof of Lemma C.1. Define</p><p>It is easy to verify that</p><p>Obviously, for any</p><p>It remains to construct g, L 1 , and L 2 such that</p><p>Now we are ready to construct g : R 5 &#8593; R 5 . Define</p><p>for any (y 1 , y 2 , y 3 , y 4 , y 5 ) &#8595; R 5 , where</p><p>). See an illustration of the ReLU network realizing g : R 5 &#8593; R 5 in Figure <ref type="figure">12</ref>. Clearly, g &#8595; NN{9, 1; R 5 &#8593; R 5 }.</p><p>Define L 1 : R &#8593; R 5 via L 1 (x) := (x, 1, 1, x, 0) and L 2 : R 5 &#8593; R via L 2 (x 1 , x 2 , x 3 , x 4 , x 5 ) := x 5 . Then, we have</p><p>from which we deduce</p><p>So we finish the proof of Lemma C.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Proof of Proposition 3.2</head><p>The bit extraction technique proposed in <ref type="bibr">(Bartlett et al., 1998)</ref> plays a crucial role in proving Proposition 3.2. Before we delve into the proof of Proposition 3.2, we first establish Lemma D.1, which serves as a key intermediate step in the proof of Proposition 3.2.</p><p>Lemma D.1. Given any r &#8595; N + , there exist g &#8595; NN 8, 2; R 3 &#8593; R 3 and two affine linear maps L 1 : R 2 &#8593; R 5 and L 2 : R 5 &#8593; R such that: For any</p><p>We will prove Proposition 3.2 by assuming the validity of Lemma D.1, which will be proved later in Section D.2.</p><p>D.1. Proof of Proposition 3.2 with Lemma D.1</p><p>Now we are ready to give the proof of Proposition 3.2 by assuming Lemma D.1 is true.</p><p>Proof of Proposition 3.2. We may assume n = m since we can set</p><p>Then, for any k &#8595; {1, 2, &#8226; &#8226; &#8226; , n &#8600; 1}, we have</p><p>By Lemma D.1 with r = n &#8600; 1 therein, there exist g, &#63738; g &#8595; NN{8, 2; R 3 &#8593; R 3 } and four affine linear maps L</p><p>On Enhancing Expressive Power via Compositions of Single Fixed-Size ReLU Network It is easy to verify that g &#8595; NN{16, 2; R 6 &#8593; R 6 }. Moreover, we have g &#8594;(n&#8593;1) (x, y) = &#63739; g &#8594;(n&#8593;1) (x), &#63738; g &#8594;(n&#8593;1) (y) &#63727; for any x, y &#8595; R 3 .</p><p>Therefore, for k = 0, 1, &#8226; &#8226; &#8226; , n &#8600; 1, we have</p><p>where the last equality comes from Equation ( <ref type="formula">17</ref>). It follows that, for k = 0, 1,</p><p>So we finish the proof of Proposition 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2. Proof of Lemma D.1</head><p>To make the proof of Proposition 3.2 complete, we now provide the proof of Lemma D.1.</p><p>Proof of Lemma D.1. Set &#982; = 2 &#8593;r and define</p><p>See an illustration of T in Figure <ref type="figure">13</ref>.</p><p>It is easy to verify that By setting &#8641; r+1 = 2&#8641; r &#8600; T (&#8641; r &#8600; 1 2 ) = 0, we have </p><p>Define g : R &#8659; [0, &#8599;) 2 &#8593; R 3 via g(x 1 , x 2 , x 3 ) :=</p><p>for any (x 1 , x 2 , x 3 ) &#8595; R &#8659; [0, &#8599;) 2 .</p><p>For &#8640; = 1, 2, &#8226; &#8226; &#8226; , r + 1, we set</p><p>Then, for &#8640; = 1, 2, &#8226; &#8226; &#8226; , r, we have</p><p>implying &#982; r+1 = g(&#982; r ) = &#8226; &#8226; &#8226; = g &#8594;r (&#982; 1 ).</p><p>Define L 1 : R 2 &#8593; R 3 via L 1 (x 1 , x 2 ) := (x 1 &#8600;1, x 2 , 0) and L 2 : R 3 &#8593; R via L 2 (x 1 , x 2 , x 3 ) := x 3 for any x 1 , x 2 , x 3 &#8595; R.</p><p>Then, we have Clearly, g 2 &#8595; NN{N 2 , L 2 ; R d2 &#8593; R d2 } implies &#63738; g 2 &#8595; NN{N 2 , L 2 ; R d+1 &#8593; R d+1 }.</p><p>By Lemma E.1, there exists , 2r 1 + 3</p><p>&#63739; y, 0, 2r 1 + 1, 2r 1 + 3</p><p>Define L 1 : R d0 &#8593; R d+2 via L 1 (x) := &#63739; L 1 (x), 0, 2r 1 + 1, 2r 1 + 3 &#63727; &#8595; R d+2 for any x &#8595; R d0 Step 2: Constructing &#1009;.</p><p>Next, let construct a selector function &#1009; to "select" g 1 or g 2 in G.</p><p>By Lemma E.2, there exists where " &#8594;0 means the identity map.</p><p>Observe that, for k = 1, 2, &#8226; &#8226; &#8226; , r 1 + r 2 , &#982; k , t k = " &#8594;k (x, 2r 1 + 1) = " &#8594; " &#8594;(k&#8593;1) (x, 2r 1 + 1)</p><p>= "(&#982; k&#8593;1 , t k&#8593;1 ) = &#1009; &#8594; G(&#982; k&#8593;1 , t k&#8593;1 ) = &#1009; &#63739; g 1 (&#982; k&#8593;1 ), g 2 (&#982; k&#8593;1 ), g 0 (t k&#8593;1 ) &#63727; .</p><p>Then, by Equation (20), it is easy to verify that t k = g 0 (t k&#8593;1 ) = t k&#8593;1 &#8600; 2 and g 1 (&#982; k&#8593;1 ), g 2 (&#982; k&#8593;1 ) &#8595; [&#8600;M, M ] d for k = 1, 2, &#8226; &#8226; &#8226; , r 1 + r 2 , from which we deduce</p><p>and</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Department of Mathematics, Duke University, USA. Correspondence to: Shijun Zhang &lt;shijun.zhang@duke.edu&gt;.</p></note>
		</body>
		</text>
</TEI>
