<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Estimation of the Mean Function of Functional Data via Deep Neural Networks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/07/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10271671</idno>
					<idno type="doi">10.1002/sta4.393</idno>
					<title level='j'>Stat</title>
<idno>2049-1573</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shuoyang Wang</author><author>Guanqun Cao</author><author>Zuofeng Shang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this work, we propose a deep neural networks based method to perform nonparametric regression for functional data. The proposed estimators are based on sparsely connected deep neural networks with ReLU activation function. We provide the convergence rate of the proposed deep neural networks estimator in terms of the empirical norm. Through Monte Carlo simulation studies we examine the finitesample performance of the proposed method. Finally, the proposed method is applied to analyze positron emission tomography images of patients with Alzheimer disease obtained from the Alzheimer Disease Neuroimaging Initiative database.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Functional data refer to curves or functions, i.e. the data for each variable are viewed as smooth curves, surfaces, or hypersurfaces evaluated at a finite subset of some interval in 1D and 2D (e.g., some period of time, some range of pixels or voxels and so on). Functional data means intrinsically infinite-dimensional but are usually measured discretely. The high intrinsic dimensionality of these data poses challenges both for theory and computation. Functional data analysis (FDA) has been a topic of increasing interest in the statistics community for recent decades. <ref type="bibr">Ramsay and Silverman (2005)</ref> and J. <ref type="bibr">Wang, Chiou, and M&#252;ller (2016)</ref> gave a comprehensive overview of FDA.</p><p>In FDA problems, estimation of mean functions is the fundamental first step; see <ref type="bibr">Cardot (2000)</ref>; <ref type="bibr">Ferraty and Vieu (2006)</ref>; <ref type="bibr">Rice and Wu (2001)</ref> for example. Various methods exist that allow to estimate the regression function nonparametrically. <ref type="bibr">Rice and Wu (2001)</ref> adopted the mixed effect models where the mean function and the eigenfunctions were represented with B-splines and the spline coefficients were estimated by the EM algorithm; <ref type="bibr">Yao, M&#252;ller, and Wang (2005a)</ref> applied the local linear smoothers to estimate the mean and the covariance functions. <ref type="bibr">Morris and Carroll (2006)</ref> generalized the linear mixed model to the functional mixed model framework, with model fitting done by using a Bayesian wavelet-based approach. In <ref type="bibr">Cao, Yang, and Todem (2012)</ref>, a polynomial spline estimator is proposed for the mean function of functional data together with a simultaneous confidence band.</p><p>These nonparametric methods apply the pre-specified basis expansion, e.g., polynomial spline, local linear smoother, wavelet and 0 Abbreviations: FDA, functional data analysis; FPCA, functional principal component analysis; <ref type="bibr">DNN, Deep Neural Networks.</ref> so on, to fit the unknown mean function. The convergence rates achieve either optimal nonparametric rate or parametric rate dependents on how dense of the observed points for each subject.</p><p>Even though FDA has received considerable attention over the last decade, most approaches still focus on 1D functional data. The high intrinsic dimensionality of these data poses challenges both for theory and computation; these challenges vary with how the functional data were sampled. Hence, few are developed for general multi-dimensional functional data. Recently, several attempts have been made to extend these nonparametric methods for spatial and image data. Y. <ref type="bibr">Wang, Wang, Wang, and Ogden (2020)</ref> used bivariate splines over triangulations to handle an irregular domain of the images that is common in brain imaging studies. The proposed spline estimators of the mean functions are shown to be consistent and asymptotically normal. However, the triangularized bivariate splines are designed for 2D functions only. Extending spline basis functions for general d-dimensional data observed on an irregular domain is very sophisticated and becomes extremely complex as d increases. B. <ref type="bibr">Wang, Nan, Zhu, and Koeppe (2014)</ref> proposed a regularized Haar wavelet-based approach for the analysis of 3D brain image data in the framework of functional linear regression model.</p><p>Another popular method is functional principal component analysis (FPCA) which is an extension of multivariate principal component analysis, see <ref type="bibr">Hall, M&#252;ller, and Wang (2006)</ref>; <ref type="bibr">Yao, M&#252;ller, and Wang (2005b)</ref> for example. Recently, there are a few studies on 2D FDA. <ref type="bibr">Zhou and Pan (2014)</ref> proposed a smooth FPCA for 2D functions on irregular planar domains; their approach is based on a mixed effects model that specifies the principal component functions as bivariate splines on triangulations and the principal component scores as random effects. <ref type="bibr">Lila, Aston, and Sangalli (2016)</ref> proposed a FPCA model that can handle real functions observable on a 2D manifold. <ref type="bibr">Chen and Jiang (2017)</ref> extended it to analyze functional/longitudinal data observed on a general ddimensional domain. When applying FPCA, how to choose the number of eigenfunctions is an important practical issue without a satisfactory theoretical solution. Presumably, the larger the number of eigenfunctions, the more flexible the approximation would be, and hence, the closer to the true curve. However, a large number of eigenfunctions always result in a complex model which introduces difficulties to follow-up analysis.</p><p>For many years, the use of neural networks has been one of the most promising approaches in connection with applications related to approximation and estimation of multivariate functions (see, e.g., <ref type="bibr">Anthony and Bartlett (2009)</ref>; <ref type="bibr">Ripley (2014)</ref>). Recently, the focus is on multilayer neural networks, which use many hidden layers, and the corresponding techniques are called deep learning.</p><p>Under the nonparametric regression model, via sparsely connected deep neural networks, Schmidt-Hieber (2020) and <ref type="bibr">Liu, Shang, and Cheng (2021)</ref> showed that the L 2 risk of the least squares neural network regression estimator achieves the same minimax rate of convergence (up to a logarithmic factor) as proposed in <ref type="bibr">Stone (1982)</ref>. Furthermore, this neural network estimator does not suffer the curse-of-dimensionality which is a classical drawback in the traditional nonparametric regression framework. <ref type="bibr">Bauer and Kohler (2019)</ref> has also obtained the similar results under deep learning framework via a different activation function. <ref type="bibr">Liu, Boukai, and Shang (2019)</ref> further removed the logarithmic factors to achieve exact optimal nonparametric rate.</p><p>Although considerable advances have been achieved in deep learning research, from the statistical perspective its application and theoretical research is still in its infancy stage <ref type="bibr">(Fan, Ma, &amp; Zhong 2019)</ref>. There are many technical challenges left for statisticians.</p><p>For example, the availability of scalable computing and stochastic optimization techniques are challenge for developing statistical asymptotic properties. <ref type="bibr">Rossi, Delannay, Conan-Guez, and Verleysen (2005)</ref> extended the Radial-Basis function networks and multilayer perceptron models to functional data inputs. However, neither the consistency of the proposed estimator nor deep neural network architectures have been considered in their work. Recently, there are some works proposed for deep learning algorithms from statistical point of view <ref type="bibr">(Thind, Multani, &amp; Cao 2020a</ref><ref type="bibr">2020b)</ref>. Motivated by these desiderata, the main goal of this article is to provide a novel method of FDA in the deep neural network framework.</p><p>The contributions of the present paper are four-fold. First, to our best knowledge, this is the first work on proposing deep neural networks (DNN) based estimator for FDA. An R package "FDADNN" has been developed and is available from GitHub website.</p><p>Second, we develop the convergence rate (in empirical norm) of the proposed neural networks estimator. It is well-known that when the observed points come from a hypercube, i.e., [0, 1] d , d = 3 for 3D imaging study, the nonparametric convergence rates are slower than the optimal nonparametric rate. This means that no statistical procedure can perfectly recover the signal pointwisely. However, by borrowing the advantage from the deep learning domain, the convergence rate of the proposed DNN estimator does not depend on the dimension d. Third, our proposed DNN estimator is unified for any dimensional functional data, which has broader and more flexible applications. Finally, we do not assume additional or complex structure for the true mean function, for example, additive models or single-index models. As in the deep learning domain, the true regression functions are assumed to be constructed in a modular form and the modularity of the system can be fairly complex, which resolve the misspecification issue.</p><p>Different from the existing neural network literature on nonparametric regression <ref type="bibr">Bauer and Kohler (2019)</ref>, Schmidt-Hieber (2020) and <ref type="bibr">Liu et al. (2021)</ref> which only handle i.i.d. data, we focus on FDA, where each subject is a random curve in a hypercube.</p><p>Because of this special data structure, the major challenge becomes to deal with the correlation among the N evaluation points in the framework of neural network, which has not been achieved in the existing works. It is not surprising that the convergence rate decreases with n (number of subjects) as well as N.</p><p>The paper is structured as follows. Section 2 provides the model setting in FDA and introduces multilayer feed-forward artificial neural networks and discusses mathematical modeling. The implementation on hyperparameter selections also be included in Section 2. The theoretical properties of the proposed DNN estimator can be found in Section 4. In Section 5, it is shown that the finite sample performance of proposed neural network estimator. The proposed method is applied to the spatially normalized positron emission tomography (PET) data from Alzheimer Disease Neuroimaging Initiative (ADNI) in Section 6 and make some concluding remarks in Section 7. The proof of the main result together with additional discussion can be found in Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">MODELS AND DNN ESTIMATOR</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">FDA model</head><p>Denote by Y ij the j-th observation of the random curve</p><p>For simple notations, we examine the equally spaced design, in other words, X ij = X j = j/N. The main results can be extended to irregularly spaced design.</p><p>Without loss of generality, let</p><p>The error term i (X j ) has mean zero and finite variance.</p><p>In this work, we consider the following classical FDA model:</p><p>where</p><p>is a Gaussian process characterizing individual curve variations from f 0 (&#8226;) with mean zero and Cov(&#951;(X j ), &#951;(X j )) := G(X j , X j ). Let i X j = &#964; X j &#949; ij , where &#949; ij 's are independent normal random variables and &#964; (X) is the standard deviation function bounded above zero for any X &#8712; [0, 1] d . By Mercer's Theorem, covariance function G(X, X ) has the following spectrum decomposition</p><p>where {&#955; k } &#8734; k=1 and {&#968; k (X)} &#8734; k=1 are the eigenvalues and eigenfunctions of G(X, X ), and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Deep Neural Networks</head><p>Before conducting the estimation of the mean function f 0 in (1) via the DNN, let us briefly introduce the necessary notations and terminologies used in the neural networks. From the high level, typical DNN use a composition of a series of simple nonlinear functions to model nonlinearity, i.e.,</p><p>where &#8226; denotes composition of two functions and L is called the number of hidden layers or depth of a DNN model. One can define</p><p>for each 1 &#8804; l &#8804; L recursively and h 0 = x. "Deep" in deep neural networks refers to the use of multiple layers in the network. In the feed-forward neural network, the information moves in only one direction-forward-from the input layers, through the hidden layers and to the output nodes layers. In this kind of neural nets, there is a specific choice of g l : g l (h l-1 ) = &#963;(W l h l-1 ), l = 1, . . . , L, where W l is a p l &#215; p l+1 weight matrix in the l-th layer and p = (p 0 , . . . , p L+1 ) is the width vector. The nonlinear function &#963; is called the activation function. Here, we study the popular rectifier linear unit (ReLU) activation function applied element-wise [&#963;(x)] j = max(x j , 0), j = 1, . . . , d. For any vector</p><p>We call v the activation vector. The DNN model is then defined as</p><p>To fit networks with data generated from the d-dimensional hypercube functional data model, we must have p 0 = d and p L+1 = 1.</p><p>Without loss of generality, we assume for any f &#8712; F , its empirical L 2 norm is bounded , i.e.,</p><p>Although the depth and width of the neural nets can be extremely deep and wide, overfitting and computational burden are serious problems in such networks. To overcome these issues, the networks are modeled by assuming that each unit will be active only for a small fraction of the data to avoid overfitting. Smaller weights in a neural network can result in a model that is more stable and less likely to overfit the training dataset, in turn having better performance when making a prediction on new data <ref type="bibr">(Srivastava, Hinton, Krizhevsky, Sutskever, &amp; Salakhutdinov 2014)</ref>. Therefore, we assume that there are only few non-zero network parameters. Equivalently, we define the sparse neural networks and add constrains on the maximum-entry norm and non-zero entries of weight matrix W l and activation vector v l . The sparse neural networks for our functional data model are given by</p><p>where s &gt; 0, &#8226; &#8734; denotes the maximum-entry norm and &#8226; 0 denotes the number of non-zero entries, respectively. Let v 0 be a zero vector for simply notation. The selecting procedures of unknown tuning parameters (L, p, s) shall be given in the Section 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Deep neural network estimator</head><p>In the functional data regression model, the common objective is to find an optimal estimator by least-square loss function. In the neural network setting, this coincides with training neural networks by minimizing the empirical risk over all the training data. In particularly, given the networks in (3) and denote F = F (L, p, s), the proposed DNN estimator is defined as</p><p>where</p><p>Different from classical nonparametric estimators, f has no analytical expression or basis expansion expression. Hence, to better understand the reasons that this DNN estimator has excellent performance, we first project f 0 onto the network space F , namely, f * := arg min f&#8712;F f 0 -f &#8734;. In other words, f * is the best possible approximation of f 0 in F . Note</p><p>where &#961;</p><p>Hence, we follow the conventional approximation-estimation decomposition (or bias-variance trade-off) to decompose the empirical norm</p><p>The above equation indicates that the empirical norm of the estimator is bounded by two items. The first item is the approximation error and essentially determined by the distance between the network class F and true function class f 0 , which can be arbitrarily small according to <ref type="bibr">Yarotsky (2017)</ref>. From statistical point of view, the second item is the estimation error and is a weighted average of a random process. It is affected by the parameters in F , true mean function class, and the characteristic of the error terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">IMPLEMENTATION</head><p>In this section, we discuss the detailed computational procedure for the proposed DNN estimator in (4). The following proposed computational procedure can be easily realized via R package "FDADNN" which is available at <ref type="url">https://github.com/     FDASTATAUBURN/FDADNN</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Neural network's architecture selection</head><p>Tuning parameters are crucial as they control the overall behavior of the proposed estimator and the learning process. In machine learning, those parameters are called network architecture parameters. A neural network's architecture can simply be defined as the number of layers L, and the number of hidden neurons within these layers p. In our considered sparse neural network space F , sparse parameter s should also be carefully selected. Note that in the practice, it's unrealistic to control the exact number of inactive nodes, so instead of using sparse parameter s, we add an L 1 penalty to control the number of active nodes in each layer during optimization procedure. Denote &#950; as the L 1 regularization factor. In the following, we utilize &#950; to replace sparse parameter s in the numerical analysis. The ultimate goal is to find an optimal combination of (L, p, &#950;) that minimizes a pre-defined loss function to give better results. There are fairly large numbers of literature discussing the optimization selection, such as grid search, random search, and Bayesian optimization. Considering the computational efficiency and statistical properties, we recommend the following data-adaptive selection procedure in the practical application. The further justification of the optimization algorithm is beyond the scope of this work and shall be anther interesting and challenge topic for the future work.</p><p>We set the same neuron numbers for each layer for simplicity, i.e. p = (p, . . . , p), and we follow the rule that p is increasing as n and N are increasing. We use K-fold cross-validation to choose (L, p, &#950;), i.e., (Lopt, popt, &#950;opt) = arg min</p><p>where &#920; is a architecture parameter space which contains pre-selected choices of (L, p, &#950;). Typically, K = 5 or 10. For the kth cross-validation, at the j-th grid point, Y (-k)</p><p>&#8226;j (L, p, &#950;) denotes the estimated output given (L, p, &#950;) and Y (k)</p><p>&#8226;j is the average of observations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Training neural networks</head><p>The minimization in ( <ref type="formula">4</ref>) is usually done via stochastic gradient descent (SGD). In a way similar to gradient descent, in each update, a small sub-sample called a batch which is typically of size B = 32 to 512, is randomly drawn and the gradient calculation is only on the sub-sample instead of the whole training dataset. This saves considerably the computational cost in calculation of gradient.</p><p>By the law of large numbers, this stochastic gradient should be close to the full sample one, albeit with some random fluctuations.</p><p>We choose B = 32 or 64 batches depending on the performance of convergence. A pass of the whole training set is called an There are certainly some challenges for SGD to train DNN. For example, albeit good theoretical guarantees for well-behaved problems, SGD might converge very slowly; the learning rates are difficult to tune (Allen-Zhu, Li, &amp; Song 2019). To address these challenges, several variants gradient-based optimization algorithms are introduced, such as Adam, RMSprop and Adadelta.</p><p>Instead of the classical SGD procedure, Adam is a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. Hence, it is well suited for problems when there are large sample size and parameters <ref type="bibr">(Kingma &amp; Ba 2015)</ref>. In our numerical studies, Adam provides the best results and is the most computationally efficient among these candidates. We recommend Adam in the real life applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">THEORETICAL PROPERTIES OF THE DNN ESTIMATOR</head><p>In this section, we develop the convergence rate of the proposed DNN estimator in (4). For simple notations, log means the logarithmic function with base 2. For sequences (an)n and (bn)n, an bn means an &#8804; c 1 bn and an &#8805; c 2 bn where c 1 and c 2 are absolute constants for any n. Let C N = [G(X j , X j )/N] N j,j =1 be the N &#215; N kernel matrix corresponding to covariance function G. We now introduce the main assumptions:</p><p>Moreover, the maximal eigenvalue of the kernel matrix C N satisfies &#955; 1,N = O(N -) for some constant &#8805; 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(A4)</head><p>The DNN estimator f &#8712; F (L, p, s), where L log(nN ), s (nN ) <ref type="formula">A1</ref>) is a natural definition for neural network, which is fairly flexible and many well known function classes are contained in it. For example, the additive model f 0 (x) = d i=1 f i (x i ), can be written as a composition of two functions <ref type="figure">d</ref>, <ref type="figure">d</ref>, <ref type="figure">1</ref>) and t = (1, d). The generalized additive model f 0 (x) = h d i=1 f i (x i ) , it can be written as a composition of three functions f 0 = g 2 &#8226; g 1 &#8226; g 0 , with g 0 , g 1 described above, and g 2 = h. Assumption (A2) is a standard assumption for the variance of measurement errors, which requires the bounded variance of measurement error over the whole space. This assumption has been widely used in functional data nonparametric regression literature, see <ref type="bibr">Cao et al. (2012)</ref>; <ref type="bibr">Yao et al. (2005a)</ref> for example. Assumption (A3) is a standard eigenvalue assumption for Mercer kernel and it is widely used assumption for covariance functions in FDA literature, see <ref type="bibr">Cao, Wang, Li, and Yang (2016)</ref>; <ref type="bibr">Li and Hsing (2010)</ref> for example. We also provide two examples to demonstrate (A3) is a reasonable assumption in the supplementary file. By <ref type="bibr">Braun (2006)</ref>, Assumption (A3) trivially holds for = 0 (see Propositions ?? and ?? in the supplementary file), and may even hold for some positive as revealed by Examples 1 in Section ?? in the supplementary file. Assumption (A4) depicts the architecture and parameters' setting in the network space.</p><p>We assume a natural compositional function class for the true mean function f 0 :</p><p>, i = 1, . . . , q, with unknown parameters d i and q.</p><p>The following theorem establishes the convergence rate of the DNN estimator f under the empirical norm. Its proof and some technical lemmas will be provided in the supplementary file.</p><p>Theorem 1. Under Assumptions (A1)-(A4), with probability greater than (1 -2 nN ) log(nN ) +1 &#8594; 1, we have</p><p>where &#8805; 0, &#952; = min i=0,...,q</p><p>, c is a constant only depends on t, d, &#946; which are defined in (??) in the Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">SIMULATION</head><p>To illustrate how the introduced nonparametric regression estimators based on our proposed neural networks method behave in case of finite sample sizes, we conduct substantial simulations for both 2D and 3D functional data. We use empirical L 2 risk</p><p>as the criterion to evaluate the performance of estimators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">2D simulation</head><p>In this simulation, the 2D images are generated from the model:</p><p>where</p><p>To demonstrate the practical performance of our theoretical results, we consider the following two mean functions:</p><p>and the corresponding images are shown in the first row of Figure <ref type="figure">1</ref>. To simulate the within-subject dependence for each subject i, we generate &#951; (&#8226;) from a Gaussian process, with mean 0, and covariance function</p><p>We consider sample size n = 50, 100, 200 and for each image, let N 2 = 15 or 25, which means for each 2D image, the number of observational points (pixels) is set to be N = N 2 2 = 225 or 625. The neural network ( <ref type="formula">4</ref>) is trained through optimizer Adam with architecture parameters (L, p, &#950;) selected from L &#8712; {3, 4}, p &#8712; {100, 300, 500, 1000, 2000}, &#950; &#8712; {10 -4 , 10 -5 , 10 -6 , 10 -7 } and learning rate 0.001. 10-fold cross-validation method discussed in Section 3 is applied to select the optimal architecture parameters in each Monte Calro simulation. Epochs are selected from 300 to 500 and batch size is chosen as 32. We find the convergence of algorithms is promising.</p><p>The alternative approach for 2D case we considered is a 2D regression spline method (bivariate spline). With regard to the variety of modifications of this approach known in the literature, we focus on the version for 2D FDA in <ref type="bibr">Lai and Wang (2003)</ref>. Let B (x) = {Bm(x)} m&#8712;M be the set of bivariate Bernstein basis polynomials, where M stands for an index set of Bernstein basis polynomials. Then we can represent any bivariate function f(x) by f(x) &#8776; B (x)&#947; where &#947; = (&#947;m, m &#8712; M) is the bivariate spline coefficient vector. The estimator f BS is implemented by the R package BPST, which was developed by the authors of <ref type="bibr">Lai and Wang (2003)</ref>.</p><p>The second and the third rows in Figure <ref type="figure">1</ref> depicts the proposed neural network estimator f DNN and bivariate spline estimator f BS when n = 200, N = 625 and &#963; = 1. Table <ref type="table">1</ref> summarizes the empirical L 2 risk and standard deviation of estimators f DNN and f BS under 100 simulations for two different noise levels. From the above figures and table, one can see that our method and the bivariate spline method have fairly similar estimation performances. As the bivariate spline estimator is able to achieve the optimal nonparametric convergence rate Y. <ref type="bibr">Wang et al. (2020)</ref>, the comparable estimation results in Tables <ref type="table">1</ref> and <ref type="table">2</ref> also support the asymptotic convergence rate of our proposed estimator f DNN in Theorem 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">3D simulation</head><p>For 3D simulation, the images are generated from the model ( <ref type="formula">7</ref>) in 2D case. The true mean function is</p><p>spaced grid points in each dimension on [0, 1] 3 and N 3 N 3 N 3 = N. Here, we mimic the number of voxels of the real data, which usually have different values for N 3 , N 3 and N 3 . For each subject, the within-imaging dependence &#951; (&#8226;) is generated from a Gaussian process with mean 0, and covariance function G 0 x j , x j = 3 k=1 cos 2&#960;(x kj -x kj ) , j, j = 1, . . . , N. Measurement errors i (&#8226;) are generated the same as 2D case. We consider sample size n = 50, 100, 200 and N = 3, 000 (20 &#215; 15 &#215; 10) and 4, 500 (30&#215;15&#215;10). Results of each setting are based on 100 simulations. The training of neural networks architecture (L, p, s) follows the same procedures as in 2D case. Architecture parameters (L, p, &#950;) selected from L &#8712; {3, 4}, p &#8712; {100, 300, 500, 1000, 2000, 5000}, &#950; &#8712; {10 -4 , 10 -5 , 10 -6 , 10 -7 }. The triangularized bivariate splines method proposed in Y. <ref type="bibr">Wang et al. (2020)</ref> are designed for 2D functions only. Extending spline basis functions for 3D functional data is very sophisticated and to our best knowledge, it is not available for 3D FDA yet. Hence, we only conduct 3D numerical analysis with our proposed DNN method. To exam the performance of the estimator f, we also summarizes the empirical L 2 risk and standard deviation of estimators f DNN in Table <ref type="table">3</ref>. It is clear to find that the empirical risk decrease when sample sizes or observed voxels numbers increase for both noise levels, which supports our theoretical findings. The mean function f 0 and its DNN estimator are presented in Figure <ref type="figure">2</ref>. It is easy to conclude that the DNN estimator follows the the same pattern as the true mean function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6</head><p>ADNI PET DATA ANALYSIS</p><p>The dataset used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). The ADNI is a longitudinal multicenter study designed to develop clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of AD. From this database, we collect PET data from 79 patients in AD group. This PET dataset has been spatially normalized and post-processed. These AD patients have three to six times doctor visits and we only select the PET scans obtained in the third visits. Patients' age ranges from 59 to 88 and average age is 76.49. There are 33 females and 46 males among these 79 subjects. All scans were reoriented into 79 &#215; 95 &#215; 69 voxels, which means each patient has 69 sliced 2D images with 79 &#215; 95</p><p>pixels. For 2D case, it means each subject has N = 7, 505 = 79 &#215; 95 observed pixels for each selected image slice. For 3D case, the observed number of voxels for each patient's brain sample is N = 79 &#215; 95 &#215; 69, which is more than 0.5 million.</p><p>For 2D case, we select the 20-th, 40-th and 60-th slices from 69 slices for each patient. We first take average across 79 patients for each slices (the first row in Figure <ref type="figure">3</ref>). Then, based on the averaged images, we obtain the proposed DNN estimators for each slice (the second row in Figure <ref type="figure">3</ref>). We also recover the image with higher resolutions 512 &#215; 512 pixels, instead of the original 95 &#215; 69 pixels for each slice (the third row in Figure <ref type="figure">3</ref>). The neural network (4) is trained through optimizer Adam with architecture parameters (L, p, &#950;) selected from L &#8712; {3, 4}, p &#8712; {500, 1000}, &#950; &#8712; {10 -5 , 10 -6 , 10 -7 } and learning rate 0.001. 10-fold crossvalidation method selects the optimal architecture parameters Lopt = 3, popt = 1000 and &#950;opt = 10 -7 . We used 300 to 500 epochs and 2 to 8 as batch size given different slices.</p><p>In 3D case, on 79 patients, and total 79 &#215; 95 &#215; 69 voxels. Same as 2D case, we first average the total 79 3D scans into one 3D scan, and then perform neural network to train the model based on the averaged 3D image. In the bottom row of Figure <ref type="figure">3</ref>, we break down the recovered 3D image and show the recovered 20-th, 40-th and 60-th slices. The neural network (4) is trained through optimizer Adam with architecture parameters (L, p, &#950;) selected from L &#8712; {3, 4}, p &#8712; {1000, 1500}, &#950; &#8712; {10 -5 , 10 -6 , 10 -7 }. 10fold cross-validation method selects the optimal architecture parameters Lopt = 4, popt = 1500 and &#950;opt = 10 -7 . According to our numerical experience, we find 300 epochs and 64 batch size providing the reasonable well results.</p><p>In Figure <ref type="figure">4</ref>, we also recover the image in higher resolutions 128 &#215; 128 &#215; 128 voxels, which means instead of the original 79 &#215; 95 &#215; 69 voxels, we can provide the estimated image slices with higher resolution (128 &#215; 128 pixels, instead of the original 79 &#215; 95 pixels) at finer grid points (128 points, instead of the original 69 points).</p><p>For model assessment, we split the data into to training and testing groups, with 60 patients and 19 patients separately. Table <ref type="table">4</ref> summarizes the performance of the proposed estimator based on empirical L 2 risks. We conclude that the proposed functional regression model and DNN estimator have satisfied numerical results in this real application.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">DISCUSSION</head><p>In this work, we resolve the model misspecification issue in multi-dimensional FDA via the promising technique from the deep learning domain. By properly choosing network architecture, our estimator achieves the optimal nonparametric convergence rate in empirical norm. To our best knowledge, this is the first piece of work in FDA, which yields attractive empirical convergence rate for multi-dimensional FDA, and at meanwhile is free from model misspecification. Numerical analysis demonstrates that our approach is useful in recovering the signal for imaging data. Some interesting future works may include the functional linear regression model and classification problems in the framework of DNN.  f 0 fDNN FIGURE 2 Two different angles (Left and Right panels) to view the true mean function and the DNN estimator in 3D simulation  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>FIGURES AND TABLES</head></div></body>
		</text>
</TEI>
