Kernel-based learning algorithms have been extensively studied over the past two decades for their successful applications in scientific research and industrial problem-solving. In classical kernel methods, such as kernel ridge regression and support vector machines, an unregularized offset term naturally appears. While its importance can be defended in some situations, it is arguable in others. However, it is commonly agreed that the offset term introduces essential challenges to the optimization and theoretical analysis of the algorithms. In this paper, we demonstrate that Kernel Ridge Regression (KRR) with an offset is closely connected to regularization schemes involving centered reproducing kernels. With the aid of this connection and the theory of centered reproducing kernels, we will establish generalization error bounds for KRR with an offset. These bounds indicate that the algorithm can achieve minimax optimal rates. 
                        more » 
                        « less   
                    
                            
                            Overparameterized Random Feature Regression with Nearly Orthogonal Data
                        
                    
    
            We investigate the properties of random feature ridge regression (RFRR) given by a two-layer neural network with random Gaussian initialization. We study the non-asymptotic behaviors of the RFRR with nearly orthogonal deterministic unit-length input data vectors in the overparameterized regime, where the width of the first layer is much larger than the sample size. Our analysis shows high-probability non-asymptotic concentration results for the training errors, cross-validations, and generalization errors of RFRR centered around their respective values for a kernel ridge regression (KRR). This KRR is derived from an expected kernel generated by a nonlinear random feature map. We then approximate the performance of the KRR by a polynomial kernel matrix obtained from the Hermite polynomial expansion of the activation function, whose degree only depends on the orthogonality among different data points. This polynomial kernel determines the asymptotic behavior of the RFRR and the KRR. Our results hold for a wide variety of activation functions and input data sets that exhibit nearly orthogonal properties. Based on these approximations, we obtain a lower bound for the generalization error of the RFRR for a nonlinear student-teacher model. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2154099
- PAR ID:
- 10540436
- Editor(s):
- Ruiz, Francisco; Dy, Jennifer; van_de_Meent, Jan-Willem
- Publisher / Repository:
- Proceedings of Machine Learning Research
- Date Published:
- Volume:
- 206
- ISSN:
- 2640-3498
- Page Range / eLocation ID:
- 8463-8493
- Format(s):
- Medium: X
- Location:
- Valencia, Spain
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            null (Ed.)Nyström approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems through sub-sampling the n-by-n empirical kernel matrix appearing in the objective function. However, the performance of such a sub-sampling method heavily relies on correctly estimating the statistical leverage scores for forming the sampling distribution, which can be as costly as solving the original KRR. In this work, we propose a linear time (modulo poly-log terms) algorithm to accurately approximate the statistical leverage scores in the stationary-kernel-based KRR with theoretical guarantees. Particularly, by analyzing the first-order condition of the KRR objective, we derive an analytic formula, which depends on both the input distribution and the spectral density of stationary kernels, for capturing the non-uniformity of the statistical leverage scores. Numerical experiments demonstrate that with the same prediction accuracy our method is orders of magnitude more efficient than existing methods in selecting the representative sub-samples in the Nyström approximation.more » « less
- 
            Tuning parameter selection is of critical importance for kernel ridge regression. To date, a data-driven tuning method for divide-and-conquer kernel ridge regression (d-KRR) has been lacking in the literature, which limits the applicability of d-KRR for large datasets. In this article, by modifying the generalized cross-validation (GCV) score, we propose a distributed generalized cross-validation (dGCV) as a data-driven tool for selecting the tuning parameters in d-KRR. Not only the proposed dGCV is computationally scalable for massive datasets, it is also shown, under mild conditions, to be asymptotically optimal in the sense that minimizing the dGCV score is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework. Supplemental materials for this article are available online.more » « less
- 
            Tuning parameter selection is of critical importance for kernel ridge regression. To this date, data driven tuning method for divide-and-conquer kernel ridge regression (d-KRR) has been lacking in the literature, which limits the applicability of d-KRR for large datasets. In this article, by modifying the generalized crossvalidation (GCV) score, we propose a distributed generalized cross-validation (dGCV) as a data-driven tool for selecting the tuning parameters in d-KRR. Not only the proposed dGCV is computationally scalable for massive datasets, it is also shown, under mild conditions, to be asymptotically optimal in the sense that minimizing the dGCV score is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework. Supplemental materials for this article are available online.more » « less
- 
            We study the optimization of wide neural networks (NNs) via gradient flow (GF) in setups that allow feature learning while admitting non-asymptotic global convergence guarantees. First, for wide shallow NNs under the mean-field scaling and with a general class of activation functions, we prove that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. Building upon this analysis, we study a model of wide multi-layer NNs whose second-to-last layer is trained via GF, for which we also prove a linear-rate convergence of the training loss to zero, but regardless of the input dimension. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    