Abstract The rapid development of modeling techniques has brought many opportunities for data‐driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general. This article is categorized under:Data: Types and Structure > Traditional Statistical DataStatistical Learning and Exploratory Methods of the Data Sciences > Modeling MethodsStatistical and Graphical Methods of Data Analysis > Information Theoretic MethodsStatistical Models > Model Selection
more »
« less
Individualized inference through fusion learning
Abstract Fusion learning methods, developed for the purpose of analyzing datasets from many different sources, have become a popular research topic in recent years. Individualized inference approaches through fusion learning extend fusion learning approaches to individualized inference problems over a heterogeneous population, where similar individuals are fused together to enhance the inference over the target individual. Both classical fusion learning and individualized inference approaches through fusion learning are established based on weighted aggregation of individual information, but the weight used in the latter is localized to thetargetindividual. This article provides a review on two individualized inference methods through fusion learning,iFusion andiGroup, that are developed under different asymptotic settings. Both procedures guarantee optimal asymptotic theoretical performance and computational scalability. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Manifold LearningStatistical Learning and Exploratory Methods of the Data Sciences > Modeling MethodsStatistical and Graphical Methods of Data Analysis > Nonparametric MethodsData: Types and Structure > Massive Data
more »
« less
- PAR ID:
- 10449116
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- WIREs Computational Statistics
- Volume:
- 12
- Issue:
- 5
- ISSN:
- 1939-5108
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review. This article is categorized under:Statistical and Graphical Methods of Data Analysis > Dimension ReductionStatistical Learning and Exploratory Methods of the Data Sciences > Manifold Learningmore » « less
-
Abstract A fundamental problem in functional data analysis is to classify a functional observation based on training data. The application of functional data classification has gained immense popularity and utility across a wide array of disciplines, encompassing biology, engineering, environmental science, medical science, neurology, social science, and beyond. The phenomenal growth of the application of functional data classification indicates the urgent need for a systematic approach to develop efficient classification methods and scalable algorithmic implementations. Therefore, we here conduct a comprehensive review of classification methods for functional data. The review aims to bridge the gap between the functional data analysis community and the machine learning community, and to intrigue new principles for functional data classification. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and ClassificationStatistical Models > Classification ModelsData: Types and Structure > Time Series, Stochastic Processes, and Functional Datamore » « less
-
Biologists routinely fit novel and complex statistical models to push the limits of our understanding. Examples include, but are not limited to, flexible Bayesian approaches (e.g. BUGS, stan), frequentist and likelihood‐based approaches (e.g. packageslme4) and machine learning methods.These software and programs afford the user greater control and flexibility in tailoring complex hierarchical models. However, this level of control and flexibility places a higher degree of responsibility on the user to evaluate the robustness of their statistical inference. To determine how often biologists are running model diagnostics on hierarchical models, we reviewed 50 recently published papers in 2021 in the journalNature Ecology & Evolution, and we found that the majority of published papers didnotreport any validation of their hierarchical models, making it difficult for the reader to assess the robustness of their inference. This lack of reporting likely stems from a lack of standardized guidance for best practices and standard methods.Here, we provide a guide to understanding and validating complex models using data simulations. To determine how often biologists use data simulation techniques, we also reviewed 50 recently published papers in 2021 in the journalMethods Ecology & Evolution. We found that 78% of the papers that proposed a new estimation technique, package or model used simulations or generated data in some capacity (18 of 23 papers); but very few of those papers (5 of 23 papers) included either a demonstration that the code could recover realistic estimates for a dataset with known parameters or a demonstration of the statistical properties of the approach. To distil the variety of simulations techniques and their uses, we provide a taxonomy of simulation studies based on the intended inference. We also encourage authors to include a basic validation study whenever novel statistical models are used, which in general, is easy to implement.Simulating data helps a researcher gain a deeper understanding of the models and their assumptions and establish the reliability of their estimation approaches. Wider adoption of data simulations by biologists can improve statistical inference, reliability and open science practices.more » « less
-
Abstract Graphs representing complex systems often share a partial underlying structure across domains while retaining individual features. Thus, identifying common structures can shed light on the underlying signal, for instance, when applied to scientific discovery or clinical diagnoses. Furthermore, growing evidence shows that the shared structure across domains boosts the estimation power of graphs, particularly for high‐dimensional data. However, building a joint estimator to extract the common structure may be more complicated than it seems, most often due to data heterogeneity across sources. This manuscript surveys recent work on statistical inference of joint Gaussian graphical models, identifying model structures that fit various data generation processes. This article is categorized under:Data: Types and Structure > Graph and Network DataStatistical Models > Graphical Modelsmore » « less
An official website of the United States government
