skip to main content


Title: Individualized inference through fusion learning
Abstract

Fusion learning methods, developed for the purpose of analyzing datasets from many different sources, have become a popular research topic in recent years. Individualized inference approaches through fusion learning extend fusion learning approaches to individualized inference problems over a heterogeneous population, where similar individuals are fused together to enhance the inference over the target individual. Both classical fusion learning and individualized inference approaches through fusion learning are established based on weighted aggregation of individual information, but the weight used in the latter is localized to thetargetindividual. This article provides a review on two individualized inference methods through fusion learning,iFusion andiGroup, that are developed under different asymptotic settings. Both procedures guarantee optimal asymptotic theoretical performance and computational scalability.

This article is categorized under:

Statistical Learning and Exploratory Methods of the Data Sciences > Manifold Learning

Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods

Statistical and Graphical Methods of Data Analysis > Nonparametric Methods

Data: Types and Structure > Massive Data

 
more » « less
Award ID(s):
1737857 1741390 1812048 1934924
NSF-PAR ID:
10449116
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
WIREs Computational Statistics
Volume:
12
Issue:
5
ISSN:
1939-5108
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Searching for patterns in data is important because it can lead to the discovery of sequence segments that play a functional role. The complexity of pattern statistics that are used in data analysis and the need of the sampling distribution of those statistics for inference renders efficient computation methods as paramount. This article gives an overview of the main methods used to compute distributions of statistics of overlapping pattern occurrences, specifically, generating functions, correlation functions, the Goulden‐Jackson cluster method, recursive equations, and Markov chain embedding. The underlying data sequence will be assumed to be higher‐order Markovian, which includes sparse Markov models and variable length Markov chains as special cases. Also considered will be recent developments for extending the computational capabilities of the Markov chain‐based method through an algorithm for minimizing the size of the chain's state space, as well as improved data modeling capabilities through sparse Markov models. An application to compute a distribution used as a test statistic in sequence alignment will serve to illustrate the usefulness of the methodology.

    This article is categorized under:

    Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition

    Data: Types and Structure > Categorical Data

    Statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms

     
    more » « less
  2. Abstract

    Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept ofstabilityhas emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field.

    This article is categorized under:

    Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification

     
    more » « less
  3. Abstract

    The rapid development of modeling techniques has brought many opportunities for data‐driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general.

    This article is categorized under:

    Data: Types and Structure > Traditional Statistical Data

    Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods

    Statistical and Graphical Methods of Data Analysis > Information Theoretic Methods

    Statistical Models > Model Selection

     
    more » « less
  4. Abstract

    With the explosion in available technologies for measuring many biological phenomena on a large scale, there have been concerted efforts in a variety of biological and medical settings to perform systems biology analyses. A crucial question then becomes how to combine data across the various large‐scale data types. This article reviews the data types that can be considered and treats so‐called horizontal and vertical integration analyses. This article focuses on the use of multiple testing approaches in order to perform integrative analyses. Two questions help to clarify the class of procedures that should be used. The first deals with whether a horizontal or vertical integration is being performed. The second is if there is a priority for a given platform. Based on the answers to these questions, we review various methodologies that could be applied.

    This article is categorized under:

    Statistical Learning and Exploratory Methods of the Data Sciences > Knowledge Discovery

    Statistical and Graphical Methods of Data Analysis > Nonparametric Methods

    Applications of Computational Statistics > Genomics/Proteomics/Genetics

     
    more » « less
  5. Abstract

    This paper provides a review of the literature regarding methods for constructing prediction intervals for counting variables, with particular focus on those whose distributions are Poisson or derived from Poisson and with an over‐dispersion property. Independent and identically distributed models and regression models are both considered. The motivating problem for this review is that of predicting the number of daily and cumulative cases or deaths attributable to COVID‐19 at a future date.

    This article is categorized under:

    Applications of Computational Statistics > Clinical Trials

    Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods

    Statistical Models > Generalized Linear Models

     
    more » « less