skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 1, 2026

Title: Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data
Gaussian processes are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Scientific data are often heterogeneous in their inputs and contain multiple known discrete groups of samples; thus, it is desirable to leverage the similarity among groups while accounting for heterogeneity across groups. We propose multi-group Gaussian processes (MGGPs) defined over Rp×C , where C is a finite set representing the group label, by developing general classes of valid (positive definite) covariance functions on such domains. MGGPs are able to accurately recover relationships between the groups and efficiently share strength across samples from all groups during inference, while capturing distinct group-specific behaviors in the conditional posterior distributions. We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables.  more » « less
Award ID(s):
2113778
PAR ID:
10611614
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
https://www.jmlr.org/papers/v26/23-0291.html
Date Published:
Journal Name:
Journal of machine learning research
Volume:
26
Issue:
30
ISSN:
1532-4435
Page Range / eLocation ID:
1-34
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables. Such a representation, however, fails to hold for selection with the Group LASSO and substantially obstructs the scope of subsequent post-selective inference. Key questions of inferential interest—for example, inference for the effects of selected variables on the outcome—remain unanswered. In the present paper, we develop a consistent, post-selective, Bayesian method to address the existing gaps by deriving a likelihood adjustment factor and an approximation thereof that eliminates bias from the selection of groups. Experiments on simulated data and data from the Human Connectome Project demonstrate that our method recovers the effects of parameters within the selected groups while paying only a small price for bias adjustment. 
    more » « less
  2. Multilevel modeling and multi-task learning are two widely used approaches for modeling nested (multi-level) data, which contain observations that can be clustered into groups, characterized by their group-level features. Despite the similarity of the problems they address, the explicit relationship between multilevel modeling and multi-task learning has not been carefully examined. In this paper, we present a comparative analysis between the two methods to illustrate their strengths and limitations when applied to two-level nested data. We provide a detailed analysis demonstrating the equivalence of their formulations under a mild condition from an optimization perspective. We also demonstrate their limitations in terms of their predictive performance and especially, their difficulty in identifying potential cross-scale interactions between the local and group-level features when applied to datasets with either a small number of groups or limited training examples per group. To overcome these limitations, we propose a novel method for disaggregating the coarse-scale values of the group-level features in the nested data. Experimental results on both synthetic and real-world data show that the disaggregated group-level features can help enhance the prediction accuracy of the models significantly and identify the cross-scale interactions more effectively. 
    more » « less
  3. The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across singleperson, multi-person, and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably, M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset, leading to a hop on the leaderboard from 10th to 2nd place. Moreover, M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://cjerry1243.github.io/M3Act. 
    more » « less
  4. Abstract MotivationGenome-wide maps of epigenetic modifications are powerful resources for non-coding genome annotation. Maps of multiple epigenetics marks have been integrated into cell or tissue type-specific chromatin state annotations for many cell or tissue types. With the increasing availability of multiple chromatin state maps for biologically similar samples, there is a need for methods that can effectively summarize the information about chromatin state annotations within groups of samples and identify differences across groups of samples at a high resolution. ResultsWe developed CSREP, which takes as input chromatin state annotations for a group of samples. CSREP then probabilistically estimates the state at each genomic position and derives a representative chromatin state map for the group. CSREP uses an ensemble of multi-class logistic regression classifiers that predict the chromatin state assignment of each sample given the state maps from all other samples. The difference in CSREP’s probability assignments for the two groups can be used to identify genomic locations with differential chromatin state assignments. Using groups of chromatin state maps of a diverse set of cell and tissue types, we demonstrate the advantages of using CSREP to summarize chromatin state maps and identify biologically relevant differences between groups at a high resolution. Availability and implementationThe CSREP source code and generated data are available at http://github.com/ernstlab/csrep. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Longitudinal studies with binary or ordinal responses are widely encountered in various disciplines, where the primary focus is on the temporal evolution of the probability of each response category. Traditional approaches build from the generalized mixed effects modeling framework. Even amplified with nonparametric priors placed on the fixed or random effects, such models are restrictive due to the implied assumptions on the marginal expectation and covariance structure of the responses. We tackle the problem from a functional data analysis perspective, treating the observations for each subject as realizations from subject-specific stochastic processes at the measured times. We develop the methodology focusing initially on binary responses, for which we assume the stochastic processes have Binomial marginal distributions. Leveraging the logits representation, we model the discrete space processes through continuous space processes. We utilize a hierarchical framework to model the mean and covariance kernel of the continuous space processes nonparametrically and simultaneously through a Gaussian process prior and an Inverse-Wishart process prior, respectively. The prior structure results in flexible inference for the evolution and correlation of binary responses, while allowing for borrowing of strength across all subjects. The modeling approach can be naturally extended to ordinal responses. Here, the continuation-ratio logits factorization of the multinomial distribution is key for efficient modeling and inference, including a practical way of dealing with unbalanced longitudinal data. The methodology is illustrated with synthetic data examples and an analysis of college students’ mental health status data. 
    more » « less