NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Testing for association in multiview network data

https://doi.org/10.1111/biom.13464

Gao, Lucy L.; Witten, Daniela; Bien, Jacob (April 2021, Biometrics)

Abstract In this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of suchmultiviewnetwork data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two‐view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein–protein interaction data from the HINT database. We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to cocomplex association data. We also extend this proposal to the setting of a network with node covariates. The proposed methods extend readily to three or more network/multivariate data views.
more » « less
Selective Inference for Hierarchical Clustering

https://doi.org/10.1080/01621459.2022.2116331

Gao, Lucy L.; Bien, Jacob; Witten, Daniela (October 2022, Journal of the American Statistical Association)

Full Text Available
Tree-based Node Aggregation in Sparse Graphical Models

Wilms, Ines; Bien, Jacob (September 2022, Journal of machine learning research)

Full Text Available
Interactive Exploration of Large Dendrograms with Prototypes

https://doi.org/10.1080/00031305.2022.2087734

Kaplan, Andee; Bien, Jacob (July 2022, The American Statistician)

Full Text Available
Ocean mover’s distance: using optimal transport for analysing oceanographic data

https://doi.org/10.1098/rspa.2021.0875

Hyun, Sangwon; Mishra, Aditya; Follett, Christopher L.; Jonsson, Bror; Kulk, Gemma; Forget, Gael; Racault, Marie-Fanny; Jackson, Thomas; Dutkiewicz, Stephanie; Müller, Christian L.; et al (June 2022, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences)

Remote sensing observations from satellites and global biogeochemical models have combined to revolutionize the study of ocean biogeochemical cycling, but comparing the two data streams to each other and across time remains challenging due to the strong spatial-temporal structuring of the ocean. Here, we show that the Wasserstein distance provides a powerful metric for harnessing these structured datasets for better marine ecosystem and climate predictions. The Wasserstein distance complements commonly used point-wise difference methods such as the root-mean-squared error, by quantifying differences in terms of spatial displacement in addition to magnitude. As a test case, we consider chlorophyll (a key indicator of phytoplankton biomass) in the northeast Pacific Ocean, obtained from model simulations, in situ measurements, and satellite observations. We focus on two main applications: (i) comparing model predictions with satellite observations, and (ii) temporal evolution of chlorophyll both seasonally and over longer time frames. The Wasserstein distance successfully isolates temporal and depth variability and quantifies shifts in biogeochemical province boundaries. It also exposes relevant temporal trends in satellite chlorophyll consistent with climate change predictions. Our study shows that optimal transport vectors underlying the Wasserstein distance provide a novel visualization tool for testing models and better understanding temporal dynamics in the ocean.
more » « less
Full Text Available
Controlling costs: Feature selection on a budget

https://doi.org/10.1002/sta4.427

Yu, Guo; Witten, Daniela; Bien, Jacob (March 2022, Stat)

The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost‐conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
more » « less
Tree-aggregated predictive modeling of microbiome data

https://doi.org/10.1038/s41598-021-93645-3

Bien, Jacob; Yan, Xiaohan; Simpson, Léo; Müller, Christian L. (December 2021, Scientific Reports)

Abstract Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call (ee-ggregation of ompositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.
more » « less
Full Text Available
Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages

https://doi.org/10.1080/01621459.2021.1942013

Wilms, Ines; Basu, Sumanta; Bien, Jacob; Matteson, David S. (August 2021, Journal of the American Statistical Association)

Full Text Available
Rare Feature Selection in High Dimensions

https://doi.org/10.1080/01621459.2020.1796677

Yan, Xiaohan; Bien, Jacob (April 2021, Journal of the American Statistical Association)

Full Text Available
High dimensional forecasting via interpretable vector autoregression

Nicholson, W. B.; Wilms, I.; Bien, J.; Matteson, D. S. (September 2020, Journal of machine learning research)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records