NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A unified hypothesis-free feature extraction framework for diverse epigenomic data

https://doi.org/10.1093/bioadv/vbaf013

Balcı, Ali_Tuğrul; Chikina, Maria; Mahony, ed., Shaun (March 2025, Bioinformatics Advances)

Abstract MotivationEpigenetic assays using next-generation sequencing have furthered our understanding of the functional genomic regions and the mechanisms of gene regulation. However, a single assay produces billions of data points, with limited information about the biological process due to numerous sources of technical and biological noise. To draw biological conclusions, numerous specialized algorithms have been proposed to summarize the data into higher-order patterns, such as peak calling and the discovery of differentially methylated regions. The key principle underlying these approaches is the search for locally consistent patterns. ResultsWe propose L0 segmentation as a universal framework for extracting locally coherent signals for diverse epigenetic sources. L0 serves to compress the input signal by approximating it as a piecewise constant. We implement a highly scalable L0 segmentation with additional loss functions designed for sequencing epigenetic data types including Poisson loss for single tracks and binomial loss for methylation/coverage data. We show that the L0 segmentation approach retains the salient features of the data yet can identify subtle features, such as transcription end sites, missed by other analytic approaches. Availability and implementationOur approach is implemented as an R package “l01segmentation” with a C++ backend. Available at https://github.com/boooooogey/l01segmentation.
more » « less
Heterogeneous pseudobulk simulation enables realistic benchmarking of cell-type deconvolution methods

https://doi.org/10.1186/s13059-024-03292-w

Hu, Mengying; Chikina, Maria (December 2024, Genome Biology)

Abstract BackgroundComputational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. ResultsIn our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. ConclusionsOur heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly packagehttps://github.com/humengying0907/deconvBenchmarkingandhttps://doi.org/10.5281/zenodo.8206516, enabling further developments in deconvolution methods.
more » « less
Free, publicly-accessible full text available December 1, 2025
A hybrid constrained continuous optimization approach for optimal causal discovery from biological data

https://doi.org/10.1093/bioinformatics/btae411

Zhu, Yuehua; Benos, Panayiotis_V; Chikina, Maria (September 2024, Bioinformatics)

Abstract MotivationUnderstanding causal effects is a fundamental goal of science and underpins our ability to make accurate predictions in unseen settings and conditions. While direct experimentation is the gold standard for measuring and validating causal effects, the field of causal graph theory offers a tantalizing alternative: extracting causal insights from observational data. Theoretical analysis has shown that this is indeed possible, given a large dataset and if certain conditions are met. However, biological datasets, frequently, do not meet such requirements but evaluation of causal discovery algorithms is typically performed on synthetic datasets, which they meet all requirements. Thus, real-life datasets are needed, in which the causal truth is reasonably known. In this work we first construct such a large-scale real-life dataset and then we perform on it a comprehensive benchmarking of various causal discovery methods. ResultsWe find that the PC algorithm is particularly accurate at estimating causal structure, including the causal direction which is critical for biological applicability. However, PC does only produces cause-effect directionality, but not estimates of causal effects. We propose PC-NOTEARS (PCnt), a hybrid solution, which includes the PC output as an additional constraint inside the NOTEARS optimization. This approach combines PC algorithm’s strengths in graph structure prediction with the NOTEARS continuous optimization to estimate causal effects accurately. PCnt achieved best aggregate performance across all structural and effect size metrics. Availability and implementationhttps://github.com/zhu-yh1/PC-NOTEARS.
more » « less
InstaPrism: an R package for fast implementation of BayesPrism

https://doi.org/10.1093/bioinformatics/btae440

Hu, Mengying; Chikina, Maria; Nikolski, ed., Macha (July 2024, Bioinformatics)

Abstract SummaryComputational cell-type deconvolution is an important analytic technique for modeling the compositional heterogeneity of bulk gene expression data. A conceptually new Bayesian approach to this problem, BayesPrism, has recently been proposed and has subsequently been shown to be superior in accuracy and robustness against model misspecifications by independent studies; however, given that BayesPrism relies on Gibbs sampling, it is orders of magnitude more computationally expensive than standard approaches. Here, we introduce the InstaPrism package which re-implements BayesPrism in a derandomized framework by replacing the time-consuming Gibbs sampling step with a fixed-point algorithm. We demonstrate that the new algorithm is effectively equivalent to BayesPrism while providing a considerable speed and memory advantage. Furthermore, the InstaPrism package is equipped with a precompiled, curated set of references tailored for a variety of cancer types, streamlining the deconvolution process. Availability and implementationThe package InstaPrism is freely available at: https://github.com/humengying0907/InstaPrism. The source code and evaluation pipeline used in this paper can be found at: https://github.com/humengying0907/InstaPrismSourceCode.
more » « less
Personal transcriptome variation is poorly explained by current genomic deep learning models

https://doi.org/10.1038/s41588-023-01574-w

Huang, Connie; Shuai, Richard W; Baokar, Parth; Chung, Ryan; Rastogi, Ruchir; Kathail, Pooja; Ioannidis, Nilah M (December 2023, Nature Genetics)

Abstract Genomic deep learning models can predict genome-wide epigenetic features and gene expression levels directly from DNA sequence. While current models perform well at predicting gene expression levels across genes in different cell types from the reference genome, their ability to explain expression variation between individuals due tocis-regulatory genetic variants remains largely unexplored. Here, we evaluate four state-of-the-art models on paired personal genome and transcriptome data and find limited performance when explaining variation in expression across individuals. In addition, models often fail to predict the correct direction of effect ofcis-regulatory genetic variation on expression.
more » « less
Full Text Available
Temporal dynamics of the multi-omic response to endurance exercise training

https://doi.org/10.1038/s41586-023-06877-w

Amar, David; Gay, Nicole R; Jean-Beltran, Pierre M; Bae, Dam; Dasari, Surendra; Dennis, Courtney; Evans, Charles R; Gaul, David A; Ilkayeva, Olga; Ivanova, Anna A; et al (May 2024, Nature)

Abstract Regular exercise promotes whole-body health and prevents disease, but the underlying molecular mechanisms are incompletely understood^1–3. Here, the Molecular Transducers of Physical Activity Consortium⁴profiled the temporal transcriptome, proteome, metabolome, lipidome, phosphoproteome, acetylproteome, ubiquitylproteome, epigenome and immunome in whole blood, plasma and 18 solid tissues in male and femaleRattus norvegicusover eight weeks of endurance exercise training. The resulting data compendium encompasses 9,466 assays across 19 tissues, 25 molecular platforms and 4 training time points. Thousands of shared and tissue-specific molecular alterations were identified, with sex differences found in multiple tissues. Temporal multi-omic and multi-tissue analyses revealed expansive biological insights into the adaptive responses to endurance training, including widespread regulation of immune, metabolic, stress response and mitochondrial pathways. Many changes were relevant to human health, including non-alcoholic fatty liver disease, inflammatory bowel disease, cardiovascular health and tissue injury and recovery. The data and analyses presented in this study will serve as valuable resources for understanding and exploring the multi-tissue molecular effects of endurance training and are provided in a public repository (https://motrpac-data.org/).
more » « less
Full Text Available
Methylation Data Analysis and Interpretation

https://doi.org/10.1146/annurev-biodatasci-120924-091033

Zhu, Yuehua; Mao, Weiguang; Hosseini, Rezwan; Chikina, Maria (August 2025, Annual Review of Biomedical Data Science)

DNA methylation, a covalent modification, fundamentally shapes mammalian gene regulation and cellular identity. This review examines methylation's biochemical underpinnings, genomic distribution patterns, and analytical approaches. We highlight three distinctive aspects that separate methylation from other epigenetic marks: its remarkable stability as a silencing mechanism, its capacity to maintain distinct states independently of DNA sequence, and its effectiveness as a quantitative trait linking genotype to disease risk. We also explore the phenomenon of methylation clocks and their biological significance. The review addresses technical considerations across major assay types—both array-based technologies and sequencing approaches—with emphasis on data normalization, quality control, cell proportion inference, and the specialized statistical models required for next-generation sequencing analysis.
more » « less
Free, publicly-accessible full text available August 11, 2026
The T cell receptor landscape of childhood brain tumors

https://doi.org/10.1126/scitranslmed.adp0675

Raphael, Itay; Xiong, Zujian; Sneiderman, Chaim T; Raphael, Rebecca A; Mash, Moshe; Schwegman, Lance; Jackson, Sydney A; O’Brien, Casey; Anderson, Kevin J; Sever, ReidAnn E; et al (March 2025, Science Translational Medicine)

The diverse T cell receptor (TCR) repertoire confers the ability to recognize an almost unlimited array of antigens. Characterization of antigen specificity of tumor-infiltrating lymphocytes (TILs) is key for understanding antitumor immunity and for guiding the development of effective immunotherapies. Here, we report a large-scale comprehensive examination of the TCR landscape of TILs across the spectrum of pediatric brain tumors, the leading cause of cancer-related mortality in children. We show that a T cell clonality index can inform patient prognosis, where more clonality is associated with more favorable outcomes. Moreover, TCR similarity groups’ assessment revealed patient clusters with defined human leukocyte antigen associations. Computational analysis of these clusters identified putative tumor antigens and peptides as targets for antitumor T cell immunity, which were functionally validated by T cell stimulation assays in vitro. Together, this study presents a framework for tumor antigen prediction based on in situ and in silico TIL TCR analyses. We propose that TCR-based investigations should inform tumor classification and precision immunotherapy development.
more » « less
Free, publicly-accessible full text available March 19, 2026
Best holdout assessment is sufficient for cancer transcriptomic model selection

https://doi.org/10.1016/j.patter.2024.101115

Crawford, Jake; Chikina, Maria; Greene, Casey S (December 2024, Patterns)

Free, publicly-accessible full text available December 1, 2025
Quick and effective approximation of in silico saturation mutagenesis experiments with first-order taylor expansion

https://doi.org/10.1016/j.isci.2024.110807

Sasse, Alexander; Chikina, Maria; Mostafavi, Sara (September 2024, iScience)

Full Text Available

« Prev Next »

Search for: All records