skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Birol, ed., Inanc"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Motivation

    Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample.

    Results

    We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs.

    Availability and implementation

    HiFine is available at https://github.com/dyxstat/HiFine.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation

    Gene expression imputation has been an essential step of the single-cell RNA-Seq data analysis workflow. Among several deep-learning methods, the debut of scGNN gained substantial recognition in 2021 for its superior performance and the ability to produce a cell–cell graph. However, the implementation of scGNN was relatively time-consuming and its performance could still be optimized.

    Results

    The implementation of scGNN 2.0 is significantly faster than scGNN thanks to a simplified close-loop architecture. For all eight datasets, cell clustering performance was increased by 85.02% on average in terms of adjusted rand index, and the imputation Median L1 Error was reduced by 67.94% on average. With the built-in visualizations, users can quickly assess the imputation and cell clustering results, compare against benchmarks and interpret the cell–cell interaction. The expanded input and output formats also pave the way for custom workflows that integrate scGNN 2.0 with other scRNA-Seq toolkits on both Python and R platforms.

    Availability and implementation

    scGNN 2.0 is implemented in Python (as of version 3.8) with the source code available at https://github.com/OSU-BMBL/scGNN2.0.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Summary

    Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3–5× compared to other formats, and bringing interoperability across tools.

    Availability and implementation

    Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    Double minute (DM) chromosomes are acentric extrachromosomal DNA artifacts that are frequently observed in the cells of numerous cancers. They are highly amplified and contain oncogenes and drug-resistance genes, making their presence a challenge for effective cancer treatment. Algorithmic discovery of DM can potentially improve bench-derived therapies for cancer treatment. A hindrance to this task is that DMs evolve, yielding circular chromatin that shares segments from progenitor DMs. This creates DMs with overlapping amplicon coordinates. Existing DM discovery algorithms use whole genome shotgun sequencing (WGS) in isolation, which can potentially incorrectly classify DMs that share overlapping coordinates.

    Results

    In this study, we describe an algorithm called ‘HolistIC’ that can predict DMs in tumor genomes by integrating WGS and Hi–C sequencing data. The consolidation of these sources of information resolves ambiguity in DM amplicon prediction that exists in DM prediction with WGS data used in isolation. We implemented and tested our algorithm on the tandem Hi–C and WGS datasets of three cancer datasets and a simulated dataset. Results on the cancer datasets demonstrated HolistIC’s ability to predict DMs from Hi–C and WGS data in tandem. The results on the simulated data showed the HolistIC can accurately distinguish DMs that have overlapping amplicon coordinates, an advance over methods that predict extrachromosomal amplification using WGS data in isolation.

    Availability and implementation

    Our software, named ‘HolistIC’, is available at http://www.github.com/mhayes20/HolistIC.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    De novo transcriptome analysis using RNA-seq offers a promising means to study gene expression in non-model organisms. Yet, the difficulty of transcriptome assembly means that the contigs provided by the assembler often represent a fractured and incomplete view of the transcriptome, complicating downstream analysis. We introduce Grouper, a new method for clustering contigs from de novo assemblies that are likely to belong to the same transcripts and genes; these groups can subsequently be analyzed more robustly. When provided with access to the genome of a related organism, Grouper can transfer annotations to the de novo assembly, further improving the clustering.

    Results

    On de novo assemblies from four different species, we show that Grouper is able to accurately cluster a larger number of contigs than the existing state-of-the-art method. The Grouper pipeline is able to map greater than 10% more reads against the contigs, leading to accurate downstream differential expression analyses. The labeling module, in the presence of a closely related annotated genome, can efficiently transfer annotations to the contigs and use this information to further improve clustering. Overall, Grouper provides a complete and efficient pipeline for processing de novo transcriptomic assemblies.

    Availability and implementation

    The Grouper software is freely available at https://github.com/COMBINE-lab/grouper under the 2-clause BSD license.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  6. Abstract Motivation

    Genomic sequencing studies, including RNA sequencing and bisulfite sequencing studies, are becoming increasingly common and increasingly large. Large genomic sequencing studies open doors for accurate molecular trait heritability estimation and powerful differential analysis. Heritability estimation and differential analysis in sequencing studies requires the development of statistical methods that can properly account for the count nature of the sequencing data and that are computationally efficient for large datasets.

    Results

    Here, we develop such a method, PQLseq (Penalized Quasi-Likelihood for sequencing count data), to enable effective and efficient heritability estimation and differential analysis using the generalized linear mixed model framework. With extensive simulations and comparisons to previous methods, we show that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data. In addition, we show that PQLseq is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods. Finally, we apply PQLseq to perform gene expression heritability estimation and differential expression analysis in a large RNA sequencing study in the Hutterites.

    Availability and implementation

    PQLseq is implemented as an R package with source code freely available at www.xzlab.org/software.html and https://cran.r-project.org/web/packages/PQLseq/index.html.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  7. Abstract Motivation

    Metagenomics studies microbial genomes in an ecosystem such as the gastrointestinal tract of a human. Identification of novel microbial species and quantification of their distributional variations among different samples that are sequenced using next-generation-sequencing technology hold the key to the success of most metagenomic studies. To achieve these goals, we propose a simple yet powerful metagenomic binning method, MetaBMF. The method does not require prior knowledge of reference genomes and produces highly accurate results, even at a strain level. Thus, it can be broadly used to identify disease-related microbial organisms that are not well-studied.

    Results

    Mathematically, we count the number of mapped reads on each assembled genomic fragment cross different samples as our input matrix and propose a scalable stratified angle regression algorithm to factorize this count matrix into a product of a binary matrix and a nonnegative matrix. The binary matrix can be used to separate microbial species and the nonnegative matrix quantifies the species distributions in different samples. In simulation and empirical studies, we demonstrate that MetaBMF has a high binning accuracy. It can not only bin DNA fragments accurately at a species level but also at a strain level. As shown in our example, we can accurately identify the Shiga-toxigenic Escherichia coli O104: H4 strain which led to the 2011 German E.coli outbreak. Our efforts in these areas should lead to (i) fundamental advances in metagenomic binning, (ii) development and refinement of technology for the rapid identification and quantification of microbial distributions and (iii) finding of potential probiotics or reliable pathogenic bacterial strains.

    Availability and implementation

    The software is available at https://github.com/didi10384/MetaBMF.

     
    more » « less
  8. Abstract Motivation

    Joint reconstruction of multiple gene regulatory networks (GRNs) using gene expression data from multiple tissues/conditions is very important for understanding common and tissue/condition-specific regulation. However, there are currently no computational models and methods available for directly constructing such multiple GRNs that not only share some common hub genes but also possess tissue/condition-specific regulatory edges.

    Results

    In this paper, we proposed a new graphic Gaussian model for joint reconstruction of multiple gene regulatory networks (JRmGRN), which highlighted hub genes, using gene expression data from several tissues/conditions. Under the framework of Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log likelihood function. We formulated it as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. The performance of JRmGRN was first evaluated with synthetic data and the results showed that JRmGRN outperformed several other methods for reconstruction of GRNs. We also applied our method to real Arabidopsis thaliana RNA-seq data from two light regime conditions in comparison with other methods, and both common hub genes and some conditions-specific hub genes were identified with higher accuracy and precision.

    Availability and implementation

    JRmGRN is available as a R program from: https://github.com/wenpingd.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  9. Abstract Motivation

    Genetic variation that disrupts gene function by altering gene splicing between individuals can substantially influence traits and disease. In those cases, accurately predicting the effects of genetic variation on splicing can be highly valuable for investigating the mechanisms underlying those traits and diseases. While methods have been developed to generate high quality computational predictions of gene structures in reference genomes, the same methods perform poorly when used to predict the potentially deleterious effects of genetic changes that alter gene splicing between individuals. Underlying that discrepancy in predictive ability are the common assumptions by reference gene finding algorithms that genes are conserved, well-formed and produce functional proteins.

    Results

    We describe a probabilistic approach for predicting recent changes to gene structure that may or may not conserve function. The model is applicable to both coding and non-coding genes, and can be trained on existing gene annotations without requiring curated examples of aberrant splicing. We apply this model to the problem of predicting altered splicing patterns in the genomes of individual humans, and we demonstrate that performing gene-structure prediction without relying on conserved coding features is feasible. The model predicts an unexpected abundance of variants that create de novo splice sites, an observation supported by both simulations and empirical data from RNA-seq experiments. While these de novo splice variants are commonly misinterpreted by other tools as coding or non-coding variants of little or no effect, we find that in some cases they can have large effects on splicing activity and protein products and we propose that they may commonly act as cryptic factors in disease.

    Availability and implementation

    The software is available from geneprediction.org/SGRF.

    Supplementary information

    Supplementary information is available at Bioinformatics online.

     
    more » « less
  10. Abstract Motivation

    Whole-genome sequencing of uncultured eukaryotic genomes is complicated by difficulties in acquiring sufficient amounts of tissue. Single-cell genomics (SCG) by multiple displacement amplification provides a technical workaround, yielding whole-genome libraries which can be assembled de novo. Downsides of multiple displacement amplification include coverage biases and exacerbation of contamination. These factors affect assembly continuity and fidelity, complicating discrimination of genomes from contamination and noise by available tools. Uncultured eukaryotes and their relatives are often underrepresented in large sequence data repositories, further impairing identification and separation.

    Results

    We compare the ability of filtering approaches to remove contamination and resolve eukaryotic draft genomes from SCG metagenomes, finding significant variation in outcomes. To address these inconsistencies, we introduce a consensus approach that is codified in the SCGid software package. SCGid parallelly filters assemblies using different approaches, yielding three intermediate drafts from which consensus is drawn. Using genuine and mock SCG metagenomes, we show that our approach corrects for variation among draft genomes predicted by individual approaches and outperforms them in recapitulating published drafts in a fast and repeatable way, providing a useful alternative to available methods and manual curation.

    Availability and implementation

    The SCGid package is implemented in python and R. Source code is available at http://www.github.com/amsesk/SCGid under the GNU GPL 3.0 license.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less