skip to main content


Search for: All records

Publication Date from: 01/01/1900 to 12/01/2016

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Motivation

    As an increasing amount of protein–protein interaction (PPI) data becomes available, their computational interpretation has become an important problem in bioinformatics. The alignment of PPI networks from different species provides valuable information about conserved subnetworks, evolutionary pathways and functional orthologs. Although several methods have been proposed for global network alignment, there is a pressing need for methods that produce more accurate alignments in terms of both topological and functional consistency.

    Results

    In this work, we present a novel global network alignment algorithm, named ModuleAlign, which makes use of local topology information to define a module-based homology score. Based on a hierarchical clustering of functionally coherent proteins involved in the same module, ModuleAlign employs a novel iterative scheme to find the alignment between two networks. Evaluated on a diverse set of benchmarks, ModuleAlign outperforms state-of-the-art methods in producing functionally consistent alignments. By aligning Pathogen–Human PPI networks, ModuleAlign also detects a novel set of conserved human genes that pathogens preferentially target to cause pathogenesis.

    Availability

    http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html

    Contact

    canzar@ttic.edu or j3xu.ttic.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation

    Protein intrinsically disordered regions (IDRs) play an important role in many biological processes. Two key properties of IDRs are (i) the occurrence is proteome-wide and (ii) the ratio of disordered residues is about 6%, which makes it challenging to accurately predict IDRs. Most IDR prediction methods use sequence profile to improve accuracy, which prevents its application to proteome-wide prediction since it is time-consuming to generate sequence profiles. On the other hand, the methods without using sequence profile fare much worse than using sequence profile.

    Method

    This article formulates IDR prediction as a sequence labeling problem and employs a new machine learning method called Deep Convolutional Neural Fields (DeepCNF) to solve it. DeepCNF is an integration of deep convolutional neural networks (DCNN) and conditional random fields (CRF); it can model not only complex sequence–structure relationship in a hierarchical manner, but also correlation among adjacent residues. To deal with highly imbalanced order/disorder ratio, instead of training DeepCNF by widely used maximum-likelihood, we develop a novel approach to train it by maximizing area under the ROC curve (AUC), which is an unbiased measure for class-imbalanced data.

    Results

    Our experimental results show that our IDR prediction method AUCpreD outperforms existing popular disorder predictors. More importantly, AUCpreD works very well even without sequence profile, comparing favorably to or even outperforming many methods using sequence profile. Therefore, our method works for proteome-wide disorder prediction while yielding similar or better accuracy than the others.

    Availability and Implementation

    http://raptorx2.uchicago.edu/StructurePropertyPred/predict/

    Contact

    wangsheng@uchicago.edu, jinboxu@gmail.com

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Motivation

    Chloroplast genomes are now produced in the hundreds for angiosperm phylogenetics projects, but current methods for annotation, alignment and tree estimation still require some manual intervention reducing throughput and increasing analysis time for large chloroplast systematics projects.

    Results

    Verdant is a web-based software suite and database built to take advantage a novel annotation program, annoBTD. Using annoBTD, Verdant provides accurate annotation of chloroplast genomes without manual intervention. Subsequent alignment and tree estimation can incorporate newly annotated and publically available plastomes and can accommodate a large number of taxa. Verdant sharply reduces the time required for analysis of assembled chloroplast genomes and removes the need for pipelines and software on personal hardware.

    Availability and Implementation

    Verdant is available at: http://verdant.iplantcollaborative.org/plastidDB/. It is implemented in PHP, Perl, MySQL, Javascript, HTML and CSS with all major browsers supported.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    The estimation of phylogenetic trees is a major part of many biological dataset analyses, but maximum likelihood approaches are NP-hard and Bayesian MCMC methods do not scale well to even moderate-sized datasets. Supertree methods, which are used to construct trees from trees computed on subsets, are critically important tools for enabling the statistical estimation of phylogenies for large and potentially heterogeneous datasets. Supertree estimation is itself NP-hard, and no current supertree method has sufficient accuracy and scalability to provide good accuracy on the large datasets that supertree methods were designed for, containing thousands of species and many subset trees.

    Results

    We present FastRFS, a new method based on a dynamic programming method we have developed to find an exact solution to the Robinson-Foulds Supertree problem within a constrained search space. FastRFS has excellent accuracy in terms of criterion scores and topological accuracy of the resultant trees, substantially improving on competing methods on a large collection of biological and simulated data. In addition, FastRFS is extremely fast, finishing in minutes on even very large datasets, and in under an hour on a biological dataset with 2228 species.

    Availability and Implementation

    FastRFS is available on github at https://github.com/pranjalv123/FastRFS

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    Reconstructing regulatory networks from expression and interaction data is a major goal of systems biology. While much work has focused on trying to experimentally and computationally determine the set of transcription-factors (TFs) and microRNAs (miRNAs) that regulate genes in these networks, relatively little work has focused on inferring the regulation of miRNAs by TFs. Such regulation can play an important role in several biological processes including development and disease. The main challenge for predicting such interactions is the very small positive training set currently available. Another challenge is the fact that a large fraction of miRNAs are encoded within genes making it hard to determine the specific way in which they are regulated.

    Results

    To enable genome wide predictions of TF–miRNA interactions, we extended semi-supervised machine-learning approaches to integrate a large set of different types of data including sequence, expression, ChIP-seq and epigenetic data. As we show, the methods we develop achieve good performance on both a labeled test set, and when analyzing general co-expression networks. We next analyze mRNA and miRNA cancer expression data, demonstrating the advantage of using the predicted set of interactions for identifying more coherent and relevant modules, genes, and miRNAs. The complete set of predictions is available on the supporting website and can be used by any method that combines miRNAs, genes, and TFs.

    Availability and Implementation

    Code and full set of predictions are available from the supporting website: http://cs.cmu.edu/~mruffalo/tf-mirna/.

    Contact

    zivbj@cs.cmu.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  6. Abstract Motivation

    Transcription by RNA polymerase is a highly dynamic process involving multiple distinct points of regulation. Nascent transcription assays are a relatively new set of high throughput techniques that measure the location of actively engaged RNA polymerase genome wide. Hence, nascent transcription is a rich source of information on the regulation of RNA polymerase activity. To fully dissect this data requires the development of stochastic models that can both deconvolve the stages of polymerase activity and identify significant changes in activity between experiments.

    Results

    We present a generative, probabilistic model of RNA polymerase that fully describes loading, initiation, elongation and termination. We fit this model genome wide and profile the enzymatic activity of RNA polymerase across various loci and following experimental perturbation. We observe striking correlation of predicted loading events and regulatory chromatin marks. We provide principled statistics that compute probabilities reminiscent of traveler’s and divergent ratios. We finish with a systematic comparison of RNA Polymerase activity at promoter versus non-promoter associated loci.

    Availability and Implementation

    Transcription Fit (Tfit) is a freely available, open source software package written in C/C ++ that requires GNU compilers 4.7.3 or greater. Tfit is available from GitHub (https://github.com/azofeifa/Tfit).

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  7. Abstract Motivation

    The ability to centralize and store data for long periods on an end user’s computational resources is increasingly difficult for many scientific disciplines. For example, genomics data is increasingly large and distributed, and the data needs to be moved into workflow execution sites ranging from lab workstations to the cloud. However, the typical user is not always informed on emerging network technology or the most efficient methods to move and share data. Thus, the user defaults to using inefficient methods for transfer across the commercial internet.

    Results

    To accelerate large data transfer, we created a tool called the Big Data Smart Socket (BDSS) that abstracts data transfer methodology from the user. The user provides BDSS with a manifest of datasets stored in a remote storage repository. BDSS then queries a metadata repository for curated data transfer mechanisms and optimal path to move each of the files in the manifest to the site of workflow execution. BDSS functions as a standalone tool or can be directly integrated into a computational workflow such as provided by the Galaxy Project. To demonstrate applicability, we use BDSS within a biological context, although it is applicable to any scientific domain.

    Availability and Implementation

    BDSS is available under version 2 of the GNU General Public License at https://github.com/feltus/BDSS.

     
    more » « less
  8. Abstract Motivation

    The somatic mutations in the pathways that drive cancer development tend to be mutually exclusive across tumors, providing a signal for distinguishing driver mutations from a larger number of random passenger mutations. This mutual exclusivity signal can be confounded by high and highly variable mutation rates across a cohort of samples. Current statistical tests for exclusivity that incorporate both per-gene and per-sample mutational frequencies are computationally expensive and have limited precision.

    Results

    We formulate a weighted exact test for assessing the significance of mutual exclusivity in an arbitrary number of mutational events. Our test conditions on the number of samples with a mutation as well as per-event, per-sample mutation probabilities. We provide a recursive formula to compute P-values for the weighted test exactly as well as a highly accurate and efficient saddlepoint approximation of the test. We use our test to approximate a commonly used permutation test for exclusivity that conditions on per-event, per-sample mutation frequencies. However, our test is more efficient and it recovers more significant results than the permutation test. We use our Weighted Exclusivity Test (WExT) software to analyze hundreds of colorectal and endometrial samples from The Cancer Genome Atlas, which are two cancer types that often have extremely high mutation rates. On both cancer types, the weighted test identifies sets of mutually exclusive mutations in cancer genes with fewer false positives than earlier approaches.

    Availability and Implementation

    See http://compbio.cs.brown.edu/projects/wext for software.

    Contact

    braphael@cs.brown.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  9. Abstract Motivation

    Due to their high genomic variability, RNA viruses and retroviruses present a unique opportunity for detailed study of molecular evolution. Lentiviruses, with HIV being a notable example, are one of the best studied viral groups: hundreds of thousands of sequences are available together with experimentally resolved three-dimensional structures for most viral proteins. In this work, we use these data to study specific patterns of evolution of the viral proteins, and their relationship to protein interactions and immunogenicity.

    Results

    We propose a method for identification of two types of surface residues clusters with abnormal conservation: extremely conserved and extremely variable clusters. We identify them on the surface of proteins from HIV and other animal immunodeficiency viruses. Both types of clusters are overrepresented on the interaction interfaces of viral proteins with other proteins, nucleic acids or low molecular-weight ligands, both in the viral particle and between the virus and its host. In the immunodeficiency viruses, the interaction interfaces are not more conserved than the corresponding proteins on an average, and we show that extremely conserved clusters coincide with protein–protein interaction hotspots, predicted as the residues with the largest energetic contribution to the interaction. Extremely variable clusters have been identified here for the first time. In the HIV-1 envelope protein gp120, they overlap with known antigenic sites. These antigenic sites also contain many residues from extremely conserved clusters, hence representing a unique interacting interface enriched both in extremely conserved and in extremely variable clusters of residues. This observation may have important implication for antiretroviral vaccine development.

    Availability and Implementation

    A Python package is available at https://bioinf.mpi-inf.mpg.de/publications/viral-ppi-pred/

    Contact

    voitenko@mpi-inf.mpg.de or kalinina@mpi-inf.mpg.de

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  10. Abstract Questions

    Urban ecosystems present an opportunity to study ecological communities in the context of unprecedented environmental change. In the face of urban land conversion, ecologists observe new patterns of species composition, dominance, behaviour and dispersal. We propose a hypothetical socioeconomic template that describes a gradient in human investment in community composition to aid in organizing the human role in shaping urban biodiversity. We asked: (1) what is the relative magnitude of taxonomic and functional turnover of urban woody plant communities across different land‐use types; and (2) do land uses exhibiting higher intensity of human management of biodiversity support higher turnover over those with less human influence?

    Location

    Baltimore,MD,USA(39°17′ N, 76°38′ W).

    Methods

    We examined patterns in woody plant biodiversity across 209 plots of different urban land uses. Six land‐use types were arranged along a gradient in the intensity through which humans are hypothesized to manage species composition at the plot scale. We calculated local, or α‐diversity, and compositional turnover, or β‐diversity, of taxonomic and functional diversity across plots within each land‐use type. We compared the magnitude of these biodiversity measures between land uses to test our conceptual template for how the intensity of human management can predict urban woody plant biodiversity.

    Results

    We observed high taxonomic turnover in residential and commercial plots compared with vacant or open space land‐use areas. This was associated with a weaker, but similar, pattern in functional diversity. This was associated with low total abundance in residential and commercial plots. Furthermore, the number of unique species was extremely high in the same land‐use types.

    Conclusions

    Our observations help explain why turnover can be high in heavily managed plots relative to vacant land. In patches without heavy human management, we found low levels of turnover. This highlights the importance of assessing diversity both locally and at the level of turnover between patches. Management and policy can benefit from the perspective embodied in the conceptual approach tested here.

     
    more » « less