Title: GO Bench: shared hub for universal benchmarking of machine learning-based protein functional annotations
AbstractMotivation
Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear.
Results
The Gene Benchmarking database is the first effort to provide an easy-to-use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to pre-specified datasets and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web interface. The GO bench web application can also be used to evaluate and display any trained model on leaderboards for annotation tasks.
Availability and implementation
The GO Benchmarking dataset is freely available at www.gobench.org. Code is hosted at github.com/mofradlab, with repositories for website code, core utilities and examples of usage (Supplementary Section S.7).
Supplementary information
Supplementary data are available at Bioinformatics online.
Gene lists are routinely produced from various omic studies. Enrichment analysis can link these gene lists with underlying molecular pathways and functional categories such as gene ontology (GO) and other databases.
Results
To complement existing tools, we developed ShinyGO based on a large annotation database derived from Ensembl and STRING-db for 59 plant, 256 animal, 115 archeal and 1678 bacterial species. ShinyGO’s novel features include graphical visualization of enrichment results and gene characteristics, and application program interface access to KEGG and STRING for the retrieval of pathway diagrams and protein–protein interaction networks. ShinyGO is an intuitive, graphical web application that can help researchers gain actionable insights from gene-sets.
Availability and implementation
http://ge-lab.org/go/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets.
Results
We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases.
Availability and implementation
GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license.
Supplementary information
Supplementary data are available at Bioinformatics online.
Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features.
Results
We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms.
Availability and implementation
The code and features are freely available at: https://github.com/ek1203/rsfp.
Supplementary information
Supplementary data are available at Bioinformatics online.
Species tree inference from multi-copy gene trees has long been a challenge in phylogenomics. The recent method ASTRAL-Pro has made strides by enabling multi-copy gene family trees as input and has been quickly adopted. Yet, its scalability, especially memory usage, needs to improve to accommodate the ever-growing dataset size.
Results
We present ASTRAL-Pro 2, an ultrafast and memory efficient version of ASTRAL-Pro that adopts a placement-based optimization algorithm for significantly better scalability without sacrificing accuracy.
Availability and implementation
The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER; data are available at https://github.com/chaoszhang/A-Pro2_data.
Supplementary information
Supplementary data are available at Bioinformatics online.
Abstract Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/. Supplementary information Supplementary data are available at Bioinformatics online.
Dickson, Andrew, Asgari, Ehsaneddin, McHardy, Alice C., Mofrad, Mohammad R. K., and Cowen, ed., Lenore. GO Bench: shared hub for universal benchmarking of machine learning-based protein functional annotations. Bioinformatics 39.2 Web. doi:10.1093/bioinformatics/btad081.
Dickson, Andrew, Asgari, Ehsaneddin, McHardy, Alice C., Mofrad, Mohammad R. K., & Cowen, ed., Lenore. GO Bench: shared hub for universal benchmarking of machine learning-based protein functional annotations. Bioinformatics, 39 (2). https://doi.org/10.1093/bioinformatics/btad081
Dickson, Andrew, Asgari, Ehsaneddin, McHardy, Alice C., Mofrad, Mohammad R. K., and Cowen, ed., Lenore.
"GO Bench: shared hub for universal benchmarking of machine learning-based protein functional annotations". Bioinformatics 39 (2). Country unknown/Code not available: Oxford University Press. https://doi.org/10.1093/bioinformatics/btad081.https://par.nsf.gov/biblio/10398370.
@article{osti_10398370,
place = {Country unknown/Code not available},
title = {GO Bench: shared hub for universal benchmarking of machine learning-based protein functional annotations},
url = {https://par.nsf.gov/biblio/10398370},
DOI = {10.1093/bioinformatics/btad081},
abstractNote = {Abstract MotivationGene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear. ResultsThe Gene Benchmarking database is the first effort to provide an easy-to-use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to pre-specified datasets and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web interface. The GO bench web application can also be used to evaluate and display any trained model on leaderboards for annotation tasks. Availability and implementationThe GO Benchmarking dataset is freely available at www.gobench.org. Code is hosted at github.com/mofradlab, with repositories for website code, core utilities and examples of usage (Supplementary Section S.7). Supplementary informationSupplementary data are available at Bioinformatics online.},
journal = {Bioinformatics},
volume = {39},
number = {2},
publisher = {Oxford University Press},
author = {Dickson, Andrew and Asgari, Ehsaneddin and McHardy, Alice C. and Mofrad, Mohammad R. K. and Cowen, ed., Lenore},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.