skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A roadmap for the functional annotation of protein families: a community perspective
Abstract Over the last 25 years, biology has entered the genomic era and is becoming a science of ‘big data’. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3–4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.  more » « less
Award ID(s):
2129768
PAR ID:
10429083
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Date Published:
Journal Name:
Database
Volume:
2022
ISSN:
1758-0463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Schwartz, Russell (Ed.)
    Abstract Motivation Identification and interpretation of non-coding variations that affect disease risk remain a paramount challenge in genome-wide association studies (GWAS) of complex diseases. Experimental efforts have provided comprehensive annotations of functional elements in the human genome. On the other hand, advances in computational biology, especially machine learning approaches, have facilitated accurate predictions of cell-type-specific functional annotations. Integrating functional annotations with GWAS signals has advanced the understanding of disease mechanisms. In previous studies, functional annotations were treated as static of a genomic region, ignoring potential functional differences imposed by different genotypes across individuals. Results We develop a computational approach, Openness Weighted Association Studies (OWAS), to leverage and aggregate predictions of chromosome accessibility in personal genomes for prioritizing GWAS signals. The approach relies on an analytical expression we derived for identifying disease associated genomic segments whose effects in the etiology of complex diseases are evaluated. In extensive simulations and real data analysis, OWAS identifies genes/segments that explain more heritability than existing methods, and has a better replication rate in independent cohorts than GWAS. Moreover, the identified genes/segments show tissue-specific patterns and are enriched in disease relevant pathways. We use rheumatic arthritis and asthma as examples to demonstrate how OWAS can be exploited to provide novel insights on complex diseases. Availability and implementation The R package OWAS that implements our method is available at https://github.com/shuangsong0110/OWAS. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Lenore, Cowen (Ed.)
    Abstract Motivation Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. Results We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. Availability and implementation An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Chloroviruses (family Phycodnaviridae) are dsDNA viruses found throughout the world’s inland waters. The open reading frames in the genomes of 41 sequenced chloroviruses (330 ± 40 kbp each) representing three virus types were analyzed for evidence of evolutionarily conserved local genomic “contexts”, the organization of biological information into units of a scale larger than a gene. Despite a general loss of synteny between virus types, we informatically detected a highly conserved genomic context defined by groups of three or more genes that we have termed “gene gangs”. Unlike previously described local genomic contexts, the definition of gene gangs requires only that member genes be consistently co-localized and are not constrained by strand, regulatory sites, or intervening sequences (and therefore represent a new type of conserved structural genomic element). An analysis of functional annotations and transcriptomic data suggests that some of the gene gangs may organize genes involved in specific biochemical processes, but that this organization does not involve their coordinated expression. 
    more » « less
  4. The contemporary capacity of genome sequence analysis significantly lags behind the rapidly evolving sequencing technologies. Retrieving biological meaningful information from an ever-increasing amount of genome data would be significantly beneficial for functional genomic studies. For example, the duplication, organization, evolution, and function of superfamily genes are arguably important in many aspects of life. However, the incompleteness of annotations in many sequenced genomes often results in biased conclusions in comparative genomic studies of superfamilies. Here, we present a Perl software, called Closing Target Trimming (CTT), for automatically identifying most, if not all, members of a gene family in any sequenced genomes on CentOS 7 platform. To benefit a broader application on other operating systems, we also created a Docker application package, CTTdocker. Our test data on the F-box gene superfamily showed 78.2 and 79% gene finding accuracies in two well annotated plant genomes, Arabidopsis thaliana and rice, respectively. To further demonstrate the effectiveness of this program, we ran it through 18 plant genomes and five non-plant genomes to compare the expansion of the F-box and the BTB superfamilies. The program discovered that on average 12.7 and 9.3% of the total F-box and BTB members, respectively, are new loci in plant genomes, while it only found a small number of new members in vertebrate genomes. Therefore, different evolutionary and regulatory mechanisms of cullin-RING ubiquitin ligases may be present in plants and animals. We also annotated and compared the Pkinase family members across a wide range of organisms, including 10 fungi, 10 metazoa, 10 vertebrates, and 10 additional plants, which were randomly selected from the Ensembl database. Our CTT annotation recovered on average 14% more loci, including pseudogenes, of the Pkinase superfamily in these 40 genomes, demonstrating its robust replicability and scalability in annotating superfamiy members in any genomes. 
    more » « less
  5. null (Ed.)
    Legumes are of great interest for sustainable agricultural production as they fix atmospheric nitrogen to improve the soil. Medicago truncatula is a well-established model legume, and extensive studies in fundamental molecular, physiological, and developmental biology have been undertaken to translate into trait improvements in economically important legume crops worldwide. However, M. truncatula reference genome was generated in the accession Jemalong A17, which is highly recalcitrant to transformation. M. truncatula R108 is more attractive for genetic studies due to its high transformation efficiency and Tnt1-insertion population resource for functional genomics. The need to perform accurate synteny analysis and comprehensive genome-scale comparisons necessitates a chromosome-length genome assembly for M. truncatula cv. R108. Here, we performed in situ Hi-C (48×) to anchor, order, orient scaffolds, and correct misjoins of contigs in a previously published genome assembly (R108 v1.0), resulting in an improved genome assembly containing eight chromosome-length scaffolds that span 97.62% of the sequenced bases in the input assembly. The long-range physical information data generated using Hi-C allowed us to obtain a chromosome-length ordering of the genome assembly, better validate previous draft misjoins, and provide further insights accurately predicting synteny between A17 and R108 regions corresponding to the known chromosome 4/8 translocation. Furthermore, mapping the Tnt1 insertion landscape on this reference assembly presents an important resource for M. truncatula functional genomics by supporting efficient mutant gene identification in Tnt1 insertion lines. Our data provide a much-needed foundational resource that supports functional and molecular research into the Leguminosae for sustainable agriculture and feeding the future. 
    more » « less