NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

GO2Sum: generating human-readable functional summary of proteins from GO terms

https://doi.org/10.1038/s41540-024-00358-0

Giri, Swagarika Jaharlal; Ibtehaz, Nabil; Kihara, Daisuke (December 2024, npj Systems Biology and Applications)

Abstract Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
more » « less
Full Text Available
Chromosome level genome assembly of the Etruscan shrew Suncus etruscus

https://doi.org/10.1038/s41597-024-03011-x

Bukhman, Yury V; Meyer, Susanne; Chu, Li-Fang; Abueg, Linelle; Antosiewicz-Bourget, Jessica; Balacco, Jennifer; Brecht, Michael; Dinatale, Erica; Fedrigo, Olivier; Formenti, Giulio; et al (December 2024, Scientific Data)

Abstract Suncus etruscusis one of the world’s smallest mammals, with an average body mass of about 2 grams. The Etruscan shrew’s small body is accompanied by a very high energy demand and numerous metabolic adaptations. Here we report a chromosome-level genome assembly using PacBio long read sequencing, 10X Genomics linked short reads, optical mapping, and Hi-C linked reads. The assembly is partially phased, with the 2.472 Gbp primary pseudohaplotype and 1.515 Gbp alternate. We manually curated the primary assembly and identified 22 chromosomes, including X and Y sex chromosomes. The NCBI genome annotation pipeline identified 39,091 genes, 19,819 of them protein-coding. We also identified segmental duplications, inferred GO term annotations, and computed orthologs of human and mouse genes. This reference-quality genome will be an important resource for research on mammalian development, metabolism, and body size control.
more » « less
Full Text Available
Domain-PFP allows protein function prediction using function-aware domain embedding representations

https://doi.org/10.1038/s42003-023-05476-9

Ibtehaz, Nabil; Kagaya, Yuki; Kihara, Daisuke (December 2023, Communications Biology)

Abstract Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
more » « less
Full Text Available
A High-Quality Blue Whale Genome, Segmental Duplications, and Historical Demography

https://doi.org/10.1093/molbev/msae036

Bukhman, Yury V; Morin, Phillip A; Meyer, Susanne; Chu, Li-Fang; Jacobsen, Jeff K; Antosiewicz-Bourget, Jessica; Mamott, Daniel; Gonzales, Maylie; Argus, Cara; Bolin, Jennifer; et al (March 2024, Molecular Biology and Evolution)
Gaut, Brandon (Ed.)
Abstract The blue whale, Balaenoptera musculus, is the largest animal known to have ever existed, making it an important case study in longevity and resistance to cancer. To further this and other blue whale-related research, we report a reference-quality, long-read-based genome assembly of this fascinating species. We assembled the genome from PacBio long reads and utilized Illumina/10×, optical maps, and Hi-C data for scaffolding, polishing, and manual curation. We also provided long read RNA-seq data to facilitate the annotation of the assembly by NCBI and Ensembl. Additionally, we annotated both haplotypes using TOGA and measured the genome size by flow cytometry. We then compared the blue whale genome with other cetaceans and artiodactyls, including vaquita (Phocoena sinus), the world's smallest cetacean, to investigate blue whale's unique biological traits. We found a dramatic amplification of several genes in the blue whale genome resulting from a recent burst in segmental duplications, though the possible connection between this amplification and giant body size requires further study. We also discovered sites in the insulin-like growth factor-1 gene correlated with body size in cetaceans. Finally, using our assembly to examine the heterozygosity and historical demography of Pacific and Atlantic blue whale populations, we found that the genomes of both populations are highly heterozygous and that their genetic isolation dates to the last interglacial period. Taken together, these results indicate how a high-quality, annotated blue whale genome will serve as an important resource for biology, evolution, and conservation research.
more » « less
Full Text Available
Proteomic Analysis of Unicellular Cyanobacterium Crocosphaera subtropica ATCC 51142 under Extended Light or Dark Growth

https://doi.org/10.1021/acs.jproteome.4c00439

Panda, Punyatoya; Giri, Swagarika J; Sherman, Louis A; Kihara, Daisuke; Aryal, Uma K (February 2025, Journal of Proteome Research)

Full Text Available
Bioinformatic Approaches for Characterizing Molecular Structure and Function of Food Proteins

https://doi.org/10.1146/annurev-food-060721-022222

Helmick, Harrison; Jain, Anika; Terashi, Genki; Liceaga, Andrea; Bhunia, Arun K.; Kihara, Daisuke; Kokini, Jozef L. (March 2023, Annual Review of Food Science and Technology)

Structural bioinformatics analyzes protein structural models with the goal of uncovering molecular drivers of food functionality. This field aims to develop tools that can rapidly extract relevant information from protein databases as well as organize this information for researchers interested in studying protein functionality. Food bioinformaticians take advantage of millions of protein amino acid sequences and structures contained within these databases, extracting features such as surface hydrophobicity that are then used to model functionality, including solubility, thermostability, and emulsification. This work is aided by a protein structure–function relationship framework, in which bioinformatic properties are linked to physicochemical experimentation. Strong bioinformatic correlations exist for protein secondary structure, electrostatic potential, and surface hydrophobicity. Modeling changes in protein structures through molecular mechanics is an increasingly accessible field that will continue to propel food science research.
more » « less
Full Text Available
A haplotype-resolved genome assembly of the Nile rat facilitates exploration of the genetic basis of diabetes

https://doi.org/10.1186/s12915-022-01427-8

Toh, Huishi; Yang, Chentao; Formenti, Giulio; Raja, Kalpana; Yan, Lily; Tracey, Alan; Chow, William; Howe, Kerstin; Bergeron, Lucie A.; Zhang, Guojie; et al (December 2022, BMC Biology)

Abstract Background The Nile rat ( Avicanthis niloticus ) is an important animal model because of its robust diurnal rhythm, a cone-rich retina, and a propensity to develop diet-induced diabetes without chemical or genetic modifications. A closer similarity to humans in these aspects, compared to the widely used Mus musculus and Rattus norvegicus models, holds the promise of better translation of research findings to the clinic. Results We report a 2.5 Gb, chromosome-level reference genome assembly with fully resolved parental haplotypes, generated with the Vertebrate Genomes Project (VGP). The assembly is highly contiguous, with contig N50 of 11.1 Mb, scaffold N50 of 83 Mb, and 95.2% of the sequence assigned to chromosomes. We used a novel workflow to identify 3613 segmental duplications and quantify duplicated genes. Comparative analyses revealed unique genomic features of the Nile rat, including some that affect genes associated with type 2 diabetes and metabolic dysfunctions. We discuss 14 genes that are heterozygous in the Nile rat or highly diverged from the house mouse. Conclusions Our findings reflect the exceptional level of genomic resolution present in this assembly, which will greatly expand the potential of the Nile rat as a model organism.
more » « less
Full Text Available
ContactPFP: Protein Function Prediction Using Predicted Contact Information

https://doi.org/10.3389/fbinf.2022.896295

Kagaya, Yuki; Flannery, Sean T.; Jain, Aashish; Kihara, Daisuke (June 2022, Frontiers in Bioinformatics)

Computational function prediction is one of the most important problems in bioinformatics as elucidating the function of genes is a central task in molecular biology and genomics. Most of the existing function prediction methods use protein sequences as the primary source of input information because the sequence is the most available information for query proteins. There are attempts to consider other attributes of query proteins. Among these attributes, the three-dimensional (3D) structure of proteins is known to be very useful in identifying the evolutionary relationship of proteins, from which functional similarity can be inferred. Here, we report a novel protein function prediction method, ContactPFP, which uses predicted residue-residue contact maps as input structural features of query proteins. Although 3D structure information is known to be useful, it has not been routinely used in function prediction because the 3D structure is not experimentally determined for many proteins. In ContactPFP, we overcome this limitation by using residue-residue contact prediction, which has become increasingly accurate due to rapid development in the protein structure prediction field. ContactPFP takes a query protein sequence as input and uses predicted residue-residue contact as a proxy for the 3D protein structure. To characterize how predicted contacts contribute to function prediction accuracy, we compared the performance of ContactPFP with several well-established sequence-based function prediction methods. The comparative study revealed the advantages and weaknesses of ContactPFP compared to contemporary sequence-based methods. There were many cases where it showed higher prediction accuracy. We examined factors that affected the accuracy of ContactPFP using several illustrative cases that highlight the strength of our method.
more » « less
Full Text Available
Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction

https://doi.org/10.1038/s41598-021-87204-z

Jain, Aashish; Terashi, Genki; Kagaya, Yuki; Maddhuri Venkata Subramaniya, Sai Raghavendra; Christoffer, Charles; Kihara, Daisuke (December 2021, Scientific Reports)
null (Ed.)
Abstract Protein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. We show that combining four MSAs of different E-value cutoffs improved the model prediction performance as compared to single E-value MSA features. A further improvement was observed when an attention layer was used and even more when additional prediction tasks of bond angle predictions were added. The improvement of distance predictions were successfully transferred to achieve better protein tertiary structure modeling.
more » « less
Full Text Available
Protein contact map refinement for improving structure prediction using generative adversarial networks

https://doi.org/10.1093/bioinformatics/btab220

Maddhuri Venkata Subramaniya, Sai Raghavendra; Terashi, Genki; Jain, Aashish; Kagaya, Yuki; Kihara, Daisuke (March 2021, Bioinformatics)
Valencia, Alfonso (Ed.)
Abstract Motivation Protein structure prediction remains as one of the most important problems in computational biology and biophysics. In the past few years, protein residue–residue contact prediction has undergone substantial improvement, which has made it a critical driving force for successful protein structure prediction. Boosting the accuracy of contact predictions has, therefore, become the forefront of protein structure prediction. Results We show a novel contact map refinement method, ContactGAN, which uses Generative Adversarial Networks (GAN). ContactGAN was able to make a significant improvement over predictions made by recent contact prediction methods when tested on three datasets including protein structure modeling targets in CASP13 and CASP14. We show improvement of precision in contact prediction, which translated into improvement in the accuracy of protein tertiary structure models. On the other hand, observed improvement over trRosetta was relatively small, reasons for which are discussed. ContactGAN will be a valuable addition in the structure prediction pipeline to achieve an extra gain in contact prediction accuracy. Availability and implementation https://github.com/kiharalab/ContactGAN. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available

Search for: All records