Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata

Yuan, Hao; Hicks, Parker; Ahmadian, Mansooreh; Johnson, Kayla_A; Valtadoros, Lydia; Krishnan, Arjun

doi:10.1093/bib/bbae652

Citation Details

Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata

Abstract Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https://github.com/krishnanlab/txt2onto2.0. more »

Award ID(s):: 2328140

PAR ID:: 10561675

Author(s) / Creator(s):: Yuan, Hao; Hicks, Parker; Ahmadian, Mansooreh; Johnson, Kayla_A; Valtadoros, Lydia; Krishnan, Arjun

Publisher / Repository:: Oxford University Press

Date Published:: 2024-12-22

Journal Name:: Briefings in Bioinformatics

Volume:: 26

Issue:: 1

ISSN:: 1467-5463

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1093/bib/bbae652

More Like this