skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The diversity of experimental organisms in biomedical research may be influenced by biomedical funding
Contrary to concerns of some critics, we present evidence that biomedical research is not dominated by a small handful of model organisms. An exhaustive analysis of research literature suggests that the diversity of experimental organisms in biomedical research has increased substantially since 1975. There has been a longstanding worry that organism‐centric funding policies can lead to biases in experimental organism choice, and thus negatively impact the direction of research and the interpretation of results. Critics have argued that a focus on model organisms has unduly constrained the diversity of experimental organisms. The availability of large electronic databases of scientific literature, combined with interest in quantitative methods among philosophers of science, presents new opportunities for data‐driven investigations into organism choice in biomedical research. The diversity of organisms used in NIH‐funded research may be considerably lower than in the broader biomedical sciences, and may be subject to greater constraints on organism choice.  more » « less
Award ID(s):
1656284
PAR ID:
10026900
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
BioEssays
Volume:
39
Issue:
5
ISSN:
0265-9247
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundAs biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. MethodsWe consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. ResultsWe applied our sequenced-based FCA approach to all the sub-hierarchies underDisease or Disorderin the NCI Thesaurus (19.08d version) and five sub-hierarchies underClinical FindingandProcedurein the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. ConclusionOur sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies. 
    more » « less
  2. Abstract MotivationLarge language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge. ResultsCompared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework’s capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion. Availability and implementationSPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository. 
    more » « less
  3. Abstract Mormyroidea is a superfamily of weakly electric African fishes with great potential as a model in a variety of biomedical research areas including systems neuroscience, muscle cell and craniofacial development, ion channel biophysics, and flagellar/ciliary biology. However, they are currently difficult to breed in the laboratory setting, which is essential for any tractable model organism. As such, there is a need to better understand the reproductive biology of mormyroids to breed them more reliably in the laboratory to effectively use them as a biomedical research model. This review seeks to (1) briefly highlight the biomedically relevant phenotypes of mormyroids and (2) compile information about mormyroid reproduction including sex differences, breeding season, sexual maturity, gonads, gametes, and courtship/spawning behaviors. We also highlight areas of mormyroid reproductive biology that are currently unexplored and/or have the potential for further investigation that may provide insights into more successful mormyroid laboratory breeding methods. 
    more » « less
  4. Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor’s notes that has been annotated with ICD concepts. Extensive experimental evaluation shows our model achieves significant performance improvements over several state-of-the-art techniques. 
    more » « less
  5. Abstract MotivationFigures in biomedical papers communicate essential information with the potential to identify relevant documents in biomedical and clinical settings. However, academic search interfaces mainly search over text fields. ResultsWe describe a search system for biomedical documents that leverages image modalities and an existing index server. We integrate a problem-specific taxonomy of image modalities and image-based data into a custom search system. Our solution features a front-end interface to enhance classical document search results with image-related data, including page thumbnails, figures, captions and image-modality information. We demonstrate the system on a subset of the CORD-19 document collection. A quantitative evaluation demonstrates higher precision and recall for biomedical document retrieval. A qualitative evaluation with domain experts further highlights our solution’s benefits to biomedical search. Availability and implementationA demonstration is available at https://runachay.evl.uic.edu/scholar. Our code and image models can be accessed via github.com/uic-evl/bio-search. The dataset is continuously expanded. 
    more » « less