skip to main content


Search for: All records

Creators/Authors contains: "Smith, Brett"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Motivation

    Large language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge.

    Results

    Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework’s capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.

    Availability and implementation

    SPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.

     
    more » « less
  2. Abstract

    Extreme precision radial velocity (EPRV) measurements contend with internal noise (instrumental systematics) and external noise (intrinsic stellar variability) on the road to 10 cm s−1“exo-Earth” sensitivity. Both of these noise sources are well-probed using “Sun-as-a-star” RVs and cross-instrument comparisons. We built the Solar Calibrator (SoCal), an autonomous system that feeds stable, disk-integrated sunlight to the recently commissioned Keck Planet Finder (KPF) at the W. M. Keck Observatory. With SoCal, KPF acquires signal-to-noise ratio (S/N) ∼ 1200,R= 98,000 optical (445–870 nm) spectra of the Sun in 5 s exposures at unprecedented cadence for an EPRV facility using KPF’s fast readout mode (<16 s between exposures). Daily autonomous operation is achieved by defining an operations loop using state machine logic. Data affected by clouds are automatically flagged using a reliable quality control metric derived from simultaneous irradiance measurements. Comparing solar data across the growing global network of EPRV spectrographs with solar feeds will allow EPRV teams to disentangle internal and external noise sources and benchmark spectrograph performance. To facilitate this, all SoCal data products are immediately available to the public on the Keck Observatory Archive. We compared SoCal RVs to contemporaneous RVs from NEID, the only other immediately public EPRV solar data set. We find agreement at the 30–40 cm s−1level on timescales of several hours, which is comparable to the combined photon-limited precision. Data from SoCal were also used to assess a detector problem and wavelength calibration inaccuracies associated with KPF during early operations. Long-term SoCal operations will collect upwards of 1000 solar spectra per six-hour day using KPF’s fast readout mode, enabling stellar activity studies at high S/N on our nearest solar-type star.

     
    more » « less
  3. We have explored the ligand topology of high-valent Fe(iv)–oxo complexes for screening a large molecular database with machine learning.

     
    more » « less
  4. Abstract Motivation

    Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information.

    Results

    In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a ‘parent table’ of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts.

    Availability and implementation

    The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. null (Ed.)
  6. Abstract

    Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to enable computers to solve complex problems with AI methods. However, its application to biomedicine has been lagging in part due to the daunting complexity of molecular and cellular pathways that govern human physiology and pathology. In this article, we describe concrete uses of Scalable PrecisiOn Medicine Knowledge Engine (SPOKE), an open knowledge network that connects curated information from thirty‐seven specialized and human‐curated databases into a single property graph, with 3 million nodes and 15 million edges to date. Applications discussed in this article include drug discovery, COVID‐19 research and chronic disease diagnosis, and management.

     
    more » « less