skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Title: DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients

In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) is pivotal, but its assignment process is inefficient. The study introduces , an advanced large language model (LLM) fine-tuned on clinical notes to enhance DRGs assignment. Utilizing LLaMA as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries, our -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986, with a maximum input token length of 512. This model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. Applied to base DRG and complication or comorbidity (CC)/major complication or comorbidity (MCC) prediction, achieved a top-1 prediction accuracy of 67.8% and 67.5%, respectively. Additionally, our findings indicate that ’s performance correlates with increased model parameters and input context lengths.

more » « less
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
npj Digital Medicine
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Aims. We present a variability-, color-, and morphology-based classifier designed to identify multiple classes of transients and persistently variable and non-variable sources from the Zwicky Transient Facility (ZTF) Data Release 11 (DR11) light curves of extended and point sources. The main motivation to develop this model was to identify active galactic nuclei (AGN) at different redshift ranges to be observed by the 4MOST Chilean AGN/Galaxy Evolution Survey (ChANGES). That being said, it also serves as a more general time-domain astronomy study. Methods. The model uses nine colors computed from CatWISE and Pan-STARRS1 (PS1), a morphology score from PS1, and 61 single-band variability features computed from the ZTF DR11 g and r light curves. We trained two versions of the model, one for each ZTF band, since ZTF DR11 treats the light curves observed in a particular combination of field, filter, and charge-coupled device (CCD) quadrant independently. We used a hierarchical local classifier per parent node approach-where each node is composed of a balanced random forest model. We adopted a taxonomy with 17 classes: non-variable stars, non-variable galaxies, three transients (SNIa, SN-other, and CV/Nova), five classes of stochastic variables (lowz-AGN, midz-AGN, highz-AGN, Blazar, and YSO), and seven classes of periodic variables (LPV, EA, EB/EW, DSCT, RRL, CEP, and Periodic-other). Results. The macro-averaged precision, recall, and F1-score are 0.61, 0.75, and 0.62 for the g -band model, and 0.60, 0.74, and 0.61, for the r -band model. When grouping the four AGN classes (lowz-AGN, midz-AGN, highz-AGN, and Blazar) into one single class, its precision-recall, and F1-score are 1.00, 0.95, and 0.97, respectively, for both the g and r bands. This demonstrates the good performance of the model in classifying AGN candidates. We applied the model to all the sources in the ZTF/4MOST overlapping sky (−28 ≤ Dec ≤ 8.5), avoiding ZTF fields that cover the Galactic bulge (| gal_b | ≤ 9 and gal_l ≤ 50). This area includes 86 576 577 light curves in the g band and 140 409 824 in the r band with 20 or more observations and with an average magnitude in the corresponding band lower than 20.5. Only 0.73% of the g -band light curves and 2.62% of the r -band light curves were classified as stochastic, periodic, or transient with high probability ( P init ≥ 0.9). Even though the metrics obtained for the two models are similar, we find that, in general, more reliable results are obtained when using the g -band model. With it, we identified 384 242 AGN candidates (including low-, mid-, and high-redshift AGN and Blazars), 287 156 of which have P init ≥ 0.9. 
    more » « less
  2. null (Ed.)
    Abstract The inter-residue contact prediction and deep learning showed the promise to improve the estimation of protein model accuracy (EMA) in the 13th Critical Assessment of Protein Structure Prediction (CASP13). To further leverage the improved inter-residue distance predictions to enhance EMA, during the 2020 CASP14 experiment, we integrated several new inter-residue distance features with the existing model quality assessment features in several deep learning methods to predict the quality of protein structural models. According to the evaluation of performance in selecting the best model from the models of CASP14 targets, our three multi-model predictors of estimating model accuracy (MULTICOM-CONSTRUCT, MULTICOM-AI, and MULTICOM-CLUSTER) achieve the averaged loss of 0.073, 0.079, and 0.081, respectively, in terms of the global distance test score (GDT-TS). The three methods are ranked first, second, and third out of all 68 CASP14 predictors. MULTICOM-DEEP, the single-model predictor of estimating model accuracy (EMA), is ranked within top 10 among all the single-model EMA methods according to GDT-TS score loss. The results demonstrate that inter-residue distance features are valuable inputs for deep learning to predict the quality of protein structural models. However, larger training datasets and better ways of leveraging inter-residue distance information are needed to fully explore its potentials. 
    more » « less
  3. Abstract Objective We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. Materials and Methods Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. Results For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. Conclusions Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports. 
    more » « less
  4. Abstract

    Water monitoring in households provides occupants and utilities with key information to support water conservation and efficiency in the residential sector. High costs, intrusiveness, and practical complexity limit appliance-level monitoring via sub-meters on every water-consuming end use in households. Non-intrusive machine learning methods have emerged as promising techniques to analyze observed data collected by a single meter at the inlet of the house and estimate the disaggregated contribution of each water end use. While fine temporal resolution data allow for more accurate end-use disaggregation, there is an inevitable increase in the amount of data that needs to be stored and analyzed. To explore this tradeoff and advance previous studies based on synthetic data, we first collected 1 s resolution indoor water use data from a residential single-point smart water metering system installed at a four-person household, as well as ground-truth end-use labels based on a water diary recorded over a 4-week study period. Second, we trained a supervised machine learning model (random forest classifier) to classify six water end-use categories across different temporal resolutions and two different model calibration scenarios. Finally, we evaluated the results based on three different performance metrics (micro, weighted, and macro F1 scores). Our findings show that data collected at 1- to 5-s intervals allow for better end-use classification (weighted F-score higher than 0.85), particularly for toilet events; however, certain water end uses (e.g., shower and washing machine events) can still be predicted with acceptable accuracy even at coarser resolutions, up to 1 min, provided that these end-use categories are well represented in the training dataset. Overall, our study provides insights for further water sustainability research and widespread deployment of smart water meters.

    more » « less
  5. Abstract

    The timeliness of monitoring is essential to algal bloom management. However, acquiring algal bio-indicators can be time-consuming and laborious, and bloom biomass data often contain a large proportion of extreme values limiting the predictive models. Therefore, to predict algal blooms from readily water quality parameters (i.e. dissolved oxygen, pH, etc), and to provide a novel solution to the modeling challenges raised by the extremely distributed biomass data, a Bayesian scale-mixture of skew-normal (SMSN) model was proposed. In this study, our SMSN model accurately predicted over-dispersed biomass variations with skewed distributions in both rivers and lakes (in-sample and out-of-sample predictionR2ranged from 0.533 to 0.706 and 0.412 to 0.742, respectively). Moreover, we successfully achieve a probabilistic assessment of algal blooms with the Bayesian framework (accuracy >0.77 and macro-F1score >0.72), which robustly decreased the classic point-prediction-based inaccuracy by up to 34%. This work presented a promising Bayesian SMSN modeling technique, allowing for real-time prediction of algal biomass variations andin-situprobabilistic assessment of algal bloom.

    more » « less