skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Large Language Model–Based Responses to Patients’ In-Basket Messages
ImportanceVirtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful. ObjectivesTo assess PCPs’ perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy. Design, Setting, and ParticipantsThis cross-sectional quality improvement study tested the hypothesis that PCPs’ ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI. ExposuresRandomly assigned patient messages coupled with either an HCP message or the draft GenAI response. Main Outcomes and MeasuresPCPs rated responses’ information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy. ResultsA total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20];P = .01,U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27];P = .37;U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47],P = .49,t = −0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23];P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25];P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8];P = .002; difference, 31.2%). ConclusionsIn this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs’, a significant concern for patients with low health or English literacy.  more » « less
Award ID(s):
2129076 1928614
PAR ID:
10524046
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
JAMA
Date Published:
Journal Name:
JAMA Network Open
Volume:
7
Issue:
7
ISSN:
2574-3805
Page Range / eLocation ID:
e2422399
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ImportanceAssessing nontechnical skills in operating rooms (ORs) is crucial for enhancing surgical performance and patient safety. However, automated and real-time evaluation of these skills remains challenging. ObjectiveTo explore the feasibility of using motion features extracted from surgical video recordings to automatically assess nontechnical skills during cardiac surgical procedures. Design, Setting, and ParticipantsThis cross-sectional study used video recordings of cardiac surgical procedures at a tertiary academic US hospital collected from January 2021 through May 2022. The OpenPose library was used to analyze videos to extract body pose estimations of team members and compute various team motion features. The Non-Technical Skills for Surgeons (NOTSS) assessment tool was employed for rating the OR team’s nontechnical skills by 3 expert raters. Main Outcomes and MeasuresNOTSS overall score, with motion features extracted from surgical videos as measures. ResultsA total of 30 complete cardiac surgery procedures were included: 26 (86.6%) were on-pump coronary artery bypass graft procedures and 4 (13.4%) were aortic valve replacement or repair procedures. All patients were male, and the mean (SD) age was 72 (6.3) years. All surgical teams were composed of 4 key roles (attending surgeon, attending anesthesiologist, primary perfusionist, and scrub nurse) with additional supporting roles. NOTSS scores correlated significantly with trajectory (r = 0.51,P = .005), acceleration (r = 0.48,P = .008), and entropy (r = −0.52,P = .004) of team displacement. Multiple linear regression, adjusted for patient factors, showed average team trajectory (adjustedR2 = 0.335; coefficient, 10.51 [95% CI, 8.81-12.21];P = .004) and team displacement entropy (adjustedR2 = 0.304; coefficient, −12.64 [95% CI, −20.54 to −4.74];P = .003) were associated with NOTSS scores. Conclusions and RelevanceThis study suggests a significant link between OR team movements and nontechnical skills ratings by NOTSS during cardiac surgical procedures, suggesting automated surgical video analysis could enhance nontechnical skills assessment. Further investigation across different hospitals and specialties is necessary to validate these findings. 
    more » « less
  2. ImportanceScreening with low-dose computed tomography (CT) has been shown to reduce mortality from lung cancer in randomized clinical trials in which the rate of adherence to follow-up recommendations was over 90%; however, adherence to Lung Computed Tomography Screening Reporting & Data System (Lung-RADS) recommendations has been low in practice. Identifying patients who are at risk of being nonadherent to screening recommendations may enable personalized outreach to improve overall screening adherence. ObjectiveTo identify factors associated with patient nonadherence to Lung-RADS recommendations across multiple screening time points. Design, Setting, and ParticipantsThis cohort study was conducted at a single US academic medical center across 10 geographically distributed sites where lung cancer screening is offered. The study enrolled individuals who underwent low-dose CT screening for lung cancer between July 31, 2013, and November 30, 2021. ExposuresLow-dose CT screening for lung cancer. Main Outcomes and MeasuresThe main outcome was nonadherence to follow-up recommendations for lung cancer screening, defined as failing to complete a recommended or more invasive follow-up examination (ie, diagnostic dose CT, positron emission tomography–CT, or tissue sampling vs low-dose CT) within 15 months (Lung-RADS score, 1 or 2), 9 months (Lung-RADS score, 3), 5 months (Lung-RADS score, 4A), or 3 months (Lung-RADS score, 4B/X). Multivariable logistic regression was used to identify factors associated with patient nonadherence to baseline Lung-RADS recommendations. A generalized estimating equations model was used to assess whether the pattern of longitudinal Lung-RADS scores was associated with patient nonadherence over time. ResultsAmong 1979 included patients, 1111 (56.1%) were aged 65 years or older at baseline screening (mean [SD] age, 65.3 [6.6] years), and 1176 (59.4%) were male. The odds of being nonadherent were lower among patients with a baseline Lung-RADS score of 1 or 2 vs 3 (adjusted odds ratio [AOR], 0.35; 95% CI, 0.25-0.50), 4A (AOR, 0.21; 95% CI, 0.13-0.33), or 4B/X, (AOR, 0.10; 95% CI, 0.05-0.19); with a postgraduate vs college degree (AOR, 0.70; 95% CI, 0.53-0.92); with a family history of lung cancer vs no family history (AOR, 0.74; 95% CI, 0.59-0.93); with a high age-adjusted Charlson Comorbidity Index score (≥4) vs a low score (0 or 1) (AOR, 0.67; 95% CI, 0.46-0.98); in the high vs low income category (AOR, 0.79; 95% CI, 0.65-0.98); and referred by physicians from pulmonary or thoracic-related departments vs another department (AOR, 0.56; 95% CI, 0.44-0.73). Among 830 eligible patients who had completed at least 2 screening examinations, the adjusted odds of being nonadherent to Lung-RADS recommendations at the following screening were increased in patients with consecutive Lung-RADS scores of 1 to 2 (AOR, 1.38; 95% CI, 1.12-1.69). Conclusions and RelevanceIn this retrospective cohort study, patients with consecutive negative lung cancer screening results were more likely to be nonadherent with follow-up recommendations. These individuals are potential candidates for tailored outreach to improve adherence to recommended annual lung cancer screening. 
    more » « less
  3. ImportanceIdentifying and tracking new infections during an emerging pandemic is crucial to design and deploy interventions to protect populations and mitigate the pandemic’s effects, yet it remains a challenging task. ObjectiveTo characterize the ability of nonprobability online surveys to longitudinally estimate the number of COVID-19 infections in the population both in the presence and absence of institutionalized testing. Design, Setting, and ParticipantsInternet-based online nonprobability surveys were conducted among residents aged 18 years or older across 50 US states and the District of Columbia, using the PureSpectrum survey vendor, approximately every 6 weeks between June 1, 2020, and January 31, 2023, for a multiuniversity consortium—the COVID States Project. Surveys collected information on COVID-19 infections with representative state-level quotas applied to balance age, sex, race and ethnicity, and geographic distribution. Main Outcomes and MeasuresThe main outcomes were (1) survey-weighted estimates of new monthly confirmed COVID-19 cases in the US from January 2020 to January 2023 and (2) estimates of uncounted test-confirmed cases from February 1, 2022, to January 1, 2023. These estimates were compared with institutionally reported COVID-19 infections collected by Johns Hopkins University and wastewater viral concentrations for SARS-CoV-2 from Biobot Analytics. ResultsThe survey spanned 17 waves deployed from June 1, 2020, to January 31, 2023, with a total of 408 515 responses from 306 799 respondents (mean [SD] age, 42.8 [13.0] years; 202 416 women [66.0%]). Overall, 64 946 respondents (15.9%) self-reported a test-confirmed COVID-19 infection. National survey-weighted test-confirmed COVID-19 estimates were strongly correlated with institutionally reported COVID-19 infections (Pearson correlation,r = 0.96;P < .001) from April 2020 to January 2022 (50-state correlation mean [SD] value,r = 0.88 [0.07]). This was before the government-led mass distribution of at-home rapid tests. After January 2022, correlation was diminished and no longer statistically significant (r = 0.55;P = .08; 50-state correlation mean [SD] value,r = 0.48 [0.23]). In contrast, survey COVID-19 estimates correlated highly with SARS-CoV-2 viral concentrations in wastewater both before (r = 0.92;P < .001) and after (r = 0.89;P < .001) January 2022. Institutionally reported COVID-19 cases correlated (r = 0.79;P < .001) with wastewater viral concentrations before January 2022, but poorly (r = 0.31;P = .35) after, suggesting that both survey and wastewater estimates may have better captured test-confirmed COVID-19 infections after January 2022. Consistent correlation patterns were observed at the state level. Based on national-level survey estimates, approximately 54 million COVID-19 cases were likely unaccounted for in official records between January 2022 and January 2023. Conclusions and RelevanceThis study suggests that nonprobability survey data can be used to estimate the temporal evolution of test-confirmed infections during an emerging disease outbreak. Self-reporting tools may enable government and health care officials to implement accessible and affordable at-home testing for efficient infection monitoring in the future. 
    more » « less
  4. ImportanceMarked elevation in levels of depressive symptoms compared with historical norms have been described during the COVID-19 pandemic, and understanding the extent to which these are associated with diminished in-person social interaction could inform public health planning for future pandemics or other disasters. ObjectiveTo describe the association between living in a US county with diminished mobility during the COVID-19 pandemic and self-reported depressive symptoms, while accounting for potential local and state-level confounding factors. Design, Setting, and ParticipantsThis survey study used 18 waves of a nonprobability internet survey conducted in the United States between May 2020 and April 2022. Participants included respondents who were 18 years and older and lived in 1 of the 50 US states or Washington DC. Main Outcome and MeasureDepressive symptoms measured by the Patient Health Questionnaire-9 (PHQ-9); county-level community mobility estimates from mobile apps; COVID-19 policies at the US state level from the Oxford stringency index. ResultsThe 192 271 survey respondents had a mean (SD) of age 43.1 (16.5) years, and 768 (0.4%) were American Indian or Alaska Native individuals, 11 448 (6.0%) were Asian individuals, 20 277 (10.5%) were Black individuals, 15 036 (7.8%) were Hispanic individuals, 1975 (1.0%) were Pacific Islander individuals, 138 702 (72.1%) were White individuals, and 4065 (2.1%) were individuals of another race. Additionally, 126 381 respondents (65.7%) identified as female and 65 890 (34.3%) as male. Mean (SD) depression severity by PHQ-9 was 7.2 (6.8). In a mixed-effects linear regression model, the mean county-level proportion of individuals not leaving home was associated with a greater level of depression symptoms (β, 2.58; 95% CI, 1.57-3.58) after adjustment for individual sociodemographic features. Results were similar after the inclusion in regression models of local COVID-19 activity, weather, and county-level economic features, and persisted after widespread availability of COVID-19 vaccination. They were attenuated by the inclusion of state-level pandemic restrictions. Two restrictions, mandatory mask-wearing in public (β, 0.23; 95% CI, 0.15-0.30) and policies cancelling public events (β, 0.37; 95% CI, 0.22-0.51), demonstrated modest independent associations with depressive symptom severity. Conclusions and RelevanceIn this study, depressive symptoms were greater in locales and times with diminished community mobility. Strategies to understand the potential public health consequences of pandemic responses are needed. 
    more » « less
  5. null (Ed.)
    Abstract Objectives Electronic health record systems are increasingly used to send messages to physicians, but research on physicians’ inbox use patterns is limited. This study’s aims were to (1) quantify the time primary care physicians (PCPs) spend managing inboxes; (2) describe daily patterns of inbox use; (3) investigate which types of messages consume the most time; and (4) identify factors associated with inbox work duration. Materials and Methods We analyzed 1 month of electronic inbox data for 1275 PCPs in a large medical group and linked these data with physicians’ demographic data. Results PCPs spent an average of 52 minutes on inbox management on workdays, including 19 minutes (37%) outside work hours. Temporal patterns of electronic inbox use differed from other EHR functions such as charting. Patient-initiated messages (28%) and results (29%) accounted for the most inbox work time. PCPs with higher inbox work duration were more likely to be female (P < .001), have more patient encounters (P < .001), have older patients (P < .001), spend proportionally more time on patient messages (P < .001), and spend more time per message (P < .001). Compared with PCPs with the lowest duration of time on inbox work, PCPs with the highest duration had more message views per workday (200 vs 109; P < .001) and spent more time on the inbox outside work hours (30 minutes vs 9.7 minutes; P < .001). Conclusions Electronic inbox work by PCPs requires roughly an hour per workday, much of which occurs outside scheduled work hours. Interventions to assist PCPs in handling patient-initiated messages and results may help alleviate inbox workload. 
    more » « less