skip to main content


Title: Predictive modeling of clinical trial terminations using feature engineering and embedding learning
Abstract In this study, we propose to use machine learning to understand terminated clinical trials. Our goal is to answer two fundamental questions: (1) what are common factors/markers associated to terminated clinical trials? and (2) how to accurately predict whether a clinical trial may be terminated or not? The answer to the first question provides effective ways to understand characteristics of terminated trials for stakeholders to better plan their trials; and the answer to the second question can direct estimate the chance of success of a clinical trial in order to minimize costs. By using 311,260 trials to build a testbed with 68,999 samples, we use feature engineering to create 640 features, reflecting clinical trial administration, eligibility, study information, criteria etc. Using feature ranking, a handful of features, such as trial eligibility, trial inclusion/exclusion criteria, sponsor types etc. , are found to be related to the clinical trial termination. By using sampling and ensemble learning, we achieve over 67% Balanced Accuracy and over 0.73 AUC (Area Under the Curve) scores to correctly predict clinical trial termination, indicating that machine learning can help achieve satisfactory prediction results for clinical trial study.  more » « less
Award ID(s):
2027339 1763452 1828181
NSF-PAR ID:
10230296
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Scientific Reports
Volume:
11
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Gadekallu, Thippa Reddy (Ed.)
    As of March 30 2021, over 5,193 COVID-19 clinical trials have been registered through Clinicaltrial.gov. Among them, 191 trials were terminated, suspended, or withdrawn (indicating the cessation of the study). On the other hand, 909 trials have been completed (indicating the completion of the study). In this study, we propose to study underlying factors of COVID-19 trial completion vs . cessation, and design predictive models to accurately predict whether a COVID-19 trial may complete or cease in the future. We collect 4,441 COVID-19 trials from ClinicalTrial.gov to build a testbed, and design four types of features to characterize clinical trial administration, eligibility, study information, criteria, drug types, study keywords, as well as embedding features commonly used in the state-of-the-art machine learning. Our study shows that drug features and study keywords are most informative features, but all four types of features are essential for accurate trial prediction. By using predictive models, our approach achieves more than 0.87 AUC (Area Under the Curve) score and 0.81 balanced accuracy to correctly predict COVID-19 clinical trial completion vs . cessation. Our research shows that computational methods can deliver effective features to understand difference between completed vs . ceased COVID-19 trials. In addition, such models can also predict COVID-19 trial status with satisfactory accuracy, and help stakeholders better plan trials and minimize costs. 
    more » « less
  2. Abstract Overly restrictive eligibility criteria for clinical trials may limit the generalizability of the trial results to their target real-world patient populations. We developed a novel machine learning approach using large collections of real-world data (RWD) to better inform clinical trial eligibility criteria design. We extracted patients’ clinical events from electronic health records (EHRs), which include demographics, diagnoses, and drugs, and assumed certain compositions of these clinical events within an individual’s EHRs can determine the subphenotypes—homogeneous clusters of patients, where patients within each subgroup share similar clinical characteristics. We introduced an outcome-guided probabilistic model to identify those subphenotypes, such that the patients within the same subgroup not only share similar clinical characteristics but also at similar risk levels of encountering severe adverse events (SAEs). We evaluated our algorithm on two previously conducted clinical trials with EHRs from the OneFlorida+ Clinical Research Consortium. Our model can clearly identify the patient subgroups who are more likely to suffer or not suffer from SAEs as subphenotypes in a transparent and interpretable way. Our approach identified a set of clinical topics and derived novel patient representations based on them. Each clinical topic represents a certain clinical event composition pattern learned from the patient EHRs. Tested on both trials, patient subgroup (#SAE=0) and patient subgroup (#SAE>0) can be well-separated by k-means clustering using the inferred topics. The inferred topics characterized as likely to align with the patient subgroup (#SAE>0) revealed meaningful combinations of clinical features and can provide data-driven recommendations for refining the exclusion criteria of clinical trials. The proposed supervised topic modeling approach can infer the clinical topics from the subphenotypes with or without SAEs. The potential rules for describing the patient subgroups with SAEs can be further derived to inform the design of clinical trial eligibility criteria. 
    more » « less
  3. Clinical trials are crucial for the advancement of treatment and knowledge within the medical community. Since 2007, US federal government took the initiative and requires organizations sponsoring clinical trials with at least one site in the United States to submit information on these clinical trials to the ClinicalTrials.gov database, resulting in a rich source of information for clinical trial research. Nevertheless, only a handful of analytic studies have been carried out to understand this valuable data source. In this study, we propose to use network analysis to understand infectious disease clinical trial research. Our goal is to answer two important questions: (1) what are the concentrations and characteristics of infectious disease clinical trail research? and (2) how to accurately predict what type of clinical trials a sponsor (or an investigator) is interested in? The answers to the first question provide effective ways to summarize clinical trial research related to particular disease(s), and the answers to the second question help match clinical trial sponsors and investigators for information recommendation. By using 4,228 clinical trails as the test bed, our study involves 4,864 sponsors and 1,879 research areas characterized by Medical Subject Heading (MeSH) keywords. We extract a set of network measures to show patterns of infectious disease clinical trials, and design a new community based link prediction approach to predict sponsors' interests, with significant improvement compared to baselines. This trans-formative study concludes that using network analysis can tremendously help the understanding of clinical trial research for effective summarization, characterization, and prediction. 
    more » « less
  4. Attention allows us to select relevant and ignore irrelevant information from our complex environments. What happens when attention shifts from one item to another? To answer this question, it is critical to have tools that accurately recover neural representations of both feature and location information with high temporal resolution. In the present study, we used human electroencephalography (EEG) and machine learning to explore how neural representations of object features and locations update across dynamic shifts of attention. We demonstrate that EEG can be used to create simultaneous time courses of neural representations of attended features (time point-by-time point inverted encoding model reconstructions) and attended location (time point-by-time point decoding) during both stable periods and across dynamic shifts of attention. Each trial presented two oriented gratings that flickered at the same frequency but had different orientations; participants were cued to attend one of them and on half of trials received a shift cue midtrial. We trained models on a stable period from Hold attention trials and then reconstructed/decoded the attended orientation/location at each time point on Shift attention trials. Our results showed that both feature reconstruction and location decoding dynamically track the shift of attention and that there may be time points during the shifting of attention when 1) feature and location representations become uncoupled and 2) both the previously attended and currently attended orientations are represented with roughly equal strength. The results offer insight into our understanding of attentional shifts, and the noninvasive techniques developed in the present study lend themselves well to a wide variety of future applications. NEW & NOTEWORTHY We used human EEG and machine learning to reconstruct neural response profiles during dynamic shifts of attention. Specifically, we demonstrated that we could simultaneously read out both location and feature information from an attended item in a multistimulus display. Moreover, we examined how that readout evolves over time during the dynamic process of attentional shifts. These results provide insight into our understanding of attention, and this technique carries substantial potential for versatile extensions and applications. 
    more » « less
  5. Abstract Background:

    Randomized controlled trials (RCT) play a central role in evidence-based healthcare. However, the clinical and policy implications of implementing RCTs in clinical practice are difficult to predict as the studied population is often different from the target population where results are being applied. This study illustrates the concepts of generalizability and transportability, demonstrating their utility in interpreting results from the National Lung Screening Trial (NLST).

    Methods:

    Using inverse-odds weighting, we demonstrate how generalizability and transportability techniques can be used to extrapolate treatment effect from (i) a subset of NLST to the entire NLST population and from (ii) the entire NLST to different target populations.

    Results:

    Our generalizability analysis revealed that lung cancer mortality reduction by LDCT screening across the entire NLST [16% (95% confidence interval [CI]: 4–24)] could have been estimated using a smaller subset of NLST participants. Using transportability analysis, we showed that populations with a higher prevalence of females and current smokers had a greater reduction in lung cancer mortality with LDCT screening [e.g., 27% (95% CI, 11–37) for the population with 80% females and 80% current smokers] than those with lower prevalence of females and current smokers.

    Conclusions:

    This article illustrates how generalizability and transportability methods extend estimation of RCTs' utility beyond trial participants, to external populations of interest, including those that more closely mirror real-world populations.

    Impact:

    Generalizability and transportability approaches can be used to quantify treatment effects for populations of interest, which may be used to design future trials or adjust lung cancer screening eligibility criteria.

     
    more » « less