Abstract Overly restrictive eligibility criteria for clinical trials may limit the generalizability of the trial results to their target real-world patient populations. We developed a novel machine learning approach using large collections of real-world data (RWD) to better inform clinical trial eligibility criteria design. We extracted patients’ clinical events from electronic health records (EHRs), which include demographics, diagnoses, and drugs, and assumed certain compositions of these clinical events within an individual’s EHRs can determine the subphenotypes—homogeneous clusters of patients, where patients within each subgroup share similar clinical characteristics. We introduced an outcome-guided probabilistic model to identify those subphenotypes, such that the patients within the same subgroup not only share similar clinical characteristics but also at similar risk levels of encountering severe adverse events (SAEs). We evaluated our algorithm on two previously conducted clinical trials with EHRs from the OneFlorida+ Clinical Research Consortium. Our model can clearly identify the patient subgroups who are more likely to suffer or not suffer from SAEs as subphenotypes in a transparent and interpretable way. Our approach identified a set of clinical topics and derived novel patient representations based on them. Each clinical topic represents a certain clinical event composition pattern learned from the patient EHRs. Tested on both trials, patient subgroup (#SAE=0) and patient subgroup (#SAE>0) can be well-separated by k-means clustering using the inferred topics. The inferred topics characterized as likely to align with the patient subgroup (#SAE>0) revealed meaningful combinations of clinical features and can provide data-driven recommendations for refining the exclusion criteria of clinical trials. The proposed supervised topic modeling approach can infer the clinical topics from the subphenotypes with or without SAEs. The potential rules for describing the patient subgroups with SAEs can be further derived to inform the design of clinical trial eligibility criteria.
more »
« less
Automatable Distributed Regression Analysis of Vertically Partitioned Data Facilitated by PopMedNet: Feasibility and Enhancement Study
Background In clinical research, important variables may be collected from multiple data sources. Physical pooling of patient-level data from multiple sources often raises several challenges, including proper protection of patient privacy and proprietary interests. We previously developed an SAS-based package to perform distributed regression—a suite of privacy-protecting methods that perform multivariable-adjusted regression analysis using only summary-level information—with horizontally partitioned data, a setting where distinct cohorts of patients are available from different data sources. We integrated the package with PopMedNet, an open-source file transfer software, to facilitate secure file transfer between the analysis center and the data-contributing sites. The feasibility of using PopMedNet to facilitate distributed regression analysis (DRA) with vertically partitioned data, a setting where the data attributes from a cohort of patients are available from different data sources, was unknown. Objective The objective of the study was to describe the feasibility of using PopMedNet and enhancements to PopMedNet to facilitate automatable vertical DRA (vDRA) in real-world settings. Methods We gathered the statistical and informatic requirements of using PopMedNet to facilitate automatable vDRA. We enhanced PopMedNet based on these requirements to improve its technical capability to support vDRA. Results PopMedNet can enable automatable vDRA. We identified and implemented two enhancements to PopMedNet that improved its technical capability to perform automatable vDRA in real-world settings. The first was the ability to simultaneously upload and download multiple files, and the second was the ability to directly transfer summary-level information between the data-contributing sites without a third-party analysis center. Conclusions PopMedNet can be used to facilitate automatable vDRA to protect patient privacy and support clinical research in real-world settings.
more »
« less
- Award ID(s):
- 1853209
- PAR ID:
- 10291646
- Date Published:
- Journal Name:
- JMIR Medical Informatics
- Volume:
- 9
- Issue:
- 4
- ISSN:
- 2291-9694
- Page Range / eLocation ID:
- e21459
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The COVID-19 pandemic has infected millions of people around the world, spreading rapidly and causing a flood of patients that risk overwhelming clinical facilities. Whether in urban or rural areas, hospitals have limited resources and personnel to treat critical infections in intensive care units, which must be allocated effectively. To assist clinical staff in deciding which patients are in the greatest need of critical care, we develop a predictive model based on a publicly-available data set that is rich in clinical markers. We perform statistical analysis to determine which clinical markers strongly correlate with hospital admission, semi-intensive care, and intensive care for COVID-19 patients. We create a predictive model that will assist clinical personnel in determining COVID-19 patient prognosis. Additionally, we take a step towards a global framework for COVID-19 prognosis prediction by incorporating statistical data for geographically and ethnically diverse COVID--19 patient sets into our own model. To the best of our knowledge, this is the first model which does not exclusively utilize local data.more » « less
-
Abstract ObjectiveEmerging technologies (eg, wearable devices) have made it possible to collect data directly from individuals (eg, time-series), providing new insights on the health and well-being of individual patients. Broadening the access to these data would facilitate the integration with existing data sources (eg, clinical and genomic data) and advance medical research. Compared to traditional health data, these data are collected directly from individuals, are highly unique and provide fine-grained information, posing new privacy challenges. In this work, we study the applicability of a novel privacy model to enable individual-level time-series data sharing while maintaining the usability for data analytics. Methods and materialsWe propose a privacy-protecting method for sharing individual-level electrocardiography (ECG) time-series data, which leverages dimensional reduction technique and random sampling to achieve provable privacy protection. We show that our solution provides strong privacy protection against an informed adversarial model while enabling useful aggregate-level analysis. ResultsWe conduct our evaluations on 2 real-world ECG datasets. Our empirical results show that the privacy risk is significantly reduced after sanitization while the data usability is retained for a variety of clinical tasks (eg, predictive modeling and clustering). DiscussionOur study investigates the privacy risk in sharing individual-level ECG time-series data. We demonstrate that individual-level data can be highly unique, requiring new privacy solutions to protect data contributors. ConclusionThe results suggest our proposed privacy-protection method provides strong privacy protections while preserving the usefulness of the data.more » « less
-
Abstract Background Logistic regression (LR) is a widely used classification method for modeling binary outcomes in many medical data classification tasks. Researchers that collect and combine datasets from various data custodians and jurisdictions can greatly benefit from the increased statistical power to support their analysis goals. However, combining data from different sources creates serious privacy concerns that need to be addressed. Methods In this paper, we propose two privacy-preserving protocols for performing logistic regression with the Newton–Raphson method in the estimation of parameters. Our proposals are based on secure Multi-Party Computation (MPC) and tailored to the honest majority and dishonest majority security settings. Results The proposed protocols are evaluated against both synthetic and real-world datasets in terms of efficiency and accuracy, and a comparison is made with the ordinary logistic regression. The experimental results demonstrate that the proposed protocols are highly efficient and accurate. Conclusions Our work introduces two iterative algorithms to enable the distributed training of a logistic regression model in a privacy-preserving manner. The implementation results show that our algorithms can handle large datasets from multiple sources.more » « less
-
Patient-generated health data (PGHD), created and captured from patients via wearable devices and mobile apps, are proliferating outside of clinical settings. Examples include sleep tracking, fitness trackers, continuous glucose monitors, and RFID-enabled implants, with many additional biometric or health surveillance applications in development or envisioned. These data are included in growing stockpiles of personal health data being mined for insight via big data analytics and artificial intelligence/deep learning technologies. Governing these data resources to facilitate patient care and health research while preserving individual privacy and autonomy will be challenging, as PGHD are the least regulated domains of digitalized personal health data (U.S. Department of Health and Human Services, 2018). When patients themselves collect digitalized PGHD using “apps” provided by technology firms, these data fall outside of conventional health data regulation, such as HIPAA. Instead, PGHD are maintained primarily on the information technology infrastructure of vendors, and data are governed under the IT firm’s own privacy policies and within the firm’s intellectual property rights. Dominant narratives position these highly personal data as valuable resources to transform healthcare, stimulate innovation in medical research, and engage individuals in their health and healthcare. However, ensuring privacy, security, and equity of benefits from PGHD will be challenging. PGHD can be aggregated and, despite putative “deidentification,” be linked with other health, economic, and social data for predictive analytics. As large tech companies enter the healthcare sector (e.g., Google Health is partnering with Ascension Health to analyze the PHI of millions of people across 21 U.S. states), the lack of harmonization between regulatory regimes may render existing safeguards to preserve patient privacy and control over their PHI ineffective. While healthcare providers are bound to adhere to health privacy laws, Big Tech comes under more relaxed regulatory regimes that will facilitate monetizing PGHD. We explore three existing data protection regimes relevant to PGHD in the United States that are currently in tension with one another: federal and state health-sector laws, data use and reuse for research and innovation, and industry self-regulation by large tech companies We then identify three types of structures (organizational, regulatory, technological/algorithmic), which synergistically could help enact needed regulatory oversight while limiting the friction and economic costs of regulation. This analysis provides a starting point for further discussions and negotiations among stakeholders and regulators to do so.more » « less
An official website of the United States government

