Title: Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
Abstract Objective In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. Materials and Methods We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. Results Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. Discussion We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. Conclusion By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require. more »« less
Korobeinikov, Dmitrii; Chuprov, Sergei; Zatsarenko, Raman; Reznik, Leon
(, 19th Annual Symposium on Information Assurance (ASIA’ 24) , June 4-5, 2024, Albany, NY)
Goal, S
(Ed.)
Machine Learning models are widely utilized in a variety of applications, including Intelligent Transportation Systems (ITS). As these systems are operating in highly dynamic environments, they are exposed to numerous security threats that cause Data Quality (DQ) variations. Among such threats are network attacks that may cause data losses. We evaluate the influence of these factors on the image DQ and consequently on the image ML model performance. We propose and investigate Federated Learning (FL) as the way to enhance the overall level of privacy and security in ITS, as well as to improve ML model robustness to possible DQ variations in real-world applications. Our empirical study conducted with traffic sign images and YOLO, VGG16 and ResNet models proved the greater robustness of FL-based architecture over a centralized one.
Hu, Mengtong; Shi, Xu; Song, Peter X‐K
(, Statistics in Medicine)
Data sharing barriers present paramount challenges arising from multicenter clinical studies where multiple data sources are stored and managed in a distributed fashion at different local study sites. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time‐consuming. Data merging may become more burdensome when propensity score modeling is involved in the analysis because combining many confounding variables, and systematic incorporation of this additional modeling in a meta‐analysis has not been thoroughly investigated in the literature. Motivated from a multicenter clinical trial of basal insulin treatment for reducing the risk of post‐transplantation diabetes mellitus, we propose a new inference framework that avoids the merging of subject‐level raw data from multiple sites at a centralized facility but needs only the sharing of summary statistics. Unlike the architecture of federated learning, the proposed collaborative inference does not need a center site to combine local results and thus enjoys maximal protection of data privacy and minimal sensitivity to unbalanced data distributions across data sources. We show theoretically and numerically that the new distributed inference approach has little loss of statistical power compared to the centralized method that requires merging the entire data. We present large‐sample properties and algorithms for the proposed method. We illustrate its performance by simulation experiments and the motivating example on the differential average treatment effect of basal insulin to lower risk of diabetes among kidney‐transplant patients compared to the standard‐of‐care.
With the wider availability of healthcare data such as Electronic Health Records (EHR), more and more data-driven based approaches have been proposed to improve the quality-of-care delivery. Predictive modeling, which aims at building computational models for predicting clinical risk, is a popular research topic in healthcare analytics. However, concerns about privacy of healthcare data may hinder the development of effective predictive models that are generalizable because this often requires rich diverse data from multiple clinical institutions. Recently, federated learning (FL) has demonstrated promise in addressing this concern. However, data heterogeneity from different local participating sites may affect prediction performance of federated models. Due to acute kidney injury (AKI) and sepsis’ high prevalence among patients admitted to intensive care units (ICU), the early prediction of these conditions based on AI is an important topic in critical care medicine. In this study, we take AKI and sepsis onset risk prediction in ICU as two examples to explore the impact of data heterogeneity in the FL framework as well as compare performances across frameworks. We built predictive models based on local, pooled, and FL frameworks using EHR data across multiple hospitals. The local framework only used data from each site itself. The pooled framework combined data from all sites. In the FL framework, each local site did not have access to other sites’ data. A model was updated locally, and its parameters were shared to a central aggregator, which was used to update the federated model’s parameters and then subsequently, shared with each site. We found models built within a FL framework outperformed local counterparts. Then, we analyzed variable importance discrepancies across sites and frameworks. Finally, we explored potential sources of the heterogeneity within the EHR data. The different distributions of demographic profiles, medication use, and site information contributed to data heterogeneity.
Levac, Brett R.; Arvinte, Marius; Tamir, Jonathan I.
(, Bioengineering)
Image reconstruction is the process of recovering an image from raw, under-sampled signal measurements, and is a critical step in diagnostic medical imaging, such as magnetic resonance imaging (MRI). Recently, data-driven methods have led to improved image quality in MRI reconstruction using a limited number of measurements, but these methods typically rely on the existence of a large, centralized database of fully sampled scans for training. In this work, we investigate federated learning for MRI reconstruction using end-to-end unrolled deep learning models as a means of training global models across multiple clients (data sites), while keeping individual scans local. We empirically identify a low-data regime across a large number of heterogeneous scans, where a small number of training samples per client are available and non-collaborative models lead to performance drops. In this regime, we investigate the performance of adaptive federated optimization algorithms as a function of client data distribution and communication budget. Experimental results show that adaptive optimization algorithms are well suited for the federated learning of unrolled models, even in a limited-data regime (50 slices per data site), and that client-sided personalization can improve reconstruction quality for clients that did not participate in training.
Hanke, Michael; Pestilli, Franco; Wagner, Adina S.; Markiewicz, Christopher J.; Poline, Jean-Baptiste; Halchenko, Yaroslav O.
(, Neuroforum)
Decentralized research data management (dRDM) systems handle digital research objects across participating nodes without critically relying on central services. We present four perspectives in defense of dRDM, illustrating that, in contrast to centralized or federated RDM solutions, a dRDM system based on heterogeneous but interoperable components can offer a sustainable, resilient, inclusive, and adaptive infrastructure for scientific stakeholders: An individual scientist or lab, a research institute, a domain data archive or cloud computing platform, and a collaborative multi-site consortium. All perspectives share the use of a common, self-contained, portable data structure as an abstraction from current technology and service choices. In conjunction, the four perspectives review how varying requirements of independent scientific stakeholders can be addressed by a scalable, uniform dRDM solution, and present a working system as an exemplary implementation.
@article{osti_10349490,
place = {Country unknown/Code not available},
title = {Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative},
url = {https://par.nsf.gov/biblio/10349490},
DOI = {10.1093/jamia/ocab217},
abstractNote = {Abstract Objective In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. Materials and Methods We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. Results Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. Discussion We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. Conclusion By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.},
journal = {Journal of the American Medical Informatics Association},
volume = {29},
number = {4},
author = {Pfaff, Emily R and Girvin, Andrew T and Gabriel, Davera L and Kostka, Kristin and Morris, Michele and Palchuk, Matvey B and Lehmann, Harold P and Amor, Benjamin and Bissell, Mark and Bradwell, Katie R and Gold, Sigfried and Hong, Stephanie S and Loomba, Johanna and Manna, Amin and McMurry, Julie A and Niehaus, Emily and Qureshi, Nabeel and Walden, Anita and Zhang, Xiaohan Tanner and Zhu, Richard L and Moffitt, Richard A and Haendel, Melissa A and Chute, Christopher G and Adams, William G and Al-Shukri, Shaymaa and Anzalone, Alfred and Baghal, Ahmad and Bennett, Tellen D and Bernstam, Elmer V and Bernstam, Elmer V and Bissell, Mark M and Bush, Brian and Campion, Thomas R and Castro, Victor and Chang, Jack and Chaudhari, Deepa D and Chen, Wenjin and Chu, San and Cimino, James J and Crandall, Keith A and Crooks, Mark and Davies, Sara J and DiPalazzo, John and Dorr, David and Eckrich, Dan and Eltinge, Sarah E and Fort, Daniel G and Golovko, George and Gupta, Snehil and Haendel, Melissa A and Hajagos, Janos G and Hanauer, David A and Harnett, Brett M and Horswell, Ronald and Huang, Nancy and Johnson, Steven G and Kahn, Michael and Khanipov, Kamil and Kieler, Curtis and Luzuriaga, Katherine Ruiz and Maidlow, Sarah and Martinez, Ashley and Mathew, Jomol and McClay, James C and McMahan, Gabriel and Melancon, Brian and Meystre, Stephane and Miele, Lucio and Morizono, Hiroki and Pablo, Ray and Patel, Lav and Phuong, Jimmy and Popham, Daniel J and Pulgarin, Claudia and Santos, Carlos and Sarkar, Indra Neil and Sazo, Nancy and Setoguchi, Soko and Soby, Selvin and Surampalli, Sirisha and Suver, Christine and Vangala, Uma Maheswara and Visweswaran, Shyam and Oehsen, James von and Walters, Kellie M and Wiley, Laura and Williams, David A and Zai, Adrian},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.